Skip to content

Commit 5b09123

Browse files
authored
Merge pull request #165 from daeisbae/164-convert-to-fastapi-backend
Convert to fastapi backend with docker support (#164)
2 parents ee8384f + 61e2ff8 commit 5b09123

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+2185
-2354
lines changed

.env.example

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,6 @@ TOKEN_PROCESSING_CHARACTER_LIMIT=30000 # Approx useful for 64k context window, a
2929
# Maximum retries for trying to input the code. If it is still beyond the limit, it will be try max retries then stop. To prevent huge input token billing
3030
TOKEN_PROCESSING_MAX_RETRIES=3
3131
# Reduce the number of characters per retry. You can think it as PROCESSOR_CHAR_LIMIT - REDUCE_CHAR_PER_RETRY * retries of characters will be processed in each retry
32-
TOKEN_PROCESSING_REDUCE_CHAR_PER_RETRY=3000 # Approx useful for 64k context window
32+
TOKEN_PROCESSING_REDUCE_CHAR_PER_RETRY=3000 # Approx useful for 64k context window
33+
34+
NEXT_PUBLIC_API_ENDPOINT=

README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,15 @@
1919
- PostgreSQL (For storing the summarized repository information)
2020
- Github API Key (To get more quota requesting the repository data)
2121
- Amazon S3 (You can ignore the parameters if you are going to use it locally. You need to use certificate for your Database if you are going to host it.)
22+
- Docker (If you are hosting locally)
2223

23-
### Configuration
24+
### Configuration (Local)
25+
26+
1. Copy `.env.example` to `.env`
27+
2. Configure all the variables given in `.env`
28+
3. Run `docker compose up` or `docker compose up -d` to hide the output
29+
30+
### Configuration (Cloud)
2431

2532
1. Create PostgreSQL instance
2633
2. Copy `.env.example` to `.env`
@@ -30,6 +37,7 @@
3037
6. Build the server (`npm run build`)
3138
7. Run (`npm start`)
3239

40+
3341
#### Ollama Configuration Guide
3442

3543
- It's recommended if you can run bigger LLM than 14b parameter.

backend/Dockerfile

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
FROM python:3.12-slim
2+
3+
ENV PYTHONDONTWRITEBYTECODE=1 \
4+
PYTHONUNBUFFERED=1
5+
6+
WORKDIR /app
7+
8+
RUN apt-get update && apt-get install -y --no-install-recommends \
9+
build-essential \
10+
libpq-dev \
11+
&& rm -rf /var/lib/apt/lists/*
12+
13+
COPY requirements.txt .
14+
15+
RUN pip install --upgrade pip
16+
RUN pip install --no-cache-dir -r requirements.txt
17+
18+
COPY . .
19+
20+
EXPOSE 8080
21+
22+
CMD ["bash", "-c", "python db/scripts/init_db.py && uvicorn main:app --host 0.0.0.0 --port 8080"]

backend/__init__.py

Whitespace-only changes.

backend/agent/__init__.py

Whitespace-only changes.

backend/agent/code_splitter.py

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
from enum import Enum
2+
from typing import Optional, List
3+
4+
from langchain_core.documents import Document
5+
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
6+
7+
from loguru import logger
8+
9+
def get_language_from_extension(extension: str) -> Optional[Language]:
10+
"""
11+
Retrieves the programming language associated with a given file extension.
12+
13+
:param extension: The file extension excluding the dot (e.g., 'js', 'py').
14+
:return: The corresponding Language enum or None if not supported.
15+
"""
16+
extension_to_language_map = {
17+
'py': Language.PYTHON,
18+
'js': Language.JS,
19+
'jsx': Language.JS,
20+
'ts': Language.TS,
21+
'tsx': Language.TS,
22+
'mjs': Language.JS,
23+
'cjs': Language.JS,
24+
'go': Language.GO,
25+
'rb': Language.RUBY,
26+
'rs': Language.RUST,
27+
'php': Language.PHP,
28+
'cpp': Language.CPP,
29+
'cc': Language.CPP,
30+
'c': Language.C,
31+
'cxx': Language.CPP,
32+
'hpp': Language.CPP,
33+
'hxx': Language.CPP,
34+
'h': Language.C,
35+
'java': Language.JAVA,
36+
'kt': Language.KOTLIN,
37+
'cs': Language.CSHARP,
38+
'scala': Language.SCALA,
39+
'swift': Language.SWIFT,
40+
'lua': Language.LUA,
41+
'pl': Language.PERL,
42+
'hs': Language.HASKELL,
43+
'lhs': Language.HASKELL,
44+
'md': Language.MARKDOWN
45+
}
46+
return extension_to_language_map.get(extension.lower())
47+
48+
49+
class LineNumberTextSplitter(RecursiveCharacterTextSplitter):
50+
"""
51+
A custom text splitter that tracks and annotates line numbers for each chunk.
52+
"""
53+
54+
def create_documents(self, texts: List[str], **kwargs) -> List[Document]:
55+
documents = []
56+
current_line = 1 # Initialize the starting line number
57+
58+
for text in texts:
59+
# Split the text into chunks using the parent class's method
60+
chunks = self.split_text(text)
61+
for chunk in chunks:
62+
# Calculate the number of lines in the chunk
63+
num_lines = chunk.count('\n') + 1
64+
doc = Document(
65+
page_content=chunk,
66+
metadata={
67+
'loc': {
68+
'lines': {
69+
'from': current_line,
70+
'to': current_line + num_lines - 1
71+
}
72+
}
73+
}
74+
)
75+
documents.append(doc)
76+
current_line += num_lines # Update the current line number
77+
78+
return documents
79+
80+
class CodeSplitter:
81+
def __init__(self, chunk_size: int, chunk_overlap: int):
82+
"""
83+
Constructor for CodeSplitter.
84+
85+
:param chunk_size: The size of each chunk.
86+
:param chunk_overlap: The number of overlapping characters between chunks.
87+
"""
88+
self.chunk_size = chunk_size
89+
self.chunk_overlap = chunk_overlap
90+
91+
def split_code(self, file_extension: str, code: str) -> Optional[str]:
92+
"""
93+
Splits the provided code into chunks based on the file extension.
94+
95+
:param file_extension: The file extension indicating the programming language.
96+
:param code: The code content to be split.
97+
:return: The code with line numbers or None if the language is not supported.
98+
"""
99+
language = get_language_from_extension(file_extension)
100+
if not language:
101+
logger.warning(f"Unsupported language for extension: {file_extension}")
102+
return None
103+
104+
separators = RecursiveCharacterTextSplitter.get_separators_for_language(language.value)
105+
splitter = LineNumberTextSplitter(
106+
separators=separators,
107+
chunk_size=self.chunk_size,
108+
chunk_overlap=self.chunk_overlap,
109+
length_function=len
110+
)
111+
112+
try:
113+
docs = splitter.create_documents([code])
114+
except Exception as e:
115+
logger.critical(f"Error during splitting: {e}")
116+
return None
117+
118+
doc_with_metadata = ''
119+
for doc in docs:
120+
loc = doc.metadata.get('loc', {})
121+
lines = loc.get('lines', {})
122+
from_line = lines.get('from', 'unknown')
123+
to_line = lines.get('to', 'unknown')
124+
doc_with_metadata += f'# Lines {from_line} - {to_line}\n{doc.page_content}\n\n'
125+
return doc_with_metadata

backend/agent/index.py

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
from typing import Optional, List
2+
3+
from agent.prompt import CodePrompt, FolderPrompt
4+
from agent.schema_parser import SchemaParser
5+
from agent.schema_factory import FileSchema, FolderSchema
6+
from agent.prompt_generator import PromptGenerator, FilePromptTemplateVariables, PromptTemplateConfig, PromptType, \
7+
RepoInfo, \
8+
FolderPromptTemplateVariables
9+
from agent.code_splitter import CodeSplitter
10+
from llm.llm_provider import LLMProvider
11+
12+
13+
# Base Processor
14+
class BaseProcessor:
15+
def __init__(self, llm: LLMProvider):
16+
self.llm = llm
17+
self.schema_parser: Optional[SchemaParser] = None
18+
self.prompt_generator: Optional[PromptGenerator] = None
19+
20+
async def process(self, prompt: str) -> dict:
21+
response = await self.llm.run(prompt)
22+
return self.schema_parser.parse(response)
23+
24+
25+
# Code Processor
26+
class CodeProcessor(BaseProcessor):
27+
def __init__(self, llm: LLMProvider):
28+
super().__init__(llm)
29+
self.code_splitter = CodeSplitter(200, 25)
30+
self.schema_parser = SchemaParser(FileSchema)
31+
self.prompt_generator = PromptGenerator(
32+
PromptTemplateConfig(
33+
template=(
34+
'The following instruction is given:\n{requirements}\n{format_instructions}\n'
35+
'The given repository owner is {repo_owner} with repository name of {repo_name}\n'
36+
'The commit SHA referenced is {commit_sha}\n'
37+
'The path of the file is {path}\n'
38+
'Below is the code for your task: {code}'
39+
)
40+
),
41+
PromptType.FILE
42+
)
43+
44+
async def generate(self, code: str, repo_info: dict[str, str]) -> dict:
45+
extension = repo_info.get('path').split('.').pop()
46+
splitted_code = self.code_splitter.split_code(extension, code)
47+
variables = FilePromptTemplateVariables(
48+
requirements=CodePrompt,
49+
format_instructions=self.schema_parser.format_instructions,
50+
code=splitted_code,
51+
repo_name=repo_info.get('repo_name'),
52+
repo_owner=repo_info.get('repo_owner'),
53+
commit_sha=repo_info.get('commit_sha'),
54+
path=repo_info.get('path'),
55+
)
56+
prompt = await self.prompt_generator.generate(variables, code=variables.code)
57+
return await self.process(prompt)
58+
59+
60+
# Folder Processor
61+
class FolderProcessor(BaseProcessor):
62+
def __init__(self, llm: LLMProvider):
63+
super().__init__(llm)
64+
self.schema_parser = SchemaParser(FolderSchema)
65+
self.prompt_generator = PromptGenerator(
66+
PromptTemplateConfig(
67+
template=(
68+
'The following instruction is given:\n{requirements}\n{format_instructions}\n'
69+
'The given repository owner is {repo_owner} with repository name of {repo_name}\n'
70+
'The commit SHA referenced is {commit_sha}\n'
71+
'The path of the folder is {path}\n'
72+
'Below are the summaries for the codebase:\n{ai_summaries}'
73+
)
74+
),
75+
PromptType.FOLDER
76+
)
77+
78+
async def generate(self, ai_summaries: List[str], repo_info: dict[str, str]) -> dict:
79+
variables = FolderPromptTemplateVariables(
80+
requirements=FolderPrompt,
81+
format_instructions=self.schema_parser.format_instructions,
82+
ai_summaries='\n'.join(ai_summaries),
83+
repo_owner=repo_info.get('repo_owner'),
84+
commit_sha=repo_info.get('commit_sha'),
85+
path=repo_info.get('path'),
86+
repo_name=repo_info.get('repo_name'),
87+
)
88+
prompt = await self.prompt_generator.generate(variables, ai_summaries=ai_summaries)
89+
return await self.process(prompt)

backend/agent/prompt.py

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
CodePrompt: str = """
2+
You are an expert software engineer and your task is to deeply analyze a provided codebase from a GitHub repository. Your goal is to generate a comprehensive and structured summary of the codebase that is suitable for a developer-friendly wiki page in markdown format but without backticks.
3+
4+
**Input:**
5+
6+
You will receive the following information, extracted from a GitHub repository:
7+
8+
1. **Repository Description:**
9+
* 'description': (A textual description of the repository, although it may not be available or be correct)
10+
2. **Code File:**
11+
* The raw content of code files within the repository.
12+
* The owner of the repository.
13+
* The repository name.
14+
* The commit sha of the repository.
15+
* The path to the code file within the repository.
16+
17+
**Analysis Tasks:**
18+
19+
1. **High-Level Overview:**
20+
* Provide a concise summary of the file responsibilities and functionalities based on it\'s content.
21+
* Explain its role in the overall system.
22+
* Identify its dependencies on other modules/components.
23+
* Highlight any important classes, functions, or data structures.
24+
* Link all the code blocks (Class,Function,Enum,Exception) that are referenced using the following markdown link format: [`Description of Code Block`](Full github url of the file including the start line with optional ending line#L{startLine}-L{endLine}). This is in the form of "https://github.com/{owner}/{repo}/blob/{commitSha}/{path}#L{lineStart}-L{lineEnd}".
25+
2. **Code-Level Insights:**
26+
* Analyze the code files to understand the implementation details.
27+
* Identify core algorithms, data structures, and design patterns used.
28+
* Provide a summary of how data flows between different parts of the system.
29+
3. **Dependencies and Relationships:**
30+
* Clearly document the relationships between different modules, classes, and functions.
31+
* Explain how different parts of the codebase interact with each other.
32+
33+
**Output:**
34+
"""
35+
36+
FolderPrompt: str = """
37+
You are an expert software engineer and your task is to deeply analyze a provided codebase from a GitHub repository. Your goal is to generate a comprehensive and structured summary of the codebase that is suitable for a developer-friendly wiki page in markdown format but without backticks.
38+
39+
**Input:**
40+
41+
You will receive the following information, summarized from the expert software engineer:
42+
43+
1. **Repository Description:**
44+
* 'description': (A textual description of the repository, although it may not be available or be correct)
45+
2. **Code Files:**
46+
* The summary of code files within the repository.
47+
* The owner of the repository.
48+
* The repository name.
49+
* The commit sha of the repository.
50+
* The path to the code file within the repository.
51+
52+
**Analysis Tasks:**
53+
54+
1. **High-Level Overview:**
55+
* Start by providing the core functionality among the folders or files. (ex. the folder name \"core\", \"src\" or folder with the same repository name usually contains the core functionality of the system. You can ignore utility folders unless they contain important information or there are nothing to explain.)
56+
* Provide a concise summary of the folder's responsibilities and functionalities based on it\'s sub-files and sub-folders summaries.
57+
* Explain its role in the overall system.
58+
* Identify its dependencies on other modules/components/folder.
59+
* Highlight any important classes, functions, or data structures in it's sub-files and sub-folders.
60+
* Link all the code blocks that are referenced using the following markdown link format: [`Description of Code Block`](Full github url of the file including the start line with optional ending line#L{startLine}-L{endLine}). This is in the form of "https://github.com/{owner}/{repo}/blob/{commitSha}/{path}#L{lineStart}-L{lineEnd}".
61+
2. **Dependencies and Relationships:**
62+
* Clearly document the relationships between different folders and files.
63+
* Explain how different parts of the codebase interact with each other.
64+
65+
**Output:**
66+
"""

0 commit comments

Comments
 (0)