-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/cog-544 eval on swe bench #232
Conversation
WalkthroughThe changes introduce two new files, Changes
Possibly related issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (9)
evals/eval_swe_bench.py (3)
21-21
: Consider parameterizing the code length limitThe line
code_text = dataset[0]["text"][:100000]
truncates the text to 100,000 characters. If this limit is important, consider defining it as a constant or parameter to improve code readability and maintainability. Additionally, adding a comment to explain the rationale behind the specific limit would be helpful.
97-97
: Handle potentialNone
value formodel_patch
When writing
pred["model_patch"]
to thepatch_file
, ifmodel_patch
isNone
, it defaults to an empty string. Ensure that this behavior is intentional. If not, consider adding a check or handling theNone
case appropriately.
101-101
: Review the use ofshell=True
insubprocess.run
Using
shell=True
can pose security risks if the command string contains untrusted input. Ensure that all commands intest_spec.repo_script_list
are from trusted sources. If possible, consider using a list of arguments withoutshell=True
to enhance security.evals/deepeval_on_swe_bench.py (6)
41-41
: Move module-level imports to the top of the fileModule-level import statements should be placed at the top of the file, after any module comments and docstrings, to comply with Python's PEP 8 style guidelines.
Apply this diff to move the import statements:
+from cognee.infrastructure.llm.get_llm_client import get_llm_client +import logging +import os +from cognee.base_config import get_base_config +from cognee.infrastructure.databases.vector import get_vector_engine from typing import List, Dict, Type from swebench.harness.utils import load_swebench_dataset from deepeval.dataset import EvaluationDataset from deepeval.test_case import LLMTestCase from pydantic import BaseModel -from deepeval.synthesizer import Synthesizer -# ... other code ... -# Import statements not at the top -from cognee.infrastructure.llm.get_llm_client import get_llm_client -import logging -import os -from cognee.base_config import get_base_config -from cognee.infrastructure.databases.vector import get_vector_engineAlso applies to: 46-46, 85-87
🧰 Tools
🪛 Ruff
41-41: Module level import not at top of file
(E402)
81-81
: Remove or utilize the unused variablegraph
The variable
graph
is assigned but not used, which may indicate redundant code or missing functionality.Apply this diff to remove the unused variable assignment:
async def run_cognify_base_rag(): from cognee.api.v1.add import add from cognee.api.v1.prune import prune from cognee.api.v1.cognify.cognify import cognify await prune.prune_system() await add("data://test_datasets", "initial_test") - graph = await cognify("initial_test") + await cognify("initial_test") pass🧰 Tools
🪛 Ruff
81-81: Local variable
graph
is assigned to but never usedRemove assignment to unused variable
graph
(F841)
148-149
: Remove duplicate code for loading datasetsThe code loads
swe_dataset
andtest_dataset
at lines 148-149 and again at lines 162-163. This seems redundant; consider removing the duplication to avoid unnecessary processing.Apply this diff to remove the duplicate code outside the main block:
-swe_dataset = load_swebench_dataset('princeton-nlp/SWE-bench_bm25_13K', split='test') -test_dataset = convert_swe_to_deepeval_testcases(swe_dataset) # Other code... if __name__ == "__main__": import asyncio async def main(): # await run_cognify_base_rag() # await cognify_search_base_rag("show_all_processes", "context") await cognify_search_graph("show_all_processes", "context") asyncio.run(main()) # run_cognify_base_rag_and_search() # # Data preprocessing before setting the dataset test cases swe_dataset = load_swebench_dataset('princeton-nlp/SWE-bench_bm25_13K', split='test') test_dataset = convert_swe_to_deepeval_testcases(swe_dataset)Also applies to: 162-163
60-67
: Add type hint forcontext
parameter inget_answer
functionThe
context
parameter inget_answer
lacks a type annotation. Adding a type hint improves code readability and type checking.Apply this diff to add the type hint:
def get_answer(content: str, context: str, model: Type[BaseModel] = AnswerModel): try: return get_answer_base( content, context, model )
117-117
: Avoid redundant type conversion when accessing model fieldsThe
str()
conversion aroundget_answer(...).model_dump()['response']
may be unnecessary if'response'
is already a string. Ensure that type conversions are necessary to prevent unexpected results.If
'response'
is already a string, you can remove thestr()
conversion:actual_output= str(get_answer(case.input, case.context).model_dump()['response']), # Change to: actual_output = get_answer(case.input, case.context).model_dump()['response'],
Also applies to: 138-138
68-70
: Avoid catching broad exceptionsCatching the base
Exception
class can make debugging difficult and may mask other issues. Consider catching more specific exceptions.If you expect specific exceptions, it's better to catch those explicitly. For example:
def get_answer(content: str, context: str, model: Type[BaseModel] = AnswerModel): try: return get_answer_base( content, context, model ) - except Exception as error: + except SomeSpecificException as error: logger.error("Error extracting cognitive layers from content: %s", error, exc_info=True) raise errorReplace
SomeSpecificException
with the actual exception types you expect.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
evals/deepeval_on_swe_bench.py
(1 hunks)evals/eval_swe_bench.py
(1 hunks)
🧰 Additional context used
🪛 Ruff
evals/deepeval_on_swe_bench.py
41-41: Module level import not at top of file
(E402)
46-46: Module level import not at top of file
(E402)
81-81: Local variable graph
is assigned to but never used
Remove assignment to unused variable graph
(F841)
85-85: Module level import not at top of file
(E402)
86-86: Module level import not at top of file
(E402)
87-87: Module level import not at top of file
(E402)
evals/eval_swe_bench.py
30-30: f-string without any placeholders
Remove extraneous f
prefix
(F541)
31-31: f-string without any placeholders
Remove extraneous f
prefix
(F541)
32-32: f-string without any placeholders
Remove extraneous f
prefix
(F541)
33-33: f-string without any placeholders
Remove extraneous f
prefix
(F541)
124-124: Local variable results
is assigned to but never used
Remove assignment to unused variable results
(F841)
🔇 Additional comments (2)
evals/eval_swe_bench.py (1)
61-61
: Address the TODO: Verify the system_prompt
parameter
There's a TODO comment indicating uncertainty about using prompt
as the system_prompt
. Please verify if setting system_prompt = prompt
is correct, or if adjustments are needed to align with the LLM client's expectations.
Would you like assistance in verifying the correct usage of the system_prompt
parameter?
evals/deepeval_on_swe_bench.py (1)
127-127
: Confirm if limiting to the first 4 items in swe_dataset
is intentional
The code processes only the first 4 items of swe_dataset
. Please verify if this limitation is intentional or if the entire dataset should be processed.
If you intend to process the entire dataset, apply this diff:
def convert_swe_to_deepeval_testcases(swe_dataset: List[Dict]):
deepeval_dataset = EvaluationDataset()
- for datum in swe_dataset[:4]:
+ for datum in swe_dataset:
input = datum["problem_statement"]
expected_output = datum["patch"]
context = [datum["text"]]
evals/eval_swe_bench.py
Outdated
cognee_preds, cognee_preds_dataset = await get_preds(dataset, with_cognee=True) | ||
# nocognee_preds = await get_preds(dataset, with_cognee=False) | ||
test_specs = list(map(make_test_spec, test_data)) | ||
results = await evaluate(test_specs, cognee_preds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused variable results
The variable results
is assigned but never used. This may lead to confusion and unnecessary memory usage. If the returned value from evaluate
is not needed, consider removing the assignment.
Apply this diff to fix the issue:
- results = await evaluate(test_specs, cognee_preds)
+ await evaluate(test_specs, cognee_preds)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
results = await evaluate(test_specs, cognee_preds) | |
await evaluate(test_specs, cognee_preds) |
🧰 Tools
🪛 Ruff
124-124: Local variable results
is assigned to but never used
Remove assignment to unused variable results
(F841)
evals/eval_swe_bench.py
Outdated
result = subprocess.run(command, shell=True, check=True, capture_output=True, text=True) | ||
print(result) | ||
|
||
subprocess.run("git apply --allow-empty -v logs/patch.diff", shell=True, capture_output=True, text=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add check=True
to capture subprocess errors
In the subprocess.run
call for applying the patch, check=True
is omitted. This means that if git apply
fails, the error won't raise an exception. Adding check=True
will ensure that any errors are caught and can be handled appropriately.
Apply this diff to fix the issue:
- subprocess.run("git apply --allow-empty -v logs/patch.diff", shell=True, capture_output=True, text=True)
+ subprocess.run("git apply --allow-empty -v logs/patch.diff", shell=True, check=True, capture_output=True, text=True)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
subprocess.run("git apply --allow-empty -v logs/patch.diff", shell=True, capture_output=True, text=True) | |
subprocess.run("git apply --allow-empty -v logs/patch.diff", shell=True, check=True, capture_output=True, text=True) |
evals/eval_swe_bench.py
Outdated
f"I need you to solve this issue by looking at the provided knowledge graph and by " | ||
+ f"generating a single patch file that I can apply directly to this repository " | ||
+ f"using git apply. Please respond with a single patch " | ||
+ f"file in the following format." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove unnecessary f
prefixes from static strings
The strings in lines 30-33 do not contain any placeholder expressions, so the f
prefixes are unnecessary. Removing them will clean up the code without affecting functionality.
Apply this diff to fix the issue:
instructions = (
- f"I need you to solve this issue by looking at the provided knowledge graph and by "
- + f"generating a single patch file that I can apply directly to this repository "
- + f"using git apply. Please respond with a single patch "
- + f"file in the following format."
+ "I need you to solve this issue by looking at the provided knowledge graph and by "
+ "generating a single patch file that I can apply directly to this repository "
+ "using git apply. Please respond with a single patch "
+ "file in the following format."
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
f"I need you to solve this issue by looking at the provided knowledge graph and by " | |
+ f"generating a single patch file that I can apply directly to this repository " | |
+ f"using git apply. Please respond with a single patch " | |
+ f"file in the following format." | |
instructions = ( | |
"I need you to solve this issue by looking at the provided knowledge graph and by " | |
"generating a single patch file that I can apply directly to this repository " | |
"using git apply. Please respond with a single patch " | |
"file in the following format." | |
) |
🧰 Tools
🪛 Ruff
30-30: f-string without any placeholders
Remove extraneous f
prefix
(F541)
31-31: f-string without any placeholders
Remove extraneous f
prefix
(F541)
32-32: f-string without any placeholders
Remove extraneous f
prefix
(F541)
33-33: f-string without any placeholders
Remove extraneous f
prefix
(F541)
evals/deepeval_on_swe_bench.py
Outdated
async def cognify_search_graph(content:str, context:str): | ||
from cognee.api.v1.search import search, SearchType | ||
params = {'query': 'Donald Trump'} | ||
|
||
results = await search(SearchType.INSIGHTS, params) | ||
print("results", results) | ||
return results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function cognify_search_graph
does not use its parameters content
and context
The function cognify_search_graph
accepts content
and context
as parameters but uses a hardcoded query 'Donald Trump'
instead of the content
parameter.
Apply this diff to use the content
parameter:
async def cognify_search_graph(content: str, context: str):
from cognee.api.v1.search import search, SearchType
- params = {'query': 'Donald Trump'}
+ params = {'query': content}
results = await search(SearchType.INSIGHTS, params)
print("results", results)
return results
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
async def cognify_search_graph(content:str, context:str): | |
from cognee.api.v1.search import search, SearchType | |
params = {'query': 'Donald Trump'} | |
results = await search(SearchType.INSIGHTS, params) | |
print("results", results) | |
return results | |
async def cognify_search_graph(content:str, context:str): | |
from cognee.api.v1.search import search, SearchType | |
params = {'query': content} | |
results = await search(SearchType.INSIGHTS, params) | |
print("results", results) | |
return results |
evals/deepeval_on_swe_bench.py
Outdated
def get_answer(content: str,context, model: Type[BaseModel]= AnswerModel): | ||
|
||
try: | ||
return (get_answer_base( | ||
content, | ||
context, | ||
model | ||
)) | ||
except Exception as error: | ||
logger.error("Error extracting cognitive layers from content: %s", error, exc_info = True) | ||
raise error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type mismatch for context
parameter passed to get_answer_base
The context
parameter is passed as a list in some cases, but get_answer_base
expects it to be a string. This may lead to unexpected behavior when converting the list to a string.
Consider standardizing the context
parameter's type across functions. If context
is intended to be a list of strings, you can join the list into a single string before passing it to get_answer_base
.
Apply this diff to modify get_answer
:
def get_answer(content: str, context: str, model: Type[BaseModel] = AnswerModel):
try:
+ if isinstance(context, list):
+ context_str = ' '.join(context)
+ else:
+ context_str = str(context)
return get_answer_base(
content,
- context,
+ context_str,
model
)
except Exception as error:
logger.error("Error extracting cognitive layers from content: %s", error, exc_info=True)
raise error
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_answer(content: str,context, model: Type[BaseModel]= AnswerModel): | |
try: | |
return (get_answer_base( | |
content, | |
context, | |
model | |
)) | |
except Exception as error: | |
logger.error("Error extracting cognitive layers from content: %s", error, exc_info = True) | |
raise error | |
def get_answer(content: str, context: str, model: Type[BaseModel] = AnswerModel): | |
try: | |
if isinstance(context, list): | |
context_str = ' '.join(context) | |
else: | |
context_str = str(context) | |
return get_answer_base( | |
content, | |
context_str, | |
model | |
) | |
except Exception as error: | |
logger.error("Error extracting cognitive layers from content: %s", error, exc_info = True) | |
raise error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (6)
evals/deepeval_on_swe_bench.py (4)
15-20
: Improve documentation of the DeepEval exampleThe commented example uses hardcoded paths and lacks explanation. Consider adding more context about its purpose and how to adapt it for different environments.
-# DeepEval dataset for reference -# synthesizer = Synthesizer() -# synthesizer.generate_goldens_from_docs( -# document_paths=['/app/.data/short_stories/soldiers_home.pdf'], -# include_expected_output=True -# ) +# Example: Generating a DeepEval dataset from documents +# Usage: +# synthesizer = Synthesizer() +# synthesizer.generate_goldens_from_docs( +# document_paths=['path/to/your/docs/*.pdf'], # Adjust path as needed +# include_expected_output=True +# )
57-57
: Make system prompt configurableThe system prompt is hardcoded. Consider making it configurable through environment variables or configuration files.
+SYSTEM_PROMPT_TEMPLATE = os.getenv('SYSTEM_PROMPT_TEMPLATE', "THIS IS YOUR CONTEXT:{context}") + - system_prompt = "THIS IS YOUR CONTEXT:" + str(context) + system_prompt = SYSTEM_PROMPT_TEMPLATE.format(context=str(context))
85-85
: Remove unused variable assignmentThe
graph
variable is assigned but never used.- graph = await cognify("initial_test") + await cognify("initial_test")🧰 Tools
🪛 Ruff
85-85: Local variable
graph
is assigned to but never usedRemove assignment to unused variable
graph
(F841)
157-174
: Improve organization of main execution blockThe main block mixes async and sync code, and contains commented-out code. Consider organizing it better by:
- Separating async and sync operations into different functions
- Removing or documenting commented code
- Adding proper logging for the evaluation results
if __name__ == "__main__": import asyncio + import logging + + logging.basicConfig(level=logging.INFO) + + async def run_search(): + try: + await cognify_search_graph("show_all_processes", "context") + except Exception as e: + logger.error("Search failed: %s", e, exc_info=True) + raise + + def run_evaluation(): + try: + test_dataset = convert_swe_to_deepeval_testcases( + load_swebench_dataset('princeton-nlp/SWE-bench_bm25_13K', split='test') + ) + metric = HallucinationMetric() + eval_result = test_dataset.evaluate([metric]) + logger.info("Evaluation results: %s", eval_result) + except Exception as e: + logger.error("Evaluation failed: %s", e, exc_info=True) + raise + + # Run async search + asyncio.run(run_search()) + + # Run sync evaluation + run_evaluation()evals/eval_swe_bench.py (2)
29-34
: Simplify string concatenation by removing unnecessary '+' operatorsThe concatenation of strings using
'+'
within parentheses is unnecessary. In Python, adjacent string literals within parentheses are automatically concatenated. Removing the'+'
operators enhances readability and follows best practices.Apply this diff to simplify the string construction:
instructions = ( - "I need you to solve this issue by looking at the provided knowledge graph and by " - + "generating a single patch file that I can apply directly to this repository " - + "using git apply. Please respond with a single patch " - + "file in the following format." + "I need you to solve this issue by looking at the provided knowledge graph and by " + "generating a single patch file that I can apply directly to this repository " + "using git apply. Please respond with a single patch " + "file in the following format." )
92-92
: Move import statement to the top of the fileThe import statement
from datasets import Dataset
is inside a conditional block. For clarity and to follow PEP 8 guidelines, it's better to place all import statements at the top of the file unless there's a specific reason to defer the import.Apply this diff to move the import statement:
+from datasets import Dataset if filepath.exists(): - from datasets import Dataset dataset = Dataset.load_from_disk(filepath)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
evals/deepeval_on_swe_bench.py
(1 hunks)evals/eval_swe_bench.py
(1 hunks)
🧰 Additional context used
🪛 Ruff
evals/deepeval_on_swe_bench.py
85-85: Local variable graph
is assigned to but never used
Remove assignment to unused variable graph
(F841)
🔇 Additional comments (3)
evals/deepeval_on_swe_bench.py (2)
62-73
: Fix type mismatch in context parameter handling
The existing issue with type mismatch in the context
parameter is still present. Please refer to the previous review comment for the fix.
103-109
: Fix unused parameters in cognify_search_graph
The function still uses a hardcoded query instead of the content
parameter. Please refer to the previous review comment for the fix.
evals/eval_swe_bench.py (1)
102-108
: 🛠️ Refactor suggestion
Add check=True
to subprocess.run
to ensure errors are caught
In the subprocess.run
call for running the evaluation script, check=True
is omitted. This means that if the subprocess fails, an error won't raise an exception, potentially leading to unnoticed failures. Adding check=True
ensures that any errors are caught and can be handled appropriately.
Apply this diff to add check=True
:
subprocess.run([
"python", "-m", "swebench.harness.run_evaluation",
"--dataset_name", 'princeton-nlp/SWE-bench',
"--split", "test",
"--predictions_path", "withcognee.json",
"--max_workers", "1",
"--instance_ids", test_data[0]["instance_id"],
"--run_id", "with_cognee"
+], check=True)
-])
async def main(): | ||
# await run_cognify_base_rag() | ||
# await cognify_search_base_rag("show_all_processes", "context") | ||
await cognify_search_graph("show_all_processes", "context") | ||
asyncio.run(main()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling to async main function
The async main function lacks error handling which could lead to unhandled exceptions.
async def main():
+ try:
# await run_cognify_base_rag()
# await cognify_search_base_rag("show_all_processes", "context")
await cognify_search_graph("show_all_processes", "context")
+ except Exception as e:
+ logger.error("Error in main: %s", e, exc_info=True)
+ raise
asyncio.run(main())
Committable suggestion skipped: line range outside the PR's diff.
swe_dataset = load_swebench_dataset( | ||
'princeton-nlp/SWE-bench_bm25_13K', split='test') | ||
deepeval_dataset = convert_swe_to_deepeval(swe_dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove duplicate dataset loading
This dataset loading is duplicated at lines 153-155. Consider consolidating the loading logic into a single function.
-swe_dataset = load_swebench_dataset(
- 'princeton-nlp/SWE-bench_bm25_13K', split='test')
-deepeval_dataset = convert_swe_to_deepeval(swe_dataset)
+def load_and_convert_dataset(split='test'):
+ swe_dataset = load_swebench_dataset(
+ 'princeton-nlp/SWE-bench_bm25_13K', split=split)
+ return convert_swe_to_deepeval(swe_dataset)
Committable suggestion skipped: line range outside the PR's diff.
|
||
def convert_swe_to_deepeval_testcases(swe_dataset: List[Dict]): | ||
deepeval_dataset = EvaluationDataset() | ||
for datum in swe_dataset[:4]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Make dataset limit configurable
The hardcoded limit of 4 items should be configurable for different testing scenarios.
+DEFAULT_TEST_LIMIT = int(os.getenv('TEST_LIMIT', '4'))
+
- for datum in swe_dataset[:4]:
+ for datum in swe_dataset[:DEFAULT_TEST_LIMIT]:
Committable suggestion skipped: line range outside the PR's diff.
8d7171d
to
98e3445
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Outside diff range and nitpick comments (10)
evals/eval_utils.py (4)
60-60
: Remove unnecessary f-string prefix in line 60The f-string in line 60 does not contain any placeholders, so the
f
prefix is unnecessary and can be removed to improve code clarity.Apply the following change:
-patch = "\n".join([f"<patch>", instance["patch"], "</patch>"]) +patch = "\n".join(["<patch>", instance["patch"], "</patch>"])🧰 Tools
🪛 Ruff
60-60: f-string without any placeholders
Remove extraneous
f
prefix(F541)
17-23
: Add exception handling iningest_files
to handle file read errorsCurrently, if a file cannot be opened or read, the function will raise an exception and halt the entire process. Adding exception handling will allow the function to handle file read errors gracefully and continue processing other files.
Consider modifying the function as follows:
def ingest_files(filenames): files_dict = dict() for filename in filenames: try: with open(filename) as f: content = f.read() files_dict[filename] = content except Exception as e: logging.warning(f"Failed to read {filename}: {e}") return files_dict
27-47
: Optimize working directory restoration iningest_repos
The working directory is reset to
orig_dir
inside thefinally
block after processing each instance. This could result in unnecessary calls toos.chdir
, impacting performance. Consider moving the restoration of the working directory outside the loop to reset it only once after all instances are processed.You might adjust the function as follows:
def ingest_repos(input_instances): orig_dir = os.getcwd() try: with TemporaryDirectory( dir="/scratch" if os.path.exists("/scratch") else "/tmp" ) as root_dir: for instance in tqdm( input_instances.values(), total=len(input_instances), desc="Downloading repos on specific commits", ): with AutoContextManager( instance, root_dir ) as cm: readmes = cm.get_readme_files() instance["readmes"] = ingest_files(readmes) instance["file_contents"] = ingest_directory_contents( cm.repo_path ) finally: # Restore the original working directory after all instances are processed os.chdir(orig_dir) return input_instances
17-23
: Add docstrings to functions for better documentationThe following functions lack docstrings, which makes it harder for other developers to understand their purpose and usage. Adding docstrings will enhance code readability and maintainability.
ingest_files
(lines 17-23)ingest_repos
(lines 26-49)extract_fields
(lines 52-62)create_dataset
(lines 64-88)Consider adding docstrings like this:
def ingest_files(filenames): """ Reads the contents of the given list of filenames and returns a dictionary mapping filenames to their contents. Args: filenames (list): List of filenames to read. Returns: dict: Dictionary with filenames as keys and file contents as values. """ # Function implementation remains the sameAlso applies to: 26-49, 52-62, 64-88
evals/eval_swe_bench.py (1)
67-80
: Add type hints and documentation.The function would benefit from type hints and a docstring explaining the return format.
Apply this diff:
-async def get_preds(dataset, with_cognee=True): +async def get_preds(dataset: list, with_cognee: bool = True) -> list[dict]: + """Generate predictions using either cognee pipeline or direct LLM. + + Args: + dataset: List of instances to process + with_cognee: If True, use cognee pipeline, otherwise use direct LLM + + Returns: + List of predictions in the format: + [{"instance_id": str, "model_patch": str, "model_name_or_path": str}] + """evals/deepeval_on_swe_bench.py (5)
15-20
: Remove commented example codeThe commented DeepEval dataset example doesn't provide value and should be removed to improve code clarity.
28-28
: Clean up commented codeRemove commented code that's not providing value or add a comment explaining why it's kept for future reference.
Also applies to: 36-36
22-39
: Add return type hint to the functionAdd a return type hint to improve code clarity and enable better type checking:
-def convert_swe_to_deepeval(swe_dataset: List[Dict]): +def convert_swe_to_deepeval(swe_dataset: List[Dict]) -> EvaluationDataset:
57-57
: Improve system prompt constructionThe current system prompt construction is basic and could be improved:
- Add separation between context and content
- Consider using a template for better structure
- system_prompt = "THIS IS YOUR CONTEXT:" + str(context) + system_prompt = f""" +CONTEXT: +{str(context)} + +Please provide your response based on the above context. +"""
128-150
: Optimize LLM calls with batchingConsider batching LLM calls for better performance when processing multiple test cases:
+async def batch_get_answers(inputs: List[str], contexts: List[str], + batch_size: int = 5) -> List[str]: + results = [] + for i in range(0, len(inputs), batch_size): + batch_inputs = inputs[i:i + batch_size] + batch_contexts = contexts[i:i + batch_size] + # Assuming get_answer can be modified to support batching + batch_results = await asyncio.gather(*[ + get_answer(input, context) + for input, context in zip(batch_inputs, batch_contexts) + ]) + results.extend(batch_results) + return results -def convert_swe_to_deepeval_testcases(swe_dataset: List[Dict]): +async def convert_swe_to_deepeval_testcases(swe_dataset: List[Dict]): deepeval_dataset = EvaluationDataset() - for datum in swe_dataset[:4]: + data = swe_dataset[:4] + inputs = [d["problem_statement"] for d in data] + contexts = [[d["text"]] for d in data] + + responses = await batch_get_answers(inputs, contexts) + + for datum, response in zip(data, responses): input = datum["problem_statement"] expected_output = datum["patch"] context = [datum["text"]] deepeval_dataset.add_test_case( LLMTestCase( input=input, - actual_output=str(get_answer( - input, context).model_dump()['response']), + actual_output=str(response.model_dump()['response']), expected_output=expected_output, context=context, ) ) return deepeval_dataset
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
evals/deepeval_on_swe_bench.py
(1 hunks)evals/eval_swe_bench.py
(1 hunks)evals/eval_utils.py
(1 hunks)
🧰 Additional context used
🪛 Ruff
evals/deepeval_on_swe_bench.py
85-85: Local variable graph
is assigned to but never used
Remove assignment to unused variable graph
(F841)
evals/eval_utils.py
60-60: f-string without any placeholders
Remove extraneous f
prefix
(F541)
🔇 Additional comments (1)
evals/eval_swe_bench.py (1)
1-14
: LGTM: Imports are well-organized and relevant.
The imports are properly organized following the standard convention (stdlib → third-party → local) and all seem necessary for the implementation.
if __name__ == "__main__": | ||
import asyncio | ||
asyncio.run(main(), debug=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug mode from asyncio.run.
Debug mode in production could expose sensitive information through detailed tracebacks.
Apply this diff:
if __name__ == "__main__":
import asyncio
- asyncio.run(main(), debug=True)
+ asyncio.run(main())
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if __name__ == "__main__": | |
import asyncio | |
asyncio.run(main(), debug=True) | |
if __name__ == "__main__": | |
import asyncio | |
asyncio.run(main()) |
async def llm_on_preprocessed_data(dataset): | ||
problem_statement = dataset[0]['problem_statement'] | ||
prompt = dataset[0]["text"] | ||
|
||
llm_client = get_llm_client() | ||
answer_prediction = llm_client.create_structured_output( | ||
text_input=problem_statement, | ||
system_prompt=prompt, | ||
response_model=str, | ||
) | ||
return answer_prediction | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add consistent error handling with cognee_and_llm function.
For consistency with the previous function, add input validation and error handling.
Apply this diff:
async def llm_on_preprocessed_data(dataset):
+ if not dataset:
+ raise ValueError("Dataset cannot be empty")
+
problem_statement = dataset[0]['problem_statement']
prompt = dataset[0]["text"]
- llm_client = get_llm_client()
- answer_prediction = llm_client.create_structured_output(
- text_input=problem_statement,
- system_prompt=prompt,
- response_model=str,
- )
- return answer_prediction
+ try:
+ llm_client = get_llm_client()
+ answer_prediction = llm_client.create_structured_output(
+ text_input=problem_statement,
+ system_prompt=prompt,
+ response_model=str,
+ )
+ return answer_prediction
+ except Exception as e:
+ raise RuntimeError(f"Failed to generate LLM prediction: {str(e)}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
async def llm_on_preprocessed_data(dataset): | |
problem_statement = dataset[0]['problem_statement'] | |
prompt = dataset[0]["text"] | |
llm_client = get_llm_client() | |
answer_prediction = llm_client.create_structured_output( | |
text_input=problem_statement, | |
system_prompt=prompt, | |
response_model=str, | |
) | |
return answer_prediction | |
async def llm_on_preprocessed_data(dataset): | |
if not dataset: | |
raise ValueError("Dataset cannot be empty") | |
problem_statement = dataset[0]['problem_statement'] | |
prompt = dataset[0]["text"] | |
try: | |
llm_client = get_llm_client() | |
answer_prediction = llm_client.create_structured_output( | |
text_input=problem_statement, | |
system_prompt=prompt, | |
response_model=str, | |
) | |
return answer_prediction | |
except Exception as e: | |
raise RuntimeError(f"Failed to generate LLM prediction: {str(e)}") |
async def cognee_and_llm(dataset, search_type=SearchType.CHUNKS): | ||
await cognee.prune.prune_data() | ||
await cognee.prune.prune_system(metadata=True) | ||
|
||
dataset_name = "SWE_test_data" | ||
code_text = dataset[0]["text"] | ||
await cognee.add([code_text], dataset_name) | ||
await code_graph_pipeline([dataset_name]) | ||
graph_engine = await get_graph_engine() | ||
with open(graph_engine.filename, "r") as f: | ||
graph_str = f.read() | ||
|
||
problem_statement = dataset[0]['problem_statement'] | ||
instructions = ( | ||
"I need you to solve this issue by looking at the provided knowledge graph and by " | ||
+ "generating a single patch file that I can apply directly to this repository " | ||
+ "using git apply. Please respond with a single patch " | ||
+ "file in the following format." | ||
) | ||
|
||
prompt = "\n".join([ | ||
instructions, | ||
"<patch>", | ||
PATCH_EXAMPLE, | ||
"</patch>", | ||
"This is the knowledge graph:", | ||
graph_str | ||
]) | ||
|
||
llm_client = get_llm_client() | ||
answer_prediction = llm_client.create_structured_output( | ||
text_input=problem_statement, | ||
system_prompt=prompt, | ||
response_model=str, | ||
) | ||
return answer_prediction | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add input validation and error handling.
Several potential issues need to be addressed:
- Missing validation for empty dataset
- Hard-coded dataset name could cause conflicts
- Graph file handling needs proper cleanup
- Missing error handling for pipeline failures
Apply this diff to improve robustness:
async def cognee_and_llm(dataset, search_type=SearchType.CHUNKS):
+ if not dataset:
+ raise ValueError("Dataset cannot be empty")
+
await cognee.prune.prune_data()
await cognee.prune.prune_system(metadata=True)
- dataset_name = "SWE_test_data"
+ dataset_name = f"SWE_test_data_{id(dataset)}" # Unique name per dataset
code_text = dataset[0]["text"]
- await cognee.add([code_text], dataset_name)
- await code_graph_pipeline([dataset_name])
- graph_engine = await get_graph_engine()
- with open(graph_engine.filename, "r") as f:
- graph_str = f.read()
+ try:
+ await cognee.add([code_text], dataset_name)
+ await code_graph_pipeline([dataset_name])
+ graph_engine = await get_graph_engine()
+ with open(graph_engine.filename, "r") as f:
+ graph_str = f.read()
+ except Exception as e:
+ raise RuntimeError(f"Failed to process code graph: {str(e)}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
async def cognee_and_llm(dataset, search_type=SearchType.CHUNKS): | |
await cognee.prune.prune_data() | |
await cognee.prune.prune_system(metadata=True) | |
dataset_name = "SWE_test_data" | |
code_text = dataset[0]["text"] | |
await cognee.add([code_text], dataset_name) | |
await code_graph_pipeline([dataset_name]) | |
graph_engine = await get_graph_engine() | |
with open(graph_engine.filename, "r") as f: | |
graph_str = f.read() | |
problem_statement = dataset[0]['problem_statement'] | |
instructions = ( | |
"I need you to solve this issue by looking at the provided knowledge graph and by " | |
+ "generating a single patch file that I can apply directly to this repository " | |
+ "using git apply. Please respond with a single patch " | |
+ "file in the following format." | |
) | |
prompt = "\n".join([ | |
instructions, | |
"<patch>", | |
PATCH_EXAMPLE, | |
"</patch>", | |
"This is the knowledge graph:", | |
graph_str | |
]) | |
llm_client = get_llm_client() | |
answer_prediction = llm_client.create_structured_output( | |
text_input=problem_statement, | |
system_prompt=prompt, | |
response_model=str, | |
) | |
return answer_prediction | |
async def cognee_and_llm(dataset, search_type=SearchType.CHUNKS): | |
if not dataset: | |
raise ValueError("Dataset cannot be empty") | |
await cognee.prune.prune_data() | |
await cognee.prune.prune_system(metadata=True) | |
dataset_name = f"SWE_test_data_{id(dataset)}" # Unique name per dataset | |
code_text = dataset[0]["text"] | |
try: | |
await cognee.add([code_text], dataset_name) | |
await code_graph_pipeline([dataset_name]) | |
graph_engine = await get_graph_engine() | |
with open(graph_engine.filename, "r") as f: | |
graph_str = f.read() | |
except Exception as e: | |
raise RuntimeError(f"Failed to process code graph: {str(e)}") | |
problem_statement = dataset[0]['problem_statement'] | |
instructions = ( | |
"I need you to solve this issue by looking at the provided knowledge graph and by " | |
+ "generating a single patch file that I can apply directly to this repository " | |
+ "using git apply. Please respond with a single patch " | |
+ "file in the following format." | |
) | |
prompt = "\n".join([ | |
instructions, | |
"<patch>", | |
PATCH_EXAMPLE, | |
"</patch>", | |
"This is the knowledge graph:", | |
graph_str | |
]) | |
llm_client = get_llm_client() | |
answer_prediction = llm_client.create_structured_output( | |
text_input=problem_statement, | |
system_prompt=prompt, | |
response_model=str, | |
) | |
return answer_prediction |
async def main(): | ||
swe_dataset = load_swebench_dataset( | ||
'princeton-nlp/SWE-bench', split='test') | ||
swe_dataset_preprocessed = load_swebench_dataset( | ||
'princeton-nlp/SWE-bench_bm25_13K', split='test') | ||
test_data = swe_dataset[:1] | ||
test_data_preprocessed = swe_dataset_preprocessed[:1] | ||
assert test_data[0]["instance_id"] == test_data_preprocessed[0]["instance_id"] | ||
filepath = Path("SWE-bench_testsample") | ||
if filepath.exists(): | ||
from datasets import Dataset | ||
dataset = Dataset.load_from_disk(filepath) | ||
else: | ||
dataset = download_instances(test_data, filepath) | ||
|
||
cognee_preds = await get_preds(dataset, with_cognee=True) | ||
# nocognee_preds = await get_preds(dataset, with_cognee=False) | ||
with open("withcognee.json", "w") as file: | ||
json.dump(cognee_preds, file) | ||
|
||
subprocess.run(["python", "-m", "swebench.harness.run_evaluation", | ||
"--dataset_name", 'princeton-nlp/SWE-bench', | ||
"--split", "test", | ||
"--predictions_path", "withcognee.json", | ||
"--max_workers", "1", | ||
"--instance_ids", test_data[0]["instance_id"], | ||
"--run_id", "with_cognee"]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address security and robustness concerns in main function.
Several issues need attention:
- Debug mode in asyncio.run could expose sensitive information
- Subprocess call needs error handling
- Hard-coded file paths should use Path objects consistently
- Commented out code should be removed if not needed
Apply this diff:
async def main():
+ output_dir = Path("output")
+ output_dir.mkdir(exist_ok=True)
+ predictions_file = output_dir / "withcognee.json"
+
swe_dataset = load_swebench_dataset(
'princeton-nlp/SWE-bench', split='test')
swe_dataset_preprocessed = load_swebench_dataset(
'princeton-nlp/SWE-bench_bm25_13K', split='test')
test_data = swe_dataset[:1]
test_data_preprocessed = swe_dataset_preprocessed[:1]
assert test_data[0]["instance_id"] == test_data_preprocessed[0]["instance_id"]
filepath = Path("SWE-bench_testsample")
if filepath.exists():
from datasets import Dataset
dataset = Dataset.load_from_disk(filepath)
else:
dataset = download_instances(test_data, filepath)
cognee_preds = await get_preds(dataset, with_cognee=True)
- # nocognee_preds = await get_preds(dataset, with_cognee=False)
- with open("withcognee.json", "w") as file:
+ with open(predictions_file, "w") as file:
json.dump(cognee_preds, file)
- subprocess.run(["python", "-m", "swebench.harness.run_evaluation",
- "--dataset_name", 'princeton-nlp/SWE-bench',
- "--split", "test",
- "--predictions_path", "withcognee.json",
- "--max_workers", "1",
- "--instance_ids", test_data[0]["instance_id"],
- "--run_id", "with_cognee"])
+ try:
+ result = subprocess.run(
+ ["python", "-m", "swebench.harness.run_evaluation",
+ "--dataset_name", 'princeton-nlp/SWE-bench',
+ "--split", "test",
+ "--predictions_path", str(predictions_file),
+ "--max_workers", "1",
+ "--instance_ids", test_data[0]["instance_id"],
+ "--run_id", "with_cognee"],
+ check=True,
+ capture_output=True,
+ text=True
+ )
+ print(result.stdout)
+ except subprocess.CalledProcessError as e:
+ print(f"Evaluation failed: {e.stderr}")
+ raise
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
async def main(): | |
swe_dataset = load_swebench_dataset( | |
'princeton-nlp/SWE-bench', split='test') | |
swe_dataset_preprocessed = load_swebench_dataset( | |
'princeton-nlp/SWE-bench_bm25_13K', split='test') | |
test_data = swe_dataset[:1] | |
test_data_preprocessed = swe_dataset_preprocessed[:1] | |
assert test_data[0]["instance_id"] == test_data_preprocessed[0]["instance_id"] | |
filepath = Path("SWE-bench_testsample") | |
if filepath.exists(): | |
from datasets import Dataset | |
dataset = Dataset.load_from_disk(filepath) | |
else: | |
dataset = download_instances(test_data, filepath) | |
cognee_preds = await get_preds(dataset, with_cognee=True) | |
# nocognee_preds = await get_preds(dataset, with_cognee=False) | |
with open("withcognee.json", "w") as file: | |
json.dump(cognee_preds, file) | |
subprocess.run(["python", "-m", "swebench.harness.run_evaluation", | |
"--dataset_name", 'princeton-nlp/SWE-bench', | |
"--split", "test", | |
"--predictions_path", "withcognee.json", | |
"--max_workers", "1", | |
"--instance_ids", test_data[0]["instance_id"], | |
"--run_id", "with_cognee"]) | |
async def main(): | |
output_dir = Path("output") | |
output_dir.mkdir(exist_ok=True) | |
predictions_file = output_dir / "withcognee.json" | |
swe_dataset = load_swebench_dataset( | |
'princeton-nlp/SWE-bench', split='test') | |
swe_dataset_preprocessed = load_swebench_dataset( | |
'princeton-nlp/SWE-bench_bm25_13K', split='test') | |
test_data = swe_dataset[:1] | |
test_data_preprocessed = swe_dataset_preprocessed[:1] | |
assert test_data[0]["instance_id"] == test_data_preprocessed[0]["instance_id"] | |
filepath = Path("SWE-bench_testsample") | |
if filepath.exists(): | |
from datasets import Dataset | |
dataset = Dataset.load_from_disk(filepath) | |
else: | |
dataset = download_instances(test_data, filepath) | |
cognee_preds = await get_preds(dataset, with_cognee=True) | |
with open(predictions_file, "w") as file: | |
json.dump(cognee_preds, file) | |
try: | |
result = subprocess.run( | |
["python", "-m", "swebench.harness.run_evaluation", | |
"--dataset_name", 'princeton-nlp/SWE-bench', | |
"--split", "test", | |
"--predictions_path", str(predictions_file), | |
"--max_workers", "1", | |
"--instance_ids", test_data[0]["instance_id"], | |
"--run_id", "with_cognee"], | |
check=True, | |
capture_output=True, | |
text=True | |
) | |
print(result.stdout) | |
except subprocess.CalledProcessError as e: | |
print(f"Evaluation failed: {e.stderr}") | |
raise |
async def run_cognify_base_rag(): | ||
from cognee.api.v1.add import add | ||
from cognee.api.v1.prune import prune | ||
from cognee.api.v1.cognify.cognify import cognify | ||
|
||
await prune.prune_system() | ||
|
||
await add("data://test_datasets", "initial_test") | ||
|
||
graph = await cognify("initial_test") | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling to async functions
The async functions lack proper error handling which could lead to unhandled exceptions.
Add try-catch blocks to handle potential errors:
async def run_cognify_base_rag():
+ try:
from cognee.api.v1.add import add
from cognee.api.v1.prune import prune
from cognee.api.v1.cognify.cognify import cognify
await prune.prune_system()
await add("data://test_datasets", "initial_test")
await cognify("initial_test")
+ except Exception as e:
+ logger.error("Error in run_cognify_base_rag: %s", e, exc_info=True)
+ raise
Apply similar error handling to cognify_search_base_rag
and cognify_search_graph
.
Also applies to: 89-101, 103-110
🧰 Tools
🪛 Ruff
85-85: Local variable graph
is assigned to but never used
Remove assignment to unused variable graph
(F841)
def convert_swe_to_deepeval(swe_dataset: List[Dict]): | ||
deepeval_dataset = EvaluationDataset() | ||
for datum in swe_dataset: | ||
input = datum["problem_statement"] | ||
expected_output = datum["patch"] | ||
context = [datum["text"]] | ||
# retrieval_context = datum.get(retrieval_context_key_name) | ||
|
||
deepeval_dataset.add_test_case( | ||
LLMTestCase( | ||
input=input, | ||
actual_output=None, | ||
expected_output=expected_output, | ||
context=context, | ||
# retrieval_context=retrieval_context, | ||
) | ||
) | ||
return deepeval_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consolidate duplicate conversion logic
There's significant code duplication between convert_swe_to_deepeval
and convert_swe_to_deepeval_testcases
. Consider consolidating the logic:
+def create_llm_test_case(datum: Dict, generate_output: bool = False) -> LLMTestCase:
+ input = datum["problem_statement"]
+ expected_output = datum["patch"]
+ context = [datum["text"]]
+
+ kwargs = {
+ "input": input,
+ "expected_output": expected_output,
+ "context": context,
+ }
+
+ if generate_output:
+ kwargs["actual_output"] = str(get_answer(
+ input, context).model_dump()['response'])
+ else:
+ kwargs["actual_output"] = None
+
+ return LLMTestCase(**kwargs)
def convert_swe_to_deepeval(swe_dataset: List[Dict]) -> EvaluationDataset:
deepeval_dataset = EvaluationDataset()
for datum in swe_dataset:
- input = datum["problem_statement"]
- expected_output = datum["patch"]
- context = [datum["text"]]
- deepeval_dataset.add_test_case(
- LLMTestCase(
- input=input,
- actual_output=None,
- expected_output=expected_output,
- context=context,
- )
- )
+ test_case = create_llm_test_case(datum, generate_output=False)
+ deepeval_dataset.add_test_case(test_case)
return deepeval_dataset
def convert_swe_to_deepeval_testcases(swe_dataset: List[Dict]) -> EvaluationDataset:
deepeval_dataset = EvaluationDataset()
for datum in swe_dataset[:4]:
- input = datum["problem_statement"]
- expected_output = datum["patch"]
- context = [datum["text"]]
- deepeval_dataset.add_test_case(
- LLMTestCase(
- input=input,
- actual_output=str(get_answer(
- input, context).model_dump()['response']),
- expected_output=expected_output,
- context=context,
- )
- )
+ test_case = create_llm_test_case(datum, generate_output=True)
+ deepeval_dataset.add_test_case(test_case)
return deepeval_dataset
Also applies to: 128-150
graph = await cognify("initial_test") | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove unused variable assignment and pass statement
The graph
variable is assigned but never used, and the pass
statement is unnecessary.
- graph = await cognify("initial_test")
- pass
+ await cognify("initial_test")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
graph = await cognify("initial_test") | |
pass | |
await cognify("initial_test") |
🧰 Tools
🪛 Ruff
85-85: Local variable graph
is assigned to but never used
Remove assignment to unused variable graph
(F841)
def get_answer_base(content: str, context: str, response_model: Type[BaseModel]): | ||
llm_client = get_llm_client() | ||
|
||
system_prompt = "THIS IS YOUR CONTEXT:" + str(context) | ||
|
||
return llm_client.create_structured_output(content, system_prompt, response_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add input validation for LLM interactions
The function passes raw input to the LLM without validation. Consider adding input validation:
def get_answer_base(content: str, context: str, response_model: Type[BaseModel]):
+ if not content or not context:
+ raise ValueError("Content and context must not be empty")
+
+ # Sanitize inputs to prevent prompt injection
+ content = content.strip()
+ context = context.strip()
+
llm_client = get_llm_client()
system_prompt = "THIS IS YOUR CONTEXT:" + str(context)
return llm_client.create_structured_output(content, system_prompt, response_model)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_answer_base(content: str, context: str, response_model: Type[BaseModel]): | |
llm_client = get_llm_client() | |
system_prompt = "THIS IS YOUR CONTEXT:" + str(context) | |
return llm_client.create_structured_output(content, system_prompt, response_model) | |
def get_answer_base(content: str, context: str, response_model: Type[BaseModel]): | |
if not content or not context: | |
raise ValueError("Content and context must not be empty") | |
# Sanitize inputs to prevent prompt injection | |
content = content.strip() | |
context = context.strip() | |
llm_client = get_llm_client() | |
system_prompt = "THIS IS YOUR CONTEXT:" + str(context) | |
return llm_client.create_structured_output(content, system_prompt, response_model) |
swe_dataset = load_swebench_dataset( | ||
'princeton-nlp/SWE-bench_bm25_13K', split='test') | ||
test_dataset = convert_swe_to_deepeval_testcases(swe_dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consolidate dataset loading
The dataset is loaded multiple times throughout the file. Consider loading it once and reusing the instance:
+def load_dataset(split='test'):
+ return load_swebench_dataset('princeton-nlp/SWE-bench_bm25_13K', split=split)
if __name__ == "__main__":
import asyncio
async def main():
await cognify_search_graph("show_all_processes", "context")
asyncio.run(main())
- swe_dataset = load_swebench_dataset(
- 'princeton-nlp/SWE-bench_bm25_13K', split='test')
+ swe_dataset = load_dataset()
test_dataset = convert_swe_to_deepeval_testcases(swe_dataset)
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (7)
evals/eval_utils.py (7)
24-26
: Improve portability of temporary directory pathThe hardcoded paths might not work on all systems. Consider using
tempfile.gettempdir()
for better portability.with TemporaryDirectory( - dir="/scratch" if os.path.exists("/scratch") else "/tmp" + dir=tempfile.gettempdir() ) as root_dir:
32-43
: Add error logging for repository processing failuresThe function silently continues on repository processing failures. Consider adding logging to track failures.
+import logging + try: with AutoContextManager( instance, root_dir ) as cm: readmes = cm.get_readme_files() instance["readmes"] = ingest_files(readmes) instance["file_contents"] = ingest_directory_contents( cm.repo_path ) + except Exception as e: + logging.error(f"Failed to process repository for instance {instance.get('instance_id')}: {e}") finally: os.chdir(orig_dir)
55-55
: Remove commented codeRemove the commented line as it appears to be unused alternative implementation.
- # text_inputs = code_text
56-56
: Simplify string concatenationThe f-string is used without any placeholders and can be simplified.
- patch = "\n".join([f"<patch>", instance["patch"], "</patch>"]) + patch = "\n".join(["<patch>", instance["patch"], "</patch>"])🧰 Tools
🪛 Ruff
56-56: f-string without any placeholders
Remove extraneous
f
prefix(F541)
60-84
: Add validation for required fieldsConsider validating critical fields that are essential for dataset creation. This will help catch data issues early.
def create_dataset(input_instances): columns = [ "instance_id", "text", "repo", "base_commit", "problem_statement", # ... other columns ] + required_fields = {"instance_id", "repo", "base_commit"} # Add critical fields data_table = {key: list() for key in columns} for instance in input_instances.values(): + # Validate required fields + missing_fields = required_fields - set(instance.keys()) + if missing_fields: + raise ValueError(f"Missing required fields: {missing_fields}") + datum = extract_fields(instance) for key in columns: data_table[key].append(datum[key] if key in datum else "")
87-98
: Improve function documentationThe docstring should include return type information and document all parameters.
def download_instances( input_data, path=Path("SWE-bench_testsample"), verbose=False, ): - """Downloads code from github. + """Downloads code from github and creates a dataset. Args: - - input_data: dictionary with unprocessed input instances. - - verbose: set ContextManager verbose to True + input_data: Dictionary with unprocessed input instances + path: Path object specifying where to save the dataset (default: Path("SWE-bench_testsample")) + verbose: Set ContextManager verbose to True (default: False) + + Returns: + Dataset: The created dataset containing processed instances + + Raises: + ValueError: If required fields are missing in input instances """
101-102
: Add progress tracking for dataset savingConsider wrapping the dataset saving operation with a progress indicator for better user feedback.
- dataset = create_dataset(input_instances_with_text) - dataset.save_to_disk(path) + print("Creating dataset...") + dataset = create_dataset(input_instances_with_text) + print(f"Saving dataset to {path}...") + dataset.save_to_disk(path) + print("Dataset saved successfully!")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
evals/eval_utils.py
(1 hunks)
🧰 Additional context used
🪛 Ruff
evals/eval_utils.py
56-56: f-string without any placeholders
Remove extraneous f
prefix
(F541)
def ingest_files(filenames): | ||
files_dict = dict() | ||
for filename in filenames: | ||
with open(filename) as f: | ||
content = f.read() | ||
files_dict[filename] = content | ||
return files_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for file operations
The function should handle potential file-related errors and specify encoding for better reliability.
def ingest_files(filenames):
files_dict = dict()
for filename in filenames:
- with open(filename) as f:
- content = f.read()
- files_dict[filename] = content
+ try:
+ with open(filename, encoding='utf-8') as f:
+ content = f.read()
+ files_dict[filename] = content
+ except (IOError, UnicodeDecodeError) as e:
+ print(f"Error reading file {filename}: {e}")
+ files_dict[filename] = ""
return files_dict
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def ingest_files(filenames): | |
files_dict = dict() | |
for filename in filenames: | |
with open(filename) as f: | |
content = f.read() | |
files_dict[filename] = content | |
return files_dict | |
def ingest_files(filenames): | |
files_dict = dict() | |
for filename in filenames: | |
try: | |
with open(filename, encoding='utf-8') as f: | |
content = f.read() | |
files_dict[filename] = content | |
except (IOError, UnicodeDecodeError) as e: | |
print(f"Error reading file {filename}: {e}") | |
files_dict[filename] = "" | |
return files_dict |
|
GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
---|---|---|---|---|---|
9573981 | Triggered | Generic Password | d07ebbe | notebooks/cognee_llama_index.ipynb | View secret |
13947935 | Triggered | Generic High Entropy Secret | d07ebbe | cognee/shared/utils.py | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Summary by CodeRabbit
New Features
Bug Fixes