Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/cog-544 eval on swe bench #232

Merged
merged 12 commits into from
Nov 23, 2024
Merged

Conversation

alekszievr
Copy link
Contributor

@alekszievr alekszievr commented Nov 15, 2024

  • Run cognee code graph pipeline on SWE-bench dataset + LLM call after
  • LLM call on pre-processed SWE-bench dataset
  • Convert output to SWE-bench format

Summary by CodeRabbit

  • New Features

    • Introduced functionality to convert Software Engineering datasets for evaluation using the DeepEval framework.
    • Added asynchronous functions for evaluating datasets with a language model and knowledge graph, enabling the generation of code patches based on predictions.
    • Implemented functions for processing code repositories and creating structured datasets from them.
  • Bug Fixes

    • Enhanced error handling for file operations and subprocess commands during dataset evaluation.

Copy link
Contributor

coderabbitai bot commented Nov 15, 2024

Walkthrough

The changes introduce two new files, evals/deepeval_on_swe_bench.py and evals/eval_swe_bench.py, which implement functionalities for evaluating a Software Engineering (SWE) dataset using the DeepEval framework and a language model (LLM) with knowledge graph techniques. The first file focuses on converting the SWE dataset into a suitable format for evaluation and includes various functions for data processing and interaction with an LLM. The second file outlines asynchronous functions for evaluating the dataset, generating predictions, and applying code patches based on those predictions.

Changes

File Path Change Summary
evals/deepeval_on_swe_bench.py - Added functions for converting SWE dataset to DeepEval format, generating answers with LLM, and managing dataset interactions.
- Functions include convert_swe_to_deepeval, get_answer_base, get_answer, run_cognify_base_rag, cognify_search_base_rag, cognify_search_graph, convert_goldens_to_test_cases, and convert_swe_to_deepeval_testcases.
evals/eval_swe_bench.py - Introduced asynchronous functions for evaluating the dataset with LLM and knowledge graph techniques.
- Functions include cognee_and_llm, llm_on_preprocessed_data, get_preds, and main.
evals/eval_utils.py - New file added with functions for processing code repositories and creating datasets, including ingest_files, ingest_repos, extract_fields, create_dataset, and download_instances.

Possibly related issues

  • [COG-622] Integrate SWE-bench into Cognee: The changes in this PR facilitate the integration of the SWE-bench into the Cognee system, aligning with the objective of running evaluations with the SWE-bench.

🐰 "In the meadow, we hop and play,
With SWE-bench now here to stay.
DeepEval's magic, LLM's grace,
Together we’ll win this coding race!
Hopping high, our tests will soar,
In the world of code, we’ll explore!
🌼"


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (9)
evals/eval_swe_bench.py (3)

21-21: Consider parameterizing the code length limit

The line code_text = dataset[0]["text"][:100000] truncates the text to 100,000 characters. If this limit is important, consider defining it as a constant or parameter to improve code readability and maintainability. Additionally, adding a comment to explain the rationale behind the specific limit would be helpful.


97-97: Handle potential None value for model_patch

When writing pred["model_patch"] to the patch_file, if model_patch is None, it defaults to an empty string. Ensure that this behavior is intentional. If not, consider adding a check or handling the None case appropriately.


101-101: Review the use of shell=True in subprocess.run

Using shell=True can pose security risks if the command string contains untrusted input. Ensure that all commands in test_spec.repo_script_list are from trusted sources. If possible, consider using a list of arguments without shell=True to enhance security.

evals/deepeval_on_swe_bench.py (6)

41-41: Move module-level imports to the top of the file

Module-level import statements should be placed at the top of the file, after any module comments and docstrings, to comply with Python's PEP 8 style guidelines.

Apply this diff to move the import statements:

+from cognee.infrastructure.llm.get_llm_client import get_llm_client
+import logging
+import os
+from cognee.base_config import get_base_config
+from cognee.infrastructure.databases.vector import get_vector_engine

 from typing import List, Dict, Type
 from swebench.harness.utils import load_swebench_dataset
 from deepeval.dataset import EvaluationDataset
 from deepeval.test_case import LLMTestCase
 from pydantic import BaseModel

-from deepeval.synthesizer import Synthesizer

-# ... other code ...

-# Import statements not at the top
-from cognee.infrastructure.llm.get_llm_client import get_llm_client
-import logging
-import os
-from cognee.base_config import get_base_config
-from cognee.infrastructure.databases.vector import get_vector_engine

Also applies to: 46-46, 85-87

🧰 Tools
🪛 Ruff

41-41: Module level import not at top of file

(E402)


81-81: Remove or utilize the unused variable graph

The variable graph is assigned but not used, which may indicate redundant code or missing functionality.

Apply this diff to remove the unused variable assignment:

 async def run_cognify_base_rag():
     from cognee.api.v1.add import add
     from cognee.api.v1.prune import prune
     from cognee.api.v1.cognify.cognify import cognify

     await prune.prune_system()

     await add("data://test_datasets", "initial_test")

-    graph = await cognify("initial_test")
+    await cognify("initial_test")
     pass
🧰 Tools
🪛 Ruff

81-81: Local variable graph is assigned to but never used

Remove assignment to unused variable graph

(F841)


148-149: Remove duplicate code for loading datasets

The code loads swe_dataset and test_dataset at lines 148-149 and again at lines 162-163. This seems redundant; consider removing the duplication to avoid unnecessary processing.

Apply this diff to remove the duplicate code outside the main block:

-swe_dataset = load_swebench_dataset('princeton-nlp/SWE-bench_bm25_13K', split='test')
-test_dataset = convert_swe_to_deepeval_testcases(swe_dataset)

 # Other code...

 if __name__ == "__main__":

     import asyncio

     async def main():
         # await run_cognify_base_rag()
         # await cognify_search_base_rag("show_all_processes", "context")
         await cognify_search_graph("show_all_processes", "context")
     asyncio.run(main())
     # run_cognify_base_rag_and_search()
     # # Data preprocessing before setting the dataset test cases
     swe_dataset = load_swebench_dataset('princeton-nlp/SWE-bench_bm25_13K', split='test')
     test_dataset = convert_swe_to_deepeval_testcases(swe_dataset)

Also applies to: 162-163


60-67: Add type hint for context parameter in get_answer function

The context parameter in get_answer lacks a type annotation. Adding a type hint improves code readability and type checking.

Apply this diff to add the type hint:

 def get_answer(content: str, context: str, model: Type[BaseModel] = AnswerModel):

     try:
         return get_answer_base(
             content,
             context,
             model
         )

117-117: Avoid redundant type conversion when accessing model fields

The str() conversion around get_answer(...).model_dump()['response'] may be unnecessary if 'response' is already a string. Ensure that type conversions are necessary to prevent unexpected results.

If 'response' is already a string, you can remove the str() conversion:

 actual_output= str(get_answer(case.input, case.context).model_dump()['response']),

# Change to:

 actual_output = get_answer(case.input, case.context).model_dump()['response'],

Also applies to: 138-138


68-70: Avoid catching broad exceptions

Catching the base Exception class can make debugging difficult and may mask other issues. Consider catching more specific exceptions.

If you expect specific exceptions, it's better to catch those explicitly. For example:

 def get_answer(content: str, context: str, model: Type[BaseModel] = AnswerModel):

     try:
         return get_answer_base(
             content,
             context,
             model
         )
-    except Exception as error:
+    except SomeSpecificException as error:
         logger.error("Error extracting cognitive layers from content: %s", error, exc_info=True)
         raise error

Replace SomeSpecificException with the actual exception types you expect.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 8c0c0b9 and ed08cdb.

📒 Files selected for processing (2)
  • evals/deepeval_on_swe_bench.py (1 hunks)
  • evals/eval_swe_bench.py (1 hunks)
🧰 Additional context used
🪛 Ruff
evals/deepeval_on_swe_bench.py

41-41: Module level import not at top of file

(E402)


46-46: Module level import not at top of file

(E402)


81-81: Local variable graph is assigned to but never used

Remove assignment to unused variable graph

(F841)


85-85: Module level import not at top of file

(E402)


86-86: Module level import not at top of file

(E402)


87-87: Module level import not at top of file

(E402)

evals/eval_swe_bench.py

30-30: f-string without any placeholders

Remove extraneous f prefix

(F541)


31-31: f-string without any placeholders

Remove extraneous f prefix

(F541)


32-32: f-string without any placeholders

Remove extraneous f prefix

(F541)


33-33: f-string without any placeholders

Remove extraneous f prefix

(F541)


124-124: Local variable results is assigned to but never used

Remove assignment to unused variable results

(F841)

🔇 Additional comments (2)
evals/eval_swe_bench.py (1)

61-61: Address the TODO: Verify the system_prompt parameter

There's a TODO comment indicating uncertainty about using prompt as the system_prompt. Please verify if setting system_prompt = prompt is correct, or if adjustments are needed to align with the LLM client's expectations.

Would you like assistance in verifying the correct usage of the system_prompt parameter?

evals/deepeval_on_swe_bench.py (1)

127-127: Confirm if limiting to the first 4 items in swe_dataset is intentional

The code processes only the first 4 items of swe_dataset. Please verify if this limitation is intentional or if the entire dataset should be processed.

If you intend to process the entire dataset, apply this diff:

 def convert_swe_to_deepeval_testcases(swe_dataset: List[Dict]):
     deepeval_dataset = EvaluationDataset()
-    for datum in swe_dataset[:4]:
+    for datum in swe_dataset:
         input = datum["problem_statement"]
         expected_output = datum["patch"]
         context = [datum["text"]]

cognee_preds, cognee_preds_dataset = await get_preds(dataset, with_cognee=True)
# nocognee_preds = await get_preds(dataset, with_cognee=False)
test_specs = list(map(make_test_spec, test_data))
results = await evaluate(test_specs, cognee_preds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused variable results

The variable results is assigned but never used. This may lead to confusion and unnecessary memory usage. If the returned value from evaluate is not needed, consider removing the assignment.

Apply this diff to fix the issue:

-        results = await evaluate(test_specs, cognee_preds)
+        await evaluate(test_specs, cognee_preds)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
results = await evaluate(test_specs, cognee_preds)
await evaluate(test_specs, cognee_preds)
🧰 Tools
🪛 Ruff

124-124: Local variable results is assigned to but never used

Remove assignment to unused variable results

(F841)

result = subprocess.run(command, shell=True, check=True, capture_output=True, text=True)
print(result)

subprocess.run("git apply --allow-empty -v logs/patch.diff", shell=True, capture_output=True, text=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add check=True to capture subprocess errors

In the subprocess.run call for applying the patch, check=True is omitted. This means that if git apply fails, the error won't raise an exception. Adding check=True will ensure that any errors are caught and can be handled appropriately.

Apply this diff to fix the issue:

-        subprocess.run("git apply --allow-empty -v logs/patch.diff", shell=True, capture_output=True, text=True)
+        subprocess.run("git apply --allow-empty -v logs/patch.diff", shell=True, check=True, capture_output=True, text=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
subprocess.run("git apply --allow-empty -v logs/patch.diff", shell=True, capture_output=True, text=True)
subprocess.run("git apply --allow-empty -v logs/patch.diff", shell=True, check=True, capture_output=True, text=True)

Comment on lines 30 to 33
f"I need you to solve this issue by looking at the provided knowledge graph and by "
+ f"generating a single patch file that I can apply directly to this repository "
+ f"using git apply. Please respond with a single patch "
+ f"file in the following format."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove unnecessary f prefixes from static strings

The strings in lines 30-33 do not contain any placeholder expressions, so the f prefixes are unnecessary. Removing them will clean up the code without affecting functionality.

Apply this diff to fix the issue:

         instructions = (
-            f"I need you to solve this issue by looking at the provided knowledge graph and by "
-            + f"generating a single patch file that I can apply directly to this repository "
-            + f"using git apply. Please respond with a single patch "
-            + f"file in the following format."  
+            "I need you to solve this issue by looking at the provided knowledge graph and by "
+            "generating a single patch file that I can apply directly to this repository "
+            "using git apply. Please respond with a single patch "
+            "file in the following format."  
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
f"I need you to solve this issue by looking at the provided knowledge graph and by "
+ f"generating a single patch file that I can apply directly to this repository "
+ f"using git apply. Please respond with a single patch "
+ f"file in the following format."
instructions = (
"I need you to solve this issue by looking at the provided knowledge graph and by "
"generating a single patch file that I can apply directly to this repository "
"using git apply. Please respond with a single patch "
"file in the following format."
)
🧰 Tools
🪛 Ruff

30-30: f-string without any placeholders

Remove extraneous f prefix

(F541)


31-31: f-string without any placeholders

Remove extraneous f prefix

(F541)


32-32: f-string without any placeholders

Remove extraneous f prefix

(F541)


33-33: f-string without any placeholders

Remove extraneous f prefix

(F541)

Comment on lines 102 to 108
async def cognify_search_graph(content:str, context:str):
from cognee.api.v1.search import search, SearchType
params = {'query': 'Donald Trump'}

results = await search(SearchType.INSIGHTS, params)
print("results", results)
return results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Function cognify_search_graph does not use its parameters content and context

The function cognify_search_graph accepts content and context as parameters but uses a hardcoded query 'Donald Trump' instead of the content parameter.

Apply this diff to use the content parameter:

 async def cognify_search_graph(content: str, context: str):
     from cognee.api.v1.search import search, SearchType
-    params = {'query': 'Donald Trump'}
+    params = {'query': content}

     results = await search(SearchType.INSIGHTS, params)
     print("results", results)
     return results
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def cognify_search_graph(content:str, context:str):
from cognee.api.v1.search import search, SearchType
params = {'query': 'Donald Trump'}
results = await search(SearchType.INSIGHTS, params)
print("results", results)
return results
async def cognify_search_graph(content:str, context:str):
from cognee.api.v1.search import search, SearchType
params = {'query': content}
results = await search(SearchType.INSIGHTS, params)
print("results", results)
return results

Comment on lines 60 to 70
def get_answer(content: str,context, model: Type[BaseModel]= AnswerModel):

try:
return (get_answer_base(
content,
context,
model
))
except Exception as error:
logger.error("Error extracting cognitive layers from content: %s", error, exc_info = True)
raise error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Type mismatch for context parameter passed to get_answer_base

The context parameter is passed as a list in some cases, but get_answer_base expects it to be a string. This may lead to unexpected behavior when converting the list to a string.

Consider standardizing the context parameter's type across functions. If context is intended to be a list of strings, you can join the list into a single string before passing it to get_answer_base.

Apply this diff to modify get_answer:

 def get_answer(content: str, context: str, model: Type[BaseModel] = AnswerModel):

     try:
+        if isinstance(context, list):
+            context_str = ' '.join(context)
+        else:
+            context_str = str(context)
         return get_answer_base(
             content,
-            context,
+            context_str,
             model
         )
     except Exception as error:
         logger.error("Error extracting cognitive layers from content: %s", error, exc_info=True)
         raise error
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_answer(content: str,context, model: Type[BaseModel]= AnswerModel):
try:
return (get_answer_base(
content,
context,
model
))
except Exception as error:
logger.error("Error extracting cognitive layers from content: %s", error, exc_info = True)
raise error
def get_answer(content: str, context: str, model: Type[BaseModel] = AnswerModel):
try:
if isinstance(context, list):
context_str = ' '.join(context)
else:
context_str = str(context)
return get_answer_base(
content,
context_str,
model
)
except Exception as error:
logger.error("Error extracting cognitive layers from content: %s", error, exc_info = True)
raise error

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (6)
evals/deepeval_on_swe_bench.py (4)

15-20: Improve documentation of the DeepEval example

The commented example uses hardcoded paths and lacks explanation. Consider adding more context about its purpose and how to adapt it for different environments.

-# DeepEval dataset for reference
-# synthesizer = Synthesizer()
-# synthesizer.generate_goldens_from_docs(
-#     document_paths=['/app/.data/short_stories/soldiers_home.pdf'],
-#     include_expected_output=True
-# )
+# Example: Generating a DeepEval dataset from documents
+# Usage:
+#   synthesizer = Synthesizer()
+#   synthesizer.generate_goldens_from_docs(
+#     document_paths=['path/to/your/docs/*.pdf'],  # Adjust path as needed
+#     include_expected_output=True
+#   )

57-57: Make system prompt configurable

The system prompt is hardcoded. Consider making it configurable through environment variables or configuration files.

+SYSTEM_PROMPT_TEMPLATE = os.getenv('SYSTEM_PROMPT_TEMPLATE', "THIS IS YOUR CONTEXT:{context}")
+
-    system_prompt = "THIS IS YOUR CONTEXT:" + str(context)
+    system_prompt = SYSTEM_PROMPT_TEMPLATE.format(context=str(context))

85-85: Remove unused variable assignment

The graph variable is assigned but never used.

-    graph = await cognify("initial_test")
+    await cognify("initial_test")
🧰 Tools
🪛 Ruff

85-85: Local variable graph is assigned to but never used

Remove assignment to unused variable graph

(F841)


157-174: Improve organization of main execution block

The main block mixes async and sync code, and contains commented-out code. Consider organizing it better by:

  1. Separating async and sync operations into different functions
  2. Removing or documenting commented code
  3. Adding proper logging for the evaluation results
 if __name__ == "__main__":
     import asyncio
+    import logging
+
+    logging.basicConfig(level=logging.INFO)
+
+    async def run_search():
+        try:
+            await cognify_search_graph("show_all_processes", "context")
+        except Exception as e:
+            logger.error("Search failed: %s", e, exc_info=True)
+            raise
+
+    def run_evaluation():
+        try:
+            test_dataset = convert_swe_to_deepeval_testcases(
+                load_swebench_dataset('princeton-nlp/SWE-bench_bm25_13K', split='test')
+            )
+            metric = HallucinationMetric()
+            eval_result = test_dataset.evaluate([metric])
+            logger.info("Evaluation results: %s", eval_result)
+        except Exception as e:
+            logger.error("Evaluation failed: %s", e, exc_info=True)
+            raise
+
+    # Run async search
+    asyncio.run(run_search())
+
+    # Run sync evaluation
+    run_evaluation()
evals/eval_swe_bench.py (2)

29-34: Simplify string concatenation by removing unnecessary '+' operators

The concatenation of strings using '+' within parentheses is unnecessary. In Python, adjacent string literals within parentheses are automatically concatenated. Removing the '+' operators enhances readability and follows best practices.

Apply this diff to simplify the string construction:

 instructions = (
-    "I need you to solve this issue by looking at the provided knowledge graph and by "
-    + "generating a single patch file that I can apply directly to this repository "
-    + "using git apply. Please respond with a single patch "
-    + "file in the following format."
+    "I need you to solve this issue by looking at the provided knowledge graph and by "
+    "generating a single patch file that I can apply directly to this repository "
+    "using git apply. Please respond with a single patch "
+    "file in the following format."
 )

92-92: Move import statement to the top of the file

The import statement from datasets import Dataset is inside a conditional block. For clarity and to follow PEP 8 guidelines, it's better to place all import statements at the top of the file unless there's a specific reason to defer the import.

Apply this diff to move the import statement:

+from datasets import Dataset

 if filepath.exists():
-    from datasets import Dataset
     dataset = Dataset.load_from_disk(filepath)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ed08cdb and 8d7171d.

📒 Files selected for processing (2)
  • evals/deepeval_on_swe_bench.py (1 hunks)
  • evals/eval_swe_bench.py (1 hunks)
🧰 Additional context used
🪛 Ruff
evals/deepeval_on_swe_bench.py

85-85: Local variable graph is assigned to but never used

Remove assignment to unused variable graph

(F841)

🔇 Additional comments (3)
evals/deepeval_on_swe_bench.py (2)

62-73: Fix type mismatch in context parameter handling

The existing issue with type mismatch in the context parameter is still present. Please refer to the previous review comment for the fix.


103-109: Fix unused parameters in cognify_search_graph

The function still uses a hardcoded query instead of the content parameter. Please refer to the previous review comment for the fix.

evals/eval_swe_bench.py (1)

102-108: 🛠️ Refactor suggestion

Add check=True to subprocess.run to ensure errors are caught

In the subprocess.run call for running the evaluation script, check=True is omitted. This means that if the subprocess fails, an error won't raise an exception, potentially leading to unnoticed failures. Adding check=True ensures that any errors are caught and can be handled appropriately.

Apply this diff to add check=True:

 subprocess.run([
     "python", "-m", "swebench.harness.run_evaluation",
     "--dataset_name", 'princeton-nlp/SWE-bench',
     "--split", "test",
     "--predictions_path",  "withcognee.json",
     "--max_workers", "1",
     "--instance_ids", test_data[0]["instance_id"],
     "--run_id", "with_cognee"
+], check=True)
-])

Comment on lines +161 to +165
async def main():
# await run_cognify_base_rag()
# await cognify_search_base_rag("show_all_processes", "context")
await cognify_search_graph("show_all_processes", "context")
asyncio.run(main())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling to async main function

The async main function lacks error handling which could lead to unhandled exceptions.

     async def main():
+        try:
             # await run_cognify_base_rag()
             # await cognify_search_base_rag("show_all_processes", "context")
             await cognify_search_graph("show_all_processes", "context")
+        except Exception as e:
+            logger.error("Error in main: %s", e, exc_info=True)
+            raise
     asyncio.run(main())

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +42 to +44
swe_dataset = load_swebench_dataset(
'princeton-nlp/SWE-bench_bm25_13K', split='test')
deepeval_dataset = convert_swe_to_deepeval(swe_dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove duplicate dataset loading

This dataset loading is duplicated at lines 153-155. Consider consolidating the loading logic into a single function.

-swe_dataset = load_swebench_dataset(
-    'princeton-nlp/SWE-bench_bm25_13K', split='test')
-deepeval_dataset = convert_swe_to_deepeval(swe_dataset)

+def load_and_convert_dataset(split='test'):
+    swe_dataset = load_swebench_dataset(
+        'princeton-nlp/SWE-bench_bm25_13K', split=split)
+    return convert_swe_to_deepeval(swe_dataset)

Committable suggestion skipped: line range outside the PR's diff.


def convert_swe_to_deepeval_testcases(swe_dataset: List[Dict]):
deepeval_dataset = EvaluationDataset()
for datum in swe_dataset[:4]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Make dataset limit configurable

The hardcoded limit of 4 items should be configurable for different testing scenarios.

+DEFAULT_TEST_LIMIT = int(os.getenv('TEST_LIMIT', '4'))
+
-    for datum in swe_dataset[:4]:
+    for datum in swe_dataset[:DEFAULT_TEST_LIMIT]:

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Outside diff range and nitpick comments (10)
evals/eval_utils.py (4)

60-60: Remove unnecessary f-string prefix in line 60

The f-string in line 60 does not contain any placeholders, so the f prefix is unnecessary and can be removed to improve code clarity.

Apply the following change:

-patch = "\n".join([f"<patch>", instance["patch"], "</patch>"])
+patch = "\n".join(["<patch>", instance["patch"], "</patch>"])
🧰 Tools
🪛 Ruff

60-60: f-string without any placeholders

Remove extraneous f prefix

(F541)


17-23: Add exception handling in ingest_files to handle file read errors

Currently, if a file cannot be opened or read, the function will raise an exception and halt the entire process. Adding exception handling will allow the function to handle file read errors gracefully and continue processing other files.

Consider modifying the function as follows:

def ingest_files(filenames):
    files_dict = dict()
    for filename in filenames:
        try:
            with open(filename) as f:
                content = f.read()
            files_dict[filename] = content
        except Exception as e:
            logging.warning(f"Failed to read {filename}: {e}")
    return files_dict

27-47: Optimize working directory restoration in ingest_repos

The working directory is reset to orig_dir inside the finally block after processing each instance. This could result in unnecessary calls to os.chdir, impacting performance. Consider moving the restoration of the working directory outside the loop to reset it only once after all instances are processed.

You might adjust the function as follows:

def ingest_repos(input_instances):
    orig_dir = os.getcwd()
    try:
        with TemporaryDirectory(
            dir="/scratch" if os.path.exists("/scratch") else "/tmp"
        ) as root_dir:
            for instance in tqdm(
                input_instances.values(),
                total=len(input_instances),
                desc="Downloading repos on specific commits",
            ):
                with AutoContextManager(
                    instance, root_dir
                ) as cm:
                    readmes = cm.get_readme_files()
                    instance["readmes"] = ingest_files(readmes)
                    instance["file_contents"] = ingest_directory_contents(
                        cm.repo_path
                    )
    finally:
        # Restore the original working directory after all instances are processed
        os.chdir(orig_dir)

    return input_instances

17-23: Add docstrings to functions for better documentation

The following functions lack docstrings, which makes it harder for other developers to understand their purpose and usage. Adding docstrings will enhance code readability and maintainability.

  • ingest_files (lines 17-23)
  • ingest_repos (lines 26-49)
  • extract_fields (lines 52-62)
  • create_dataset (lines 64-88)

Consider adding docstrings like this:

def ingest_files(filenames):
    """
    Reads the contents of the given list of filenames and returns a dictionary
    mapping filenames to their contents.

    Args:
        filenames (list): List of filenames to read.

    Returns:
        dict: Dictionary with filenames as keys and file contents as values.
    """
    # Function implementation remains the same

Also applies to: 26-49, 52-62, 64-88

evals/eval_swe_bench.py (1)

67-80: Add type hints and documentation.

The function would benefit from type hints and a docstring explaining the return format.

Apply this diff:

-async def get_preds(dataset, with_cognee=True):
+async def get_preds(dataset: list, with_cognee: bool = True) -> list[dict]:
+    """Generate predictions using either cognee pipeline or direct LLM.
+
+    Args:
+        dataset: List of instances to process
+        with_cognee: If True, use cognee pipeline, otherwise use direct LLM
+
+    Returns:
+        List of predictions in the format:
+        [{"instance_id": str, "model_patch": str, "model_name_or_path": str}]
+    """
evals/deepeval_on_swe_bench.py (5)

15-20: Remove commented example code

The commented DeepEval dataset example doesn't provide value and should be removed to improve code clarity.


28-28: Clean up commented code

Remove commented code that's not providing value or add a comment explaining why it's kept for future reference.

Also applies to: 36-36


22-39: Add return type hint to the function

Add a return type hint to improve code clarity and enable better type checking:

-def convert_swe_to_deepeval(swe_dataset: List[Dict]):
+def convert_swe_to_deepeval(swe_dataset: List[Dict]) -> EvaluationDataset:

57-57: Improve system prompt construction

The current system prompt construction is basic and could be improved:

  1. Add separation between context and content
  2. Consider using a template for better structure
-    system_prompt = "THIS IS YOUR CONTEXT:" + str(context)
+    system_prompt = f"""
+CONTEXT:
+{str(context)}
+
+Please provide your response based on the above context.
+"""

128-150: Optimize LLM calls with batching

Consider batching LLM calls for better performance when processing multiple test cases:

+async def batch_get_answers(inputs: List[str], contexts: List[str], 
+                          batch_size: int = 5) -> List[str]:
+    results = []
+    for i in range(0, len(inputs), batch_size):
+        batch_inputs = inputs[i:i + batch_size]
+        batch_contexts = contexts[i:i + batch_size]
+        # Assuming get_answer can be modified to support batching
+        batch_results = await asyncio.gather(*[
+            get_answer(input, context) 
+            for input, context in zip(batch_inputs, batch_contexts)
+        ])
+        results.extend(batch_results)
+    return results

-def convert_swe_to_deepeval_testcases(swe_dataset: List[Dict]):
+async def convert_swe_to_deepeval_testcases(swe_dataset: List[Dict]):
     deepeval_dataset = EvaluationDataset()
-    for datum in swe_dataset[:4]:
+    data = swe_dataset[:4]
+    inputs = [d["problem_statement"] for d in data]
+    contexts = [[d["text"]] for d in data]
+    
+    responses = await batch_get_answers(inputs, contexts)
+    
+    for datum, response in zip(data, responses):
         input = datum["problem_statement"]
         expected_output = datum["patch"]
         context = [datum["text"]]
         
         deepeval_dataset.add_test_case(
             LLMTestCase(
                 input=input,
-                actual_output=str(get_answer(
-                    input, context).model_dump()['response']),
+                actual_output=str(response.model_dump()['response']),
                 expected_output=expected_output,
                 context=context,
             )
         )
     return deepeval_dataset
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 8d7171d and 98e3445.

📒 Files selected for processing (3)
  • evals/deepeval_on_swe_bench.py (1 hunks)
  • evals/eval_swe_bench.py (1 hunks)
  • evals/eval_utils.py (1 hunks)
🧰 Additional context used
🪛 Ruff
evals/deepeval_on_swe_bench.py

85-85: Local variable graph is assigned to but never used

Remove assignment to unused variable graph

(F841)

evals/eval_utils.py

60-60: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (1)
evals/eval_swe_bench.py (1)

1-14: LGTM: Imports are well-organized and relevant.

The imports are properly organized following the standard convention (stdlib → third-party → local) and all seem necessary for the implementation.

Comment on lines +110 to +112
if __name__ == "__main__":
import asyncio
asyncio.run(main(), debug=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove debug mode from asyncio.run.

Debug mode in production could expose sensitive information through detailed tracebacks.

Apply this diff:

 if __name__ == "__main__":
     import asyncio
-    asyncio.run(main(), debug=True)
+    asyncio.run(main())
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if __name__ == "__main__":
import asyncio
asyncio.run(main(), debug=True)
if __name__ == "__main__":
import asyncio
asyncio.run(main())

Comment on lines +54 to +65
async def llm_on_preprocessed_data(dataset):
problem_statement = dataset[0]['problem_statement']
prompt = dataset[0]["text"]

llm_client = get_llm_client()
answer_prediction = llm_client.create_structured_output(
text_input=problem_statement,
system_prompt=prompt,
response_model=str,
)
return answer_prediction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add consistent error handling with cognee_and_llm function.

For consistency with the previous function, add input validation and error handling.

Apply this diff:

 async def llm_on_preprocessed_data(dataset):
+    if not dataset:
+        raise ValueError("Dataset cannot be empty")
+
     problem_statement = dataset[0]['problem_statement']
     prompt = dataset[0]["text"]
 
-    llm_client = get_llm_client()
-    answer_prediction = llm_client.create_structured_output(
-        text_input=problem_statement,
-        system_prompt=prompt,
-        response_model=str,
-    )
-    return answer_prediction
+    try:
+        llm_client = get_llm_client()
+        answer_prediction = llm_client.create_structured_output(
+            text_input=problem_statement,
+            system_prompt=prompt,
+            response_model=str,
+        )
+        return answer_prediction
+    except Exception as e:
+        raise RuntimeError(f"Failed to generate LLM prediction: {str(e)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def llm_on_preprocessed_data(dataset):
problem_statement = dataset[0]['problem_statement']
prompt = dataset[0]["text"]
llm_client = get_llm_client()
answer_prediction = llm_client.create_structured_output(
text_input=problem_statement,
system_prompt=prompt,
response_model=str,
)
return answer_prediction
async def llm_on_preprocessed_data(dataset):
if not dataset:
raise ValueError("Dataset cannot be empty")
problem_statement = dataset[0]['problem_statement']
prompt = dataset[0]["text"]
try:
llm_client = get_llm_client()
answer_prediction = llm_client.create_structured_output(
text_input=problem_statement,
system_prompt=prompt,
response_model=str,
)
return answer_prediction
except Exception as e:
raise RuntimeError(f"Failed to generate LLM prediction: {str(e)}")

Comment on lines +16 to +52
async def cognee_and_llm(dataset, search_type=SearchType.CHUNKS):
await cognee.prune.prune_data()
await cognee.prune.prune_system(metadata=True)

dataset_name = "SWE_test_data"
code_text = dataset[0]["text"]
await cognee.add([code_text], dataset_name)
await code_graph_pipeline([dataset_name])
graph_engine = await get_graph_engine()
with open(graph_engine.filename, "r") as f:
graph_str = f.read()

problem_statement = dataset[0]['problem_statement']
instructions = (
"I need you to solve this issue by looking at the provided knowledge graph and by "
+ "generating a single patch file that I can apply directly to this repository "
+ "using git apply. Please respond with a single patch "
+ "file in the following format."
)

prompt = "\n".join([
instructions,
"<patch>",
PATCH_EXAMPLE,
"</patch>",
"This is the knowledge graph:",
graph_str
])

llm_client = get_llm_client()
answer_prediction = llm_client.create_structured_output(
text_input=problem_statement,
system_prompt=prompt,
response_model=str,
)
return answer_prediction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add input validation and error handling.

Several potential issues need to be addressed:

  1. Missing validation for empty dataset
  2. Hard-coded dataset name could cause conflicts
  3. Graph file handling needs proper cleanup
  4. Missing error handling for pipeline failures

Apply this diff to improve robustness:

 async def cognee_and_llm(dataset, search_type=SearchType.CHUNKS):
+    if not dataset:
+        raise ValueError("Dataset cannot be empty")
+
     await cognee.prune.prune_data()
     await cognee.prune.prune_system(metadata=True)
 
-    dataset_name = "SWE_test_data"
+    dataset_name = f"SWE_test_data_{id(dataset)}"  # Unique name per dataset
     code_text = dataset[0]["text"]
-    await cognee.add([code_text], dataset_name)
-    await code_graph_pipeline([dataset_name])
-    graph_engine = await get_graph_engine()
-    with open(graph_engine.filename, "r") as f:
-        graph_str = f.read()
+    try:
+        await cognee.add([code_text], dataset_name)
+        await code_graph_pipeline([dataset_name])
+        graph_engine = await get_graph_engine()
+        with open(graph_engine.filename, "r") as f:
+            graph_str = f.read()
+    except Exception as e:
+        raise RuntimeError(f"Failed to process code graph: {str(e)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def cognee_and_llm(dataset, search_type=SearchType.CHUNKS):
await cognee.prune.prune_data()
await cognee.prune.prune_system(metadata=True)
dataset_name = "SWE_test_data"
code_text = dataset[0]["text"]
await cognee.add([code_text], dataset_name)
await code_graph_pipeline([dataset_name])
graph_engine = await get_graph_engine()
with open(graph_engine.filename, "r") as f:
graph_str = f.read()
problem_statement = dataset[0]['problem_statement']
instructions = (
"I need you to solve this issue by looking at the provided knowledge graph and by "
+ "generating a single patch file that I can apply directly to this repository "
+ "using git apply. Please respond with a single patch "
+ "file in the following format."
)
prompt = "\n".join([
instructions,
"<patch>",
PATCH_EXAMPLE,
"</patch>",
"This is the knowledge graph:",
graph_str
])
llm_client = get_llm_client()
answer_prediction = llm_client.create_structured_output(
text_input=problem_statement,
system_prompt=prompt,
response_model=str,
)
return answer_prediction
async def cognee_and_llm(dataset, search_type=SearchType.CHUNKS):
if not dataset:
raise ValueError("Dataset cannot be empty")
await cognee.prune.prune_data()
await cognee.prune.prune_system(metadata=True)
dataset_name = f"SWE_test_data_{id(dataset)}" # Unique name per dataset
code_text = dataset[0]["text"]
try:
await cognee.add([code_text], dataset_name)
await code_graph_pipeline([dataset_name])
graph_engine = await get_graph_engine()
with open(graph_engine.filename, "r") as f:
graph_str = f.read()
except Exception as e:
raise RuntimeError(f"Failed to process code graph: {str(e)}")
problem_statement = dataset[0]['problem_statement']
instructions = (
"I need you to solve this issue by looking at the provided knowledge graph and by "
+ "generating a single patch file that I can apply directly to this repository "
+ "using git apply. Please respond with a single patch "
+ "file in the following format."
)
prompt = "\n".join([
instructions,
"<patch>",
PATCH_EXAMPLE,
"</patch>",
"This is the knowledge graph:",
graph_str
])
llm_client = get_llm_client()
answer_prediction = llm_client.create_structured_output(
text_input=problem_statement,
system_prompt=prompt,
response_model=str,
)
return answer_prediction

Comment on lines +82 to +109
async def main():
swe_dataset = load_swebench_dataset(
'princeton-nlp/SWE-bench', split='test')
swe_dataset_preprocessed = load_swebench_dataset(
'princeton-nlp/SWE-bench_bm25_13K', split='test')
test_data = swe_dataset[:1]
test_data_preprocessed = swe_dataset_preprocessed[:1]
assert test_data[0]["instance_id"] == test_data_preprocessed[0]["instance_id"]
filepath = Path("SWE-bench_testsample")
if filepath.exists():
from datasets import Dataset
dataset = Dataset.load_from_disk(filepath)
else:
dataset = download_instances(test_data, filepath)

cognee_preds = await get_preds(dataset, with_cognee=True)
# nocognee_preds = await get_preds(dataset, with_cognee=False)
with open("withcognee.json", "w") as file:
json.dump(cognee_preds, file)

subprocess.run(["python", "-m", "swebench.harness.run_evaluation",
"--dataset_name", 'princeton-nlp/SWE-bench',
"--split", "test",
"--predictions_path", "withcognee.json",
"--max_workers", "1",
"--instance_ids", test_data[0]["instance_id"],
"--run_id", "with_cognee"])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Address security and robustness concerns in main function.

Several issues need attention:

  1. Debug mode in asyncio.run could expose sensitive information
  2. Subprocess call needs error handling
  3. Hard-coded file paths should use Path objects consistently
  4. Commented out code should be removed if not needed

Apply this diff:

 async def main():
+    output_dir = Path("output")
+    output_dir.mkdir(exist_ok=True)
+    predictions_file = output_dir / "withcognee.json"
+
     swe_dataset = load_swebench_dataset(
         'princeton-nlp/SWE-bench', split='test')
     swe_dataset_preprocessed = load_swebench_dataset(
         'princeton-nlp/SWE-bench_bm25_13K', split='test')
     test_data = swe_dataset[:1]
     test_data_preprocessed = swe_dataset_preprocessed[:1]
     assert test_data[0]["instance_id"] == test_data_preprocessed[0]["instance_id"]
     filepath = Path("SWE-bench_testsample")
     if filepath.exists():
         from datasets import Dataset
         dataset = Dataset.load_from_disk(filepath)
     else:
         dataset = download_instances(test_data, filepath)
 
     cognee_preds = await get_preds(dataset, with_cognee=True)
-    # nocognee_preds = await get_preds(dataset, with_cognee=False)
-    with open("withcognee.json", "w") as file:
+    with open(predictions_file, "w") as file:
         json.dump(cognee_preds, file)
 
-    subprocess.run(["python", "-m", "swebench.harness.run_evaluation",
-                    "--dataset_name", 'princeton-nlp/SWE-bench',
-                    "--split", "test",
-                    "--predictions_path",  "withcognee.json",
-                    "--max_workers", "1",
-                    "--instance_ids", test_data[0]["instance_id"],
-                    "--run_id", "with_cognee"])
+    try:
+        result = subprocess.run(
+            ["python", "-m", "swebench.harness.run_evaluation",
+             "--dataset_name", 'princeton-nlp/SWE-bench',
+             "--split", "test",
+             "--predictions_path", str(predictions_file),
+             "--max_workers", "1",
+             "--instance_ids", test_data[0]["instance_id"],
+             "--run_id", "with_cognee"],
+            check=True,
+            capture_output=True,
+            text=True
+        )
+        print(result.stdout)
+    except subprocess.CalledProcessError as e:
+        print(f"Evaluation failed: {e.stderr}")
+        raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def main():
swe_dataset = load_swebench_dataset(
'princeton-nlp/SWE-bench', split='test')
swe_dataset_preprocessed = load_swebench_dataset(
'princeton-nlp/SWE-bench_bm25_13K', split='test')
test_data = swe_dataset[:1]
test_data_preprocessed = swe_dataset_preprocessed[:1]
assert test_data[0]["instance_id"] == test_data_preprocessed[0]["instance_id"]
filepath = Path("SWE-bench_testsample")
if filepath.exists():
from datasets import Dataset
dataset = Dataset.load_from_disk(filepath)
else:
dataset = download_instances(test_data, filepath)
cognee_preds = await get_preds(dataset, with_cognee=True)
# nocognee_preds = await get_preds(dataset, with_cognee=False)
with open("withcognee.json", "w") as file:
json.dump(cognee_preds, file)
subprocess.run(["python", "-m", "swebench.harness.run_evaluation",
"--dataset_name", 'princeton-nlp/SWE-bench',
"--split", "test",
"--predictions_path", "withcognee.json",
"--max_workers", "1",
"--instance_ids", test_data[0]["instance_id"],
"--run_id", "with_cognee"])
async def main():
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)
predictions_file = output_dir / "withcognee.json"
swe_dataset = load_swebench_dataset(
'princeton-nlp/SWE-bench', split='test')
swe_dataset_preprocessed = load_swebench_dataset(
'princeton-nlp/SWE-bench_bm25_13K', split='test')
test_data = swe_dataset[:1]
test_data_preprocessed = swe_dataset_preprocessed[:1]
assert test_data[0]["instance_id"] == test_data_preprocessed[0]["instance_id"]
filepath = Path("SWE-bench_testsample")
if filepath.exists():
from datasets import Dataset
dataset = Dataset.load_from_disk(filepath)
else:
dataset = download_instances(test_data, filepath)
cognee_preds = await get_preds(dataset, with_cognee=True)
with open(predictions_file, "w") as file:
json.dump(cognee_preds, file)
try:
result = subprocess.run(
["python", "-m", "swebench.harness.run_evaluation",
"--dataset_name", 'princeton-nlp/SWE-bench',
"--split", "test",
"--predictions_path", str(predictions_file),
"--max_workers", "1",
"--instance_ids", test_data[0]["instance_id"],
"--run_id", "with_cognee"],
check=True,
capture_output=True,
text=True
)
print(result.stdout)
except subprocess.CalledProcessError as e:
print(f"Evaluation failed: {e.stderr}")
raise

Comment on lines +76 to +87
async def run_cognify_base_rag():
from cognee.api.v1.add import add
from cognee.api.v1.prune import prune
from cognee.api.v1.cognify.cognify import cognify

await prune.prune_system()

await add("data://test_datasets", "initial_test")

graph = await cognify("initial_test")
pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling to async functions

The async functions lack proper error handling which could lead to unhandled exceptions.

Add try-catch blocks to handle potential errors:

 async def run_cognify_base_rag():
+    try:
         from cognee.api.v1.add import add
         from cognee.api.v1.prune import prune
         from cognee.api.v1.cognify.cognify import cognify
         await prune.prune_system()
         await add("data://test_datasets", "initial_test")
         await cognify("initial_test")
+    except Exception as e:
+        logger.error("Error in run_cognify_base_rag: %s", e, exc_info=True)
+        raise

Apply similar error handling to cognify_search_base_rag and cognify_search_graph.

Also applies to: 89-101, 103-110

🧰 Tools
🪛 Ruff

85-85: Local variable graph is assigned to but never used

Remove assignment to unused variable graph

(F841)

Comment on lines +22 to +39
def convert_swe_to_deepeval(swe_dataset: List[Dict]):
deepeval_dataset = EvaluationDataset()
for datum in swe_dataset:
input = datum["problem_statement"]
expected_output = datum["patch"]
context = [datum["text"]]
# retrieval_context = datum.get(retrieval_context_key_name)

deepeval_dataset.add_test_case(
LLMTestCase(
input=input,
actual_output=None,
expected_output=expected_output,
context=context,
# retrieval_context=retrieval_context,
)
)
return deepeval_dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consolidate duplicate conversion logic

There's significant code duplication between convert_swe_to_deepeval and convert_swe_to_deepeval_testcases. Consider consolidating the logic:

+def create_llm_test_case(datum: Dict, generate_output: bool = False) -> LLMTestCase:
+    input = datum["problem_statement"]
+    expected_output = datum["patch"]
+    context = [datum["text"]]
+    
+    kwargs = {
+        "input": input,
+        "expected_output": expected_output,
+        "context": context,
+    }
+    
+    if generate_output:
+        kwargs["actual_output"] = str(get_answer(
+            input, context).model_dump()['response'])
+    else:
+        kwargs["actual_output"] = None
+        
+    return LLMTestCase(**kwargs)

 def convert_swe_to_deepeval(swe_dataset: List[Dict]) -> EvaluationDataset:
     deepeval_dataset = EvaluationDataset()
     for datum in swe_dataset:
-        input = datum["problem_statement"]
-        expected_output = datum["patch"]
-        context = [datum["text"]]
-        deepeval_dataset.add_test_case(
-            LLMTestCase(
-                input=input,
-                actual_output=None,
-                expected_output=expected_output,
-                context=context,
-            )
-        )
+        test_case = create_llm_test_case(datum, generate_output=False)
+        deepeval_dataset.add_test_case(test_case)
     return deepeval_dataset

 def convert_swe_to_deepeval_testcases(swe_dataset: List[Dict]) -> EvaluationDataset:
     deepeval_dataset = EvaluationDataset()
     for datum in swe_dataset[:4]:
-        input = datum["problem_statement"]
-        expected_output = datum["patch"]
-        context = [datum["text"]]
-        deepeval_dataset.add_test_case(
-            LLMTestCase(
-                input=input,
-                actual_output=str(get_answer(
-                    input, context).model_dump()['response']),
-                expected_output=expected_output,
-                context=context,
-            )
-        )
+        test_case = create_llm_test_case(datum, generate_output=True)
+        deepeval_dataset.add_test_case(test_case)
     return deepeval_dataset

Also applies to: 128-150

Comment on lines +85 to +86
graph = await cognify("initial_test")
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove unused variable assignment and pass statement

The graph variable is assigned but never used, and the pass statement is unnecessary.

-    graph = await cognify("initial_test")
-    pass
+    await cognify("initial_test")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
graph = await cognify("initial_test")
pass
await cognify("initial_test")
🧰 Tools
🪛 Ruff

85-85: Local variable graph is assigned to but never used

Remove assignment to unused variable graph

(F841)

Comment on lines +54 to +59
def get_answer_base(content: str, context: str, response_model: Type[BaseModel]):
llm_client = get_llm_client()

system_prompt = "THIS IS YOUR CONTEXT:" + str(context)

return llm_client.create_structured_output(content, system_prompt, response_model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add input validation for LLM interactions

The function passes raw input to the LLM without validation. Consider adding input validation:

 def get_answer_base(content: str, context: str, response_model: Type[BaseModel]):
+    if not content or not context:
+        raise ValueError("Content and context must not be empty")
+    
+    # Sanitize inputs to prevent prompt injection
+    content = content.strip()
+    context = context.strip()
+    
     llm_client = get_llm_client()
     system_prompt = "THIS IS YOUR CONTEXT:" + str(context)
     return llm_client.create_structured_output(content, system_prompt, response_model)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_answer_base(content: str, context: str, response_model: Type[BaseModel]):
llm_client = get_llm_client()
system_prompt = "THIS IS YOUR CONTEXT:" + str(context)
return llm_client.create_structured_output(content, system_prompt, response_model)
def get_answer_base(content: str, context: str, response_model: Type[BaseModel]):
if not content or not context:
raise ValueError("Content and context must not be empty")
# Sanitize inputs to prevent prompt injection
content = content.strip()
context = context.strip()
llm_client = get_llm_client()
system_prompt = "THIS IS YOUR CONTEXT:" + str(context)
return llm_client.create_structured_output(content, system_prompt, response_model)

Comment on lines +168 to +170
swe_dataset = load_swebench_dataset(
'princeton-nlp/SWE-bench_bm25_13K', split='test')
test_dataset = convert_swe_to_deepeval_testcases(swe_dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consolidate dataset loading

The dataset is loaded multiple times throughout the file. Consider loading it once and reusing the instance:

+def load_dataset(split='test'):
+    return load_swebench_dataset('princeton-nlp/SWE-bench_bm25_13K', split=split)

 if __name__ == "__main__":
     import asyncio
     async def main():
         await cognify_search_graph("show_all_processes", "context")
     asyncio.run(main())
 
-    swe_dataset = load_swebench_dataset(
-        'princeton-nlp/SWE-bench_bm25_13K', split='test')
+    swe_dataset = load_dataset()
     test_dataset = convert_swe_to_deepeval_testcases(swe_dataset)

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (7)
evals/eval_utils.py (7)

24-26: Improve portability of temporary directory path

The hardcoded paths might not work on all systems. Consider using tempfile.gettempdir() for better portability.

     with TemporaryDirectory(
-        dir="/scratch" if os.path.exists("/scratch") else "/tmp"
+        dir=tempfile.gettempdir()
     ) as root_dir:

32-43: Add error logging for repository processing failures

The function silently continues on repository processing failures. Consider adding logging to track failures.

+import logging
+
     try:
         with AutoContextManager(
             instance, root_dir
         ) as cm:
             readmes = cm.get_readme_files()
             instance["readmes"] = ingest_files(readmes)
             instance["file_contents"] = ingest_directory_contents(
                 cm.repo_path
             )
+    except Exception as e:
+        logging.error(f"Failed to process repository for instance {instance.get('instance_id')}: {e}")
     finally:
         os.chdir(orig_dir)

55-55: Remove commented code

Remove the commented line as it appears to be unused alternative implementation.

-    # text_inputs = code_text

56-56: Simplify string concatenation

The f-string is used without any placeholders and can be simplified.

-    patch = "\n".join([f"<patch>", instance["patch"], "</patch>"])
+    patch = "\n".join(["<patch>", instance["patch"], "</patch>"])
🧰 Tools
🪛 Ruff

56-56: f-string without any placeholders

Remove extraneous f prefix

(F541)


60-84: Add validation for required fields

Consider validating critical fields that are essential for dataset creation. This will help catch data issues early.

 def create_dataset(input_instances):
     columns = [
         "instance_id",
         "text",
         "repo",
         "base_commit",
         "problem_statement",
         # ... other columns
     ]
+    required_fields = {"instance_id", "repo", "base_commit"}  # Add critical fields
 
     data_table = {key: list() for key in columns}
     for instance in input_instances.values():
+        # Validate required fields
+        missing_fields = required_fields - set(instance.keys())
+        if missing_fields:
+            raise ValueError(f"Missing required fields: {missing_fields}")
+
         datum = extract_fields(instance)
         for key in columns:
             data_table[key].append(datum[key] if key in datum else "")

87-98: Improve function documentation

The docstring should include return type information and document all parameters.

 def download_instances(
     input_data,
     path=Path("SWE-bench_testsample"),
     verbose=False,
 ):
-    """Downloads code from github.
+    """Downloads code from github and creates a dataset.
 
     Args:
-    - input_data: dictionary with unprocessed input instances.
-    - verbose: set ContextManager verbose to True
+        input_data: Dictionary with unprocessed input instances
+        path: Path object specifying where to save the dataset (default: Path("SWE-bench_testsample"))
+        verbose: Set ContextManager verbose to True (default: False)
+
+    Returns:
+        Dataset: The created dataset containing processed instances
+
+    Raises:
+        ValueError: If required fields are missing in input instances
     """

101-102: Add progress tracking for dataset saving

Consider wrapping the dataset saving operation with a progress indicator for better user feedback.

-    dataset = create_dataset(input_instances_with_text)
-    dataset.save_to_disk(path)
+    print("Creating dataset...")
+    dataset = create_dataset(input_instances_with_text)
+    print(f"Saving dataset to {path}...")
+    dataset.save_to_disk(path)
+    print("Dataset saved successfully!")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 98e3445 and d986e7c.

📒 Files selected for processing (1)
  • evals/eval_utils.py (1 hunks)
🧰 Additional context used
🪛 Ruff
evals/eval_utils.py

56-56: f-string without any placeholders

Remove extraneous f prefix

(F541)

Comment on lines +13 to +19
def ingest_files(filenames):
files_dict = dict()
for filename in filenames:
with open(filename) as f:
content = f.read()
files_dict[filename] = content
return files_dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for file operations

The function should handle potential file-related errors and specify encoding for better reliability.

 def ingest_files(filenames):
     files_dict = dict()
     for filename in filenames:
-        with open(filename) as f:
-            content = f.read()
-        files_dict[filename] = content
+        try:
+            with open(filename, encoding='utf-8') as f:
+                content = f.read()
+            files_dict[filename] = content
+        except (IOError, UnicodeDecodeError) as e:
+            print(f"Error reading file {filename}: {e}")
+            files_dict[filename] = ""
     return files_dict
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def ingest_files(filenames):
files_dict = dict()
for filename in filenames:
with open(filename) as f:
content = f.read()
files_dict[filename] = content
return files_dict
def ingest_files(filenames):
files_dict = dict()
for filename in filenames:
try:
with open(filename, encoding='utf-8') as f:
content = f.read()
files_dict[filename] = content
except (IOError, UnicodeDecodeError) as e:
print(f"Error reading file {filename}: {e}")
files_dict[filename] = ""
return files_dict

Copy link

gitguardian bot commented Nov 18, 2024

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9573981 Triggered Generic Password d07ebbe notebooks/cognee_llama_index.ipynb View secret
13947935 Triggered Generic High Entropy Secret d07ebbe cognee/shared/utils.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@borisarzentar borisarzentar merged commit 9973aff into main Nov 23, 2024
26 of 28 checks passed
@borisarzentar borisarzentar deleted the feat/COG-544-eval-on-swe-bench branch November 23, 2024 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants