Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cog 417 chunking unit tests #205

Merged
merged 42 commits into from
Nov 18, 2024
Merged

Cog 417 chunking unit tests #205

merged 42 commits into from
Nov 18, 2024

Conversation

0xideas
Copy link
Contributor

@0xideas 0xideas commented Nov 13, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced text chunking functionalities across various methods, including chunking by paragraph, sentence, and word.
    • Introduced optional parameters for improved control over chunking behavior, such as maximum length for sentences.
    • Added methods for creating transcripts from audio and image documents, improving overall structure and readability.
  • Bug Fixes

    • Improved error handling and data validation in chunking methods to ensure accuracy in output.
  • Tests

    • Added multiple unit tests to validate the functionality of new chunking methods and ensure robustness across different input scenarios.
    • Introduced integration tests for audio and image document processing, ensuring accurate transcription results.
    • Updated existing tests to align with changes in chunking logic and expected outputs.

Copy link
Contributor

coderabbitai bot commented Nov 13, 2024

Walkthrough

The pull request introduces significant modifications to the text chunking functionality across several modules. The TextChunker class now uses a local variable for paragraph chunks, enhancing its state management. The chunk_by_paragraph, chunk_by_sentence, and chunk_by_word functions have been restructured with updated signatures and improved logic for chunk handling. New test cases have been added to validate these changes, ensuring that the chunking processes yield correct results. Additionally, minor adjustments were made to existing tests and data structures to align with the new functionality.

Changes

File Path Change Summary
cognee/modules/chunking/TextChunker.py Modified TextChunker to use a local paragraph_chunks variable in the read method instead of a class-level variable, altering chunk handling logic and yielding DocumentChunk instances based on the local variable.
cognee/tasks/chunks/chunk_by_paragraph.py Updated function signature with type hints. Replaced paragraph with current_chunk, enhanced word count handling, and refined chunk yielding logic based on paragraph_length.
cognee/tasks/chunks/chunk_by_sentence.py Added optional maximum_length parameter to function signature. Updated yielding logic to include checks for maximum_length, simplifying sentence handling.
cognee/tasks/chunks/chunk_by_word.py Refactored to use global regex for chunking, added is_real_paragraph_end helper function, and improved whitespace handling. Updated control flow for yielding chunks based on new conditions.
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py Introduced three new tests to validate chunk_by_paragraph functionality, ensuring reconstruction matches original text and checking chunk lengths.
cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py Updated GROUND_TRUTH and INPUT_TEXT dictionaries to reflect changes in expected output and input for chunking tests.
cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py Added two new tests for chunk_by_sentence, validating reconstruction and chunk length against maximum limits.
cognee/tests/unit/processing/chunks/chunk_by_word_test.py Created new test file with two parameterized tests for chunk_by_word, verifying reconstructed text and ensuring chunks do not contain spaces.
cognee/tests/unit/processing/chunks/test_input.py Introduced INPUT_TEXTS dictionary for diverse predefined input texts for testing purposes.
cognee/infrastructure/engine/models/DataPoint.py Updated get_embeddable_data method to improve handling of attributes based on type, applying strip() to string attributes.
cognee/tests/integration/documents/PdfDocument_test.py Updated GROUND_TRUTH values for expected test results and renamed pdf_doc variable to document for clarity.
cognee/tests/integration/documents/TextDocument_test.py Introduced new test file for TextDocument class, validating functionality against predefined ground truth data with parameterized tests.
cognee/modules/data/processing/document_types/AudioDocument.py Added create_transcript method to improve transcript creation process, enhancing readability and reusability.
cognee/modules/data/processing/document_types/ImageDocument.py Added transcribe_image method to improve image transcription functionality, enhancing code readability and maintainability.
cognee/tests/integration/documents/AudioDocument_test.py Introduced new test function to validate AudioDocument functionality, using mocking for the create_transcript method and comparing against ground truth data.
cognee/tests/integration/documents/ImageDocument_test.py Introduced new test function to validate ImageDocument functionality, using mocking for the transcribe_image method and comparing against ground truth data.
cognee/tests/unit/modules/graph/cognee_graph_elements_test.py Reordered imports and added blank lines for readability; no functional changes to tests.
cognee/tests/unit/modules/graph/cognee_graph_test.py Reordered imports and added blank lines for readability; no functional changes to tests.

Possibly related PRs

  • Cog 170 pgvector adapter #158: The changes in the main PR regarding the TextChunker class's handling of paragraph chunks may relate to the modifications in the chunk_by_paragraph function, which also deals with chunking text into paragraphs and has undergone significant restructuring.
  • COG-485 - Fix/integration test warnings #176: The integration test warnings addressed in this PR may relate to the overall functionality of the TextChunker and its interaction with chunking processes, ensuring that the changes made in the main PR do not introduce new issues.
  • Cog 577 add unit test task #197: The addition of unit tests in this PR could include tests for the TextChunker class or related chunking functionalities, ensuring that the changes made in the main PR are covered by tests.

Suggested reviewers

  • borisarzentar
  • dexters1

Poem

🐰 In the land of text where chunks do play,
A rabbit hops and finds a way.
With paragraphs neat and sentences bright,
We chunk and yield from morning till night.
Each test a hop, each function a leap,
In the garden of code, our secrets we keep! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@0xideas 0xideas self-assigned this Nov 13, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Outside diff range and nitpick comments (18)
cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py (2)

1-1: Remove unused import.

The numpy module is imported but not used in this test file.

-import numpy as np

8-16: Consider adding edge cases to test parameterization.

While the current test cases cover diverse input types, consider adding edge cases such as:

  • Empty text
  • Very long text
  • Text with special characters
  • Text with multiple consecutive newlines
 @pytest.mark.parametrize(
     "input_text",
     [
         INPUT_TEXTS["english_text"],
         INPUT_TEXTS["english_lists"],
         INPUT_TEXTS["python_code"],
         INPUT_TEXTS["chinese_text"],
+        "",  # Empty text
+        INPUT_TEXTS["english_text"] * 1000,  # Very long text
+        INPUT_TEXTS["special_characters"],  # Text with special characters
+        INPUT_TEXTS["multiple_newlines"],  # Text with multiple consecutive newlines
     ],
 )
cognee/tasks/chunks/chunk_by_sentence.py (1)

Add tests for the maximum_length parameter in chunk_by_sentence.

  • Ensure maximum_length is properly tested in cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py
🔗 Analysis chain

Line range hint 8-30: Verify text reconstruction and usage compatibility.

The changes to chunking logic need verification to ensure:

  1. Text can be properly reconstructed from chunks (isomorphism)
  2. All callers are compatible with the new parameter

Let's check the usage and test coverage:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify function usage and test coverage

# Check for test cases testing isomorphism
echo "Checking for isomorphism tests..."
rg -l "test.*isomorphism" "cognee/tests"

# Check for all usages of chunk_by_sentence
echo "Checking all usages of chunk_by_sentence..."
rg "chunk_by_sentence\(" --type python

# Check if there are any direct tests for maximum_length
echo "Checking for maximum_length tests..."
rg "maximum_length.*chunk_by_sentence" "cognee/tests"

Length of output: 608


Script:

#!/bin/bash
# Description: Verify all usages of chunk_by_sentence with corrected file type

echo "Checking all usages of chunk_by_sentence..."
rg "chunk_by_sentence\(" --type py

Length of output: 840

cognee/tests/unit/documents/PdfDocument_test.py (1)

7-8: Document the expected chunk format

Consider adding comments to explain:

  • What constitutes a "word" in the word count
  • How the text length is calculated (e.g., handling of whitespace)
  • What determines a "sentence_end" cut type

Add documentation above the GROUND_TRUTH list:

+# Ground truth values for PDF chunks:
+# - word_count: Number of words in chunk (words are split by whitespace)
+# - len_text: Total length of normalized text including whitespace
+# - cut_type: How the chunk was terminated (e.g., "sentence_end" for complete sentences)
 GROUND_TRUTH = [
     {"word_count": 879, "len_text": 5607, "cut_type": "sentence_end"},
     {"word_count": 953, "len_text": 6363, "cut_type": "sentence_end"},
 ]
cognee/tests/unit/processing/chunks/chunk_by_word_test.py (2)

8-22: Add type hints to improve code clarity.

The test implementation is solid with good parametrization and helpful error messages. Consider adding type hints to make the function signature more explicit.

-def test_chunk_by_word_isomorphism(input_text):
+def test_chunk_by_word_isomorphism(input_text: str) -> None:

11-15: Consider adding edge case test inputs.

While the current test inputs cover various scenarios, consider adding edge cases such as:

  • Empty string
  • String with only whitespace
  • Very long words
  • Special characters and emojis

Would you like me to help create these additional test cases?

cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (1)

51-54: Enhance test coverage with parameterized tests

Consider improving the test suite with:

  1. Parameterized tests for different chunk sizes
  2. Edge cases (empty text, single word paragraphs)
  3. Error conditions (chunk size <= 0)

Example implementation using pytest.mark.parametrize:

import pytest

@pytest.mark.parametrize("chunk_size,expected_chunks", [
    (5, [...]),  # smaller chunks
    (12, GROUND_TRUTH["whole_text"]),  # current case
    (20, [...]),  # larger chunks
])
def test_chunking_with_different_sizes(chunk_size, expected_chunks):
    chunks = list(chunk_by_paragraph(INPUT_TEXT["whole_text"], chunk_size, batch_paragraphs=True))
    assert chunks == expected_chunks

def test_chunking_edge_cases():
    # Empty text
    assert list(chunk_by_paragraph("", 12)) == []
    
    # Single word paragraph
    result = list(chunk_by_paragraph("word\n", 12))
    assert len(result) == 1
    assert result[0]["word_count"] == 1

def test_chunking_errors():
    with pytest.raises(ValueError):
        list(chunk_by_paragraph(INPUT_TEXT["whole_text"], 0))
cognee/tasks/chunks/chunk_by_word.py (4)

3-4: Consider enhancing regex patterns for Unicode support

The regex patterns could be expanded to handle Unicode punctuation marks and line endings for better international text support.

-SENTENCE_ENDINGS = r"[.;!?…]"
-PARAGRAPH_ENDINGS = r"[\n\r]"
+SENTENCE_ENDINGS = r"[.;!?…\u2026\u3002\uFF01\uFF1F]"  # Include CJK punctuation
+PARAGRAPH_ENDINGS = r"[\n\r\u2028\u2029]"  # Include Unicode line/paragraph separators

6-11: Enhance type hints and docstring

The function would benefit from complete type annotations and example usage in the docstring.

-def chunk_by_word(data: str):
+from typing import Iterator, Tuple
+
+def chunk_by_word(data: str) -> Iterator[Tuple[str, str]]:
     """
     Chunks text into words and endings while preserving whitespace.
     Whitespace is included with the preceding word.
     Outputs can be joined with "" to recreate the original input.
+
+    Args:
+        data: Input text to be chunked
+
+    Returns:
+        Iterator yielding tuples of (chunk, chunk_type) where chunk_type is one of:
+        "word", "sentence_end", or "paragraph_end"
+
+    Example:
+        >>> list(chunk_by_word("Hello world."))
+        [("Hello ", "word"), ("world.", "sentence_end")]
     """

16-22: Optimize leading whitespace handling

The current implementation performs multiple regex matches in a loop. This could be optimized using a single regex match.

-    while i < len(data) and (re.match(PARAGRAPH_ENDINGS, data[i]) or data[i] == " "):
-        current_chunk += data[i]
-        i += 1
+    leading_space_match = re.match(f"^([{PARAGRAPH_ENDINGS}| ]+)", data)
+    if leading_space_match:
+        current_chunk = leading_space_match.group(1)
+        i = len(current_chunk)

63-67: Simplify whitespace handling after sentence endings

The current implementation uses a while loop to handle whitespace. This could be simplified using regex.

-            next_i = i + 1
-            while next_i < len(data) and data[next_i] == " ":
-                current_chunk += data[next_i]
-                next_i += 1
+            if i + 1 < len(data):
+                space_match = re.match(r"( +)", data[i+1:])
+                if space_match:
+                    current_chunk += space_match.group(1)
+                    next_i = i + 1 + len(space_match.group(1))
cognee/modules/chunking/TextChunker.py (3)

19-27: Add input validation and error handling.

While moving to method-level state management is good, consider adding:

  1. Input validation for content_text
  2. Error handling for chunk_by_paragraph generator
  3. Validation for chunk_data dictionary structure
 def read(self):
     paragraph_chunks = []
     for content_text in self.get_text():
+        if not content_text:
+            continue
         try:
             for chunk_data in chunk_by_paragraph(
                 content_text,
                 self.max_chunk_size,
                 batch_paragraphs = True,
             ):
+                if not all(k in chunk_data for k in ["word_count", "text", "chunk_id", "cut_type"]):
+                    raise ValueError("Invalid chunk data structure")
                 if self.chunk_size + chunk_data["word_count"] <= self.max_chunk_size:
                     paragraph_chunks.append(chunk_data)
                     self.chunk_size += chunk_data["word_count"]
+        except Exception as e:
+            print(f"Error processing content: {str(e)}")
+            continue

Line range hint 30-54: Critical improvements needed in error handling and array access.

Several issues need attention:

  1. The bare except is too broad and masks errors
  2. Unsafe array access could raise IndexError
  3. Complex nested conditions could be simplified

Apply these improvements:

-                    if len(paragraph_chunks) == 0:
+                    if not paragraph_chunks:
                         yield DocumentChunk(
                             id = chunk_data["chunk_id"],
                             text = chunk_data["text"],
                             word_count = chunk_data["word_count"],
                             is_part_of = self.document,
                             chunk_index = self.chunk_index,
                             cut_type = chunk_data["cut_type"],
                         )
                         paragraph_chunks = []
                         self.chunk_size = 0
                     else:
                         chunk_text = " ".join(chunk["text"] for chunk in paragraph_chunks)
                         try:
+                            if not paragraph_chunks:
+                                raise ValueError("Empty paragraph chunks")
+                            last_chunk = paragraph_chunks[-1]  # Safer array access
                             yield DocumentChunk(
                                 id = uuid5(NAMESPACE_OID, f"{str(self.document.id)}-{self.chunk_index}"),
                                 text = chunk_text,
                                 word_count = self.chunk_size,
                                 is_part_of = self.document,
                                 chunk_index = self.chunk_index,
-                                cut_type = paragraph_chunks[len(paragraph_chunks) - 1]["cut_type"],
+                                cut_type = last_chunk["cut_type"],
                             )
-                        except Exception as e:
-                            print(e)
+                        except Exception as e:
+                            print(f"Error creating DocumentChunk: {str(e)}")
+                            # Re-raise to handle at higher level if needed
+                            raise

Line range hint 1-67: Consider architectural improvements for better maintainability.

Several architectural improvements could enhance the code:

  1. Replace print statements with proper logging
  2. Consider splitting the chunk management logic into a separate class
  3. Add type hints for better code clarity and maintainability

Example improvements:

import logging
from typing import Iterator, Dict, Any, Optional
from dataclasses import dataclass

logger = logging.getLogger(__name__)

@dataclass
class ChunkData:
    text: str
    word_count: int
    cut_type: str
    chunk_id: Optional[str] = None

class ChunkManager:
    """Handles the logic of accumulating and managing chunks."""
    def __init__(self, max_size: int):
        self.max_size = max_size
        self.current_size = 0
        self.chunks: List[ChunkData] = []

    def can_add(self, chunk: ChunkData) -> bool:
        return self.current_size + chunk.word_count <= self.max_size

class TextChunker:
    def __init__(self, document: Any, get_text: callable, chunk_size: int = 1024):
        self.document = document
        self.max_chunk_size = chunk_size
        self.get_text = get_text
        self.chunk_manager = ChunkManager(chunk_size)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)

37-42: Consider enhancing error reporting for better debugging.

While the test is well-structured, the error message could be more helpful by showing a diff of the mismatched sections when the assertion fails.

Consider this enhancement:

-    assert (
-        reconstructed_text == input_text
-    ), f"texts are not identical: {len(input_text) = }, {len(reconstructed_text) = }"
+    if reconstructed_text != input_text:
+        # Find the first difference
+        pos = next(i for i in range(min(len(input_text), len(reconstructed_text))) 
+                  if input_text[i] != reconstructed_text[i])
+        context = slice(max(0, pos-50), pos+50)
+        assert False, (
+            f"Texts differ at position {pos}.\n"
+            f"Input context: {input_text[context]!r}\n"
+            f"Reconstructed context: {reconstructed_text[context]!r}\n"
+            f"Lengths: {len(input_text) = }, {len(reconstructed_text) = }"
+        )
cognee/tests/unit/processing/chunks/test_input.py (2)

3-275: Add docstring to describe test cases and their purposes.

The test input dictionary is well-structured with diverse test cases. Consider adding a module-level docstring to:

  1. Describe the purpose of each test case
  2. Document the expected chunking behavior for each case
  3. Explain any special characters or formatting that should be preserved

This documentation will help other developers understand the test coverage and maintain the test suite.

Example docstring:

"""Test input data for text chunking functionality.

Test cases:
- english_lists: Structured content with nested lists to verify list handling
- python_code: Python source with type annotations to verify code block handling
- chinese_text: Chinese narrative to verify Unicode and CJK character handling
- english_text: Literary text with line numbers to verify formatting preservation
"""

3-275: Add edge cases to improve test coverage.

While the current test cases are comprehensive, consider adding these edge cases to strengthen the test suite:

# Add to INPUT_TEXTS
"empty_text": "",
"single_char": "x",
"whitespace_only": "   \n\t   \n   ",
"mixed_content": """
# Python code
def hello():
    print("Hello")

# Markdown
## Header
Regular text
""",

These additions will help verify the chunker's behavior with boundary conditions.

cognee/tasks/chunks/chunk_by_paragraph.py (1)

21-27: Refactor duplicate chunk_dict creation into a helper function

The code for creating chunk_dict is repeated multiple times. Extracting this logic into a helper function would improve maintainability and reduce code duplication.

Also applies to: 48-55, 66-72

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7a72aa4 and 9b2fb09.

📒 Files selected for processing (10)
  • cognee/modules/chunking/TextChunker.py (2 hunks)
  • cognee/tasks/chunks/chunk_by_paragraph.py (1 hunks)
  • cognee/tasks/chunks/chunk_by_sentence.py (1 hunks)
  • cognee/tasks/chunks/chunk_by_word.py (1 hunks)
  • cognee/tests/unit/documents/PdfDocument_test.py (1 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (2 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py (1 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_word_test.py (1 hunks)
  • cognee/tests/unit/processing/chunks/test_input.py (1 hunks)
🧰 Additional context used
🪛 Ruff
cognee/tests/unit/processing/chunks/chunk_by_word_test.py

40-40: Avoid equality comparisons to False; use if not space_test: for false checks

Replace with not space_test

(E712)

🔇 Additional comments (11)
cognee/tasks/chunks/chunk_by_sentence.py (1)

5-5: LGTM! Clean type annotation and parameter addition.

The Optional type import and the new maximum_length parameter are properly implemented with correct type annotations.

Also applies to: 8-8

cognee/tests/unit/documents/PdfDocument_test.py (1)

7-8: Verify updated ground truth values against the test file

The changes to the ground truth values (text length and word count) should be validated against the actual content of "artificial-intelligence.pdf".

Let's verify the test file content:

cognee/tests/unit/processing/chunks/chunk_by_word_test.py (2)

1-6: LGTM! Clean and well-organized imports.

The imports are appropriate and well-structured, with test inputs properly separated into a dedicated file.


25-40: 🛠️ Refactor suggestion

Improve boolean comparison and add type hints.

The test logic is good, but there are a few improvements to make:

-def test_chunk_by_word_splits(input_text):
+def test_chunk_by_word_splits(input_text: str) -> None:
     chunks = np.array(list(chunk_by_word(input_text)))
     space_test = np.array([" " not in chunk[0].strip() for chunk in chunks])
 
     assert np.all(
-        space_test
-    ), f"These chunks contain spaces within them: {chunks[space_test == False]}"
+        space_test
+    ), f"These chunks contain spaces within them: {chunks[~space_test]}"

Let's verify the chunk_by_word implementation to ensure we're testing the right contract:

🧰 Tools
🪛 Ruff

40-40: Avoid equality comparisons to False; use if not space_test: for false checks

Replace with not space_test

(E712)

cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (2)

42-43: LGTM! Improved test data formatting

The removal of leading whitespace and consistent newline usage makes the test data cleaner and more maintainable.

Also applies to: 45-46


35-35: Verify the change in cut_type from "sentence_cut" to "word"

The change in cut_type suggests a fundamental change in how incomplete paragraphs are handled. This could affect downstream processing that relies on the cut_type.

✅ Verification successful

Change in cut_type Verified: The alteration from "sentence_cut" to "word" in cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py is isolated to the test file and does not affect downstream processing.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any code that depends on the cut_type value
rg "sentence_cut|cut_type.*sentence" -A 5

Length of output: 1809


Script:

#!/bin/bash
# Search for all instances of "word" in cut_type contexts
rg '"cut_type"\s*:\s*"word"' -A 5

# Search for functions or methods that handle "cut_type" values
rg 'def .*cut_type' -A 10

# Check if "word" is used in any comments or documentation related to cut_type
rg 'cut_type.*word' --ignore-case -A 3

Length of output: 854

cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (2)

1-6: LGTM! Well-organized imports.

The imports are appropriate and well-structured, with good separation of test input data.


8-73: Consider adding edge cases to the test matrix.

While the current test matrix is comprehensive for normal cases, consider adding these edge cases:

# Add to both parameter lists:
(INPUT_TEXTS["empty_text"], 64, True),  # Empty text
(INPUT_TEXTS["single_char"], 64, True),  # Single character
(INPUT_TEXTS["very_long_word"], 64, True),  # Word longer than chunk size
(INPUT_TEXTS["all_whitespace"], 64, True),  # Text with only whitespace

Let's check if these edge cases are already covered in the test input file:

cognee/tests/unit/processing/chunks/test_input.py (2)

1-2: LGTM!

The import is clean and appropriate for a test file.


3-275: Verify test data security and performance impact.

Please ensure that:

  1. The test data doesn't contain any sensitive information or PII
  2. The large text blocks won't impact test performance or memory usage in CI/CD pipelines

Consider adding a maximum size limit for test inputs to prevent memory issues.

✅ Verification successful

Test data is secure and within performance limits.

No sensitive information found in the test data.

File size is 9012 bytes, which is acceptable and unlikely to impact performance.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for potential sensitive information in test data
rg -i "password|secret|key|token|credential" cognee/tests/unit/processing/chunks/test_input.py

# Calculate file size
ls -l cognee/tests/unit/processing/chunks/test_input.py | awk '{print $5}'

Length of output: 513

cognee/tasks/chunks/chunk_by_paragraph.py (1)

5-79: Function updates enhance clarity and maintainability

The refactored chunk_by_paragraph function shows improved logic and type clarity. The use of descriptive variable names and structured processing enhances readability and understanding.

cognee/tasks/chunks/chunk_by_sentence.py Show resolved Hide resolved
cognee/tasks/chunks/chunk_by_sentence.py Outdated Show resolved Hide resolved
cognee/tasks/chunks/chunk_by_word.py Outdated Show resolved Hide resolved
cognee/tasks/chunks/chunk_by_word.py Outdated Show resolved Hide resolved
cognee/modules/chunking/TextChunker.py Show resolved Hide resolved
cognee/tasks/chunks/chunk_by_paragraph.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Outside diff range and nitpick comments (19)
cognee/tasks/chunks/chunk_by_sentence.py (1)

8-8: Add comprehensive documentation

Consider adding a docstring with:

  • Function description
  • Parameter descriptions with types
  • Return value description
  • Examples of usage

Here's a suggested docstring:

def chunk_by_sentence(data: str, maximum_length: Optional[int] = None) -> Iterator[Tuple[UUID, int, str, int, str]]:
    """Split text into sentence chunks with optional length control.

    Args:
        data: Input text to be chunked
        maximum_length: Optional maximum number of words per chunk. If specified, must be positive.

    Yields:
        Tuple containing:
        - paragraph_id: UUID identifying the paragraph
        - chunk_index: Index of the chunk within the paragraph
        - sentence: The chunked text
        - word_count: Number of words in the chunk
        - word_type: Type of break ("paragraph_end", "sentence_end", "length_cut", or "sentence_cut")

    Raises:
        ValueError: If maximum_length is not positive
    """
cognee/tests/unit/documents/PdfDocument_test.py (1)

7-8: Document the rationale for ground truth values.

Consider adding a comment explaining why these specific numbers are expected. This would help future maintainers understand if changes to the chunking logic should affect these values.

 GROUND_TRUTH = [
+    # Values reflect sentence-end chunking with improved whitespace handling and word tokenization
     {"word_count": 879, "len_text": 5607, "cut_type": "sentence_end"},
     {"word_count": 953, "len_text": 6363, "cut_type": "sentence_end"},
 ]
cognee/tests/unit/processing/chunks/chunk_by_word_test.py (2)

17-22: Enhance error message for better debugging.

While the test logic is solid, the error message could be more helpful for debugging failures.

Consider this improvement:

     assert (
         reconstructed_text == input_text
-    ), f"texts are not identical: {len(input_text) = }, {len(reconstructed_text) = }"
+    ), f"texts are not identical:\nExpected: {input_text!r}\nActual: {reconstructed_text!r}\nLengths: {len(input_text) = }, {len(reconstructed_text) = }"

35-40: Consider simplifying the space validation logic.

The current implementation uses numpy arrays for a simple boolean check. While this works, a more straightforward approach might be clearer.

Consider this alternative:

-    chunks = np.array(list(chunk_by_word(input_text)))
-    space_test = np.array([" " not in chunk[0] for chunk in chunks])
-
-    assert np.all(
-        space_test
-    ), f"These chunks contain spaces within them: {chunks[space_test == False]}"
+    chunks = list(chunk_by_word(input_text))
+    chunks_with_spaces = [chunk for chunk in chunks if " " in chunk[0]]
+    assert not chunks_with_spaces, f"These chunks contain spaces: {chunks_with_spaces}"

This approach:

  • Removes unnecessary numpy usage for simple boolean checks
  • Makes the code more readable
  • Directly captures the problematic chunks
🧰 Tools
🪛 Ruff

40-40: Avoid equality comparisons to False; use if not space_test: for false checks

Replace with not space_test

(E712)

cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (2)

52-52: Consider adding test coverage for non-batch mode

The test now explicitly uses batch_paragraphs=True, but there's no corresponding test for batch_paragraphs=False. Consider adding test coverage for both modes to ensure complete functionality verification.

 def test_chunking_whole_text():
     run_chunking_test(INPUT_TEXT["whole_text"], GROUND_TRUTH["whole_text"])
 
+def test_chunking_whole_text_no_batch():
+    chunks = []
+    for chunk_data in chunk_by_paragraph(INPUT_TEXT["whole_text"], 12, batch_paragraphs=False):
+        chunks.append(chunk_data)
+    # Add appropriate assertions for non-batch mode
+
 def test_chunking_cut_text():
     run_chunking_test(INPUT_TEXT["cut_text"], GROUND_TRUTH["cut_text"])
+
+def test_chunking_cut_text_no_batch():
+    chunks = []
+    for chunk_data in chunk_by_paragraph(INPUT_TEXT["cut_text"], 12, batch_paragraphs=False):
+        chunks.append(chunk_data)
+    # Add appropriate assertions for non-batch mode

Line range hint 1-70: Consider parameterizing the test cases

The current test structure duplicates logic between whole_text and cut_text cases. Consider using a parameterized test approach (e.g., pytest.mark.parametrize) to make the tests more maintainable and easier to extend with new test cases.

Example refactor:

import pytest

TEST_CASES = [
    ("whole_text", True),
    ("whole_text", False),
    ("cut_text", True),
    ("cut_text", False),
]

@pytest.mark.parametrize("text_type,batch_mode", TEST_CASES)
def test_chunking(text_type, batch_mode):
    chunks = []
    for chunk_data in chunk_by_paragraph(
        INPUT_TEXT[text_type], 12, batch_paragraphs=batch_mode
    ):
        chunks.append(chunk_data)
    
    expected = GROUND_TRUTH[text_type]
    assert len(chunks) == len(expected)
    
    for expected_chunk, actual_chunk in zip(expected, chunks):
        for key in ["text", "word_count", "cut_type"]:
            assert actual_chunk[key] == expected_chunk[key]
cognee/modules/chunking/TextChunker.py (2)

Line range hint 30-39: Improve error handling consistency and chunk index management

  1. The error handling pattern here differs from other DocumentChunk creation blocks.
  2. The chunk_index increment logic could be consolidated.

Consider applying these improvements:

 if len(paragraph_chunks) == 0:
+    try:
         yield DocumentChunk(
             id = chunk_data["chunk_id"],
             text = chunk_data["text"],
             word_count = chunk_data["word_count"],
             is_part_of = self.document,
             chunk_index = self.chunk_index,
             cut_type = chunk_data["cut_type"],
         )
+    except Exception as e:
+        print(f"Error creating DocumentChunk: {e}")
     paragraph_chunks = []
     self.chunk_size = 0

42-54: Enhance error handling and optimize text concatenation

  1. The error handling should use proper logging instead of print statements
  2. The text concatenation could be optimized
  3. The cut_type access could be simplified

Consider these improvements:

-        chunk_text = " ".join(chunk["text"] for chunk in paragraph_chunks)
+        chunk_text = " ".join(chunk["text"] for chunk in paragraph_chunks)  # Consider using a StringBuilder pattern for large chunks
         try:
             yield DocumentChunk(
                 id = uuid5(NAMESPACE_OID, f"{str(self.document.id)}-{self.chunk_index}"),
                 text = chunk_text,
                 word_count = self.chunk_size,
                 is_part_of = self.document,
                 chunk_index = self.chunk_index,
-                cut_type = paragraph_chunks[len(paragraph_chunks) - 1]["cut_type"],
+                cut_type = paragraph_chunks[-1]["cut_type"],
             )
         except Exception as e:
-            print(e)
+            import logging
+            logging.error(f"Failed to create DocumentChunk: {e}")
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (3)

37-42: Enhance the assertion error message.

The current error message only shows lengths. Consider including:

  • A diff of the first mismatch
  • The chunk boundaries where the mismatch occurred
-    assert (
-        reconstructed_text == input_text
-    ), f"texts are not identical: {len(input_text) = }, {len(reconstructed_text) = }"
+    assert reconstructed_text == input_text, (
+        f"Texts are not identical:\n"
+        f"Input length: {len(input_text)}, Reconstructed length: {len(reconstructed_text)}\n"
+        f"First difference at position: {next((i for i in range(min(len(input_text), len(reconstructed_text))) if input_text[i] != reconstructed_text[i]), -1)}\n"
+        f"Input text preview: {input_text[:100]}...\n"
+        f"Reconstructed text preview: {reconstructed_text[:100]}..."
+    )

74-79: Optimize chunk processing efficiency.

The code converts chunks to a list twice:

  1. list(chunk_by_paragraph(...))
  2. len(list(chunk_by_word(...)))

Consider storing intermediate results to avoid repeated conversions.

-    chunks = list(chunk_by_paragraph(input_text, paragraph_length, batch_paragraphs))
-
-    chunk_lengths = np.array(
-        [len(list(chunk_by_word(chunk["text"]))) for chunk in chunks]
-    )
+    chunks = list(chunk_by_paragraph(input_text, paragraph_length, batch_paragraphs))
+    word_chunks = [list(chunk_by_word(chunk["text"])) for chunk in chunks]
+    chunk_lengths = np.array([len(words) for words in word_chunks])

81-84: Consider adding minimum chunk size validation.

The test only verifies that chunks don't exceed the maximum length. Consider also testing that chunks meet a minimum size to prevent over-fragmentation.

     larger_chunks = chunk_lengths[chunk_lengths > paragraph_length]
+    too_small_chunks = chunk_lengths[chunk_lengths < paragraph_length * 0.25]  # Example threshold
     assert np.all(
         chunk_lengths <= paragraph_length
     ), f"{paragraph_length = }: {larger_chunks} are too large"
+    assert len(too_small_chunks) == 0, (
+        f"Found chunks smaller than {paragraph_length * 0.25} words: {too_small_chunks}"
+    )
cognee/tests/unit/processing/chunks/test_input.py (2)

3-275: Consider organizing test data for better maintainability

While the test data provides good coverage with different text types, consider these improvements for better maintainability:

  1. Move each test input to a separate file in a test_data directory
  2. Use a helper function to load test data from files
  3. Add docstrings describing the purpose of each test input

Example structure:

from pathlib import Path

def load_test_data(filename: str) -> str:
    """Load test data from the test_data directory."""
    data_dir = Path(__file__).parent / "test_data"
    with open(data_dir / filename, "r", encoding="utf-8") as f:
        return f.read()

INPUT_TEXTS = {
    "english_lists": load_test_data("english_lists.txt"),
    "python_code": load_test_data("python_code.py"),
    "chinese_text": load_test_data("chinese_text.txt"),
    "english_text": load_test_data("english_text.txt"),
}

133-134: Address TODO comment in test data

The TODO comment in the Python code test data suggests incomplete type hints. Consider either:

  1. Implementing the suggested overloads, or
  2. Using a simpler code example without TODOs
cognee/tasks/chunks/chunk_by_word.py (1)

58-60: Clarify ellipsis handling for improved readability

In the ellipsis (...) handling, adding '..' to current_chunk after the first '.' can be confusing. For clarity, consider appending the exact substring representing the ellipsis.

Apply this diff to make the ellipsis handling clearer:

-                current_chunk += ".."
+                current_chunk += data[i+1:i+3]

This change appends the next two dots directly, making it evident that a full ellipsis is being processed.

cognee/tasks/chunks/chunk_by_paragraph.py (5)

59-59: Avoid resetting chunk_index to 0 in non-batch mode

Resetting chunk_index to 0 on line 59 may lead to duplicate chunk_index values across different chunks, which can cause issues if chunk_index is intended to be a unique identifier within the entire data set. Consider removing the reset to maintain a consistent incrementing of chunk_index.

Apply this diff to remove the reset of chunk_index:

-            chunk_index = 0

21-27: Refactor duplicate chunk creation code into a helper function

The code for creating chunk_dict is repeated in multiple places (lines 21-27, 48-55, and 67-73). Refactoring this code into a helper function can improve maintainability and reduce redundancy.

Define a helper function to create chunk_dict:

def create_chunk_dict(text, word_count, chunk_index, cut_type, chunk_id, paragraph_id):
    chunk_dict = {
        "text": text.strip(),
        "word_count": word_count,
        "chunk_id": chunk_id,
        "chunk_index": chunk_index,
        "cut_type": cut_type,
    }
    chunk_dict["id"] = chunk_id if batch_paragraphs else paragraph_id
    return chunk_dict

Then replace the duplicated code with calls to create_chunk_dict().

Also applies to: 48-55, 67-73


14-14: Initialize last_cut_type before its first use

The variable last_cut_type is used when creating chunk_dict, but it might not be initialized if the loop doesn't execute, potentially leading to a NameError. To prevent this, initialize last_cut_type before the loop.

Apply this diff to initialize last_cut_type:

-    last_cut_type = None
+    last_cut_type = ''

10-12: Consider initializing last_paragraph_id before the loop

Similar to last_cut_type, last_paragraph_id is used when assigning chunk_dict["id"] in non-batch mode but might not be initialized if the loop doesn't execute. Initialize it before the loop to prevent any unexpected errors.

Apply this diff to initialize last_paragraph_id:

-    last_paragraph_id = None
+    last_paragraph_id = ''

16-16: Add type hint for the underscore variable to enhance readability

In the for-loop on line 16, there's an underscore _ used as a placeholder for an unused variable. Adding a descriptive name or a comment can improve code readability and maintainability.

Consider updating the loop to clarify the purpose of the ignored value:

-    for paragraph_id, _, sentence, word_count, end_type in chunk_by_sentence(data, maximum_length=paragraph_length):
+    for paragraph_id, _unused, sentence, word_count, end_type in chunk_by_sentence(data, maximum_length=paragraph_length):
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7a72aa4 and 9b2fb09.

📒 Files selected for processing (10)
  • cognee/modules/chunking/TextChunker.py (2 hunks)
  • cognee/tasks/chunks/chunk_by_paragraph.py (1 hunks)
  • cognee/tasks/chunks/chunk_by_sentence.py (1 hunks)
  • cognee/tasks/chunks/chunk_by_word.py (1 hunks)
  • cognee/tests/unit/documents/PdfDocument_test.py (1 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (2 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py (1 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_word_test.py (1 hunks)
  • cognee/tests/unit/processing/chunks/test_input.py (1 hunks)
🧰 Additional context used
🪛 Ruff
cognee/tests/unit/processing/chunks/chunk_by_word_test.py

40-40: Avoid equality comparisons to False; use if not space_test: for false checks

Replace with not space_test

(E712)

🔇 Additional comments (20)
cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py (1)

1-1: Verify numpy import usage

The numpy import appears to be unused in this test file.

cognee/tasks/chunks/chunk_by_sentence.py (1)

5-5: LGTM: Type hint import is appropriate

The addition of Optional from typing aligns with Python's type hinting best practices.

cognee/tests/unit/documents/PdfDocument_test.py (1)

7-8: LGTM! Verify the updated ground truth values.

The changes to the expected values appear to align with improvements in text processing logic. However, let's verify these numbers are correct.

✅ Verification successful

Ground truth values have been successfully verified. No further action is needed.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the ground truth values by examining the test data and related chunking logic

# Check if the test file exists and get its size
echo "Checking test file..."
ls -l $(find . -name "artificial-intelligence.pdf")

# Look for related chunking tests that might validate these numbers
echo "Looking for related test cases..."
rg -l "artificial-intelligence.pdf" --type py

# Check for any other occurrences of these specific numbers
echo "Checking for other occurrences of these numbers..."
rg "5607|6363|879|953" --type py

Length of output: 844

cognee/tests/unit/processing/chunks/chunk_by_word_test.py (1)

1-6: LGTM! Clean and well-organized imports.

The imports are properly structured with standard libraries first, followed by local imports. Good practice separating test inputs into a dedicated file.

cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (2)

42-43: LGTM: Input text formatting is consistent

The input text formatting changes align properly with the modifications in GROUND_TRUTH.

Also applies to: 45-46


6-7: Verify alignment with chunking implementation

The changes to the test data include:

  1. Addition of newline characters affecting word counts
  2. Change of cut_type from "sentence_cut" to "word"

Let's verify these changes align with the actual implementation.

Also applies to: 11-12, 35-35

✅ Verification successful

Test data aligns with chunking implementation changes

The cut_type changes in the test data correctly reflect the updates in the chunking implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the chunking implementation matches the test expectations

# Check the chunking implementation for the cut_type logic
ast-grep --pattern 'def chunk_by_paragraph($$$)' -A 20

# Look for other tests that might need similar updates
fd -e py -E 'venv' -x rg -l 'sentence_cut|word.*cut_type'

Length of output: 316


Script:

#!/bin/bash
# Search for the definition and usage of 'cut_type' in chunk_by_paragraph.py
rg 'cut_type' cognee/tasks/chunks/chunk_by_paragraph.py

# Examine related test files for 'cut_type' usage
fd -e py --type f cognee/tests/unit/processing/chunks/ -x rg 'cut_type'

Length of output: 752


Script:

#!/bin/bash
# Search for the definition and usage of 'cut_type' in chunk_by_paragraph.py
rg 'cut_type' cognee/tasks/chunks/chunk_by_paragraph.py

# Examine related test files for 'cut_type' usage with corrected fd command
fd . 'cognee/tests/unit/processing/chunks/' -e py -x rg 'cut_type'

Length of output: 586

cognee/modules/chunking/TextChunker.py (2)

19-19: LGTM: Good improvement in state management

Moving paragraph_chunks to method scope prevents unintended state persistence between read() calls, making the code more predictable and thread-safe.


Line range hint 19-67: Verify chunk size constraints are properly enforced

Let's ensure that the chunking logic properly handles edge cases and maintains chunk size constraints.

cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (2)

1-6: LGTM! Well-organized imports.

The imports are clean and well-structured, with clear separation between external libraries and internal modules.


8-36: LGTM! Comprehensive test parametrization.

The test covers a good range of scenarios with different:

  • Text types (English, lists, Python code, Chinese)
  • Paragraph lengths (64, 256, 1024)
  • Batch modes (True, False)
cognee/tests/unit/processing/chunks/test_input.py (1)

1-2: LGTM!

Clean and minimal import section with just the required pytest framework.

cognee/tasks/chunks/chunk_by_word.py (7)

3-4: Good practice: Define regex patterns as module-level constants

Defining SENTENCE_ENDINGS and PARAGRAPH_ENDINGS at the module level enhances readability and allows for easier maintenance if these patterns need to be updated in the future.


7-11: Enhancement: Added descriptive docstring

Including a detailed docstring for chunk_by_word improves code documentation and aids in understanding the function's purpose and usage.


17-23: Correct handling of leading whitespace and paragraph endings

The added loop effectively handles leading whitespace and paragraph endings, ensuring that any initial spaces or paragraph breaks are correctly yielded as separate chunks. This maintains the integrity of the original text when reconstructed.


39-46: Logical handling of paragraph endings

The updated logic correctly processes paragraph ending characters by yielding the current chunk and appropriately tagging the character based on whether it's a real paragraph end.


47-55: Accurate processing of spaces and words

The code accurately yields words upon encountering spaces and resets the current_chunk, ensuring that words and whitespace are correctly separated.


68-69: Ensure correct detection of paragraph ends after sentence endings

The logic for determining is_paragraph_end appropriately checks for paragraph endings following sentence-ending punctuation, enhancing the accuracy of chunk tagging.


76-77: Final yield of any remaining text

Yielding any remaining current_chunk at the end ensures that all text is processed and included in the output, preventing loss of data.

cognee/tasks/chunks/chunk_by_paragraph.py (2)

29-33: Ensure consistent id assignment between batch and non-batch modes

In batch mode, chunk_dict["id"] is set to chunk_dict["chunk_id"] (lines 30 and 76), while in non-batch mode, it's set to last_paragraph_id (lines 32 and 78). This difference might lead to inconsistencies in how chunks are identified downstream. Verify that this behavior is intentional and that downstream processes handle the id field correctly in both modes.

Please confirm that the differing id assignments align with the expected usage in downstream components.

Also applies to: 74-78


46-46: Review the inclusion of 'sentence_cut' in end type handling

The condition on line 46 checks for end_type in ("paragraph_end", "sentence_cut"). Including 'sentence_cut' may cause the function to yield chunks prematurely if sentence_cut doesn't signify the end of a paragraph. Verify whether 'sentence_cut' should be included or if the condition should only check for 'paragraph_end'.

Confirm that including 'sentence_cut' aligns with the intended chunking logic.

cognee/tasks/chunks/chunk_by_sentence.py Outdated Show resolved Hide resolved
cognee/tasks/chunks/chunk_by_sentence.py Show resolved Hide resolved
cognee/tasks/chunks/chunk_by_word.py Outdated Show resolved Hide resolved
cognee/tasks/chunks/chunk_by_paragraph.py Outdated Show resolved Hide resolved
cognee/tasks/chunks/chunk_by_paragraph.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py (2)

46-54: Simplify test implementation and add documentation

The test implementation could be simplified and made more maintainable:

  1. Remove unnecessary list conversion
  2. Simplify the length checking logic
  3. Add proper documentation
 def test_paragraph_chunk_length(input_text, maximum_length):
+    """Test that chunk_by_sentence respects the maximum length constraint.
+    
+    Args:
+        input_text: Text to be chunked
+        maximum_length: Maximum allowed length for each chunk
+    
+    The test verifies that no chunk exceeds the specified maximum length
+    when split into words.
+    """
-    chunks = list(chunk_by_sentence(input_text, maximum_length))
+    chunks = chunk_by_sentence(input_text, maximum_length)
 
-    chunk_lengths = np.array([len(list(chunk_by_word(chunk[2]))) for chunk in chunks])
+    for i, chunk in enumerate(chunks):
+        words = list(chunk_by_word(chunk[2]))
+        chunk_length = len(words)
+        assert chunk_length <= maximum_length, (
+            f"Chunk {i} exceeds maximum length {maximum_length}:\n"
+            f"Length: {chunk_length}\n"
+            f"Content: {chunk[2]}"
+        )
-    larger_chunks = chunk_lengths[chunk_lengths > maximum_length]
-    assert np.all(
-        chunk_lengths <= maximum_length
-    ), f"{maximum_length = }: {larger_chunks} are too large"

11-23: Add edge cases to test parameterization

While the current test cases cover various languages and text types, consider adding these edge cases to test_input.py and including them in the parameterization:

  1. Empty string
  2. Single character/word
  3. Special characters and emojis
  4. Mixed language text
  5. Very long sentences
     [
         (INPUT_TEXTS["english_text"], None),
         (INPUT_TEXTS["english_text"], 8),
         (INPUT_TEXTS["english_text"], 64),
         (INPUT_TEXTS["english_lists"], None),
         (INPUT_TEXTS["english_lists"], 8),
         (INPUT_TEXTS["english_lists"], 64),
         (INPUT_TEXTS["python_code"], None),
         (INPUT_TEXTS["python_code"], 8),
         (INPUT_TEXTS["python_code"], 64),
         (INPUT_TEXTS["chinese_text"], None),
         (INPUT_TEXTS["chinese_text"], 8),
         (INPUT_TEXTS["chinese_text"], 64),
+        (INPUT_TEXTS["empty_string"], None),
+        (INPUT_TEXTS["single_word"], None),
+        (INPUT_TEXTS["special_chars"], None),
+        (INPUT_TEXTS["mixed_language"], None),
+        (INPUT_TEXTS["long_sentence"], None),
     ],
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9ea2634 and fdec9a6.

📒 Files selected for processing (1)
  • cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py (1 hunks)
🧰 Additional context used
📓 Learnings (1)
cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py (1)
Learnt from: 0xideas
PR: topoteretes/cognee#205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
cognee/tests/unit/processing/chunks/test_input.py (1)

4-83: Consider adding docstring documentation

The test data structure is well-organized and provides comprehensive coverage. Consider adding a docstring to document the purpose and structure of each test case category.

Example addition:

 INPUT_TEXTS = {
+    # Test data for validating text chunking system's handling of various content structures
+    # and edge cases. Each key represents a specific test scenario or category.
     "english_lists": """Let me think through the key attributes...""",
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between fdec9a6 and b787407.

📒 Files selected for processing (4)
  • cognee/tasks/chunks/chunk_by_sentence.py (1 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py (1 hunks)
  • cognee/tests/unit/processing/chunks/test_input.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • cognee/tasks/chunks/chunk_by_sentence.py
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py
  • cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py
🔇 Additional comments (4)
cognee/tests/unit/processing/chunks/test_input.py (4)

1-2: LGTM!

The import is clean and appropriate for a test file.


84-172: LGTM!

The Python code segment provides an excellent test case with complex nested structures, type hints, and various syntax elements.


174-274: Previous review feedback remains unaddressed

The English text still contains line numbers (e.g., [ 5 ], [ 10 ]) which could interfere with chunking tests. This was pointed out in a previous review.

Consider cleaning up the text by removing these line numbers to ensure accurate chunking results.


275-282: LGTM! Previous feedback well addressed

Excellent implementation of the suggested edge cases, providing comprehensive coverage for:

  • Empty text
  • Single-character text
  • Whitespace variations
  • Unicode/special characters
  • Mixed line endings
  • HTML content
  • URLs and emails

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (4)
cognee/tasks/chunks/chunk_by_word.py (2)

18-23: Enhance function documentation with type hints and complete docstring

The function documentation needs improvement to follow standard Python practices.

-def chunk_by_word(data: str):
+def chunk_by_word(data: str) -> Iterator[Tuple[str, str]]:
     """
     Chunks text into words and endings while preserving whitespace.
     Whitespace is included with the preceding word.
     Outputs can be joined with "" to recreate the original input.
+    
+    Args:
+        data: The input text to be chunked
+    
+    Returns:
+        Iterator yielding tuples of (chunk, chunk_type) where chunk_type is one of:
+        - "word": A word with its trailing whitespace
+        - "sentence_end": End of sentence marker
+        - "paragraph_end": End of paragraph marker
+    
+    Raises:
+        TypeError: If input is not a string
     """

28-34: Optimize leading whitespace handling

The current implementation can be optimized by combining the regex checks.

-    while i < len(data) and (re.match(PARAGRAPH_ENDINGS, data[i]) or data[i] == " "):
+    WHITESPACE_PATTERN = re.compile(f"{PARAGRAPH_ENDINGS}|\\s")
+    while i < len(data) and WHITESPACE_PATTERN.match(data[i]):
         current_chunk += data[i]
         i += 1
cognee/tasks/chunks/chunk_by_paragraph.py (2)

5-9: Enhance function documentation with parameter descriptions

While the docstring explains the purpose well, it would be more helpful to include descriptions of the parameters and return value.

Add parameter descriptions like this:

 def chunk_by_paragraph(data: str, paragraph_length: int = 1024, batch_paragraphs: bool = True) -> Iterator[Dict[str, Any]]:
     """
     Chunks text by paragraph while preserving exact text reconstruction capability.
     When chunks are joined with empty string "", they reproduce the original text exactly.
+
+    Args:
+        data: Input text to be chunked
+        paragraph_length: Maximum number of words per chunk
+        batch_paragraphs: If True, treats multiple paragraphs as a single chunk
+
+    Returns:
+        Iterator of dictionaries containing chunk information
     """

45-45: Consider using constants for end types

Using string literals for end types could lead to maintenance issues. Consider defining constants for these values.

Add constants at the module level:

PARAGRAPH_END = "paragraph_end"
SENTENCE_CUT = "sentence_cut"

Then update the condition:

-if end_type in ("paragraph_end", "sentence_cut") and not batch_paragraphs:
+if end_type in (PARAGRAPH_END, SENTENCE_CUT) and not batch_paragraphs:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b787407 and 45a60b7.

📒 Files selected for processing (2)
  • cognee/tasks/chunks/chunk_by_paragraph.py (1 hunks)
  • cognee/tasks/chunks/chunk_by_word.py (1 hunks)
🔇 Additional comments (4)
cognee/tasks/chunks/chunk_by_word.py (2)

58-61: **** The existing comment about improving ellipsis handling robustness is still valid.


76-77: LGTM!

The final chunk handling is correct and ensures no content is lost.

cognee/tasks/chunks/chunk_by_paragraph.py (2)

10-14: LGTM: Clear state initialization

The state variables are well-defined and properly initialized for tracking chunk content, word count, and indices.


41-41: Add spaces between sentences

The current implementation concatenates sentences without spaces, which affects text reconstruction accuracy.

cognee/tasks/chunks/chunk_by_word.py Show resolved Hide resolved
cognee/tasks/chunks/chunk_by_word.py Outdated Show resolved Hide resolved
cognee/tasks/chunks/chunk_by_paragraph.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (4)
cognee/infrastructure/engine/models/DataPoint.py (1)

23-27: Add validation and improve robustness of data handling

While the addition of type checking and string stripping is good, consider these improvements for better robustness:

  1. Add None validation
  2. Handle empty strings
  3. Consider processing all index fields, not just the first one

Here's a suggested improvement:

    def get_embeddable_data(self):
+        """Retrieves and processes data based on configured index fields.
+        
+        Returns:
+            Processed attribute value(s) based on index_fields configuration.
+            For string values, returns stripped version.
+            For other types, returns as-is.
+            Returns None if attribute is None or empty string.
+        """
        if self._metadata and len(self._metadata["index_fields"]) > 0 \
            and hasattr(self, self._metadata["index_fields"][0]):
            attribute = getattr(self, self._metadata["index_fields"][0])
+            if attribute is None:
+                return None
            if isinstance(attribute, str):
-                return(attribute.strip())
+                stripped = attribute.strip()
+                return stripped if stripped else None
            else:
-                return (attribute)
+                return attribute
cognee/tasks/chunks/chunk_by_word.py (3)

19-23: Enhance function documentation

While the docstring explains the basic functionality, it could be more comprehensive:

  • Add type hints for parameters and return value
  • Document the yield tuple format
  • Include usage examples

Apply this enhancement:

 def chunk_by_word(data: str):
     """
     Chunks text into words and endings while preserving whitespace.
     Whitespace is included with the preceding word.
     Outputs can be joined with "" to recreate the original input.
+    
+    Args:
+        data: Input text to be chunked
+    
+    Yields:
+        tuple[str, str]: A tuple containing (chunk, chunk_type) where chunk_type is one of:
+            - "word": A word or whitespace
+            - "sentence_end": End of a sentence
+            - "paragraph_end": End of a paragraph
+    
+    Examples:
+        >>> list(chunk_by_word("Hello world."))
+        [("Hello ", "word"), ("world.", "sentence_end")]
     """

25-27: Consider using a dataclass for state management

The function maintains multiple state variables (last_processed_character, current_chunk, i). Consider encapsulating these in a dataclass to make the code more maintainable and easier to test.

Example implementation:

from dataclasses import dataclass

@dataclass
class ChunkerState:
    last_char: str = ""
    current_chunk: str = ""
    position: int = 0

31-36: Simplify paragraph ending logic

The paragraph ending logic is complex and could be simplified. Consider extracting the logic into a separate method that handles both the yield and state updates.

Example:

def handle_paragraph_ending(self, char: str, state: ChunkerState) -> Iterator[Tuple[str, str]]:
    if state.current_chunk:
        yield (state.current_chunk, "word")
    yield (char, "paragraph_end" if is_real_paragraph_end(state.last_char, state.position, self.data) else "word")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 45a60b7 and e794bb8.

📒 Files selected for processing (3)
  • cognee/infrastructure/engine/models/DataPoint.py (1 hunks)
  • cognee/tasks/chunks/chunk_by_word.py (1 hunks)
  • cognee/tests/unit/processing/chunks/test_input.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/tests/unit/processing/chunks/test_input.py
🔇 Additional comments (2)
cognee/infrastructure/engine/models/DataPoint.py (1)

23-27: Verify the impact of string stripping on downstream consumers

The new string stripping behavior might affect systems that depend on the original whitespace-preserved strings.

Let's check for potential consumers:

✅ Verification successful

Based on the gathered information, I can now provide the final response:

String stripping in get_embeddable_data() is safe to implement

The string stripping behavior is safe to add because:

  • The embeddable data is only used by vector database adapters (Weaviate, PGVector, LanceDB, QDrant) for text embedding
  • The embedding process through LiteLLMEmbeddingEngine doesn't rely on whitespace preservation
  • No tests verify exact string content or have assertions that would be affected by whitespace trimming
  • Vector databases use the text purely for semantic search, where leading/trailing whitespace is not meaningful
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for direct usage of get_embeddable_data
rg -l "get_embeddable_data" | while read -r file; do
  echo "=== $file ==="
  rg -A 5 "get_embeddable_data" "$file"
done

# Search for tests that might need updating
fd -e py test_ | while read -r file; do
  echo "=== $file ==="
  rg -A 5 "DataPoint.*get_embeddable_data" "$file"
done

Length of output: 4403


Script:

#!/bin/bash
# Let's check if there are any tests that verify the exact string content
# This will help us understand if there are any expectations around whitespace

# Search in test files for assertions involving DataPoint
fd -e py test_ | while read -r file; do
  echo "=== $file ==="
  rg -B 2 -A 2 "assert.*DataPoint" "$file"
done

# Also search for any string comparisons in tests that might be affected
fd -e py test_ | while read -r file; do
  echo "=== $file ==="
  rg -B 2 -A 2 "assertEqual.*str" "$file"
done

# Let's also check the embedding engines to see how they handle the text
rg -l "class.*EmbeddingEngine" | while read -r file; do
  echo "=== $file ==="
  rg -B 2 -A 5 "def embed" "$file"
done

Length of output: 3641

cognee/tasks/chunks/chunk_by_word.py (1)

3-4: LGTM! Well-defined regex constants.

The regex patterns are clear, appropriately named, and include comprehensive character sets for both sentence and paragraph endings.

cognee/tasks/chunks/chunk_by_word.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)
cognee/tasks/chunks/chunk_by_paragraph.py (2)

61-61: Remove unused variable assignment

The variable last_paragraph_id is assigned but never used in the code.

Apply this diff to remove the unused assignment:

-        last_paragraph_id = paragraph_id
🧰 Tools
🪛 Ruff

61-61: Local variable last_paragraph_id is assigned to but never used

Remove assignment to unused variable last_paragraph_id

(F841)


5-9: Add input validation and error handling

The function would benefit from proper input validation and error handling:

  • Validate that data is not None or empty
  • Ensure paragraph_length is positive
  • Document the exceptions that can be raised

Consider adding these validations:

def chunk_by_paragraph(data: str, paragraph_length: int = 1024, batch_paragraphs: bool = True) -> Iterator[Dict[str, Any]]:
    """
    Chunks text by paragraph while preserving exact text reconstruction capability.
    
    Args:
        data: Input text to chunk
        paragraph_length: Maximum length of each chunk (must be positive)
        batch_paragraphs: Whether to batch paragraphs together
        
    Returns:
        Iterator of chunk dictionaries
        
    Raises:
        ValueError: If data is empty or paragraph_length is not positive
    """
    if not data:
        raise ValueError("Input text cannot be empty")
    if paragraph_length <= 0:
        raise ValueError("paragraph_length must be positive")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f2206a0 and 57d8149.

📒 Files selected for processing (1)
  • cognee/tasks/chunks/chunk_by_paragraph.py (1 hunks)
🧰 Additional context used
🪛 Ruff
cognee/tasks/chunks/chunk_by_paragraph.py

61-61: Local variable last_paragraph_id is assigned to but never used

Remove assignment to unused variable last_paragraph_id

(F841)

🔇 Additional comments (2)
cognee/tasks/chunks/chunk_by_paragraph.py (2)

1-5: LGTM! Improved type safety with proper annotations.

The addition of type hints and return type annotation enhances code clarity and maintainability.


39-41: ⚠️ Potential issue

Add spaces between concatenated sentences

Currently, sentences are concatenated without spaces, which could affect text reconstruction and readability.

Apply this diff to fix the concatenation:

-            current_chunk += sentence
+            current_chunk += sentence + " "

Remember to strip any trailing whitespace when creating the chunk dictionary:

-                "text": current_chunk,
+                "text": current_chunk.strip(),

cognee/tasks/chunks/chunk_by_paragraph.py Show resolved Hide resolved
cognee/tasks/chunks/chunk_by_paragraph.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
cognee/tasks/chunks/chunk_by_word.py (2)

3-4: Consider enhancing regex patterns for better internationalization

The current patterns could be expanded to handle more cases:

  • Unicode ellipsis (…) in SENTENCE_ENDINGS
  • Different line ending styles in PARAGRAPH_ENDINGS
-SENTENCE_ENDINGS = r"[.;!?…]"
-PARAGRAPH_ENDINGS = r"[\n\r]"
+SENTENCE_ENDINGS = r"[.;!?…]|\.{3}"
+PARAGRAPH_ENDINGS = r"[\n\r\f\v]|\r\n"

35-40: Add type hints to function signature

The docstring is good but the function is missing return type annotation.

-def chunk_by_word(data: str):
+def chunk_by_word(data: str) -> Iterator[Tuple[str, str]]:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 57d8149 and b4d509e.

📒 Files selected for processing (5)
  • cognee/tasks/chunks/chunk_by_word.py (1 hunks)
  • cognee/tests/integration/documents/AudioDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/ImageDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/TextDocument_test.py (1 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • cognee/tests/integration/documents/AudioDocument_test.py
  • cognee/tests/integration/documents/ImageDocument_test.py
  • cognee/tests/integration/documents/TextDocument_test.py
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py
🔇 Additional comments (2)
cognee/tasks/chunks/chunk_by_word.py (2)

6-33: Well-structured helper function with proper documentation

The implementation includes:

  • Clear type hints
  • Comprehensive docstring
  • Proper bounds checking
  • Clear logic flow

55-66: Simplify sentence ending detection logic

The current implementation has nested conditions and multiple continue statements. A previous review comment suggested a more streamlined approach which is still valid.

cognee/tasks/chunks/chunk_by_word.py Show resolved Hide resolved
0xideas and others added 2 commits November 14, 2024 14:40
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (4)
cognee/tasks/chunks/chunk_by_sentence.py (2)

19-29: Optimize the section end detection logic.

The current implementation continues checking characters even after section_end is set to "sentence_cut".

Consider this optimization:

  if word_type in ["paragraph_end", "sentence_end"]:
      section_end = word_type
  else:
+     if section_end:  # Skip if we already have a section end type
+         continue
      for character in word:
          if character.isalpha():
              section_end = "sentence_cut"
              break

Line range hint 30-36: Add validation for maximum_length parameter.

The function should validate that maximum_length is positive when provided.

Add this validation at the start of the function:

 def chunk_by_sentence(data: str, maximum_length: Optional[int] = None):
+    if maximum_length is not None and maximum_length <= 0:
+        raise ValueError("maximum_length must be a positive integer")
     sentence = ""
     paragraph_id = uuid4()
cognee/tests/integration/documents/AudioDocument_test.py (1)

34-36: Consider optimizing the iteration pattern

The current implementation reads the document twice - once for creating the list and once for iteration. Consider using a more efficient approach.

-        for ground_truth, paragraph_data in zip(
-            GROUND_TRUTH, document.read(chunk_size=64)
-        ):
+        chunks = list(document.read(chunk_size=64))
+        assert len(chunks) == len(GROUND_TRUTH), "Unexpected number of chunks"
+        for ground_truth, paragraph_data in zip(GROUND_TRUTH, chunks):
cognee/tasks/chunks/chunk_by_paragraph.py (1)

73-73: Fix excessive indentation

Remove the excessive indentation on this empty line.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b4d509e and 8b68152.

📒 Files selected for processing (4)
  • cognee/tasks/chunks/chunk_by_paragraph.py (1 hunks)
  • cognee/tasks/chunks/chunk_by_sentence.py (2 hunks)
  • cognee/tests/integration/documents/AudioDocument_test.py (1 hunks)
  • cognee/tests/integration/documents/TextDocument_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/tests/integration/documents/TextDocument_test.py
🧰 Additional context used
🪛 Ruff
cognee/tests/integration/documents/AudioDocument_test.py

33-33: Ambiguous variable name: l

(E741)


33-33: Local variable l is assigned to but never used

Remove assignment to unused variable l

(F841)

🔇 Additional comments (9)
cognee/tasks/chunks/chunk_by_sentence.py (2)

13-13: LGTM: Clear variable initialization.

The section_end variable is well-named and properly initialized.


43-43: LGTM: Improved final chunk handling.

Using section_end instead of word_type correctly identifies the type of the final chunk.

cognee/tests/integration/documents/AudioDocument_test.py (2)

1-25: LGTM! Well-structured test data

The test data is realistic and well-organized, with clear mapping between GROUND_TRUTH and TEST_TEXT content.


29-31: Previous concern about empty raw_data_location remains valid

As noted in the previous review, empty raw_data_location handling needs verification. Consider adding validation in AudioDocument or documenting why empty value is acceptable.

cognee/tasks/chunks/chunk_by_paragraph.py (5)

2-9: LGTM! Well-documented function signature.

The addition of type hints and clear documentation about text reconstruction capability improves code clarity and maintainability.


10-15: LGTM! Clear variable initialization.

The variables are well-named and their purposes are clear, setting up the necessary state for the chunking process.


40-40: Add spaces between concatenated sentences

This issue was previously identified and is still present. Sentences are concatenated without spaces, which could affect text reconstruction and readability.


22-29: 🛠️ Refactor suggestion

Reduce code duplication in chunk dictionary creation

The chunk dictionary creation logic is duplicated in three places. This violates the DRY principle and makes maintenance harder.

Create a helper method to reduce duplication:

def create_chunk_dict(
    text: str,
    word_count: int,
    chunk_index: int,
    cut_type: str,
    paragraph_ids: list,
    chunk_id: uuid.UUID = None
) -> Dict[str, Any]:
    return {
        "text": text,
        "word_count": word_count,
        "chunk_id": chunk_id or uuid5(NAMESPACE_OID, text),
        "paragraph_ids": paragraph_ids,
        "chunk_index": chunk_index,
        "cut_type": cut_type
    }

Also applies to: 46-53, 64-71


16-17: Verify handling of long sentences

The function should be tested with sentences that exceed the paragraph_length to ensure proper handling.

Let's verify if there are any extremely long sentences in the test files:

✅ Verification successful

Let's check the actual test cases and implementation to verify how long sentences are handled:


Based on the implementation and test data review, I can now provide the final response:

Long sentences are properly handled - no action needed

The implementation in chunk_by_sentence.py correctly handles long sentences by enforcing the maximum_length parameter at the word level. When a sentence reaches the maximum length, it's automatically yielded regardless of finding a sentence boundary. This is verified by:

  1. The implementation explicitly checks: if word_type in ["paragraph_end", "sentence_end"] or (maximum_length and (word_count == maximum_length))
  2. Test cases include comprehensive validation through test_paragraph_chunk_length() which asserts that no chunks exceed the maximum length
  3. The test data includes various challenging inputs including long prose (Paradise Lost), complex multilingual text, and structured content
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test files and analyze sentence lengths
fd -e py -e txt test |
while read -r file; do
  echo "Analyzing $file:"
  # Look for long strings that might be test sentences
  rg -l '.{1024,}' "$file" || true
done

Length of output: 5038


Script:

#!/bin/bash
# Check test cases for chunk_by_paragraph
rg -A 10 "def test" "cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py" || true

# Check the implementation of chunk_by_sentence since it's used by chunk_by_paragraph
rg -l "def chunk_by_sentence" || true

Length of output: 427


Script:

#!/bin/bash
# Check the implementation of chunk_by_sentence and its test cases
rg -A 10 "def chunk_by_sentence" "cognee/tasks/chunks/chunk_by_sentence.py" || true

# Check test cases for chunk_by_sentence
rg -A 10 "def test" "cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py" || true

# Check test input data
rg -A 10 "whole_text|cut_text" "cognee/tests/unit/processing/chunks/test_input.py" || true

Length of output: 1337


Script:

#!/bin/bash
# Check the actual implementation of chunk_by_sentence to see how it handles long sentences
cat "cognee/tasks/chunks/chunk_by_sentence.py"

# Check the test input data file content
cat "cognee/tests/unit/processing/chunks/test_input.py"

Length of output: 10230

cognee/tasks/chunks/chunk_by_sentence.py Show resolved Hide resolved
cognee/tests/integration/documents/AudioDocument_test.py Outdated Show resolved Hide resolved
cognee/tests/integration/documents/AudioDocument_test.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
cognee/tasks/chunks/chunk_by_paragraph.py (2)

10-15: Consider adding type hints to initialized variables.

Adding type hints to the initialized variables would improve code clarity and maintainability.

-    current_chunk = ""
-    current_word_count = 0
-    chunk_index = 0
-    paragraph_ids = []
-    last_cut_type = None
+    current_chunk: str = ""
+    current_word_count: int = 0
+    chunk_index: int = 0
+    paragraph_ids: list[str] = []
+    last_cut_type: str | None = None

73-74: Fix inconsistent indentation.

The indentation of these empty lines is inconsistent with the surrounding code.

-            
-        yield chunk_dict
+        yield chunk_dict
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 8b68152 and 7cf8c74.

📒 Files selected for processing (2)
  • cognee/tasks/chunks/chunk_by_paragraph.py (1 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py
🔇 Additional comments (3)
cognee/tasks/chunks/chunk_by_paragraph.py (3)

1-9: LGTM! Good improvements to type safety and documentation.

The addition of type hints and the clear docstring about text reconstruction capability are valuable improvements.


62-72: LGTM! Good handling of remaining text.

The logic for handling the final chunk properly considers the last cut type and ensures no text is lost.


22-29: 🛠️ Refactor suggestion

Extract duplicated chunk dictionary creation logic.

The chunk dictionary creation pattern is repeated three times in the code. This violates the DRY principle and makes maintenance harder.

Previous suggestion to create a helper method still applies. This would reduce duplication and improve maintainability.

Also applies to: 46-53, 64-71

@@ -9,26 +9,25 @@ class TextChunker():

chunk_index = 0
chunk_size = 0
paragraph_chunks = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if the paragraph_chunk is used somewhere else. I mean it can be deleted but I think it's good to see what were the chunks from debugging point of view because in this way its gonna be local and be lost once the method finishes. But maybe @borisarzentar has more context here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within TextChunker, paragraph_chunks is only used within read. The attribute is never accessed from outside the class in the repo, and for a dynamic attribute like this one, it would also not be good practice to do so and break encapsulation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (2)

13-22: Consider enhancing error reporting and metadata validation.

While the isomorphism test is well-designed, consider these improvements:

  1. Make the error message more descriptive by showing the first differing position
  2. Add validation for chunk metadata (e.g., start/end positions)
 def test_chunk_by_paragraph_isomorphism(input_text, paragraph_length, batch_paragraphs):
     chunks = chunk_by_paragraph(input_text, paragraph_length, batch_paragraphs)
     reconstructed_text = "".join([chunk["text"] for chunk in chunks])
+    # Validate chunk metadata
+    prev_end = 0
+    for chunk in chunks:
+        assert chunk["start"] == prev_end, f"Gap between chunks at position {prev_end}"
+        assert chunk["end"] == prev_end + len(chunk["text"]), f"Invalid end position for chunk"
+        prev_end = chunk["end"]
+
     assert (
         reconstructed_text == input_text
-    ), f"texts are not identical: {len(input_text) = }, {len(reconstructed_text) = }"
+    ), (
+        f"Texts differ at position {next((i for i, (a, b) in enumerate(zip(input_text, reconstructed_text)) if a != b), -1)}\n"
+        f"Input length: {len(input_text)}, Reconstructed length: {len(reconstructed_text)}"
+    )

42-53: Consider memory optimization and enhanced error reporting.

While the chunk numbering test is effective, consider these improvements:

 def test_chunk_by_paragraph_chunk_numbering(
     input_text, paragraph_length, batch_paragraphs
 ):
-    chunks = chunk_by_paragraph(input_text, paragraph_length, batch_paragraphs)
-    chunk_indices = np.array([chunk["chunk_index"] for chunk in chunks])
-    assert np.all(
-        chunk_indices == np.arange(len(chunk_indices))
-    ), f"{chunk_indices = } are not monotonically increasing"
+    prev_index = -1
+    for i, chunk in enumerate(chunk_by_paragraph(input_text, paragraph_length, batch_paragraphs)):
+        current_index = chunk["chunk_index"]
+        assert current_index == i, (
+            f"Invalid chunk index at position {i}.\n"
+            f"Expected index: {i}, got: {current_index}\n"
+            f"Chunk text: {chunk['text'][:100]}..."
+        )
+        assert current_index > prev_index, (
+            f"Non-monotonic chunk index at position {i}.\n"
+            f"Previous index: {prev_index}, Current index: {current_index}"
+        )
+        prev_index = current_index
cognee/tests/unit/modules/graph/cognee_graph_test.py (1)

Line range hint 1-82: Consider adding edge cases to strengthen test coverage.

While the current test suite is well-structured and covers the main functionality, consider adding these test cases to improve coverage:

  1. Test with empty/null node IDs
  2. Test edge creation with the same source and target node
  3. Test concurrent modifications to the graph
  4. Test with large numbers of nodes/edges for performance

Would you like me to help generate these additional test cases?

cognee/tasks/chunks/chunk_by_paragraph.py (2)

6-9: Enhance docstring with parameter and return value documentation.

While the current docstring explains the core functionality, it would be more helpful with:

  • Parameter descriptions (data, paragraph_length, batch_paragraphs)
  • Return value format
  • Example usage
 def chunk_by_paragraph(data: str, paragraph_length: int = 1024, batch_paragraphs: bool = True) -> Iterator[Dict[str, Any]]:
     """
     Chunks text by paragraph while preserving exact text reconstruction capability.
     When chunks are joined with empty string "", they reproduce the original text exactly.
+
+    Args:
+        data: Input text to be chunked
+        paragraph_length: Maximum number of words per chunk
+        batch_paragraphs: If True, combines paragraphs up to paragraph_length
+
+    Returns:
+        Iterator yielding dictionaries containing:
+            - text: The chunk text
+            - word_count: Number of words in the chunk
+            - chunk_id: UUID5 identifier for the chunk
+            - paragraph_ids: List of paragraph identifiers included
+            - chunk_index: Sequential index of the chunk
+            - cut_type: Type of cut that ended this chunk
+
+    Example:
+        >>> text = "First paragraph. Second one."
+        >>> chunks = list(chunk_by_paragraph(text, paragraph_length=5))
     """

60-70: LGTM with a minor formatting issue.

The final chunk handling logic is correct, but there's unnecessary whitespace at line 71 that should be removed.

         }
-        
-            
         yield chunk_dict

Also applies to: 72-72

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7cf8c74 and 928e107.

📒 Files selected for processing (5)
  • cognee/tasks/chunks/chunk_by_paragraph.py (1 hunks)
  • cognee/tasks/chunks/chunk_by_sentence.py (1 hunks)
  • cognee/tests/unit/modules/graph/cognee_graph_elements_test.py (10 hunks)
  • cognee/tests/unit/modules/graph/cognee_graph_test.py (5 hunks)
  • cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • cognee/tests/unit/modules/graph/cognee_graph_elements_test.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/tasks/chunks/chunk_by_sentence.py
🧰 Additional context used
📓 Learnings (1)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)
Learnt from: 0xideas
PR: topoteretes/cognee#205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.
🔇 Additional comments (9)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (3)

1-11: LGTM! Well-structured test setup.

The imports and test parameters are well organized. The use of itertools.product for test parameterization is a good practice, providing comprehensive coverage across different chunk sizes and batching configurations.


25-39: Previous optimization suggestions are still applicable.

The current implementation could benefit from the memory optimization and enhanced error reporting suggestions made in the previous review.


9-10: Verify test parameters against current chunking implementation.

Based on previous learnings, ensure that the paragraph lengths (64, 256, 1024) align with the current chunking implementation in cognee/tasks/chunks.

✅ Verification successful

Test parameters align with chunking implementation

The test values are correct as they test various scenarios around the default paragraph length of 1024 (as seen in the implementation). The test covers:

  • Small chunks (64 words)
  • Medium chunks (256 words)
  • Default size chunks (1024 words)
  • Both batched and non-batched paragraph modes
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the test parameters align with the chunking implementation

# Search for relevant configuration or constants in the chunking implementation
rg -A 5 "DEFAULT_PARAGRAPH_LENGTH|MAX_PARAGRAPH_LENGTH|MIN_PARAGRAPH_LENGTH" cognee/tasks/chunks/

Length of output: 99


Script:

#!/bin/bash
# Let's try a broader search for paragraph-related configurations and implementations

# Search for files containing paragraph-related code in chunks directory
fd . cognee/tasks/chunks/ --type f --exec rg -l "paragraph" {}

# Search for actual paragraph length usage in the implementation
rg -A 3 "paragraph.*length" cognee/tasks/chunks/

# Look for the chunking implementation file
ast-grep --pattern 'def chunk_by_paragraph($$$)'

Length of output: 1811

cognee/tests/unit/modules/graph/cognee_graph_test.py (2)

4-4: LGTM! Import reordering is safe.

The reordering of Edge and Node imports is a cosmetic change that doesn't affect functionality.


Line range hint 12-82: LGTM! Improved test structure and readability.

The added newlines between test functions improve code readability by providing clear visual separation between test cases. The consistent spacing and well-documented test functions follow Python testing best practices.

cognee/tasks/chunks/chunk_by_paragraph.py (4)

2-2: LGTM! Improved type safety with explicit type hints.

The addition of type hints enhances code clarity and maintainability.

Also applies to: 5-5


10-15: LGTM! Clear and well-organized state initialization.

The state variables are properly initialized with descriptive names that clearly indicate their purpose.


20-27: Duplicate comment: Extract chunk dictionary creation logic.

A past review already suggested extracting the duplicated chunk dictionary creation into a helper method. This suggestion is still valid and would improve code maintainability.

Also applies to: 44-51, 62-69


16-58: LGTM! Well-structured chunking logic.

The implementation correctly handles:

  • Word count limits
  • Batch vs non-batch modes
  • Paragraph boundaries

Copy link

gitguardian bot commented Nov 17, 2024

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
13947935 Triggered Generic High Entropy Secret 22a0e43 cognee/shared/utils.py View secret
9573981 Triggered Generic Password 22a0e43 notebooks/cognee_llama_index.ipynb View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@0xideas 0xideas merged commit ced5385 into main Nov 18, 2024
26 of 28 checks passed
@0xideas 0xideas deleted the COG-417-chunking-unit-tests branch November 18, 2024 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants