-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cog 417 chunking unit tests #205
Conversation
WalkthroughThe pull request introduces significant modifications to the text chunking functionality across several modules. The Changes
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
🧹 Outside diff range and nitpick comments (18)
cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py (2)
1-1
: Remove unused import.The
numpy
module is imported but not used in this test file.-import numpy as np
8-16
: Consider adding edge cases to test parameterization.While the current test cases cover diverse input types, consider adding edge cases such as:
- Empty text
- Very long text
- Text with special characters
- Text with multiple consecutive newlines
@pytest.mark.parametrize( "input_text", [ INPUT_TEXTS["english_text"], INPUT_TEXTS["english_lists"], INPUT_TEXTS["python_code"], INPUT_TEXTS["chinese_text"], + "", # Empty text + INPUT_TEXTS["english_text"] * 1000, # Very long text + INPUT_TEXTS["special_characters"], # Text with special characters + INPUT_TEXTS["multiple_newlines"], # Text with multiple consecutive newlines ], )cognee/tasks/chunks/chunk_by_sentence.py (1)
Add tests for the
maximum_length
parameter inchunk_by_sentence
.
- Ensure
maximum_length
is properly tested incognee/tests/unit/processing/chunks/chunk_by_sentence_test.py
🔗 Analysis chain
Line range hint
8-30
: Verify text reconstruction and usage compatibility.The changes to chunking logic need verification to ensure:
- Text can be properly reconstructed from chunks (isomorphism)
- All callers are compatible with the new parameter
Let's check the usage and test coverage:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify function usage and test coverage # Check for test cases testing isomorphism echo "Checking for isomorphism tests..." rg -l "test.*isomorphism" "cognee/tests" # Check for all usages of chunk_by_sentence echo "Checking all usages of chunk_by_sentence..." rg "chunk_by_sentence\(" --type python # Check if there are any direct tests for maximum_length echo "Checking for maximum_length tests..." rg "maximum_length.*chunk_by_sentence" "cognee/tests"Length of output: 608
Script:
#!/bin/bash # Description: Verify all usages of chunk_by_sentence with corrected file type echo "Checking all usages of chunk_by_sentence..." rg "chunk_by_sentence\(" --type pyLength of output: 840
cognee/tests/unit/documents/PdfDocument_test.py (1)
7-8
: Document the expected chunk formatConsider adding comments to explain:
- What constitutes a "word" in the word count
- How the text length is calculated (e.g., handling of whitespace)
- What determines a "sentence_end" cut type
Add documentation above the GROUND_TRUTH list:
+# Ground truth values for PDF chunks: +# - word_count: Number of words in chunk (words are split by whitespace) +# - len_text: Total length of normalized text including whitespace +# - cut_type: How the chunk was terminated (e.g., "sentence_end" for complete sentences) GROUND_TRUTH = [ {"word_count": 879, "len_text": 5607, "cut_type": "sentence_end"}, {"word_count": 953, "len_text": 6363, "cut_type": "sentence_end"}, ]cognee/tests/unit/processing/chunks/chunk_by_word_test.py (2)
8-22
: Add type hints to improve code clarity.The test implementation is solid with good parametrization and helpful error messages. Consider adding type hints to make the function signature more explicit.
-def test_chunk_by_word_isomorphism(input_text): +def test_chunk_by_word_isomorphism(input_text: str) -> None:
11-15
: Consider adding edge case test inputs.While the current test inputs cover various scenarios, consider adding edge cases such as:
- Empty string
- String with only whitespace
- Very long words
- Special characters and emojis
Would you like me to help create these additional test cases?
cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (1)
51-54
: Enhance test coverage with parameterized testsConsider improving the test suite with:
- Parameterized tests for different chunk sizes
- Edge cases (empty text, single word paragraphs)
- Error conditions (chunk size <= 0)
Example implementation using pytest.mark.parametrize:
import pytest @pytest.mark.parametrize("chunk_size,expected_chunks", [ (5, [...]), # smaller chunks (12, GROUND_TRUTH["whole_text"]), # current case (20, [...]), # larger chunks ]) def test_chunking_with_different_sizes(chunk_size, expected_chunks): chunks = list(chunk_by_paragraph(INPUT_TEXT["whole_text"], chunk_size, batch_paragraphs=True)) assert chunks == expected_chunks def test_chunking_edge_cases(): # Empty text assert list(chunk_by_paragraph("", 12)) == [] # Single word paragraph result = list(chunk_by_paragraph("word\n", 12)) assert len(result) == 1 assert result[0]["word_count"] == 1 def test_chunking_errors(): with pytest.raises(ValueError): list(chunk_by_paragraph(INPUT_TEXT["whole_text"], 0))cognee/tasks/chunks/chunk_by_word.py (4)
3-4
: Consider enhancing regex patterns for Unicode supportThe regex patterns could be expanded to handle Unicode punctuation marks and line endings for better international text support.
-SENTENCE_ENDINGS = r"[.;!?…]" -PARAGRAPH_ENDINGS = r"[\n\r]" +SENTENCE_ENDINGS = r"[.;!?…\u2026\u3002\uFF01\uFF1F]" # Include CJK punctuation +PARAGRAPH_ENDINGS = r"[\n\r\u2028\u2029]" # Include Unicode line/paragraph separators
6-11
: Enhance type hints and docstringThe function would benefit from complete type annotations and example usage in the docstring.
-def chunk_by_word(data: str): +from typing import Iterator, Tuple + +def chunk_by_word(data: str) -> Iterator[Tuple[str, str]]: """ Chunks text into words and endings while preserving whitespace. Whitespace is included with the preceding word. Outputs can be joined with "" to recreate the original input. + + Args: + data: Input text to be chunked + + Returns: + Iterator yielding tuples of (chunk, chunk_type) where chunk_type is one of: + "word", "sentence_end", or "paragraph_end" + + Example: + >>> list(chunk_by_word("Hello world.")) + [("Hello ", "word"), ("world.", "sentence_end")] """
16-22
: Optimize leading whitespace handlingThe current implementation performs multiple regex matches in a loop. This could be optimized using a single regex match.
- while i < len(data) and (re.match(PARAGRAPH_ENDINGS, data[i]) or data[i] == " "): - current_chunk += data[i] - i += 1 + leading_space_match = re.match(f"^([{PARAGRAPH_ENDINGS}| ]+)", data) + if leading_space_match: + current_chunk = leading_space_match.group(1) + i = len(current_chunk)
63-67
: Simplify whitespace handling after sentence endingsThe current implementation uses a while loop to handle whitespace. This could be simplified using regex.
- next_i = i + 1 - while next_i < len(data) and data[next_i] == " ": - current_chunk += data[next_i] - next_i += 1 + if i + 1 < len(data): + space_match = re.match(r"( +)", data[i+1:]) + if space_match: + current_chunk += space_match.group(1) + next_i = i + 1 + len(space_match.group(1))cognee/modules/chunking/TextChunker.py (3)
19-27
: Add input validation and error handling.While moving to method-level state management is good, consider adding:
- Input validation for content_text
- Error handling for chunk_by_paragraph generator
- Validation for chunk_data dictionary structure
def read(self): paragraph_chunks = [] for content_text in self.get_text(): + if not content_text: + continue try: for chunk_data in chunk_by_paragraph( content_text, self.max_chunk_size, batch_paragraphs = True, ): + if not all(k in chunk_data for k in ["word_count", "text", "chunk_id", "cut_type"]): + raise ValueError("Invalid chunk data structure") if self.chunk_size + chunk_data["word_count"] <= self.max_chunk_size: paragraph_chunks.append(chunk_data) self.chunk_size += chunk_data["word_count"] + except Exception as e: + print(f"Error processing content: {str(e)}") + continue
Line range hint
30-54
: Critical improvements needed in error handling and array access.Several issues need attention:
- The bare except is too broad and masks errors
- Unsafe array access could raise IndexError
- Complex nested conditions could be simplified
Apply these improvements:
- if len(paragraph_chunks) == 0: + if not paragraph_chunks: yield DocumentChunk( id = chunk_data["chunk_id"], text = chunk_data["text"], word_count = chunk_data["word_count"], is_part_of = self.document, chunk_index = self.chunk_index, cut_type = chunk_data["cut_type"], ) paragraph_chunks = [] self.chunk_size = 0 else: chunk_text = " ".join(chunk["text"] for chunk in paragraph_chunks) try: + if not paragraph_chunks: + raise ValueError("Empty paragraph chunks") + last_chunk = paragraph_chunks[-1] # Safer array access yield DocumentChunk( id = uuid5(NAMESPACE_OID, f"{str(self.document.id)}-{self.chunk_index}"), text = chunk_text, word_count = self.chunk_size, is_part_of = self.document, chunk_index = self.chunk_index, - cut_type = paragraph_chunks[len(paragraph_chunks) - 1]["cut_type"], + cut_type = last_chunk["cut_type"], ) - except Exception as e: - print(e) + except Exception as e: + print(f"Error creating DocumentChunk: {str(e)}") + # Re-raise to handle at higher level if needed + raise
Line range hint
1-67
: Consider architectural improvements for better maintainability.Several architectural improvements could enhance the code:
- Replace print statements with proper logging
- Consider splitting the chunk management logic into a separate class
- Add type hints for better code clarity and maintainability
Example improvements:
import logging from typing import Iterator, Dict, Any, Optional from dataclasses import dataclass logger = logging.getLogger(__name__) @dataclass class ChunkData: text: str word_count: int cut_type: str chunk_id: Optional[str] = None class ChunkManager: """Handles the logic of accumulating and managing chunks.""" def __init__(self, max_size: int): self.max_size = max_size self.current_size = 0 self.chunks: List[ChunkData] = [] def can_add(self, chunk: ChunkData) -> bool: return self.current_size + chunk.word_count <= self.max_size class TextChunker: def __init__(self, document: Any, get_text: callable, chunk_size: int = 1024): self.document = document self.max_chunk_size = chunk_size self.get_text = get_text self.chunk_manager = ChunkManager(chunk_size)cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)
37-42
: Consider enhancing error reporting for better debugging.While the test is well-structured, the error message could be more helpful by showing a diff of the mismatched sections when the assertion fails.
Consider this enhancement:
- assert ( - reconstructed_text == input_text - ), f"texts are not identical: {len(input_text) = }, {len(reconstructed_text) = }" + if reconstructed_text != input_text: + # Find the first difference + pos = next(i for i in range(min(len(input_text), len(reconstructed_text))) + if input_text[i] != reconstructed_text[i]) + context = slice(max(0, pos-50), pos+50) + assert False, ( + f"Texts differ at position {pos}.\n" + f"Input context: {input_text[context]!r}\n" + f"Reconstructed context: {reconstructed_text[context]!r}\n" + f"Lengths: {len(input_text) = }, {len(reconstructed_text) = }" + )cognee/tests/unit/processing/chunks/test_input.py (2)
3-275
: Add docstring to describe test cases and their purposes.The test input dictionary is well-structured with diverse test cases. Consider adding a module-level docstring to:
- Describe the purpose of each test case
- Document the expected chunking behavior for each case
- Explain any special characters or formatting that should be preserved
This documentation will help other developers understand the test coverage and maintain the test suite.
Example docstring:
"""Test input data for text chunking functionality. Test cases: - english_lists: Structured content with nested lists to verify list handling - python_code: Python source with type annotations to verify code block handling - chinese_text: Chinese narrative to verify Unicode and CJK character handling - english_text: Literary text with line numbers to verify formatting preservation """
3-275
: Add edge cases to improve test coverage.While the current test cases are comprehensive, consider adding these edge cases to strengthen the test suite:
# Add to INPUT_TEXTS "empty_text": "", "single_char": "x", "whitespace_only": " \n\t \n ", "mixed_content": """ # Python code def hello(): print("Hello") # Markdown ## Header Regular text """,These additions will help verify the chunker's behavior with boundary conditions.
cognee/tasks/chunks/chunk_by_paragraph.py (1)
21-27
: Refactor duplicatechunk_dict
creation into a helper functionThe code for creating
chunk_dict
is repeated multiple times. Extracting this logic into a helper function would improve maintainability and reduce code duplication.Also applies to: 48-55, 66-72
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (10)
cognee/modules/chunking/TextChunker.py
(2 hunks)cognee/tasks/chunks/chunk_by_paragraph.py
(1 hunks)cognee/tasks/chunks/chunk_by_sentence.py
(1 hunks)cognee/tasks/chunks/chunk_by_word.py
(1 hunks)cognee/tests/unit/documents/PdfDocument_test.py
(1 hunks)cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py
(1 hunks)cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py
(2 hunks)cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py
(1 hunks)cognee/tests/unit/processing/chunks/chunk_by_word_test.py
(1 hunks)cognee/tests/unit/processing/chunks/test_input.py
(1 hunks)
🧰 Additional context used
🪛 Ruff
cognee/tests/unit/processing/chunks/chunk_by_word_test.py
40-40: Avoid equality comparisons to False
; use if not space_test:
for false checks
Replace with not space_test
(E712)
🔇 Additional comments (11)
cognee/tasks/chunks/chunk_by_sentence.py (1)
5-5
: LGTM! Clean type annotation and parameter addition.
The Optional type import and the new maximum_length parameter are properly implemented with correct type annotations.
Also applies to: 8-8
cognee/tests/unit/documents/PdfDocument_test.py (1)
7-8
: Verify updated ground truth values against the test file
The changes to the ground truth values (text length and word count) should be validated against the actual content of "artificial-intelligence.pdf".
Let's verify the test file content:
cognee/tests/unit/processing/chunks/chunk_by_word_test.py (2)
1-6
: LGTM! Clean and well-organized imports.
The imports are appropriate and well-structured, with test inputs properly separated into a dedicated file.
25-40
: 🛠️ Refactor suggestion
Improve boolean comparison and add type hints.
The test logic is good, but there are a few improvements to make:
-def test_chunk_by_word_splits(input_text):
+def test_chunk_by_word_splits(input_text: str) -> None:
chunks = np.array(list(chunk_by_word(input_text)))
space_test = np.array([" " not in chunk[0].strip() for chunk in chunks])
assert np.all(
- space_test
- ), f"These chunks contain spaces within them: {chunks[space_test == False]}"
+ space_test
+ ), f"These chunks contain spaces within them: {chunks[~space_test]}"
Let's verify the chunk_by_word implementation to ensure we're testing the right contract:
🧰 Tools
🪛 Ruff
40-40: Avoid equality comparisons to False
; use if not space_test:
for false checks
Replace with not space_test
(E712)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (2)
42-43
: LGTM! Improved test data formatting
The removal of leading whitespace and consistent newline usage makes the test data cleaner and more maintainable.
Also applies to: 45-46
35-35
: Verify the change in cut_type from "sentence_cut" to "word"
The change in cut_type suggests a fundamental change in how incomplete paragraphs are handled. This could affect downstream processing that relies on the cut_type.
✅ Verification successful
Change in cut_type Verified: The alteration from "sentence_cut"
to "word"
in cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py
is isolated to the test file and does not affect downstream processing.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any code that depends on the cut_type value
rg "sentence_cut|cut_type.*sentence" -A 5
Length of output: 1809
Script:
#!/bin/bash
# Search for all instances of "word" in cut_type contexts
rg '"cut_type"\s*:\s*"word"' -A 5
# Search for functions or methods that handle "cut_type" values
rg 'def .*cut_type' -A 10
# Check if "word" is used in any comments or documentation related to cut_type
rg 'cut_type.*word' --ignore-case -A 3
Length of output: 854
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (2)
1-6
: LGTM! Well-organized imports.
The imports are appropriate and well-structured, with good separation of test input data.
8-73
: Consider adding edge cases to the test matrix.
While the current test matrix is comprehensive for normal cases, consider adding these edge cases:
# Add to both parameter lists:
(INPUT_TEXTS["empty_text"], 64, True), # Empty text
(INPUT_TEXTS["single_char"], 64, True), # Single character
(INPUT_TEXTS["very_long_word"], 64, True), # Word longer than chunk size
(INPUT_TEXTS["all_whitespace"], 64, True), # Text with only whitespace
Let's check if these edge cases are already covered in the test input file:
cognee/tests/unit/processing/chunks/test_input.py (2)
1-2
: LGTM!
The import is clean and appropriate for a test file.
3-275
: Verify test data security and performance impact.
Please ensure that:
- The test data doesn't contain any sensitive information or PII
- The large text blocks won't impact test performance or memory usage in CI/CD pipelines
Consider adding a maximum size limit for test inputs to prevent memory issues.
✅ Verification successful
Test data is secure and within performance limits.
No sensitive information found in the test data.
File size is 9012 bytes, which is acceptable and unlikely to impact performance.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for potential sensitive information in test data
rg -i "password|secret|key|token|credential" cognee/tests/unit/processing/chunks/test_input.py
# Calculate file size
ls -l cognee/tests/unit/processing/chunks/test_input.py | awk '{print $5}'
Length of output: 513
cognee/tasks/chunks/chunk_by_paragraph.py (1)
5-79
: Function updates enhance clarity and maintainability
The refactored chunk_by_paragraph
function shows improved logic and type clarity. The use of descriptive variable names and structured processing enhances readability and understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
🧹 Outside diff range and nitpick comments (19)
cognee/tasks/chunks/chunk_by_sentence.py (1)
8-8
: Add comprehensive documentationConsider adding a docstring with:
- Function description
- Parameter descriptions with types
- Return value description
- Examples of usage
Here's a suggested docstring:
def chunk_by_sentence(data: str, maximum_length: Optional[int] = None) -> Iterator[Tuple[UUID, int, str, int, str]]: """Split text into sentence chunks with optional length control. Args: data: Input text to be chunked maximum_length: Optional maximum number of words per chunk. If specified, must be positive. Yields: Tuple containing: - paragraph_id: UUID identifying the paragraph - chunk_index: Index of the chunk within the paragraph - sentence: The chunked text - word_count: Number of words in the chunk - word_type: Type of break ("paragraph_end", "sentence_end", "length_cut", or "sentence_cut") Raises: ValueError: If maximum_length is not positive """cognee/tests/unit/documents/PdfDocument_test.py (1)
7-8
: Document the rationale for ground truth values.Consider adding a comment explaining why these specific numbers are expected. This would help future maintainers understand if changes to the chunking logic should affect these values.
GROUND_TRUTH = [ + # Values reflect sentence-end chunking with improved whitespace handling and word tokenization {"word_count": 879, "len_text": 5607, "cut_type": "sentence_end"}, {"word_count": 953, "len_text": 6363, "cut_type": "sentence_end"}, ]
cognee/tests/unit/processing/chunks/chunk_by_word_test.py (2)
17-22
: Enhance error message for better debugging.While the test logic is solid, the error message could be more helpful for debugging failures.
Consider this improvement:
assert ( reconstructed_text == input_text - ), f"texts are not identical: {len(input_text) = }, {len(reconstructed_text) = }" + ), f"texts are not identical:\nExpected: {input_text!r}\nActual: {reconstructed_text!r}\nLengths: {len(input_text) = }, {len(reconstructed_text) = }"
35-40
: Consider simplifying the space validation logic.The current implementation uses numpy arrays for a simple boolean check. While this works, a more straightforward approach might be clearer.
Consider this alternative:
- chunks = np.array(list(chunk_by_word(input_text))) - space_test = np.array([" " not in chunk[0] for chunk in chunks]) - - assert np.all( - space_test - ), f"These chunks contain spaces within them: {chunks[space_test == False]}" + chunks = list(chunk_by_word(input_text)) + chunks_with_spaces = [chunk for chunk in chunks if " " in chunk[0]] + assert not chunks_with_spaces, f"These chunks contain spaces: {chunks_with_spaces}"This approach:
- Removes unnecessary numpy usage for simple boolean checks
- Makes the code more readable
- Directly captures the problematic chunks
🧰 Tools
🪛 Ruff
40-40: Avoid equality comparisons to
False
; useif not space_test:
for false checksReplace with
not space_test
(E712)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (2)
52-52
: Consider adding test coverage for non-batch modeThe test now explicitly uses
batch_paragraphs=True
, but there's no corresponding test forbatch_paragraphs=False
. Consider adding test coverage for both modes to ensure complete functionality verification.def test_chunking_whole_text(): run_chunking_test(INPUT_TEXT["whole_text"], GROUND_TRUTH["whole_text"]) +def test_chunking_whole_text_no_batch(): + chunks = [] + for chunk_data in chunk_by_paragraph(INPUT_TEXT["whole_text"], 12, batch_paragraphs=False): + chunks.append(chunk_data) + # Add appropriate assertions for non-batch mode + def test_chunking_cut_text(): run_chunking_test(INPUT_TEXT["cut_text"], GROUND_TRUTH["cut_text"]) + +def test_chunking_cut_text_no_batch(): + chunks = [] + for chunk_data in chunk_by_paragraph(INPUT_TEXT["cut_text"], 12, batch_paragraphs=False): + chunks.append(chunk_data) + # Add appropriate assertions for non-batch mode
Line range hint
1-70
: Consider parameterizing the test casesThe current test structure duplicates logic between whole_text and cut_text cases. Consider using a parameterized test approach (e.g., pytest.mark.parametrize) to make the tests more maintainable and easier to extend with new test cases.
Example refactor:
import pytest TEST_CASES = [ ("whole_text", True), ("whole_text", False), ("cut_text", True), ("cut_text", False), ] @pytest.mark.parametrize("text_type,batch_mode", TEST_CASES) def test_chunking(text_type, batch_mode): chunks = [] for chunk_data in chunk_by_paragraph( INPUT_TEXT[text_type], 12, batch_paragraphs=batch_mode ): chunks.append(chunk_data) expected = GROUND_TRUTH[text_type] assert len(chunks) == len(expected) for expected_chunk, actual_chunk in zip(expected, chunks): for key in ["text", "word_count", "cut_type"]: assert actual_chunk[key] == expected_chunk[key]cognee/modules/chunking/TextChunker.py (2)
Line range hint
30-39
: Improve error handling consistency and chunk index management
- The error handling pattern here differs from other DocumentChunk creation blocks.
- The chunk_index increment logic could be consolidated.
Consider applying these improvements:
if len(paragraph_chunks) == 0: + try: yield DocumentChunk( id = chunk_data["chunk_id"], text = chunk_data["text"], word_count = chunk_data["word_count"], is_part_of = self.document, chunk_index = self.chunk_index, cut_type = chunk_data["cut_type"], ) + except Exception as e: + print(f"Error creating DocumentChunk: {e}") paragraph_chunks = [] self.chunk_size = 0
42-54
: Enhance error handling and optimize text concatenation
- The error handling should use proper logging instead of print statements
- The text concatenation could be optimized
- The cut_type access could be simplified
Consider these improvements:
- chunk_text = " ".join(chunk["text"] for chunk in paragraph_chunks) + chunk_text = " ".join(chunk["text"] for chunk in paragraph_chunks) # Consider using a StringBuilder pattern for large chunks try: yield DocumentChunk( id = uuid5(NAMESPACE_OID, f"{str(self.document.id)}-{self.chunk_index}"), text = chunk_text, word_count = self.chunk_size, is_part_of = self.document, chunk_index = self.chunk_index, - cut_type = paragraph_chunks[len(paragraph_chunks) - 1]["cut_type"], + cut_type = paragraph_chunks[-1]["cut_type"], ) except Exception as e: - print(e) + import logging + logging.error(f"Failed to create DocumentChunk: {e}")cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (3)
37-42
: Enhance the assertion error message.The current error message only shows lengths. Consider including:
- A diff of the first mismatch
- The chunk boundaries where the mismatch occurred
- assert ( - reconstructed_text == input_text - ), f"texts are not identical: {len(input_text) = }, {len(reconstructed_text) = }" + assert reconstructed_text == input_text, ( + f"Texts are not identical:\n" + f"Input length: {len(input_text)}, Reconstructed length: {len(reconstructed_text)}\n" + f"First difference at position: {next((i for i in range(min(len(input_text), len(reconstructed_text))) if input_text[i] != reconstructed_text[i]), -1)}\n" + f"Input text preview: {input_text[:100]}...\n" + f"Reconstructed text preview: {reconstructed_text[:100]}..." + )
74-79
: Optimize chunk processing efficiency.The code converts chunks to a list twice:
list(chunk_by_paragraph(...))
len(list(chunk_by_word(...)))
Consider storing intermediate results to avoid repeated conversions.
- chunks = list(chunk_by_paragraph(input_text, paragraph_length, batch_paragraphs)) - - chunk_lengths = np.array( - [len(list(chunk_by_word(chunk["text"]))) for chunk in chunks] - ) + chunks = list(chunk_by_paragraph(input_text, paragraph_length, batch_paragraphs)) + word_chunks = [list(chunk_by_word(chunk["text"])) for chunk in chunks] + chunk_lengths = np.array([len(words) for words in word_chunks])
81-84
: Consider adding minimum chunk size validation.The test only verifies that chunks don't exceed the maximum length. Consider also testing that chunks meet a minimum size to prevent over-fragmentation.
larger_chunks = chunk_lengths[chunk_lengths > paragraph_length] + too_small_chunks = chunk_lengths[chunk_lengths < paragraph_length * 0.25] # Example threshold assert np.all( chunk_lengths <= paragraph_length ), f"{paragraph_length = }: {larger_chunks} are too large" + assert len(too_small_chunks) == 0, ( + f"Found chunks smaller than {paragraph_length * 0.25} words: {too_small_chunks}" + )cognee/tests/unit/processing/chunks/test_input.py (2)
3-275
: Consider organizing test data for better maintainabilityWhile the test data provides good coverage with different text types, consider these improvements for better maintainability:
- Move each test input to a separate file in a
test_data
directory- Use a helper function to load test data from files
- Add docstrings describing the purpose of each test input
Example structure:
from pathlib import Path def load_test_data(filename: str) -> str: """Load test data from the test_data directory.""" data_dir = Path(__file__).parent / "test_data" with open(data_dir / filename, "r", encoding="utf-8") as f: return f.read() INPUT_TEXTS = { "english_lists": load_test_data("english_lists.txt"), "python_code": load_test_data("python_code.py"), "chinese_text": load_test_data("chinese_text.txt"), "english_text": load_test_data("english_text.txt"), }
133-134
: Address TODO comment in test dataThe TODO comment in the Python code test data suggests incomplete type hints. Consider either:
- Implementing the suggested overloads, or
- Using a simpler code example without TODOs
cognee/tasks/chunks/chunk_by_word.py (1)
58-60
: Clarify ellipsis handling for improved readabilityIn the ellipsis (
...
) handling, adding'..'
tocurrent_chunk
after the first'.'
can be confusing. For clarity, consider appending the exact substring representing the ellipsis.Apply this diff to make the ellipsis handling clearer:
- current_chunk += ".." + current_chunk += data[i+1:i+3]This change appends the next two dots directly, making it evident that a full ellipsis is being processed.
cognee/tasks/chunks/chunk_by_paragraph.py (5)
59-59
: Avoid resettingchunk_index
to 0 in non-batch modeResetting
chunk_index
to 0 on line 59 may lead to duplicatechunk_index
values across different chunks, which can cause issues ifchunk_index
is intended to be a unique identifier within the entire data set. Consider removing the reset to maintain a consistent incrementing ofchunk_index
.Apply this diff to remove the reset of
chunk_index
:- chunk_index = 0
21-27
: Refactor duplicate chunk creation code into a helper functionThe code for creating
chunk_dict
is repeated in multiple places (lines 21-27, 48-55, and 67-73). Refactoring this code into a helper function can improve maintainability and reduce redundancy.Define a helper function to create
chunk_dict
:def create_chunk_dict(text, word_count, chunk_index, cut_type, chunk_id, paragraph_id): chunk_dict = { "text": text.strip(), "word_count": word_count, "chunk_id": chunk_id, "chunk_index": chunk_index, "cut_type": cut_type, } chunk_dict["id"] = chunk_id if batch_paragraphs else paragraph_id return chunk_dictThen replace the duplicated code with calls to
create_chunk_dict()
.Also applies to: 48-55, 67-73
14-14
: Initializelast_cut_type
before its first useThe variable
last_cut_type
is used when creatingchunk_dict
, but it might not be initialized if the loop doesn't execute, potentially leading to aNameError
. To prevent this, initializelast_cut_type
before the loop.Apply this diff to initialize
last_cut_type
:- last_cut_type = None + last_cut_type = ''
10-12
: Consider initializinglast_paragraph_id
before the loopSimilar to
last_cut_type
,last_paragraph_id
is used when assigningchunk_dict["id"]
in non-batch mode but might not be initialized if the loop doesn't execute. Initialize it before the loop to prevent any unexpected errors.Apply this diff to initialize
last_paragraph_id
:- last_paragraph_id = None + last_paragraph_id = ''
16-16
: Add type hint for the underscore variable to enhance readabilityIn the for-loop on line 16, there's an underscore
_
used as a placeholder for an unused variable. Adding a descriptive name or a comment can improve code readability and maintainability.Consider updating the loop to clarify the purpose of the ignored value:
- for paragraph_id, _, sentence, word_count, end_type in chunk_by_sentence(data, maximum_length=paragraph_length): + for paragraph_id, _unused, sentence, word_count, end_type in chunk_by_sentence(data, maximum_length=paragraph_length):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (10)
cognee/modules/chunking/TextChunker.py
(2 hunks)cognee/tasks/chunks/chunk_by_paragraph.py
(1 hunks)cognee/tasks/chunks/chunk_by_sentence.py
(1 hunks)cognee/tasks/chunks/chunk_by_word.py
(1 hunks)cognee/tests/unit/documents/PdfDocument_test.py
(1 hunks)cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py
(1 hunks)cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py
(2 hunks)cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py
(1 hunks)cognee/tests/unit/processing/chunks/chunk_by_word_test.py
(1 hunks)cognee/tests/unit/processing/chunks/test_input.py
(1 hunks)
🧰 Additional context used
🪛 Ruff
cognee/tests/unit/processing/chunks/chunk_by_word_test.py
40-40: Avoid equality comparisons to False
; use if not space_test:
for false checks
Replace with not space_test
(E712)
🔇 Additional comments (20)
cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py (1)
1-1
: Verify numpy import usage
The numpy
import appears to be unused in this test file.
cognee/tasks/chunks/chunk_by_sentence.py (1)
5-5
: LGTM: Type hint import is appropriate
The addition of Optional
from typing
aligns with Python's type hinting best practices.
cognee/tests/unit/documents/PdfDocument_test.py (1)
7-8
: LGTM! Verify the updated ground truth values.
The changes to the expected values appear to align with improvements in text processing logic. However, let's verify these numbers are correct.
✅ Verification successful
Ground truth values have been successfully verified. No further action is needed.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the ground truth values by examining the test data and related chunking logic
# Check if the test file exists and get its size
echo "Checking test file..."
ls -l $(find . -name "artificial-intelligence.pdf")
# Look for related chunking tests that might validate these numbers
echo "Looking for related test cases..."
rg -l "artificial-intelligence.pdf" --type py
# Check for any other occurrences of these specific numbers
echo "Checking for other occurrences of these numbers..."
rg "5607|6363|879|953" --type py
Length of output: 844
cognee/tests/unit/processing/chunks/chunk_by_word_test.py (1)
1-6
: LGTM! Clean and well-organized imports.
The imports are properly structured with standard libraries first, followed by local imports. Good practice separating test inputs into a dedicated file.
cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py (2)
42-43
: LGTM: Input text formatting is consistent
The input text formatting changes align properly with the modifications in GROUND_TRUTH.
Also applies to: 45-46
6-7
: Verify alignment with chunking implementation
The changes to the test data include:
- Addition of newline characters affecting word counts
- Change of cut_type from "sentence_cut" to "word"
Let's verify these changes align with the actual implementation.
Also applies to: 11-12, 35-35
✅ Verification successful
Test data aligns with chunking implementation changes
The cut_type
changes in the test data correctly reflect the updates in the chunking implementation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the chunking implementation matches the test expectations
# Check the chunking implementation for the cut_type logic
ast-grep --pattern 'def chunk_by_paragraph($$$)' -A 20
# Look for other tests that might need similar updates
fd -e py -E 'venv' -x rg -l 'sentence_cut|word.*cut_type'
Length of output: 316
Script:
#!/bin/bash
# Search for the definition and usage of 'cut_type' in chunk_by_paragraph.py
rg 'cut_type' cognee/tasks/chunks/chunk_by_paragraph.py
# Examine related test files for 'cut_type' usage
fd -e py --type f cognee/tests/unit/processing/chunks/ -x rg 'cut_type'
Length of output: 752
Script:
#!/bin/bash
# Search for the definition and usage of 'cut_type' in chunk_by_paragraph.py
rg 'cut_type' cognee/tasks/chunks/chunk_by_paragraph.py
# Examine related test files for 'cut_type' usage with corrected fd command
fd . 'cognee/tests/unit/processing/chunks/' -e py -x rg 'cut_type'
Length of output: 586
cognee/modules/chunking/TextChunker.py (2)
19-19
: LGTM: Good improvement in state management
Moving paragraph_chunks
to method scope prevents unintended state persistence between read()
calls, making the code more predictable and thread-safe.
Line range hint 19-67
: Verify chunk size constraints are properly enforced
Let's ensure that the chunking logic properly handles edge cases and maintains chunk size constraints.
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (2)
1-6
: LGTM! Well-organized imports.
The imports are clean and well-structured, with clear separation between external libraries and internal modules.
8-36
: LGTM! Comprehensive test parametrization.
The test covers a good range of scenarios with different:
- Text types (English, lists, Python code, Chinese)
- Paragraph lengths (64, 256, 1024)
- Batch modes (True, False)
cognee/tests/unit/processing/chunks/test_input.py (1)
1-2
: LGTM!
Clean and minimal import section with just the required pytest framework.
cognee/tasks/chunks/chunk_by_word.py (7)
3-4
: Good practice: Define regex patterns as module-level constants
Defining SENTENCE_ENDINGS
and PARAGRAPH_ENDINGS
at the module level enhances readability and allows for easier maintenance if these patterns need to be updated in the future.
7-11
: Enhancement: Added descriptive docstring
Including a detailed docstring for chunk_by_word
improves code documentation and aids in understanding the function's purpose and usage.
17-23
: Correct handling of leading whitespace and paragraph endings
The added loop effectively handles leading whitespace and paragraph endings, ensuring that any initial spaces or paragraph breaks are correctly yielded as separate chunks. This maintains the integrity of the original text when reconstructed.
39-46
: Logical handling of paragraph endings
The updated logic correctly processes paragraph ending characters by yielding the current chunk and appropriately tagging the character based on whether it's a real paragraph end.
47-55
: Accurate processing of spaces and words
The code accurately yields words upon encountering spaces and resets the current_chunk
, ensuring that words and whitespace are correctly separated.
68-69
: Ensure correct detection of paragraph ends after sentence endings
The logic for determining is_paragraph_end
appropriately checks for paragraph endings following sentence-ending punctuation, enhancing the accuracy of chunk tagging.
76-77
: Final yield of any remaining text
Yielding any remaining current_chunk
at the end ensures that all text is processed and included in the output, preventing loss of data.
cognee/tasks/chunks/chunk_by_paragraph.py (2)
29-33
: Ensure consistent id
assignment between batch and non-batch modes
In batch mode, chunk_dict["id"]
is set to chunk_dict["chunk_id"]
(lines 30 and 76), while in non-batch mode, it's set to last_paragraph_id
(lines 32 and 78). This difference might lead to inconsistencies in how chunks are identified downstream. Verify that this behavior is intentional and that downstream processes handle the id
field correctly in both modes.
Please confirm that the differing id
assignments align with the expected usage in downstream components.
Also applies to: 74-78
46-46
: Review the inclusion of 'sentence_cut'
in end type handling
The condition on line 46 checks for end_type
in ("paragraph_end", "sentence_cut")
. Including 'sentence_cut'
may cause the function to yield chunks prematurely if sentence_cut
doesn't signify the end of a paragraph. Verify whether 'sentence_cut'
should be included or if the condition should only check for 'paragraph_end'
.
Confirm that including 'sentence_cut'
aligns with the intended chunking logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py (2)
46-54
: Simplify test implementation and add documentationThe test implementation could be simplified and made more maintainable:
- Remove unnecessary list conversion
- Simplify the length checking logic
- Add proper documentation
def test_paragraph_chunk_length(input_text, maximum_length): + """Test that chunk_by_sentence respects the maximum length constraint. + + Args: + input_text: Text to be chunked + maximum_length: Maximum allowed length for each chunk + + The test verifies that no chunk exceeds the specified maximum length + when split into words. + """ - chunks = list(chunk_by_sentence(input_text, maximum_length)) + chunks = chunk_by_sentence(input_text, maximum_length) - chunk_lengths = np.array([len(list(chunk_by_word(chunk[2]))) for chunk in chunks]) + for i, chunk in enumerate(chunks): + words = list(chunk_by_word(chunk[2])) + chunk_length = len(words) + assert chunk_length <= maximum_length, ( + f"Chunk {i} exceeds maximum length {maximum_length}:\n" + f"Length: {chunk_length}\n" + f"Content: {chunk[2]}" + ) - larger_chunks = chunk_lengths[chunk_lengths > maximum_length] - assert np.all( - chunk_lengths <= maximum_length - ), f"{maximum_length = }: {larger_chunks} are too large"
11-23
: Add edge cases to test parameterizationWhile the current test cases cover various languages and text types, consider adding these edge cases to
test_input.py
and including them in the parameterization:
- Empty string
- Single character/word
- Special characters and emojis
- Mixed language text
- Very long sentences
[ (INPUT_TEXTS["english_text"], None), (INPUT_TEXTS["english_text"], 8), (INPUT_TEXTS["english_text"], 64), (INPUT_TEXTS["english_lists"], None), (INPUT_TEXTS["english_lists"], 8), (INPUT_TEXTS["english_lists"], 64), (INPUT_TEXTS["python_code"], None), (INPUT_TEXTS["python_code"], 8), (INPUT_TEXTS["python_code"], 64), (INPUT_TEXTS["chinese_text"], None), (INPUT_TEXTS["chinese_text"], 8), (INPUT_TEXTS["chinese_text"], 64), + (INPUT_TEXTS["empty_string"], None), + (INPUT_TEXTS["single_word"], None), + (INPUT_TEXTS["special_chars"], None), + (INPUT_TEXTS["mixed_language"], None), + (INPUT_TEXTS["long_sentence"], None), ],
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py
(1 hunks)
🧰 Additional context used
📓 Learnings (1)
cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py (1)
Learnt from: 0xideas
PR: topoteretes/cognee#205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
cognee/tests/unit/processing/chunks/test_input.py (1)
4-83
: Consider adding docstring documentationThe test data structure is well-organized and provides comprehensive coverage. Consider adding a docstring to document the purpose and structure of each test case category.
Example addition:
INPUT_TEXTS = { + # Test data for validating text chunking system's handling of various content structures + # and edge cases. Each key represents a specific test scenario or category. "english_lists": """Let me think through the key attributes...""",
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
cognee/tasks/chunks/chunk_by_sentence.py
(1 hunks)cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py
(1 hunks)cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py
(1 hunks)cognee/tests/unit/processing/chunks/test_input.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- cognee/tasks/chunks/chunk_by_sentence.py
- cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py
- cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py
🔇 Additional comments (4)
cognee/tests/unit/processing/chunks/test_input.py (4)
1-2
: LGTM!
The import is clean and appropriate for a test file.
84-172
: LGTM!
The Python code segment provides an excellent test case with complex nested structures, type hints, and various syntax elements.
174-274
: Previous review feedback remains unaddressed
The English text still contains line numbers (e.g., [ 5 ]
, [ 10 ]
) which could interfere with chunking tests. This was pointed out in a previous review.
Consider cleaning up the text by removing these line numbers to ensure accurate chunking results.
275-282
: LGTM! Previous feedback well addressed
Excellent implementation of the suggested edge cases, providing comprehensive coverage for:
- Empty text
- Single-character text
- Whitespace variations
- Unicode/special characters
- Mixed line endings
- HTML content
- URLs and emails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (4)
cognee/tasks/chunks/chunk_by_word.py (2)
18-23
: Enhance function documentation with type hints and complete docstringThe function documentation needs improvement to follow standard Python practices.
-def chunk_by_word(data: str): +def chunk_by_word(data: str) -> Iterator[Tuple[str, str]]: """ Chunks text into words and endings while preserving whitespace. Whitespace is included with the preceding word. Outputs can be joined with "" to recreate the original input. + + Args: + data: The input text to be chunked + + Returns: + Iterator yielding tuples of (chunk, chunk_type) where chunk_type is one of: + - "word": A word with its trailing whitespace + - "sentence_end": End of sentence marker + - "paragraph_end": End of paragraph marker + + Raises: + TypeError: If input is not a string """
28-34
: Optimize leading whitespace handlingThe current implementation can be optimized by combining the regex checks.
- while i < len(data) and (re.match(PARAGRAPH_ENDINGS, data[i]) or data[i] == " "): + WHITESPACE_PATTERN = re.compile(f"{PARAGRAPH_ENDINGS}|\\s") + while i < len(data) and WHITESPACE_PATTERN.match(data[i]): current_chunk += data[i] i += 1cognee/tasks/chunks/chunk_by_paragraph.py (2)
5-9
: Enhance function documentation with parameter descriptionsWhile the docstring explains the purpose well, it would be more helpful to include descriptions of the parameters and return value.
Add parameter descriptions like this:
def chunk_by_paragraph(data: str, paragraph_length: int = 1024, batch_paragraphs: bool = True) -> Iterator[Dict[str, Any]]: """ Chunks text by paragraph while preserving exact text reconstruction capability. When chunks are joined with empty string "", they reproduce the original text exactly. + + Args: + data: Input text to be chunked + paragraph_length: Maximum number of words per chunk + batch_paragraphs: If True, treats multiple paragraphs as a single chunk + + Returns: + Iterator of dictionaries containing chunk information """
45-45
: Consider using constants for end typesUsing string literals for end types could lead to maintenance issues. Consider defining constants for these values.
Add constants at the module level:
PARAGRAPH_END = "paragraph_end" SENTENCE_CUT = "sentence_cut"Then update the condition:
-if end_type in ("paragraph_end", "sentence_cut") and not batch_paragraphs: +if end_type in (PARAGRAPH_END, SENTENCE_CUT) and not batch_paragraphs:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
cognee/tasks/chunks/chunk_by_paragraph.py
(1 hunks)cognee/tasks/chunks/chunk_by_word.py
(1 hunks)
🔇 Additional comments (4)
cognee/tasks/chunks/chunk_by_word.py (2)
58-61
: **** The existing comment about improving ellipsis handling robustness is still valid.
76-77
: LGTM!
The final chunk handling is correct and ensures no content is lost.
cognee/tasks/chunks/chunk_by_paragraph.py (2)
10-14
: LGTM: Clear state initialization
The state variables are well-defined and properly initialized for tracking chunk content, word count, and indices.
41-41
: Add spaces between sentences
The current implementation concatenates sentences without spaces, which affects text reconstruction accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (4)
cognee/infrastructure/engine/models/DataPoint.py (1)
23-27
: Add validation and improve robustness of data handlingWhile the addition of type checking and string stripping is good, consider these improvements for better robustness:
- Add None validation
- Handle empty strings
- Consider processing all index fields, not just the first one
Here's a suggested improvement:
def get_embeddable_data(self): + """Retrieves and processes data based on configured index fields. + + Returns: + Processed attribute value(s) based on index_fields configuration. + For string values, returns stripped version. + For other types, returns as-is. + Returns None if attribute is None or empty string. + """ if self._metadata and len(self._metadata["index_fields"]) > 0 \ and hasattr(self, self._metadata["index_fields"][0]): attribute = getattr(self, self._metadata["index_fields"][0]) + if attribute is None: + return None if isinstance(attribute, str): - return(attribute.strip()) + stripped = attribute.strip() + return stripped if stripped else None else: - return (attribute) + return attributecognee/tasks/chunks/chunk_by_word.py (3)
19-23
: Enhance function documentationWhile the docstring explains the basic functionality, it could be more comprehensive:
- Add type hints for parameters and return value
- Document the yield tuple format
- Include usage examples
Apply this enhancement:
def chunk_by_word(data: str): """ Chunks text into words and endings while preserving whitespace. Whitespace is included with the preceding word. Outputs can be joined with "" to recreate the original input. + + Args: + data: Input text to be chunked + + Yields: + tuple[str, str]: A tuple containing (chunk, chunk_type) where chunk_type is one of: + - "word": A word or whitespace + - "sentence_end": End of a sentence + - "paragraph_end": End of a paragraph + + Examples: + >>> list(chunk_by_word("Hello world.")) + [("Hello ", "word"), ("world.", "sentence_end")] """
25-27
: Consider using a dataclass for state managementThe function maintains multiple state variables (
last_processed_character
,current_chunk
,i
). Consider encapsulating these in a dataclass to make the code more maintainable and easier to test.Example implementation:
from dataclasses import dataclass @dataclass class ChunkerState: last_char: str = "" current_chunk: str = "" position: int = 0
31-36
: Simplify paragraph ending logicThe paragraph ending logic is complex and could be simplified. Consider extracting the logic into a separate method that handles both the yield and state updates.
Example:
def handle_paragraph_ending(self, char: str, state: ChunkerState) -> Iterator[Tuple[str, str]]: if state.current_chunk: yield (state.current_chunk, "word") yield (char, "paragraph_end" if is_real_paragraph_end(state.last_char, state.position, self.data) else "word")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
cognee/infrastructure/engine/models/DataPoint.py
(1 hunks)cognee/tasks/chunks/chunk_by_word.py
(1 hunks)cognee/tests/unit/processing/chunks/test_input.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tests/unit/processing/chunks/test_input.py
🔇 Additional comments (2)
cognee/infrastructure/engine/models/DataPoint.py (1)
23-27
: Verify the impact of string stripping on downstream consumers
The new string stripping behavior might affect systems that depend on the original whitespace-preserved strings.
Let's check for potential consumers:
✅ Verification successful
Based on the gathered information, I can now provide the final response:
String stripping in get_embeddable_data() is safe to implement
The string stripping behavior is safe to add because:
- The embeddable data is only used by vector database adapters (Weaviate, PGVector, LanceDB, QDrant) for text embedding
- The embedding process through LiteLLMEmbeddingEngine doesn't rely on whitespace preservation
- No tests verify exact string content or have assertions that would be affected by whitespace trimming
- Vector databases use the text purely for semantic search, where leading/trailing whitespace is not meaningful
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for direct usage of get_embeddable_data
rg -l "get_embeddable_data" | while read -r file; do
echo "=== $file ==="
rg -A 5 "get_embeddable_data" "$file"
done
# Search for tests that might need updating
fd -e py test_ | while read -r file; do
echo "=== $file ==="
rg -A 5 "DataPoint.*get_embeddable_data" "$file"
done
Length of output: 4403
Script:
#!/bin/bash
# Let's check if there are any tests that verify the exact string content
# This will help us understand if there are any expectations around whitespace
# Search in test files for assertions involving DataPoint
fd -e py test_ | while read -r file; do
echo "=== $file ==="
rg -B 2 -A 2 "assert.*DataPoint" "$file"
done
# Also search for any string comparisons in tests that might be affected
fd -e py test_ | while read -r file; do
echo "=== $file ==="
rg -B 2 -A 2 "assertEqual.*str" "$file"
done
# Let's also check the embedding engines to see how they handle the text
rg -l "class.*EmbeddingEngine" | while read -r file; do
echo "=== $file ==="
rg -B 2 -A 5 "def embed" "$file"
done
Length of output: 3641
cognee/tasks/chunks/chunk_by_word.py (1)
3-4
: LGTM! Well-defined regex constants.
The regex patterns are clear, appropriately named, and include comprehensive character sets for both sentence and paragraph endings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
cognee/tasks/chunks/chunk_by_paragraph.py (2)
61-61
: Remove unused variable assignmentThe variable
last_paragraph_id
is assigned but never used in the code.Apply this diff to remove the unused assignment:
- last_paragraph_id = paragraph_id
🧰 Tools
🪛 Ruff
61-61: Local variable
last_paragraph_id
is assigned to but never usedRemove assignment to unused variable
last_paragraph_id
(F841)
5-9
: Add input validation and error handlingThe function would benefit from proper input validation and error handling:
- Validate that
data
is not None or empty- Ensure
paragraph_length
is positive- Document the exceptions that can be raised
Consider adding these validations:
def chunk_by_paragraph(data: str, paragraph_length: int = 1024, batch_paragraphs: bool = True) -> Iterator[Dict[str, Any]]: """ Chunks text by paragraph while preserving exact text reconstruction capability. Args: data: Input text to chunk paragraph_length: Maximum length of each chunk (must be positive) batch_paragraphs: Whether to batch paragraphs together Returns: Iterator of chunk dictionaries Raises: ValueError: If data is empty or paragraph_length is not positive """ if not data: raise ValueError("Input text cannot be empty") if paragraph_length <= 0: raise ValueError("paragraph_length must be positive")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
cognee/tasks/chunks/chunk_by_paragraph.py
(1 hunks)
🧰 Additional context used
🪛 Ruff
cognee/tasks/chunks/chunk_by_paragraph.py
61-61: Local variable last_paragraph_id
is assigned to but never used
Remove assignment to unused variable last_paragraph_id
(F841)
🔇 Additional comments (2)
cognee/tasks/chunks/chunk_by_paragraph.py (2)
1-5
: LGTM! Improved type safety with proper annotations.
The addition of type hints and return type annotation enhances code clarity and maintainability.
39-41
:
Add spaces between concatenated sentences
Currently, sentences are concatenated without spaces, which could affect text reconstruction and readability.
Apply this diff to fix the concatenation:
- current_chunk += sentence
+ current_chunk += sentence + " "
Remember to strip any trailing whitespace when creating the chunk dictionary:
- "text": current_chunk,
+ "text": current_chunk.strip(),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
cognee/tasks/chunks/chunk_by_word.py (2)
3-4
: Consider enhancing regex patterns for better internationalizationThe current patterns could be expanded to handle more cases:
- Unicode ellipsis (…) in
SENTENCE_ENDINGS
- Different line ending styles in
PARAGRAPH_ENDINGS
-SENTENCE_ENDINGS = r"[.;!?…]" -PARAGRAPH_ENDINGS = r"[\n\r]" +SENTENCE_ENDINGS = r"[.;!?…]|\.{3}" +PARAGRAPH_ENDINGS = r"[\n\r\f\v]|\r\n"
35-40
: Add type hints to function signatureThe docstring is good but the function is missing return type annotation.
-def chunk_by_word(data: str): +def chunk_by_word(data: str) -> Iterator[Tuple[str, str]]:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
cognee/tasks/chunks/chunk_by_word.py
(1 hunks)cognee/tests/integration/documents/AudioDocument_test.py
(1 hunks)cognee/tests/integration/documents/ImageDocument_test.py
(1 hunks)cognee/tests/integration/documents/TextDocument_test.py
(1 hunks)cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- cognee/tests/integration/documents/AudioDocument_test.py
- cognee/tests/integration/documents/ImageDocument_test.py
- cognee/tests/integration/documents/TextDocument_test.py
- cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py
🔇 Additional comments (2)
cognee/tasks/chunks/chunk_by_word.py (2)
6-33
: Well-structured helper function with proper documentation
The implementation includes:
- Clear type hints
- Comprehensive docstring
- Proper bounds checking
- Clear logic flow
55-66
: Simplify sentence ending detection logic
The current implementation has nested conditions and multiple continue statements. A previous review comment suggested a more streamlined approach which is still valid.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (4)
cognee/tasks/chunks/chunk_by_sentence.py (2)
19-29
: Optimize the section end detection logic.The current implementation continues checking characters even after
section_end
is set to "sentence_cut".Consider this optimization:
if word_type in ["paragraph_end", "sentence_end"]: section_end = word_type else: + if section_end: # Skip if we already have a section end type + continue for character in word: if character.isalpha(): section_end = "sentence_cut" break
Line range hint
30-36
: Add validation for maximum_length parameter.The function should validate that maximum_length is positive when provided.
Add this validation at the start of the function:
def chunk_by_sentence(data: str, maximum_length: Optional[int] = None): + if maximum_length is not None and maximum_length <= 0: + raise ValueError("maximum_length must be a positive integer") sentence = "" paragraph_id = uuid4()cognee/tests/integration/documents/AudioDocument_test.py (1)
34-36
: Consider optimizing the iteration patternThe current implementation reads the document twice - once for creating the list and once for iteration. Consider using a more efficient approach.
- for ground_truth, paragraph_data in zip( - GROUND_TRUTH, document.read(chunk_size=64) - ): + chunks = list(document.read(chunk_size=64)) + assert len(chunks) == len(GROUND_TRUTH), "Unexpected number of chunks" + for ground_truth, paragraph_data in zip(GROUND_TRUTH, chunks):cognee/tasks/chunks/chunk_by_paragraph.py (1)
73-73
: Fix excessive indentationRemove the excessive indentation on this empty line.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
cognee/tasks/chunks/chunk_by_paragraph.py
(1 hunks)cognee/tasks/chunks/chunk_by_sentence.py
(2 hunks)cognee/tests/integration/documents/AudioDocument_test.py
(1 hunks)cognee/tests/integration/documents/TextDocument_test.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tests/integration/documents/TextDocument_test.py
🧰 Additional context used
🪛 Ruff
cognee/tests/integration/documents/AudioDocument_test.py
33-33: Ambiguous variable name: l
(E741)
33-33: Local variable l
is assigned to but never used
Remove assignment to unused variable l
(F841)
🔇 Additional comments (9)
cognee/tasks/chunks/chunk_by_sentence.py (2)
13-13
: LGTM: Clear variable initialization.
The section_end
variable is well-named and properly initialized.
43-43
: LGTM: Improved final chunk handling.
Using section_end
instead of word_type
correctly identifies the type of the final chunk.
cognee/tests/integration/documents/AudioDocument_test.py (2)
1-25
: LGTM! Well-structured test data
The test data is realistic and well-organized, with clear mapping between GROUND_TRUTH and TEST_TEXT content.
29-31
: Previous concern about empty raw_data_location remains valid
As noted in the previous review, empty raw_data_location handling needs verification. Consider adding validation in AudioDocument or documenting why empty value is acceptable.
cognee/tasks/chunks/chunk_by_paragraph.py (5)
2-9
: LGTM! Well-documented function signature.
The addition of type hints and clear documentation about text reconstruction capability improves code clarity and maintainability.
10-15
: LGTM! Clear variable initialization.
The variables are well-named and their purposes are clear, setting up the necessary state for the chunking process.
40-40
: Add spaces between concatenated sentences
This issue was previously identified and is still present. Sentences are concatenated without spaces, which could affect text reconstruction and readability.
22-29
: 🛠️ Refactor suggestion
Reduce code duplication in chunk dictionary creation
The chunk dictionary creation logic is duplicated in three places. This violates the DRY principle and makes maintenance harder.
Create a helper method to reduce duplication:
def create_chunk_dict(
text: str,
word_count: int,
chunk_index: int,
cut_type: str,
paragraph_ids: list,
chunk_id: uuid.UUID = None
) -> Dict[str, Any]:
return {
"text": text,
"word_count": word_count,
"chunk_id": chunk_id or uuid5(NAMESPACE_OID, text),
"paragraph_ids": paragraph_ids,
"chunk_index": chunk_index,
"cut_type": cut_type
}
Also applies to: 46-53, 64-71
16-17
: Verify handling of long sentences
The function should be tested with sentences that exceed the paragraph_length to ensure proper handling.
Let's verify if there are any extremely long sentences in the test files:
✅ Verification successful
Let's check the actual test cases and implementation to verify how long sentences are handled:
Based on the implementation and test data review, I can now provide the final response:
Long sentences are properly handled - no action needed
The implementation in chunk_by_sentence.py
correctly handles long sentences by enforcing the maximum_length
parameter at the word level. When a sentence reaches the maximum length, it's automatically yielded regardless of finding a sentence boundary. This is verified by:
- The implementation explicitly checks:
if word_type in ["paragraph_end", "sentence_end"] or (maximum_length and (word_count == maximum_length))
- Test cases include comprehensive validation through
test_paragraph_chunk_length()
which asserts that no chunks exceed the maximum length - The test data includes various challenging inputs including long prose (Paradise Lost), complex multilingual text, and structured content
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for test files and analyze sentence lengths
fd -e py -e txt test |
while read -r file; do
echo "Analyzing $file:"
# Look for long strings that might be test sentences
rg -l '.{1024,}' "$file" || true
done
Length of output: 5038
Script:
#!/bin/bash
# Check test cases for chunk_by_paragraph
rg -A 10 "def test" "cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py" || true
# Check the implementation of chunk_by_sentence since it's used by chunk_by_paragraph
rg -l "def chunk_by_sentence" || true
Length of output: 427
Script:
#!/bin/bash
# Check the implementation of chunk_by_sentence and its test cases
rg -A 10 "def chunk_by_sentence" "cognee/tasks/chunks/chunk_by_sentence.py" || true
# Check test cases for chunk_by_sentence
rg -A 10 "def test" "cognee/tests/unit/processing/chunks/chunk_by_sentence_test.py" || true
# Check test input data
rg -A 10 "whole_text|cut_text" "cognee/tests/unit/processing/chunks/test_input.py" || true
Length of output: 1337
Script:
#!/bin/bash
# Check the actual implementation of chunk_by_sentence to see how it handles long sentences
cat "cognee/tasks/chunks/chunk_by_sentence.py"
# Check the test input data file content
cat "cognee/tests/unit/processing/chunks/test_input.py"
Length of output: 10230
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
cognee/tasks/chunks/chunk_by_paragraph.py (2)
10-15
: Consider adding type hints to initialized variables.Adding type hints to the initialized variables would improve code clarity and maintainability.
- current_chunk = "" - current_word_count = 0 - chunk_index = 0 - paragraph_ids = [] - last_cut_type = None + current_chunk: str = "" + current_word_count: int = 0 + chunk_index: int = 0 + paragraph_ids: list[str] = [] + last_cut_type: str | None = None
73-74
: Fix inconsistent indentation.The indentation of these empty lines is inconsistent with the surrounding code.
- - yield chunk_dict + yield chunk_dict
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
cognee/tasks/chunks/chunk_by_paragraph.py
(1 hunks)cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py
🔇 Additional comments (3)
cognee/tasks/chunks/chunk_by_paragraph.py (3)
1-9
: LGTM! Good improvements to type safety and documentation.
The addition of type hints and the clear docstring about text reconstruction capability are valuable improvements.
62-72
: LGTM! Good handling of remaining text.
The logic for handling the final chunk properly considers the last cut type and ensures no text is lost.
22-29
: 🛠️ Refactor suggestion
Extract duplicated chunk dictionary creation logic.
The chunk dictionary creation pattern is repeated three times in the code. This violates the DRY principle and makes maintenance harder.
Previous suggestion to create a helper method still applies. This would reduce duplication and improve maintainability.
Also applies to: 46-53, 64-71
@@ -9,26 +9,25 @@ class TextChunker(): | |||
|
|||
chunk_index = 0 | |||
chunk_size = 0 | |||
paragraph_chunks = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if the paragraph_chunk is used somewhere else. I mean it can be deleted but I think it's good to see what were the chunks from debugging point of view because in this way its gonna be local and be lost once the method finishes. But maybe @borisarzentar has more context here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within TextChunker
, paragraph_chunks
is only used within read
. The attribute is never accessed from outside the class in the repo, and for a dynamic attribute like this one, it would also not be good practice to do so and break encapsulation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (5)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (2)
13-22
: Consider enhancing error reporting and metadata validation.While the isomorphism test is well-designed, consider these improvements:
- Make the error message more descriptive by showing the first differing position
- Add validation for chunk metadata (e.g., start/end positions)
def test_chunk_by_paragraph_isomorphism(input_text, paragraph_length, batch_paragraphs): chunks = chunk_by_paragraph(input_text, paragraph_length, batch_paragraphs) reconstructed_text = "".join([chunk["text"] for chunk in chunks]) + # Validate chunk metadata + prev_end = 0 + for chunk in chunks: + assert chunk["start"] == prev_end, f"Gap between chunks at position {prev_end}" + assert chunk["end"] == prev_end + len(chunk["text"]), f"Invalid end position for chunk" + prev_end = chunk["end"] + assert ( reconstructed_text == input_text - ), f"texts are not identical: {len(input_text) = }, {len(reconstructed_text) = }" + ), ( + f"Texts differ at position {next((i for i, (a, b) in enumerate(zip(input_text, reconstructed_text)) if a != b), -1)}\n" + f"Input length: {len(input_text)}, Reconstructed length: {len(reconstructed_text)}" + )
42-53
: Consider memory optimization and enhanced error reporting.While the chunk numbering test is effective, consider these improvements:
def test_chunk_by_paragraph_chunk_numbering( input_text, paragraph_length, batch_paragraphs ): - chunks = chunk_by_paragraph(input_text, paragraph_length, batch_paragraphs) - chunk_indices = np.array([chunk["chunk_index"] for chunk in chunks]) - assert np.all( - chunk_indices == np.arange(len(chunk_indices)) - ), f"{chunk_indices = } are not monotonically increasing" + prev_index = -1 + for i, chunk in enumerate(chunk_by_paragraph(input_text, paragraph_length, batch_paragraphs)): + current_index = chunk["chunk_index"] + assert current_index == i, ( + f"Invalid chunk index at position {i}.\n" + f"Expected index: {i}, got: {current_index}\n" + f"Chunk text: {chunk['text'][:100]}..." + ) + assert current_index > prev_index, ( + f"Non-monotonic chunk index at position {i}.\n" + f"Previous index: {prev_index}, Current index: {current_index}" + ) + prev_index = current_indexcognee/tests/unit/modules/graph/cognee_graph_test.py (1)
Line range hint
1-82
: Consider adding edge cases to strengthen test coverage.While the current test suite is well-structured and covers the main functionality, consider adding these test cases to improve coverage:
- Test with empty/null node IDs
- Test edge creation with the same source and target node
- Test concurrent modifications to the graph
- Test with large numbers of nodes/edges for performance
Would you like me to help generate these additional test cases?
cognee/tasks/chunks/chunk_by_paragraph.py (2)
6-9
: Enhance docstring with parameter and return value documentation.While the current docstring explains the core functionality, it would be more helpful with:
- Parameter descriptions (
data
,paragraph_length
,batch_paragraphs
)- Return value format
- Example usage
def chunk_by_paragraph(data: str, paragraph_length: int = 1024, batch_paragraphs: bool = True) -> Iterator[Dict[str, Any]]: """ Chunks text by paragraph while preserving exact text reconstruction capability. When chunks are joined with empty string "", they reproduce the original text exactly. + + Args: + data: Input text to be chunked + paragraph_length: Maximum number of words per chunk + batch_paragraphs: If True, combines paragraphs up to paragraph_length + + Returns: + Iterator yielding dictionaries containing: + - text: The chunk text + - word_count: Number of words in the chunk + - chunk_id: UUID5 identifier for the chunk + - paragraph_ids: List of paragraph identifiers included + - chunk_index: Sequential index of the chunk + - cut_type: Type of cut that ended this chunk + + Example: + >>> text = "First paragraph. Second one." + >>> chunks = list(chunk_by_paragraph(text, paragraph_length=5)) """
60-70
: LGTM with a minor formatting issue.The final chunk handling logic is correct, but there's unnecessary whitespace at line 71 that should be removed.
} - - yield chunk_dictAlso applies to: 72-72
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
cognee/tasks/chunks/chunk_by_paragraph.py
(1 hunks)cognee/tasks/chunks/chunk_by_sentence.py
(1 hunks)cognee/tests/unit/modules/graph/cognee_graph_elements_test.py
(10 hunks)cognee/tests/unit/modules/graph/cognee_graph_test.py
(5 hunks)cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- cognee/tests/unit/modules/graph/cognee_graph_elements_test.py
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/tasks/chunks/chunk_by_sentence.py
🧰 Additional context used
📓 Learnings (1)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (1)
Learnt from: 0xideas
PR: topoteretes/cognee#205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.
🔇 Additional comments (9)
cognee/tests/unit/processing/chunks/chunk_by_paragraph_2_test.py (3)
1-11
: LGTM! Well-structured test setup.
The imports and test parameters are well organized. The use of itertools.product
for test parameterization is a good practice, providing comprehensive coverage across different chunk sizes and batching configurations.
25-39
: Previous optimization suggestions are still applicable.
The current implementation could benefit from the memory optimization and enhanced error reporting suggestions made in the previous review.
9-10
: Verify test parameters against current chunking implementation.
Based on previous learnings, ensure that the paragraph lengths (64, 256, 1024) align with the current chunking implementation in cognee/tasks/chunks
.
✅ Verification successful
Test parameters align with chunking implementation
The test values are correct as they test various scenarios around the default paragraph length of 1024 (as seen in the implementation). The test covers:
- Small chunks (64 words)
- Medium chunks (256 words)
- Default size chunks (1024 words)
- Both batched and non-batched paragraph modes
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if the test parameters align with the chunking implementation
# Search for relevant configuration or constants in the chunking implementation
rg -A 5 "DEFAULT_PARAGRAPH_LENGTH|MAX_PARAGRAPH_LENGTH|MIN_PARAGRAPH_LENGTH" cognee/tasks/chunks/
Length of output: 99
Script:
#!/bin/bash
# Let's try a broader search for paragraph-related configurations and implementations
# Search for files containing paragraph-related code in chunks directory
fd . cognee/tasks/chunks/ --type f --exec rg -l "paragraph" {}
# Search for actual paragraph length usage in the implementation
rg -A 3 "paragraph.*length" cognee/tasks/chunks/
# Look for the chunking implementation file
ast-grep --pattern 'def chunk_by_paragraph($$$)'
Length of output: 1811
cognee/tests/unit/modules/graph/cognee_graph_test.py (2)
4-4
: LGTM! Import reordering is safe.
The reordering of Edge and Node imports is a cosmetic change that doesn't affect functionality.
Line range hint 12-82
: LGTM! Improved test structure and readability.
The added newlines between test functions improve code readability by providing clear visual separation between test cases. The consistent spacing and well-documented test functions follow Python testing best practices.
cognee/tasks/chunks/chunk_by_paragraph.py (4)
2-2
: LGTM! Improved type safety with explicit type hints.
The addition of type hints enhances code clarity and maintainability.
Also applies to: 5-5
10-15
: LGTM! Clear and well-organized state initialization.
The state variables are properly initialized with descriptive names that clearly indicate their purpose.
20-27
: Duplicate comment: Extract chunk dictionary creation logic.
A past review already suggested extracting the duplicated chunk dictionary creation into a helper method. This suggestion is still valid and would improve code maintainability.
Also applies to: 44-51, 62-69
16-58
: LGTM! Well-structured chunking logic.
The implementation correctly handles:
- Word count limits
- Batch vs non-batch modes
- Paragraph boundaries
|
GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
---|---|---|---|---|---|
13947935 | Triggered | Generic High Entropy Secret | 22a0e43 | cognee/shared/utils.py | View secret |
9573981 | Triggered | Generic Password | 22a0e43 | notebooks/cognee_llama_index.ipynb | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests