Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(robot): migrate test_replica_auto_balance_node_least_effort #2163

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yangchiu
Copy link
Member

@yangchiu yangchiu commented Nov 13, 2024

Which issue(s) this PR fixes:

Issue longhorn/longhorn#9776

What this PR does / why we need it:

migrate test_replica_auto_balance_node_least_effort

Special notes for your reviewer:

Additional documentation or context

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced keywords for validating volume replica counts, including checks for specific nodes and ensuring no additional scheduling occurs.
    • Added methods to wait for a specific number of replicas for a volume across different classes, enhancing volume management capabilities.
  • Tests

    • Implemented a new test case for validating replica auto-balancing behavior under specific conditions, ensuring proper distribution and health of volumes.
  • Chores

    • Added a new resource for node-related keywords to support the new test case.

@yangchiu yangchiu self-assigned this Nov 13, 2024
@yangchiu yangchiu requested a review from a team as a code owner November 13, 2024 05:14
Copy link

coderabbitai bot commented Nov 13, 2024

Walkthrough

The pull request introduces new functionalities for managing volume replicas. It adds several keywords for validating the number of running replicas on specific nodes and a new method wait_for_replica_count across multiple classes, including volume_keywords, CRD, Rest, and Volume. Additionally, a new test case for validating replica auto-balancing behavior is included, ensuring that replicas are distributed correctly across nodes under specified conditions.

Changes

File Change Summary
e2e/keywords/volume.resource Added new keywords for validating volume replica counts and node-specific checks.
e2e/libs/keywords/volume_keywords.py Introduced wait_for_replica_count method in volume_keywords class.
e2e/libs/volume/crd.py Added wait_for_replica_count method in CRD class to check replica counts on nodes.
e2e/libs/volume/rest.py Implemented wait_for_replica_count method in Rest class to monitor running replicas.
e2e/libs/volume/volume.py Added wait_for_replica_count method in Volume class to wait for specific replica counts.
e2e/tests/regression/test_scheduling.robot Introduced a new test case "Test Replica Auto Balance Node Least Effort" and added a resource for it.

Assessment against linked issues

Objective Addressed Explanation
Migrate test_scheduling.test_replica_auto_balance_node_least_effort (#9776)

Possibly related PRs

Suggested reviewers

  • chriscchien

Poem

🐇 In fields of data, we hop and play,
New keywords sprout, brightening the day.
With replicas dancing, in nodes they align,
Auto-balance whispers, all will be fine!
So let’s cheer for changes, both big and small,
For in our code garden, we nurture them all! 🌼


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 36aaa0b and 406ae36.

📒 Files selected for processing (2)
  • e2e/keywords/volume.resource (1 hunks)
  • e2e/libs/volume/crd.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • e2e/libs/volume/crd.py
🔇 Additional comments (4)
e2e/keywords/volume.resource (4)

241-244: LGTM: Basic replica count verification looks good!

The implementation correctly verifies the replica count using the wait_for_replica_count method.


245-252: LGTM: Node-specific replica count verification is well implemented!

The keyword correctly verifies the replica count on a specific node and properly sets test variables for reuse in subsequent keywords.


253-260: LGTM: Node-specific replica verification without count is correct!

The keyword correctly verifies replicas on a specific node without enforcing a specific count.


261-277: 🛠️ Refactor suggestion

Refactor scheduling verification keywords to improve maintainability

The scheduling verification keywords have duplicate code and use hardcoded values. Consider extracting the common logic into a reusable keyword.

Apply this refactor to improve maintainability and configurability:

+Verify No Additional Scheduling For Node
+    [Arguments]    ${volume_name}    ${node_name}    ${replica_count}    ${retry_count}=3    ${sleep_duration}=5
+    FOR    ${i}    IN RANGE    ${retry_count}
+        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i + 1}/${retry_count})
+        ${new_replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}    replica_count=${replica_count}
+        Should Be Equal As Integers    ${replica_count}    ${new_replica_count}
+        Sleep    ${sleep_duration}
+    END

 Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id} and no additional scheduling occurs
     Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id}
-    FOR    ${i}    IN RANGE    3
-        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i})
-        ${new_replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}    replica_count=${expected_replica_count}
-        Should Be Equal As Integers    ${replica_count}    ${new_replica_count}
-        Sleep    5
-    END
+    Verify No Additional Scheduling For Node    ${volume_name}    ${node_name}    ${expected_replica_count}

 Volume ${volume_id} should have replicas running on node ${node_id} and no additional scheduling occurs
     Volume ${volume_id} should have replicas running on node ${node_id}
-    FOR    ${i}    IN RANGE    3
-        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i})
-        ${new_replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}
-        Should Be Equal As Integers    ${replica_count}    ${new_replica_count}
-        Sleep    5
-    END
+    Verify No Additional Scheduling For Node    ${volume_name}    ${node_name}    ${replica_count}

This refactor:

  1. Extracts common scheduling verification logic into a reusable keyword
  2. Makes retry count and sleep duration configurable
  3. Improves progress indication in log messages
  4. Reduces code duplication

Likely invalid or redundant comment.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (10)
e2e/tests/regression/test_scheduling.robot (3)

56-57: Add test documentation

Please add a [Documentation] section to describe:

  • Test purpose and objectives
  • Preconditions and assumptions
  • Expected behavior of replica auto-balancing
  • Significance of using 6 replicas

Example:

[Documentation]    Test replica auto-balancing with least-effort strategy.
...    
...    This test verifies that:
...    1. Replicas are correctly distributed when nodes become schedulable
...    2. Auto-balancing maintains the total replica count
...    3. No unnecessary scheduling occurs after balancing
...    
...    Uses 6 replicas to ensure multiple replicas can be balanced across nodes.

72-78: Move wait condition comments to keywords

The comments about loop behavior and wait conditions should be encapsulated in the keywords themselves rather than as inline comments.

Consider creating a new keyword that encapsulates this logic:

*** Keywords ***
Wait For Stable Replica Distribution
    [Arguments]    ${volume_name}    ${node}    ${expected_count}
    FOR    ${i}    IN RANGE    3
        Volume ${volume_name} should have ${expected_count} replicas running on node ${node}
        Sleep    5s
    END

63-66: Consider parameterizing the replica count

The value 6 for numberOfReplicas appears to be significant for this test. Consider making it a variable to improve maintainability and clarity.

*** Variables ***
${REPLICA_COUNT_FOR_BALANCE_TEST}    6    # Ensures multiple replicas per node for balance testing

*** Test Cases ***
    ...
    And Create volume 0 with    numberOfReplicas=${REPLICA_COUNT_FOR_BALANCE_TEST}    dataEngine=${DATA_ENGINE}
e2e/libs/volume/volume.py (1)

128-130: LGTM! Consider adding docstring documentation.

The new method follows the established pattern of delegating to the underlying implementation and aligns well with the class's existing functionality for replica management.

Consider adding a docstring to document the method's purpose and parameters:

 def wait_for_replica_count(self, volume_name, node_name, replica_count):
+    """Wait until the specified number of replicas are present on the given node.
+
+    Args:
+        volume_name (str): Name of the volume
+        node_name (str): Name of the node to check replicas on
+        replica_count (int): Expected number of replicas
+
+    Returns:
+        bool: True if the desired replica count is reached
+    """
     return self.volume.wait_for_replica_count(volume_name, node_name, replica_count)
e2e/libs/keywords/volume_keywords.py (1)

239-241: Add documentation and logging for consistency

The implementation looks good and follows the class patterns. However, for better maintainability and consistency with other methods in the class, consider these improvements:

 def wait_for_replica_count(self, volume_name, node_name=None, replica_count=None):
+    """Wait for the volume to have the specified number of replicas.
+
+    Args:
+        volume_name: Name of the volume to check
+        node_name: Optional node name to check replicas on a specific node
+        replica_count: Optional specific number of replicas to wait for
+
+    Returns:
+        Result from volume.wait_for_replica_count
+    """
+    logging(f'Waiting for volume {volume_name} to have {replica_count} replicas' + 
+           (f' on node {node_name}' if node_name else ''))
     return self.volume.wait_for_replica_count(volume_name, node_name, replica_count)
e2e/keywords/volume.resource (2)

234-237: Remove unused variable assignment

The ${replica_count} variable is assigned but never used. Consider removing the assignment for cleaner code.

-    ${replica_count} =    wait_for_replica_count    ${volume_name}    node_name=    replica_count=${expected_replica_count}
+    wait_for_replica_count    ${volume_name}    node_name=    replica_count=${expected_replica_count}

246-253: Consider refactoring to reduce code duplication

This keyword is very similar to Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id}. Consider refactoring to reuse logic:

Volume ${volume_id} should have replicas running on node ${node_id}
    ${volume_name} =    generate_name_with_suffix    volume    ${volume_id}
    ${node_name} =    get_node_by_index    ${node_id}
-    ${replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}
-    Set Test Variable    ${volume_name}
-    Set Test Variable    ${node_name}
-    Set Test Variable    ${replica_count}
+    Volume ${volume_id} should have ${1} replicas running on node ${node_id}
e2e/libs/volume/rest.py (2)

234-249: Simplify conditional logic and add error handling

The implementation can be improved in two ways:

  1. Combine the conditional checks for better readability
  2. Add error handling for the volume client call

Apply this diff to improve the code:

 def wait_for_replica_count(self, volume_name, node_name, replica_count):
     for i in range(self.retry_count):
-        running_replica_count = 0
-        volume = get_longhorn_client().by_id_volume(volume_name)
-        for r in volume.replicas:
-            if node_name and r.hostId == node_name and r.running:
-                running_replica_count += 1
-            elif not node_name and r.running:
-                running_replica_count += 1
-        logging(f"Waiting for {replica_count if replica_count else ''} replicas for volume {volume_name} running on {node_name if node_name else 'nodes'}, currently it's {running_replica_count} ... ({i})")
-        if replica_count and running_replica_count == int(replica_count):
-            break
-        elif not replica_count and running_replica_count:
-            break
+        try:
+            running_replica_count = 0
+            volume = get_longhorn_client().by_id_volume(volume_name)
+            for r in volume.replicas:
+                if r.running and (not node_name or r.hostId == node_name):
+                    running_replica_count += 1
+            logging(f"Waiting for {replica_count if replica_count else ''} replicas for volume {volume_name} running on {node_name if node_name else 'nodes'}, currently it's {running_replica_count} ... ({i})")
+            if (replica_count and running_replica_count == int(replica_count)) or \
+               (not replica_count and running_replica_count):
+                break
+        except Exception as e:
+            logging(f"Failed to get volume {volume_name} with error: {e}")
         time.sleep(self.retry_interval)
     return running_replica_count
🧰 Tools
🪛 Ruff

239-242: Combine if branches using logical or operator

Combine if branches

(SIM114)


244-247: Combine if branches using logical or operator

Combine if branches

(SIM114)


Line range hint 1-332: Consider architectural improvements for better maintainability

The Rest class currently handles both volume operations and monitoring. Consider the following architectural improvements:

  1. Split the class into separate concerns:
    • VolumeOperations for basic CRUD operations
    • VolumeMonitor for monitoring and waiting operations
  2. Extract common retry mechanism into a decorator or base class

This would improve maintainability and make the code more modular.

🧰 Tools
🪛 Ruff

239-242: Combine if branches using logical or operator

Combine if branches

(SIM114)


244-247: Combine if branches using logical or operator

Combine if branches

(SIM114)

e2e/libs/volume/crd.py (1)

267-269: Add documentation and type hints.

The implementation looks good and follows the established delegation pattern. Consider adding documentation and type hints for better maintainability:

-    def wait_for_replica_count(self, volume_name, node_name, replica_count):
+    def wait_for_replica_count(self, volume_name: str, node_name: str, replica_count: int) -> bool:
+        """Wait until the specified number of replicas are running on the given node.
+
+        Args:
+            volume_name: Name of the volume to check
+            node_name: Name of the node to check
+            replica_count: Expected number of replicas
+
+        Returns:
+            bool: True if the replica count matches, False otherwise
+        """
         return Rest().wait_for_replica_count(volume_name, node_name, replica_count)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a360760 and 2053fd1.

📒 Files selected for processing (6)
  • e2e/keywords/volume.resource (1 hunks)
  • e2e/libs/keywords/volume_keywords.py (1 hunks)
  • e2e/libs/volume/crd.py (1 hunks)
  • e2e/libs/volume/rest.py (1 hunks)
  • e2e/libs/volume/volume.py (1 hunks)
  • e2e/tests/regression/test_scheduling.robot (2 hunks)
🧰 Additional context used
🪛 Ruff
e2e/libs/volume/rest.py

239-242: Combine if branches using logical or operator

Combine if branches

(SIM114)


244-247: Combine if branches using logical or operator

Combine if branches

(SIM114)

🔇 Additional comments (4)
e2e/tests/regression/test_scheduling.robot (1)

13-13: LGTM: Resource addition is appropriate

The addition of node.resource aligns with the test's requirements for node management operations.

e2e/libs/keywords/volume_keywords.py (1)

239-241: Implementation aligns well with test migration needs

The method provides the necessary functionality for the migrated test case test_replica_auto_balance_node_least_effort to verify replica counts and distribution during auto-balancing scenarios.

e2e/keywords/volume.resource (1)

238-245: LGTM! Well-structured keyword with proper variable management

The implementation correctly validates node-specific replica count and sets test variables for reuse.

e2e/libs/volume/rest.py (1)

Line range hint 251-332: LGTM! Enhanced logging improves observability

The added logging statements improve the observability of the replica rebuilding process, making it easier to track progress and debug issues.

🧰 Tools
🪛 Ruff

239-242: Combine if branches using logical or operator

Combine if branches

(SIM114)


244-247: Combine if branches using logical or operator

Combine if branches

(SIM114)

Comment on lines +73 to +75
Then Volume 0 should have replicas running on node 1
And Volume 0 should have 6 replicas running
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add timeout parameters to wait conditions

The test should include explicit timeout parameters for waiting for replica distribution to ensure test stability.

Then Volume 0 should have replicas running on node 1    timeout=300
And Volume 0 should have ${REPLICA_COUNT_FOR_BALANCE_TEST} replicas running    timeout=300

Also applies to: 84-85

Comment on lines 58 to 95
Given Set setting replica-soft-anti-affinity to true
And Set setting replica-auto-balance to least-effort

When Disable node 1 scheduling
And Disable node 2 scheduling
And Create volume 0 with numberOfReplicas=6 dataEngine=${DATA_ENGINE}
And Attach volume 0
And Wait for volume 0 healthy
And Write data to volume 0
Then Volume 0 should have 6 replicas running on node 0
And Volume 0 should have 0 replicas running on node 1
And Volume 0 should have 0 replicas running on node 2

When Enable node 1 scheduling
# wait for auto balance
Then Volume 0 should have replicas running on node 1
And Volume 0 should have 6 replicas running
# loop 3 times with 5-second wait and compare the replica count to:
# ensure no additional scheduling occurs
# the replica count remains unchanged
And Volume 0 should have replicas running on node 0 and no additional scheduling occurs
And Volume 0 should have replicas running on node 1 and no additional scheduling occurs
And Volume 0 should have 0 replicas running on node 2 and no additional scheduling occurs

When Enable node 2 scheduling
# wait for auto balance
Then Volume 0 should have replicas running on node 2
And Volume 0 should have 6 replicas running
# loop 3 times with 5-second wait and compare the replica count to:
# ensure no additional scheduling occurs
# the replica count remains unchanged
And Volume 0 should have replicas running on node 0 and no additional scheduling occurs
And Volume 0 should have replicas running on node 1 and no additional scheduling occurs
And Volume 0 should have replicas running on node 2 and no additional scheduling occurs

And Wait for volume 0 healthy
And Check volume 0 data is intact
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Missing Test Coverage for Edge Cases
The following edge cases are not covered by the existing tests:

  • Node becomes unschedulable during balancing
  • Volume operations during balancing
  • Multiple volumes being balanced simultaneously
🔗 Analysis chain

Verify test coverage for edge cases

The test covers the happy path but should verify additional scenarios:

  1. Node becomes unschedulable during balancing
  2. Volume operations during balancing
  3. Multiple volumes being balanced simultaneously
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for existing test coverage of these scenarios
rg -l "Test.*Auto.*Balance" "e2e/tests"

# Look for related test implementations
ast-grep --pattern 'Test.*Auto.*Balance.*\n.*\[Documentation\].*\n.*' 

Length of output: 154


Script:

#!/bin/bash
# Check for tests where a node becomes unschedulable during balancing
rg -l "unschedulable.*balancing" "e2e/tests"

# Check for tests involving volume operations during balancing
rg -l "volume.*operations.*balancing" "e2e/tests"

# Check for tests handling multiple volumes being balanced simultaneously
rg -l "multiple.*volumes.*balancing" "e2e/tests"

# Look for related test implementations with documentation
ast-grep --pattern 'Test.*unschedulable.*balancing.*\n.*\[Documentation\].*\n.*' 
ast-grep --pattern 'Test.*volume.*operations.*balancing.*\n.*\[Documentation\].*\n.*' 
ast-grep --pattern 'Test.*multiple.*volumes.*balancing.*\n.*\[Documentation\].*\n.*' 

Length of output: 401


Script:

#!/bin/bash
# Broaden search for tests where a node becomes unschedulable during balancing
rg -il "node.*unschedulable.*balance" "e2e/tests"

# Broaden search for tests involving volume operations during balancing
rg -il "volume.*operation.*balance" "e2e/tests"

# Broaden search for tests handling multiple volumes being balanced simultaneously
rg -il "multiple.*volume.*balance" "e2e/tests"

# Look for any tests related to node scheduling changes with documentation
ast-grep --pattern 'Test.*node.*scheduling.*\n.*\[Documentation\].*\n.*' 

# Look for any tests related to volume operations with documentation
ast-grep --pattern 'Test.*volume.*operation.*\n.*\[Documentation\].*\n.*' 

# Look for any tests related to multiple volumes with documentation
ast-grep --pattern 'Test.*multiple.*volumes.*\n.*\[Documentation\].*\n.*' 

Length of output: 371

Comment on lines +263 to +270
Volume ${volume_id} should have replicas running on node ${node_id} and no additional scheduling occurs
Volume ${volume_id} should have replicas running on node ${node_id}
FOR ${i} IN RANGE 3
Log to console Ensuring there's no additional scheduling for node ${node_name} ... (${i})
${new_replica_count} = wait_for_replica_count ${volume_name} node_name=${node_name}
Should Be Equal As Integers ${replica_count} ${new_replica_count}
Sleep 5
END
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor to reduce duplication and improve robustness

This keyword duplicates most of the logic from the previous keyword. Consider refactoring both keywords to share common logic and apply the same improvements regarding configurable retry counts and sleep duration.

Volume ${volume_id} should have replicas running on node ${node_id} and no additional scheduling occurs
    Volume ${volume_id} should have replicas running on node ${node_id}
-    FOR    ${i}    IN RANGE    3
-        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i})
+    ${retry_count} =    Set Variable    3
+    ${sleep_duration} =    Set Variable    5
+    FOR    ${i}    IN RANGE    ${retry_count}
+        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i + 1}/${retry_count})
         ${new_replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}
         Should Be Equal As Integers    ${replica_count}    ${new_replica_count}
-        Sleep    5
+        Sleep    ${sleep_duration}
     END

Consider creating a new keyword to handle the scheduling verification:

Verify No Additional Scheduling For Volume ${volume_id} On Node ${node_id} With Count ${expected_count}
    [Arguments]    ${retry_count}=3    ${sleep_duration}=5
    ${node_name} =    get_node_by_index    ${node_id}
    FOR    ${i}    IN RANGE    ${retry_count}
        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i + 1}/${retry_count})
        ${new_replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}    replica_count=${expected_count}
        Should Be Equal As Integers    ${replica_count}    ${new_replica_count}
        Sleep    ${sleep_duration}
    END

Comment on lines +254 to +261
Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id} and no additional scheduling occurs
Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id}
FOR ${i} IN RANGE 3
Log to console Ensuring there's no additional scheduling for node ${node_name} ... (${i})
${new_replica_count} = wait_for_replica_count ${volume_name} node_name=${node_name} replica_count=${expected_replica_count}
Should Be Equal As Integers ${replica_count} ${new_replica_count}
Sleep 5
END
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve robustness of scheduling verification

A few suggestions to improve this keyword:

  1. The iteration count (3) and sleep duration (5) are hardcoded. Consider making these configurable.
  2. The logging message uses ${node_name} which might not be set if the first keyword fails.
Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id} and no additional scheduling occurs
    Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id}
-    FOR    ${i}    IN RANGE    3
-        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i})
+    ${node_name} =    get_node_by_index    ${node_id}    # Set node_name for logging
+    ${retry_count} =    Set Variable    3
+    ${sleep_duration} =    Set Variable    5
+    FOR    ${i}    IN RANGE    ${retry_count}
+        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i + 1}/${retry_count})
         ${new_replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}    replica_count=${expected_replica_count}
         Should Be Equal As Integers    ${replica_count}    ${new_replica_count}
-        Sleep    5
+        Sleep    ${sleep_duration}
     END

Committable suggestion skipped: line range outside the PR's diff.

@@ -51,3 +52,43 @@ Test Soft Anti Affinity Scheduling
Then Wait until volume 0 replicas rebuilding completed
And Wait for volume 0 healthy
And Check volume 0 data is intact

Test Replica Auto Balance Node Least Effort
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a brief description?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment on lines 78 to 80
And Volume 0 should have replicas running on node 0 and no additional scheduling occurs
And Volume 0 should have replicas running on node 1 and no additional scheduling occurs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include the number of replicas as 3 running on node 0 & node 1 also in the verification?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment on lines 89 to 92
And Volume 0 should have replicas running on node 0 and no additional scheduling occurs
And Volume 0 should have replicas running on node 1 and no additional scheduling occurs
And Volume 0 should have replicas running on node 2 and no additional scheduling occurs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above comment, including the replica count.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link

mergify bot commented Nov 14, 2024

This pull request is now in conflict. Could you fix it @yangchiu? 🙏

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)
e2e/tests/regression/test_scheduling.robot (2)

56-58: Enhance test case documentation

The documentation should include:

  • Purpose of testing replica auto-balance with least-effort strategy
  • Expected behavior of replica distribution
  • Key test steps and assertions

Example documentation:

[Documentation]    Test replica auto-balance behavior with least-effort strategy
...    
...    This test verifies that replicas are automatically balanced across nodes when:
...    1. Nodes become available for scheduling
...    2. replica-auto-balance is set to least-effort
...    3. replica-soft-anti-affinity is enabled
...    
...    Expected behavior:
...    - Initially all replicas should be on node 0 (only schedulable node)
...    - When node 1 becomes schedulable, replicas should be balanced between node 0 and 1
...    - When node 2 becomes schedulable, replicas should be balanced across all nodes
...    - Total replica count should remain constant during balancing

74-82: Reduce code duplication with a helper keyword

The verification blocks are duplicated. Consider creating a helper keyword to improve maintainability.

Example:

*** Keywords ***
Verify Replica Distribution
    [Arguments]    ${volume_name}    ${expected_nodes}    ${timeout}=300
    FOR    ${node}    IN    @{expected_nodes}
        Volume ${volume_name} should have replicas running on node ${node}    timeout=${timeout}
        Volume ${volume_name} should have replicas running on node ${node} and no additional scheduling occurs
    END
    Volume ${volume_name} should have 6 replicas running    timeout=${timeout}

Usage:

Then Verify Replica Distribution    0    ${expected_nodes}=['0', '1']
...
Then Verify Replica Distribution    0    ${expected_nodes}=['0', '1', '2']

Also applies to: 84-92

e2e/libs/volume/rest.py (2)

220-236: Add docstring documentation for the new method.

The method lacks documentation describing its parameters, return value, and behavior. This is especially important for test automation code.

Add a docstring like this:

 def wait_for_replica_count(self, volume_name, node_name, replica_count):
+    """Wait for the specified number of running replicas for a volume.
+
+    Args:
+        volume_name: Name of the volume to check
+        node_name: If provided, only count replicas on this node
+        replica_count: If provided, wait for this exact count, otherwise wait for any running replica
+
+    Returns:
+        int: The final count of running replicas
+    """
🧰 Tools
🪛 Ruff

225-228: Combine if branches using logical or operator

Combine if branches

(SIM114)


230-233: Combine if branches using logical or operator

Combine if branches

(SIM114)


225-228: Simplify replica counting conditions.

The conditional logic for counting replicas can be simplified using a logical OR operator.

-                if node_name and r.hostId == node_name and r.running:
-                    running_replica_count += 1
-                elif not node_name and r.running:
-                    running_replica_count += 1
+                if r.running and (not node_name or r.hostId == node_name):
+                    running_replica_count += 1
🧰 Tools
🪛 Ruff

225-228: Combine if branches using logical or operator

Combine if branches

(SIM114)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 2053fd1 and 433fd33.

📒 Files selected for processing (6)
  • e2e/keywords/volume.resource (1 hunks)
  • e2e/libs/keywords/volume_keywords.py (1 hunks)
  • e2e/libs/volume/crd.py (1 hunks)
  • e2e/libs/volume/rest.py (1 hunks)
  • e2e/libs/volume/volume.py (1 hunks)
  • e2e/tests/regression/test_scheduling.robot (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • e2e/libs/keywords/volume_keywords.py
  • e2e/libs/volume/crd.py
  • e2e/libs/volume/volume.py
🧰 Additional context used
🪛 Ruff
e2e/libs/volume/rest.py

225-228: Combine if branches using logical or operator

Combine if branches

(SIM114)


230-233: Combine if branches using logical or operator

Combine if branches

(SIM114)

🔇 Additional comments (10)
e2e/tests/regression/test_scheduling.robot (4)

13-13: LGTM!

The node resource import is properly added and aligns with the test requirements.


62-71: Add timeout parameters to wait conditions

For test stability, explicit timeout parameters should be added to the wait conditions.

Apply this pattern to the following lines:

And Wait for volume 0 healthy    timeout=300
Then Volume 0 should have 6 replicas running on node 0    timeout=300
And Volume 0 should have 0 replicas running on node 1    timeout=300
And Volume 0 should have 0 replicas running on node 2    timeout=300

94-95: LGTM!

Good practice to verify volume health and data integrity after balancing operations.


59-61: Verify settings persistence

The test relies on these settings. Let's ensure they persist throughout the test.

✅ Verification successful

Based on the search results, I can see that these settings are used in two test files:

  1. test_scheduling.robot: Sets both settings as part of the test setup
  2. node_reboot.robot: Sets replica-soft-anti-affinity to false

Let me verify if there are any teardown or cleanup steps for these settings.


Let me check what's included in the Cleanup test resources keyword since both test files use it.


Based on the investigation:

  1. Both test files use Cleanup test resources in their teardown
  2. The cleanup function doesn't explicitly reset these settings
  3. The setting.resource file shows no default value restoration
  4. The settings are modified in different tests with different values

Let me check if there are any default settings or global setup that might handle this.


I notice there's a reset_settings() method in the Python code. Let me check if and where it's used, and also look at the Setting class implementation.


Settings are properly managed through global reset mechanism

The investigation reveals that settings persistence is properly handled:

  1. The reset_settings() method in Setting class systematically resets all non-readonly settings to their default values
  2. This reset is available through the Cleanup test resources keyword which is used in the test teardown for both files
  3. The settings replica-soft-anti-affinity and replica-auto-balance are not in the exclusion list of reset_settings()
  4. Different test files setting different values won't interfere as each test has its own setup and teardown
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if there are any other tests that might modify these settings
rg -B 2 -A 2 "Set setting (replica-soft-anti-affinity|replica-auto-balance)" "e2e/tests"

# Check if settings are reset in teardown
rg "Set setting.*(replica-soft-anti-affinity|replica-auto-balance)" "e2e/keywords/common.resource"

Length of output: 1612


Script:

#!/bin/bash
# Check for teardown or cleanup sections in these files
rg -B 5 -A 5 "^\[Teardown\]|Test Teardown|Suite Teardown" "e2e/tests/regression/test_scheduling.robot" "e2e/tests/negative/node_reboot.robot"

# Check for any global teardown in robot framework resource files
fd -e resource -x rg -B 5 -A 5 "^\[Teardown\]|Test Teardown|Suite Teardown" {}

Length of output: 1663


Script:

#!/bin/bash
# Search for the definition of Cleanup test resources
rg -B 5 -A 10 "Cleanup test resources" "e2e/keywords/"

# Also check setting.resource for any default settings restoration
fd "setting.resource" -x cat {}

Length of output: 1493


Script:

#!/bin/bash
# Check for any global setup or default settings
rg -B 5 -A 5 "Set test environment" "e2e/keywords/"

# Check Python implementation for any default settings
fd "setting_keywords.py" -x cat {}

Length of output: 1139


Script:

#!/bin/bash
# Check for usage of reset_settings
rg "reset_settings" "e2e/"

# Look for the Setting class implementation
fd "setting.py" -x cat {}

Length of output: 4992

e2e/libs/volume/rest.py (1)

220-236: Verify integration with replica auto-balancing tests.

The new method is part of migrating replica auto-balancing tests. Let's verify its usage in test cases.

✅ Verification successful

Integration with replica auto-balancing tests verified successfully

The wait_for_replica_count method is properly integrated with the auto-balancing test migration:

  1. It's part of the commit 433fd33 that migrates the test_replica_auto_balance_node_least_effort test
  2. The method is implemented across the volume handling layers (rest.py, crd.py, volume.py) and is used in Robot Framework tests
  3. The test_scheduling.robot file contains the auto-balance test that uses this functionality
  4. The implementation supports both specific node counting (when node_name is provided) and total replica counting (when node_name is None), which is essential for auto-balancing verification

The method is correctly implemented and integrated for its intended use in replica auto-balancing tests.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find test files using this method and verify integration
# Expected: Find usage in auto-balancing test files

# Search for test files using wait_for_replica_count
echo "Searching for test files using wait_for_replica_count..."
rg -l "wait_for_replica_count.*node.*replica" --type python

# Search for auto-balance test files
echo "Searching for auto-balance test files..."
rg -l "test_replica_auto_balance" --type python

Length of output: 385


Script:

#!/bin/bash
# Let's try without the --type flag since python type is not recognized
# Search for test files using wait_for_replica_count
echo "Searching for test files using wait_for_replica_count..."
rg -l "wait_for_replica_count.*node.*replica"

# Search for auto-balance test files
echo "Searching for auto-balance test files..."
rg -l "test_replica_auto_balance"

# Also search for the class definition to understand the context
echo "Searching for the class containing wait_for_replica_count..."
rg -B5 "def wait_for_replica_count"

Length of output: 2850


Script:

#!/bin/bash
# Let's check how this method is used in the auto-balance test files
echo "Checking usage in test_zone.py..."
rg -B5 -A5 "wait_for_replica_count" manager/integration/tests/test_zone.py

echo "Checking usage in test_scheduling.py..."
rg -B5 -A5 "wait_for_replica_count" manager/integration/tests/test_scheduling.py

# Let's also check if there are any recent changes or additions related to auto-balancing
echo "Checking recent git history for auto-balance related changes..."
git log --grep="auto.balance\|autobalance" --since="3 months ago" --pretty=format:"%h %s"

Length of output: 623


Script:

#!/bin/bash
# Let's examine the commit that migrated the auto-balance test
echo "Checking the auto-balance test migration commit..."
git show 433fd33 --name-status

# Let's also look for Robot Framework test files that might be using this method
echo "Checking Robot Framework test files for auto-balance tests..."
rg -l "Test Replica Auto Balance" --type robot || true

# And check the actual usage in Robot Framework files
echo "Checking usage of wait_for_replica_count in Robot files..."
rg -B5 -A5 "Wait For Replica Count" --type robot || true

Length of output: 999


Script:

#!/bin/bash
# Let's examine the test_scheduling.robot file for the auto-balance test
echo "Checking test_scheduling.robot for auto-balance test..."
rg -B10 -A10 "Test Replica Auto Balance" e2e/tests/regression/test_scheduling.robot

# Let's also check the implementation in the modified files
echo "Checking volume.resource file..."
cat e2e/keywords/volume.resource

Length of output: 18193

🧰 Tools
🪛 Ruff

225-228: Combine if branches using logical or operator

Combine if branches

(SIM114)


230-233: Combine if branches using logical or operator

Combine if branches

(SIM114)

e2e/keywords/volume.resource (5)

222-225: LGTM!

The implementation is correct and provides a clear way to validate the total replica count for a volume.


226-233: LGTM!

The implementation correctly validates node-specific replica counts and properly sets test variables for reuse in subsequent keywords.


234-241: LGTM!

The implementation correctly validates the presence of replicas on a specific node without requiring an exact count.


242-249: 🛠️ Refactor suggestion

Consider refactoring to improve robustness and reduce duplication

The implementation could be improved by:

  1. Making retry count and sleep duration configurable
  2. Adding iteration count to logging message
  3. Creating a shared keyword for scheduling verification

Apply this diff to improve the implementation:

Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id} and no additional scheduling occurs
    Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id}
-    FOR    ${i}    IN RANGE    3
-        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i})
+    ${retry_count} =    Set Variable    3
+    ${sleep_duration} =    Set Variable    5
+    FOR    ${i}    IN RANGE    ${retry_count}
+        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i + 1}/${retry_count})
         ${new_replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}    replica_count=${expected_replica_count}
         Should Be Equal As Integers    ${replica_count}    ${new_replica_count}
-        Sleep    5
+        Sleep    ${sleep_duration}
     END

Consider creating a new shared keyword for scheduling verification:

Verify No Additional Scheduling
    [Arguments]    ${volume_name}    ${node_name}    ${expected_count}    ${retry_count}=3    ${sleep_duration}=5
    FOR    ${i}    IN RANGE    ${retry_count}
        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i + 1}/${retry_count})
        ${new_replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}    replica_count=${expected_count}
        Should Be Equal As Integers    ${expected_count}    ${new_replica_count}
        Sleep    ${sleep_duration}
    END

251-258: 🛠️ Refactor suggestion

Refactor to reduce code duplication

This keyword is nearly identical to the previous one. Consider using the same shared keyword for scheduling verification to reduce duplication.

Apply this diff:

Volume ${volume_id} should have replicas running on node ${node_id} and no additional scheduling occurs
    Volume ${volume_id} should have replicas running on node ${node_id}
-    FOR    ${i}    IN RANGE    3
-        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i})
-        ${new_replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}
-        Should Be Equal As Integers    ${replica_count}    ${new_replica_count}
-        Sleep    5
-    END
+    Verify No Additional Scheduling    ${volume_name}    ${node_name}    ${replica_count}

Comment on lines +230 to +248
if replica_count and running_replica_count == int(replica_count):
break
elif not replica_count and running_replica_count:
break
time.sleep(self.retry_interval)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add assertion for expected replica count.

The method should assert that the expected count is met before returning. Currently, it silently returns even if the count doesn't match expectations.

             if replica_count and running_replica_count == int(replica_count):
+                logging(f"Successfully found {running_replica_count} running replicas")
                 break
             elif not replica_count and running_replica_count:
+                logging(f"Successfully found {running_replica_count} running replicas")
                 break
             time.sleep(self.retry_interval)
+        if replica_count:
+            assert running_replica_count == int(replica_count), \
+                f"Expected {replica_count} running replicas but got {running_replica_count}"
         return running_replica_count

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff

230-233: Combine if branches using logical or operator

Combine if branches

(SIM114)

@yangchiu yangchiu force-pushed the auto-balance-least-effort branch 2 times, most recently from 9bdeade to a429022 Compare November 14, 2024 00:31
Signed-off-by: Yang Chiu <yang.chiu@suse.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (4)
e2e/tests/regression/test_scheduling.robot (3)

58-60: Enhance test case documentation.

While the scenario is mentioned, the documentation could be more comprehensive by including:

  • Prerequisites (e.g., required node count, initial state)
  • Expected behavior of replica auto-balancing with least-effort setting
  • Validation steps and expected outcomes at each stage

Example documentation structure:

[Documentation]    Test replica auto-balance behavior with least-effort setting
...    Prerequisites:
...    - At least 3 nodes available
...    - Initial state: All nodes schedulable
...    
...    Expected Behavior:
...    - With least-effort, replicas should gradually redistribute when nodes become schedulable
...    - Distribution should maintain volume health while minimizing replica movement
...    
...    Steps:
...    1. Configure soft anti-affinity and least-effort auto-balance
...    2. Disable scheduling on nodes 1 & 2
...    3. Create and verify volume with 6 replicas on node 0
...    4. Enable node scheduling one by one
...    5. Verify gradual replica redistribution
...    6. Ensure data integrity throughout the process

64-68: Add volume creation verification steps.

Consider adding explicit verification steps after volume creation:

 And Create volume 0 with    numberOfReplicas=6    dataEngine=${DATA_ENGINE}
+And Wait for volume 0 state running
 And Attach volume 0
 And Wait for volume 0 healthy
 And Write data to volume 0

77-78: Consolidate duplicate comments.

The same comments are repeated in two places. Consider moving these to a [Documentation] section or creating a custom keyword that encapsulates this behavior.

Example:

*** Keywords ***
Verify No Additional Scheduling Occurs
    [Arguments]    ${volume_index}    ${node_index}    ${expected_count}
    [Documentation]    Verifies that the replica count remains stable
    ...    by checking multiple times with delays
    Volume ${volume_index} should have ${expected_count} replicas running on node ${node_index} and no additional scheduling occurs

Also applies to: 88-89

e2e/libs/volume/rest.py (1)

239-242: Simplify conditional logic using logical operators.

The nested if conditions can be simplified for better readability.

-                if node_name and r.hostId == node_name and r.running:
-                    running_replica_count += 1
-                elif not node_name and r.running:
-                    running_replica_count += 1
+                if r.running and (not node_name or r.hostId == node_name):
+                    running_replica_count += 1
🧰 Tools
🪛 Ruff

239-242: Combine if branches using logical or operator

Combine if branches

(SIM114)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 433fd33 and 36aaa0b.

📒 Files selected for processing (6)
  • e2e/keywords/volume.resource (1 hunks)
  • e2e/libs/keywords/volume_keywords.py (1 hunks)
  • e2e/libs/volume/crd.py (1 hunks)
  • e2e/libs/volume/rest.py (1 hunks)
  • e2e/libs/volume/volume.py (1 hunks)
  • e2e/tests/regression/test_scheduling.robot (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • e2e/libs/keywords/volume_keywords.py
  • e2e/libs/volume/crd.py
  • e2e/libs/volume/volume.py
🧰 Additional context used
🪛 Ruff
e2e/libs/volume/rest.py

239-242: Combine if branches using logical or operator

Combine if branches

(SIM114)


244-247: Combine if branches using logical or operator

Combine if branches

(SIM114)

🔇 Additional comments (5)
e2e/tests/regression/test_scheduling.robot (1)

13-13: LGTM!

The addition of node.resource follows the existing import convention and is required for the node-related operations in the new test case.

e2e/keywords/volume.resource (3)

234-237: LGTM: Clean and focused implementation

The keyword provides a clear way to validate the total replica count for a volume.


238-245: LGTM: Well-structured with proper variable management

The keyword correctly validates node-specific replica counts and sets test variables for reuse in subsequent keywords.


246-253: LGTM: Consistent with established patterns

The keyword maintains consistency with the previous implementation while providing flexibility for unspecified replica counts.

e2e/libs/volume/rest.py (1)

Line range hint 251-319: LGTM! Improved logging for better observability.

The added logging statements enhance the ability to track replica rebuilding progress without changing the core functionality.

🧰 Tools
🪛 Ruff

239-242: Combine if branches using logical or operator

Combine if branches

(SIM114)


244-247: Combine if branches using logical or operator

Combine if branches

(SIM114)

Comment on lines +83 to +86
When Enable node 2 scheduling
# wait for auto balance
Then Volume 0 should have replicas running on node 2
And Volume 0 should have 6 replicas running
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add delay after enabling scheduling.

There might be a race condition between enabling scheduling and checking replica distribution. Consider adding a small delay or wait condition.

 When Enable node 2 scheduling
+And Sleep    5s    # Allow scheduling to take effect
 # wait for auto balance
 Then Volume 0 should have replicas running on node 2
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
When Enable node 2 scheduling
# wait for auto balance
Then Volume 0 should have replicas running on node 2
And Volume 0 should have 6 replicas running
When Enable node 2 scheduling
And Sleep 5s # Allow scheduling to take effect
# wait for auto balance
Then Volume 0 should have replicas running on node 2
And Volume 0 should have 6 replicas running

Comment on lines +254 to +270
Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id} and no additional scheduling occurs
Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id}
FOR ${i} IN RANGE 3
Log to console Ensuring there's no additional scheduling for node ${node_name} ... (${i})
${new_replica_count} = wait_for_replica_count ${volume_name} node_name=${node_name} replica_count=${expected_replica_count}
Should Be Equal As Integers ${replica_count} ${new_replica_count}
Sleep 5
END

Volume ${volume_id} should have replicas running on node ${node_id} and no additional scheduling occurs
Volume ${volume_id} should have replicas running on node ${node_id}
FOR ${i} IN RANGE 3
Log to console Ensuring there's no additional scheduling for node ${node_name} ... (${i})
${new_replica_count} = wait_for_replica_count ${volume_name} node_name=${node_name}
Should Be Equal As Integers ${replica_count} ${new_replica_count}
Sleep 5
END
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Extract common scheduling verification logic

Both keywords share similar scheduling verification logic. To improve maintainability and configurability:

  1. Extract the common scheduling verification logic into a new keyword
  2. Make retry count and sleep duration configurable
  3. Improve logging with progress indication

Apply this refactor:

+Verify No Additional Scheduling
+    [Arguments]    ${volume_name}    ${node_name}    ${replica_count}    ${retry_count}=3    ${sleep_duration}=5
+    FOR    ${i}    IN RANGE    ${retry_count}
+        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i + 1}/${retry_count})
+        ${new_replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}    replica_count=${replica_count}
+        Should Be Equal As Integers    ${replica_count}    ${new_replica_count}
+        Sleep    ${sleep_duration}
+    END

 Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id} and no additional scheduling occurs
     Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id}
-    FOR    ${i}    IN RANGE    3
-        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i})
-        ${new_replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}    replica_count=${expected_replica_count}
-        Should Be Equal As Integers    ${replica_count}    ${new_replica_count}
-        Sleep    5
-    END
+    Verify No Additional Scheduling    ${volume_name}    ${node_name}    ${expected_replica_count}

 Volume ${volume_id} should have replicas running on node ${node_id} and no additional scheduling occurs
     Volume ${volume_id} should have replicas running on node ${node_id}
-    FOR    ${i}    IN RANGE    3
-        Log to console    Ensuring there's no additional scheduling for node ${node_name} ... (${i})
-        ${new_replica_count} =    wait_for_replica_count    ${volume_name}    node_name=${node_name}
-        Should Be Equal As Integers    ${replica_count}    ${new_replica_count}
-        Sleep    5
-    END
+    Verify No Additional Scheduling    ${volume_name}    ${node_name}    ${replica_count}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id} and no additional scheduling occurs
Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id}
FOR ${i} IN RANGE 3
Log to console Ensuring there's no additional scheduling for node ${node_name} ... (${i})
${new_replica_count} = wait_for_replica_count ${volume_name} node_name=${node_name} replica_count=${expected_replica_count}
Should Be Equal As Integers ${replica_count} ${new_replica_count}
Sleep 5
END
Volume ${volume_id} should have replicas running on node ${node_id} and no additional scheduling occurs
Volume ${volume_id} should have replicas running on node ${node_id}
FOR ${i} IN RANGE 3
Log to console Ensuring there's no additional scheduling for node ${node_name} ... (${i})
${new_replica_count} = wait_for_replica_count ${volume_name} node_name=${node_name}
Should Be Equal As Integers ${replica_count} ${new_replica_count}
Sleep 5
END
Verify No Additional Scheduling
[Arguments] ${volume_name} ${node_name} ${replica_count} ${retry_count}=3 ${sleep_duration}=5
FOR ${i} IN RANGE ${retry_count}
Log to console Ensuring there's no additional scheduling for node ${node_name} ... (${i + 1}/${retry_count})
${new_replica_count} = wait_for_replica_count ${volume_name} node_name=${node_name} replica_count=${replica_count}
Should Be Equal As Integers ${replica_count} ${new_replica_count}
Sleep ${sleep_duration}
END
Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id} and no additional scheduling occurs
Volume ${volume_id} should have ${expected_replica_count} replicas running on node ${node_id}
Verify No Additional Scheduling ${volume_name} ${node_name} ${expected_replica_count}
Volume ${volume_id} should have replicas running on node ${node_id} and no additional scheduling occurs
Volume ${volume_id} should have replicas running on node ${node_id}
Verify No Additional Scheduling ${volume_name} ${node_name} ${replica_count}

Comment on lines +234 to +249
def wait_for_replica_count(self, volume_name, node_name, replica_count):
for i in range(self.retry_count):
running_replica_count = 0
volume = get_longhorn_client().by_id_volume(volume_name)
for r in volume.replicas:
if node_name and r.hostId == node_name and r.running:
running_replica_count += 1
elif not node_name and r.running:
running_replica_count += 1
logging(f"Waiting for {replica_count if replica_count else ''} replicas for volume {volume_name} running on {node_name if node_name else 'nodes'}, currently it's {running_replica_count} ... ({i})")
if replica_count and running_replica_count == int(replica_count):
break
elif not replica_count and running_replica_count:
break
time.sleep(self.retry_interval)
return running_replica_count
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add assertion for expected replica count.

The method should validate that the expected count is met before returning, as suggested in the past review. This ensures that test failures are caught early with clear error messages.

     def wait_for_replica_count(self, volume_name, node_name, replica_count):
         for i in range(self.retry_count):
             running_replica_count = 0
             volume = get_longhorn_client().by_id_volume(volume_name)
             for r in volume.replicas:
-                if node_name and r.hostId == node_name and r.running:
-                    running_replica_count += 1
-                elif not node_name and r.running:
-                    running_replica_count += 1
+                if r.running and (not node_name or r.hostId == node_name):
+                    running_replica_count += 1
             logging(f"Waiting for {replica_count if replica_count else ''} replicas for volume {volume_name} running on {node_name if node_name else 'nodes'}, currently it's {running_replica_count} ... ({i})")
             if replica_count and running_replica_count == int(replica_count):
+                logging(f"Successfully found {running_replica_count} running replicas")
                 break
             elif not replica_count and running_replica_count:
+                logging(f"Successfully found {running_replica_count} running replicas")
                 break
             time.sleep(self.retry_interval)
+        if replica_count:
+            assert running_replica_count == int(replica_count), \
+                f"Expected {replica_count} running replicas but got {running_replica_count}"
         return running_replica_count
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def wait_for_replica_count(self, volume_name, node_name, replica_count):
for i in range(self.retry_count):
running_replica_count = 0
volume = get_longhorn_client().by_id_volume(volume_name)
for r in volume.replicas:
if node_name and r.hostId == node_name and r.running:
running_replica_count += 1
elif not node_name and r.running:
running_replica_count += 1
logging(f"Waiting for {replica_count if replica_count else ''} replicas for volume {volume_name} running on {node_name if node_name else 'nodes'}, currently it's {running_replica_count} ... ({i})")
if replica_count and running_replica_count == int(replica_count):
break
elif not replica_count and running_replica_count:
break
time.sleep(self.retry_interval)
return running_replica_count
def wait_for_replica_count(self, volume_name, node_name, replica_count):
for i in range(self.retry_count):
running_replica_count = 0
volume = get_longhorn_client().by_id_volume(volume_name)
for r in volume.replicas:
if r.running and (not node_name or r.hostId == node_name):
running_replica_count += 1
logging(f"Waiting for {replica_count if replica_count else ''} replicas for volume {volume_name} running on {node_name if node_name else 'nodes'}, currently it's {running_replica_count} ... ({i})")
if replica_count and running_replica_count == int(replica_count):
logging(f"Successfully found {running_replica_count} running replicas")
break
elif not replica_count and running_replica_count:
logging(f"Successfully found {running_replica_count} running replicas")
break
time.sleep(self.retry_interval)
if replica_count:
assert running_replica_count == int(replica_count), \
f"Expected {replica_count} running replicas but got {running_replica_count}"
return running_replica_count
🧰 Tools
🪛 Ruff

239-242: Combine if branches using logical or operator

Combine if branches

(SIM114)


244-247: Combine if branches using logical or operator

Combine if branches

(SIM114)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants