test(robot): add test case Test Longhorn components recovery #2143

chriscchien · 2024-10-15T09:02:26Z

Which issue(s) this PR fixes:

Issue #9536

What this PR does / why we need it:

Automate manutal test case Test Longhorn components recovery into blow sub test cases

Test Longhorn components recovery
Test Longhorn volume recovery
Test Longhorn backing image volume recovery
Test Longhorn dynamic provisioned RWX volume recovery
Test Longhorn dynamic provisioned RWO volume recovery

Special notes for your reviewer:

Tested on my local env

Additional documentation or context

Summary by CodeRabbit

Release Notes

New Features
- Introduced new keywords for managing backing images, Longhorn components, and share managers, enhancing operational capabilities.
- Added a new test suite for validating the resilience of Longhorn components and volumes under failure conditions.
Improvements
- Enhanced pod management functionality with new methods for waiting on pod statuses and retrieving pods from specific namespaces.
- Improved logging and error handling in various components for better operational visibility.
Bug Fixes
- Adjusted method signatures to allow for namespace specifications, increasing flexibility in managing resources.

coderabbitai · 2024-10-30T08:03:36Z

Walkthrough

This pull request introduces multiple enhancements across various resource management functionalities in the Longhorn system. New keywords related to backing images, Longhorn components, and share managers have been added to facilitate operations such as deletion, waiting for operational status, and ensuring recovery. Additionally, modifications to existing methods improve their flexibility by allowing namespace specifications. A new test suite has been created to validate the resilience of Longhorn components and volumes under failure conditions, incorporating various test scenarios.

Changes

File Path	Change Summary
e2e/keywords/backing_image.resource	Added keywords: `Delete backing image managers and wait for recreation`, `Wait backing image managers running`.
e2e/keywords/longhorn.resource	Added keywords: `Delete instance-manager of volume`, `Delete instance-manager of deployment volume`, `Wait for Longhorn components all running`.
e2e/keywords/sharemanager.resource	Added keywords: `Delete sharemanager of deployment and wait for recreation`, `Wait for sharemanager of deployment running`.
e2e/keywords/workload.resource	Updated `Delete Longhorn pod on node` to include conditional logic for `label_selector`; added `Delete Longhorn pod`.
e2e/libs/backing_image/backing_image.py	Added methods: `delete_backing_image_manager`, `wait_all_backing_image_managers_running`, `wait_backing_image_manager_restart`, `list_backing_image_manager`.
e2e/libs/backing_image/base.py	Added abstract methods: `wait_all_backing_image_managers_running`, `list_backing_image_manager`, `delete_backing_image_manager`, `wait_backing_image_manager_restart`.
e2e/libs/backing_image/crd.py	Introduced class `CRD` with methods for managing backing images, including `create`, `get`, `delete`, and state management methods.
e2e/libs/backing_image/rest.py	Added methods: `delete_backing_image_manager`, `wait_all_backing_image_managers_running`, `wait_backing_image_manager_restart`, `list_backing_image_manager`.
e2e/libs/k8s/k8s.py	Added function: `wait_for_namespace_pods_running(namespace)`; modified `wait_all_pods_evicted` to include logging.
e2e/libs/keywords/backing_image_keywords.py	Added methods for backing image management: `delete_backing_image_manager`, `wait_all_backing_image_managers_running`, `wait_backing_image_manager_restart`, `list_backing_image_manager`, `delete_all_backing_image_managers_and_wait_for_recreation`.
e2e/libs/keywords/k8s_keywords.py	Added method: `wait_for_namespace_pods_running(namespace)`.
e2e/libs/keywords/sharemanager_keywords.py	Added methods: `delete_sharemanager`, `delete_sharemanager_and_wait_for_recreation`, `wait_for_share_manager_running`.
e2e/libs/keywords/workload_keywords.py	Updated several methods to include `namespace` and `label_selector` parameters for improved workload management.
e2e/libs/sharemanager/base.py	Added abstract methods: `get(name)`, `delete(name)`, `wait_for_running(name)`, `wait_for_restart(name, last_creation_time)`.
e2e/libs/sharemanager/crd.py	Updated constructor and added methods for managing share managers: `get`, `delete`, `wait_for_running`, `wait_for_restart`.
e2e/libs/sharemanager/rest.py	Added methods: `get`, `delete`, `wait_for_running`, `wait_for_restart`.
e2e/libs/sharemanager/sharemanager.py	Added methods that delegate functionality to the `sharemanager` instance: `delete`, `wait_for_running`, `get`, `wait_for_restart`.
e2e/libs/workload/workload.py	Updated `get_workload_pod_names` and `get_workload_pods` to include `namespace` parameter for flexibility.
e2e/tests/negative/component_resilience.robot	Introduced a new test suite for negative test cases on Longhorn resilience, including multiple keywords and test cases for component recovery.

Possibly related PRs

Add case test_metric_longhorn_backup #2145: The changes in this PR enhance the testing framework for backup metrics, which may relate to the management of backing images as both involve ensuring the integrity and operational status of storage components.
test(robot): migrate test_soft_anti_affinity_scheduling #2158: This PR introduces a new test case for scheduling related to volume management, which could be relevant to the operational checks performed by the new keywords in the main PR regarding backing image managers.
test(robot): migrate test_replica_rebuild_per_volume_limit #2159: The migration of the test case for replica rebuilding aligns with the management of backing images, as both involve ensuring the correct functioning and recovery of storage components in a Kubernetes environment.

Suggested reviewers

yangchiu

🐰 In the meadow where changes bloom,
New keywords sprout, dispelling gloom.
Backing images, share managers too,
All wait and delete, just like new!
With tests to ensure they stand tall,
Longhorn's resilience will conquer all! 🌼

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 27

🧹 Outside diff range and nitpick comments (24)

e2e/libs/sharemanager/base.py (1)

21-23: Consider adding type hints and docstring for wait_for_restart.

The wait_for_restart method would benefit from type hints and documentation, especially for the last_creation_time parameter, as its type and format might not be immediately obvious to implementers.

     @abstractmethod
-    def wait_for_restart(self, name, last_creation_time):
+    def wait_for_restart(self, name: str, last_creation_time: str) -> bool:
+        """Wait for a share manager to restart after the specified creation time.
+
+        Args:
+            name: Name of the share manager
+            last_creation_time: Previous creation timestamp to compare against
+
+        Returns:
+            bool: True if the share manager has restarted, False otherwise
+        """
         return NotImplemented

e2e/libs/sharemanager/rest.py (4)

14-15: Add type hints and docstring to get method

The get method lacks type hints and documentation. This information is crucial for maintainability and usage understanding.

-    def get(self, name):
+    def get(self, name: str) -> dict:
+        """Get share manager details by name.
+        
+        Args:
+            name: Name of the share manager
+            
+        Returns:
+            Share manager details
+            
+        Raises:
+            ApiException: If the API request fails
+        """

17-18: Add type hints and docstring to delete method

The delete method lacks type hints and documentation.

-    def delete(self, name):
+    def delete(self, name: str) -> None:
+        """Delete a share manager.
+        
+        Args:
+            name: Name of the share manager to delete
+            
+        Raises:
+            ApiException: If the deletion fails
+        """

20-21: Add type hints and docstring to wait_for_running method

The wait_for_running method lacks type hints and documentation.

-    def wait_for_running(self, name):
+    def wait_for_running(self, name: str, timeout: int = 300) -> None:
+        """Wait for share manager to reach running state.
+        
+        Args:
+            name: Name of the share manager
+            timeout: Maximum time to wait in seconds
+            
+        Raises:
+            TimeoutError: If the share manager doesn't reach running state
+            ApiException: If the status check fails
+        """

23-24: Add type hints and docstring to wait_for_restart method

The wait_for_restart method lacks type hints and documentation.

-    def wait_for_restart(self, name, last_creation_time):
+    def wait_for_restart(self, name: str, last_creation_time: str, timeout: int = 300) -> None:
+        """Wait for share manager to restart.
+        
+        Args:
+            name: Name of the share manager
+            last_creation_time: Previous creation timestamp
+            timeout: Maximum time to wait in seconds
+            
+        Raises:
+            TimeoutError: If the share manager doesn't restart
+            ApiException: If the status check fails
+        """

e2e/libs/sharemanager/sharemanager.py (1)

20-31: Consider adding docstrings for better maintainability.

While the implementation is clean and follows the strategy pattern correctly, adding docstrings would improve maintainability and help other developers understand the purpose and expected behavior of each method.

Example improvement:

 def delete(self, name):
+    """Delete a share manager instance.
+    
+    Args:
+        name (str): Name of the share manager to delete
+        
+    Returns:
+        The result from the underlying implementation
+    """
     return self.sharemanager.delete(name)

e2e/keywords/backing_image.resource (1)

29-30: Consider adding documentation for the keyword.

The keyword is well-implemented, but adding documentation would help users understand its purpose and expected behavior.

-Wait backing image managers running
+Wait backing image managers running
+    [Documentation]    Waits until all backing image managers are in running state.

e2e/libs/backing_image/base.py (1)

38-40: Consider using plural form in method name for consistency.

The method list_backing_image_manager returns a collection of managers, so consider renaming it to list_backing_image_managers for consistency with typical naming conventions for methods returning collections.

-    def list_backing_image_manager(self):
+    def list_backing_image_managers(self):
         return NotImplemented

e2e/keywords/sharemanager.resource (3)

24-27: Add documentation for the new keyword.

The implementation looks good and aligns with the PR objectives. Consider adding documentation to describe the purpose, parameters, and expected behavior of this keyword.

 Delete sharemanager of deployment ${deployment_id} and wait for recreation
+    [Documentation]    Deletes the sharemanager associated with the given deployment ID and waits for it to be recreated.
+    ...                
+    ...                Arguments:
+    ...                - deployment_id: The ID of the deployment whose sharemanager should be deleted
     ${deployment_name} =   generate_name_with_suffix    deployment    ${deployment_id}
     ${volume_name} =    get_workload_volume_name    ${deployment_name}
     delete_sharemanager_and_wait_for_recreation    ${volume_name}

29-32: Add documentation and consider timeout parameter.

The implementation looks good but could benefit from some enhancements:

Add documentation to describe the keyword's purpose and parameters
Consider adding a timeout parameter to control how long to wait

 Wait for sharemanager of deployment ${deployment_id} running
+    [Documentation]    Waits for the sharemanager associated with the given deployment ID to be in running state.
+    ...                
+    ...                Arguments:
+    ...                - deployment_id: The ID of the deployment whose sharemanager should be monitored
+    [Arguments]    ${deployment_id}    ${timeout}=300
     ${deployment_name} =   generate_name_with_suffix    deployment    ${deployment_id}
     ${volume_name} =    get_workload_volume_name    ${deployment_name}
-    wait_for_share_manager_running    ${volume_name}
+    wait_for_share_manager_running    ${volume_name}    timeout=${timeout}

24-32: Implementation aligns well with PR objectives.

The new keywords provide essential functionality for testing sharemanager recovery scenarios:

Delete sharemanager of deployment enables testing recovery by triggering sharemanager recreation
Wait for sharemanager of deployment running allows verification of successful recovery

These additions will effectively support the automation of Longhorn components recovery test cases as outlined in the PR objectives.

Consider adding error handling keywords to handle cases where recovery fails or times out, which would make the test suite more robust.

e2e/libs/keywords/backing_image_keywords.py (1)

23-42: Consider implementing a retry mechanism for resilience testing.

Since this code is part of component recovery testing, consider implementing a retry mechanism with exponential backoff for the wait operations. This would make the tests more resilient and better simulate real-world recovery scenarios.

Key recommendations:

Create a common retry decorator/utility for all wait operations
Add configurable retry parameters (max attempts, backoff factor)
Implement detailed logging of retry attempts for test debugging
Consider adding assertions about the time taken for recovery

Would you like me to provide an example implementation of the retry mechanism?

e2e/libs/keywords/sharemanager_keywords.py (3)

51-52: Add documentation and error handling.

Consider adding a docstring and basic error handling to improve test maintainability and debugging:

 def delete_sharemanager(self, name):
+    """Delete a share manager instance by name.
+    
+    Args:
+        name (str): Name of the share manager to delete
+    
+    Returns:
+        The result of the deletion operation
+    
+    Raises:
+        Exception: If deletion fails
+    """
+    try:
         return self.sharemanager.delete(name)
+    except Exception as e:
+        logging(f"Failed to delete share manager {name}: {str(e)}")
+        raise

60-61: Add documentation and error handling for wait operation.

Consider adding a docstring and error handling to improve test reliability:

 def wait_for_share_manager_running(self, name):
+    """Wait for a share manager to reach running state.
+    
+    Args:
+        name (str): Name of the share manager
+    
+    Raises:
+        TimeoutError: If the share manager doesn't reach running state
+        ValueError: If name is empty
+    """
+    if not name:
+        raise ValueError("Share manager name cannot be empty")
+
+    try:
         return self.sharemanager.wait_for_running(name)
+    except Exception as e:
+        logging(f"Failed waiting for share manager {name} to run: {str(e)}")
+        raise

50-61: Consider architectural improvements for better maintainability.

The new methods are well-integrated, but consider these improvements:

Extract common timeout and wait logic into a base method to avoid duplication
Add integration tests to verify the recovery scenarios
Consider using a configuration object for timeouts and retry settings

Example of a base wait method:

def _wait_with_timeout(self, operation, timeout=300, interval=2):
    """Base method for wait operations with timeout.
    
    Args:
        operation (callable): Function to execute
        timeout (int): Maximum wait time in seconds
        interval (int): Sleep interval between retries
    """
    start_time = time.time()
    while time.time() - start_time < timeout:
        try:
            return operation()
        except Exception as e:
            if time.time() - start_time >= timeout:
                raise TimeoutError(f"Operation timed out: {str(e)}")
            time.sleep(interval)

e2e/libs/sharemanager/crd.py (1)

63-66: Consider extracting timestamp comparison logic

The datetime parsing and comparison logic could be moved to a utility function for reuse across other test cases, especially since this PR involves multiple recovery test scenarios.

Consider creating a utility function like:

def is_newer_timestamp(new_time: str, old_time: str, fmt: str = "%Y-%m-%dT%H:%M:%SZ") -> bool:
    return datetime.strptime(new_time, fmt) > datetime.strptime(old_time, fmt)

e2e/libs/keywords/k8s_keywords.py (1)

83-84: Consider adding a docstring for better maintainability.

The method implementation looks good and follows the class's pattern of wrapping k8s module functions. However, adding a docstring would improve maintainability by documenting the method's purpose and parameters.

Consider adding documentation like this:

 def wait_for_namespace_pods_running(self, namespace):
+    """Wait for all pods in the specified namespace to be in running state.
+
+    Args:
+        namespace (str): The namespace to check for running pods
+
+    Returns:
+        bool: True if all pods are running, False otherwise
+    """
     return wait_for_namespace_pods_running(namespace)

e2e/libs/backing_image/rest.py (1)

113-113: Fix whitespace consistency

There are extra blank lines around the new methods.

Apply this diff to maintain consistent spacing:

-    
     def delete_backing_image_manager(self, name):

e2e/libs/keywords/workload_keywords.py (2)

64-70: LGTM: Enhanced pod selection capabilities.

The addition of namespace and label_selector parameters improves the flexibility of pod selection and deletion.

Consider adding docstring to document the parameters, especially the format expected for label_selector:

def delete_workload_pod_on_node(self, workload_name, node_name, namespace="default", label_selector=""):
    """Delete workload pod on specific node.
    
    Args:
        workload_name (str): Name of the workload
        node_name (str): Name of the node
        namespace (str, optional): Kubernetes namespace. Defaults to "default"
        label_selector (str, optional): Kubernetes label selector (e.g. "app=nginx"). Defaults to ""
    """

49-51: LGTM: Consistent namespace parameter implementation.

The addition of namespace parameters across methods follows a consistent pattern and improves the test framework's flexibility while maintaining backward compatibility.

Consider creating a base class or configuration object to store common parameters like default namespace. This would make it easier to modify defaults across all methods and reduce parameter repetition.

Example:

class WorkloadConfig:
    DEFAULT_NAMESPACE = "default"

class workload_keywords:
    def __init__(self):
        self.config = WorkloadConfig()
        # ... rest of init ...

    def delete_pod(self, pod_name, namespace=None):
        namespace = namespace or self.config.DEFAULT_NAMESPACE
        # ... rest of method ...

Also applies to: 64-70, 71-72

e2e/keywords/workload.resource (3)

190-201: Consider improving maintainability and consistency.

A few suggestions to enhance the code:

Remove unnecessary empty lines for consistency with the rest of the file.
Consider using a mapping for label selectors to improve maintainability.

Apply this diff to implement the suggestions:

Delete Longhorn ${workload_kind} ${workload_name} pod on node ${node_id}
-    
    ${node_name} =    get_node_by_index    ${node_id}
-    
    IF    '${workload_name}' == 'engine-image'
        ${label_selector} =    Set Variable    longhorn.io/component=engine-image       
    ELSE IF    '${workload_name}' == 'instance-manager'
        ${label_selector} =    Set Variable    longhorn.io/component=instance-manager
    ELSE
        ${label_selector} =    Set Variable    ${EMPTY}
    END
    delete_workload_pod_on_node    ${workload_name}    ${node_name}    longhorn-system    ${label_selector}

Additionally, consider creating a variable at the top of the file to map workload names to their label selectors:

*** Variables ***
&{LONGHORN_COMPONENT_LABELS}    engine-image=longhorn.io/component=engine-image    instance-manager=longhorn.io/component=instance-manager

Then simplify the keyword:

Delete Longhorn ${workload_kind} ${workload_name} pod on node ${node_id}
    ${node_name} =    get_node_by_index    ${node_id}
    ${label_selector} =    Get From Dictionary    ${LONGHORN_COMPONENT_LABELS}    ${workload_name}    ${EMPTY}
    delete_workload_pod_on_node    ${workload_name}    ${node_name}    longhorn-system    ${label_selector}

202-205: Add documentation and error handling.

The keyword would benefit from:

Documentation explaining its purpose and usage.
Error handling for cases where the pod doesn't exist.
Verification that the pod was successfully deleted.

Apply this diff to implement the suggestions:

Delete Longhorn ${workload_kind} ${workload_name} pod
+    [Documentation]    Deletes a Longhorn pod of specified workload kind and name from the longhorn-system namespace.
+    ...    Logs the pod name before deletion and verifies successful deletion.
+    ...    
+    ...    Arguments:
+    ...    - workload_kind: The kind of workload (e.g., deployment, statefulset)
+    ...    - workload_name: The name of the workload to delete
     ${pod_name} =    get_workload_pod_name    ${workload_name}    longhorn-system
+    Should Not Be Empty    ${pod_name}    msg=No pod found for workload ${workload_name}
     Log    ${pod_name}
     delete_pod    ${pod_name}     longhorn-system
+    Wait Until Keyword Succeeds    30s    5s    Should Not Exist    pod    ${pod_name}    longhorn-system

190-205: Consider adding more verification steps for component recovery testing.

Given that these keywords are part of automating test cases for Longhorn components recovery, consider adding more verification steps:

Verify that all associated resources are cleaned up after pod deletion.
Add wait conditions to ensure the system is in a known state before proceeding with recovery tests.

Would you like me to provide examples of additional verification steps that could be added?

e2e/libs/workload/workload.py (1)

Line range hint 24-45: Consider standardizing namespace handling across all functions

While the core pod retrieval functions now support custom namespaces, several other functions in this file still hardcode the 'default' namespace (e.g., write_pod_random_data, write_pod_large_data). Consider:

Adding namespace parameters consistently across all pod-related functions
Creating a module-level default namespace configuration
Updating all exec/stream operations to use the specified namespace

Example pattern to consider:

# At module level
DEFAULT_NAMESPACE = "default"

def write_pod_random_data(pod_name, size_in_mb, file_name,
                         data_directory="/data", namespace=DEFAULT_NAMESPACE):
    # ... use namespace parameter in api calls ...

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 942e57e and 7597939.

📒 Files selected for processing (19)

e2e/keywords/backing_image.resource (1 hunks)
e2e/keywords/longhorn.resource (1 hunks)
e2e/keywords/sharemanager.resource (1 hunks)
e2e/keywords/workload.resource (1 hunks)
e2e/libs/backing_image/backing_image.py (2 hunks)
e2e/libs/backing_image/base.py (1 hunks)
e2e/libs/backing_image/crd.py (1 hunks)
e2e/libs/backing_image/rest.py (1 hunks)
e2e/libs/k8s/k8s.py (3 hunks)
e2e/libs/keywords/backing_image_keywords.py (1 hunks)
e2e/libs/keywords/k8s_keywords.py (2 hunks)
e2e/libs/keywords/sharemanager_keywords.py (1 hunks)
e2e/libs/keywords/workload_keywords.py (2 hunks)
e2e/libs/sharemanager/base.py (1 hunks)
e2e/libs/sharemanager/crd.py (2 hunks)
e2e/libs/sharemanager/rest.py (1 hunks)
e2e/libs/sharemanager/sharemanager.py (1 hunks)
e2e/libs/workload/workload.py (1 hunks)
e2e/tests/negative/component_resilience.robot (1 hunks)

🧰 Additional context used

🪛 Ruff

e2e/libs/backing_image/crd.py

57-57: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

69-69: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

69-69: f-string without any placeholders

Remove extraneous f prefix

(F541)

72-72: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

91-91: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

e2e/libs/k8s/k8s.py

8-8: workload.pod.wait_for_pod_status imported but unused

Remove unused import: workload.pod.wait_for_pod_status

(F401)

9-9: workload.pod.get_pod imported but unused

Remove unused import: workload.pod.get_pod

(F401)

178-178: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

195-195: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

e2e/libs/sharemanager/crd.py

44-44: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

52-52: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

55-55: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

68-68: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

🔇 Additional comments (20)

e2e/libs/sharemanager/base.py (1)

8-23: LGTM! Well-structured abstract interface for share manager operations.

The new abstract methods provide a clean and comprehensive interface for share manager lifecycle operations, which aligns well with the PR's objective of testing component recovery. The method signatures are clear and follow consistent patterns.

e2e/libs/sharemanager/sharemanager.py (3)

21-22: LGTM! Clean delegation to strategy implementation.

The delete method follows the strategy pattern correctly and maintains a clean interface.

24-25: LGTM! Consistent with recovery testing requirements.

The wait_for_running method aligns well with the PR's objective of testing component recovery.

27-28: LGTM! Simple and focused getter implementation.

The get method provides a clean interface to retrieve share manager instances.

e2e/keywords/backing_image.resource (1)

26-27: LGTM! The keyword follows Robot Framework conventions.

The keyword is well-named and properly maps to its underlying implementation for testing backing image manager recovery.

e2e/libs/backing_image/base.py (1)

33-48: Well-structured additions for recovery testing!

The new abstract methods form a comprehensive interface for managing backing image managers, which aligns well with the PR's objective of testing Longhorn components recovery. The methods provide the necessary operations for:

Monitoring manager status (wait_all_backing_image_managers_running)
Managing lifecycle (delete_backing_image_manager, wait_backing_image_manager_restart)
Retrieving state (list_backing_image_manager)

These additions will enable thorough testing of recovery scenarios.

e2e/keywords/sharemanager.resource (1)

23-23: LGTM! Good spacing.

The added empty line improves readability by properly separating keyword definitions.

e2e/libs/keywords/backing_image_keywords.py (1)

36-42: ⚠️ Potential issue

Improve robustness of the manager deletion and recreation process.

The current implementation has several potential issues:

No error handling for API responses
Possible race conditions during iteration
No timeout for the complete operation
Lack of logging for test debugging

Here's a suggested improvement:

     def delete_all_backing_image_managers_and_wait_for_recreation(self):
+        """Delete all backing image managers and ensure they are recreated.
+        
+        Raises:
+            TimeoutError: If operation exceeds timeout
+            RuntimeError: If API calls fail
+        """
+        try:
             backing_image_managers = self.backing_image.list_backing_image_manager()
+            if not backing_image_managers.get("items"):
+                return
+            
+            # Store all managers first to avoid race conditions
+            managers_to_delete = [
+                (m["metadata"]["name"], m["metadata"]["creationTimestamp"])
+                for m in backing_image_managers["items"]
+            ]
+            
-            for backing_image in backing_image_managers["items"]:
-                name = backing_image["metadata"]["name"]
-                last_creation_time = backing_image["metadata"]["creationTimestamp"]
+            for name, last_creation_time in managers_to_delete:
+                print(f"Deleting backing image manager: {name}")
                 self.backing_image.delete_backing_image_manager(name)
                 self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+        except Exception as e:
+            raise RuntimeError(f"Failed to manage backing image managers: {str(e)}")

Let's verify if there are any existing error handling patterns in the codebase:

✅ Verification successful

Let me gather more information about the logging patterns and timeout handling in the codebase.

Based on the analysis of the codebase, I can now provide the final response:

Add error handling and logging to backing image manager operations

The codebase shows consistent patterns of error handling and logging across other components. Let's align with these patterns:

     def delete_all_backing_image_managers_and_wait_for_recreation(self):
+        """Delete all backing image managers and wait for their recreation.
+        
+        Raises:
+            RuntimeError: If API calls fail
+        """
+        logging(f'Cleaning up backing image managers')
+        try:
             backing_image_managers = self.backing_image.list_backing_image_manager()
+            if not backing_image_managers.get("items"):
+                return
+
             for backing_image in backing_image_managers["items"]:
                 name = backing_image["metadata"]["name"]
                 last_creation_time = backing_image["metadata"]["creationTimestamp"]
+                logging(f'Deleting backing image manager: {name}')
                 self.backing_image.delete_backing_image_manager(name)
                 self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+        except Exception as e:
+            raise RuntimeError(f"Failed to manage backing image managers: {str(e)}")

The codebase shows:

Consistent error handling with try-except blocks raising RuntimeError
Extensive use of logging for operations and their progress
Null checks before processing lists
No explicit timeout handling in similar wait operations

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for error handling patterns in similar test files
rg -A 3 "try:|except:|raise" "e2e/libs/keywords/"

Length of output: 1531

Script:

#!/bin/bash
# Search for logging patterns
rg -A 2 "logging\(" "e2e/libs/keywords/"

# Search for timeout patterns and wait operations
rg -A 2 "wait.*timeout|timeout.*wait" "e2e/libs/keywords/"

Length of output: 25649

e2e/libs/backing_image/backing_image.py (1)

3-3: LGTM: Import statement is correctly placed

The CRD import is appropriately grouped with other backing_image related imports.

e2e/libs/sharemanager/crd.py (1)

2-14: LGTM: Well-structured initialization with configurable retries

The addition of retry configuration through get_retry_count_and_interval() is a good practice for test automation, allowing for environment-specific tuning.

e2e/libs/keywords/k8s_keywords.py (1)

12-12: LGTM!

The import follows the established pattern and is properly placed with other k8s imports.

e2e/keywords/longhorn.resource (1)

70-84: Verify test coverage for recovery scenarios

The new keywords align well with the PR objectives for testing Longhorn components recovery. Let's verify the test coverage:

✅ Verification successful

Test coverage for recovery scenarios is comprehensive

The implementation provides thorough test coverage for Longhorn components recovery:

Test cases cover all key recovery scenarios:
- Component recovery (CSI plugins, managers, UI, etc.)
- Volume recovery for different volume types (RWO, RWX)
- Backing image volume recovery
- Recovery during replica rebuilding
- Share manager recovery for RWX volumes

The new keywords are effectively utilized in well-structured test cases that verify both the failure injection and recovery verification, with data integrity checks.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if there are corresponding test files using these keywords
# and verify the test scenarios cover all recovery cases mentioned in PR objectives

echo "Checking for test files using the new keywords..."
rg -l "Delete instance-manager of (volume|deployment)" --type robot

echo "Verifying test scenarios for different volume types..."
rg -l "Test.*recovery.*(RWO|RWX|backing image)" --type robot

Length of output: 406

Script:

#!/bin/bash
# Let's examine the actual test scenarios in the component_resilience.robot file
echo "Checking test scenarios in component_resilience.robot..."
cat e2e/tests/negative/component_resilience.robot

# Also check for any other related test files that might be using these keywords
echo -e "\nChecking for other potential test files..."
rg -l "longhorn.resource" --type robot

Length of output: 11001

e2e/libs/backing_image/rest.py (1)

114-124: ⚠️ Potential issue

Implement backing image manager methods following existing patterns

The newly added methods are currently returning NotImplemented. Based on the existing patterns in the class and the PR objectives for testing Longhorn components recovery, these methods should be implemented.

Here's a suggested implementation following the class patterns:

def delete_backing_image_manager(self, name):
-    return NotImplemented
+    logging(f"Deleting backing image manager {name}")
+    manager = get_longhorn_client().by_id_backing_image_manager(name)
+    get_longhorn_client().delete(manager)

def wait_all_backing_image_managers_running(self):
-    return NotImplemented
+    for i in range(self.retry_count):
+        logging(f"Waiting for all backing image managers running... ({i})")
+        managers = get_longhorn_client().list_backing_image_manager()
+        all_running = True
+        for manager in managers:
+            if manager.status.state != "running":
+                all_running = False
+                break
+        if all_running:
+            return
+        time.sleep(self.retry_interval)
+    assert False, "Timeout waiting for all backing image managers running"

def wait_backing_image_manager_restart(self, name, last_creation_time):
-    return NotImplemented
+    for i in range(self.retry_count):
+        logging(f"Waiting for backing image manager {name} restart... ({i})")
+        manager = get_longhorn_client().by_id_backing_image_manager(name)
+        if manager.metadata.creationTimestamp > last_creation_time:
+            return
+        time.sleep(self.retry_interval)
+    assert False, f"Timeout waiting for backing image manager {name} restart"

def list_backing_image_manager(self):
-    return NotImplemented
+    return get_longhorn_client().list_backing_image_manager()

The implementation:

Follows existing error handling and logging patterns
Uses the retry mechanism consistently
Maintains similar assertion patterns for timeouts
Utilizes the Longhorn client methods for operations

Let's verify the Longhorn client API methods exist:

e2e/libs/keywords/workload_keywords.py (2)

49-51: LGTM: Namespace parameter addition is well-implemented.

The addition of the namespace parameter with a default value maintains backward compatibility while enabling better resource isolation for tests.

71-72: Consider handling multiple pods scenario.

While the namespace parameter addition is good, returning the first pod from the list might be problematic if there are multiple pods and the order matters.

Consider either:

Documenting that this method should only be used with single-pod workloads, or
Adding a parameter to specify which pod to return

Let's verify if this method is used with multi-pod workloads:

e2e/libs/workload/workload.py (3)

17-18: LGTM: Clean namespace parameter addition

The namespace parameter addition with a default value maintains backward compatibility while enabling cross-namespace pod operations.

Line range hint 24-45: Verify RBAC permissions for cross-namespace operations

The addition of cross-namespace support requires appropriate RBAC permissions. Please ensure:

The test service account has necessary permissions to access pods across namespaces
The test framework validates namespace accessibility before operations

#!/bin/bash
# Description: Check for RBAC configurations and namespace-related permissions

# Search for RBAC definitions
rg "Role|ClusterRole|ServiceAccount" --type yaml -B 5 -A 5

# Look for namespace-related test configurations
rg "namespace.*test" --type py -B 2 -A 2

Line range hint 24-45: LGTM: Namespace support properly implemented

The namespace parameter is correctly integrated with proper error handling and logging.

Let's verify that all callers of this function handle the namespace parameter correctly:

✅ Verification successful

The previous search might have failed due to syntax. Let's try a different approach to find the callers and verify the namespace parameter usage.

Based on the shell script results, I can now provide a final response:

All callers properly handle the namespace parameter

The code analysis shows that all callers of get_workload_pods() properly handle the namespace parameter:

get_workload_pod_names() correctly propagates the namespace parameter
wait_for_workload_pods_running() properly passes the namespace parameter
wait_for_workload_pods_stable() correctly uses the namespace parameter
wait_for_workload_pod_kept_in_state() properly handles the namespace parameter
is_workload_pods_has_annotations() correctly uses the namespace parameter

The function's default namespace value of "default" maintains backward compatibility while allowing explicit namespace specification when needed. All callers either use the default value or explicitly pass the namespace parameter.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any direct calls to get_workload_pods that might need namespace parameter updates

# Search for function calls
rg "get_workload_pods\(" --type py -B 2 -A 2

# Search for potential namespace-related issues in test files
rg "namespace.*=.*default" --type py -B 2 -A 2

Length of output: 6894

Script:

#!/bin/bash
# Let's try a more specific search for get_workload_pods calls
rg "get_workload_pods\(" --type py -l | xargs cat

# Also search for the function definition to understand its full context
rg "def get_workload_pods" --type py -B 5 -A 20

Length of output: 27502

e2e/tests/negative/component_resilience.robot (2)

121-123: Consistent use of replica node variable

Ensure that the variable replica node used in deleting volume replicas is consistently defined and accessible within the test scope.

Confirm that replica node refers to a valid and correctly assigned node in the test environment.

Also applies to: 146-148, 174-176, 202-204

81-91: Ensure node selection is appropriately defined

In the steps where Longhorn component pods are deleted (lines 81-91), the node is specified as node 1. Verify that node 1 accurately references an existing node in your cluster.

Run the following script to list all nodes in the cluster and confirm node 1 exists:

coderabbitai · 2024-10-30T08:15:13Z

e2e/libs/sharemanager/base.py

+    @abstractmethod
+    def wait_for_running(self, name):
+        return NotImplemented


🛠️ Refactor suggestion

Consider adding timeout parameter to wait_for_running.

For robustness in test scenarios, consider adding an optional timeout parameter to control how long the method should wait before giving up.

@abstractmethod - def wait_for_running(self, name): + def wait_for_running(self, name: str, timeout: int = 300) -> bool: + """Wait for a share manager to reach running state. + + Args: + name: Name of the share manager + timeout: Maximum time to wait in seconds (default: 300) + + Returns: + bool: True if running state is reached, False if timeout occurs + """ return NotImplemented

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@abstractmethod

def wait_for_running(self, name):

return NotImplemented

@abstractmethod

def wait_for_running(self, name: str, timeout: int = 300) -> bool:

"""Wait for a share manager to reach running state.

Args:

name: Name of the share manager

timeout: Maximum time to wait in seconds (default: 300)

Returns:

bool: True if running state is reached, False if timeout occurs

"""

return NotImplemented

coderabbitai · 2024-10-30T08:15:14Z

e2e/libs/sharemanager/rest.py

+    def get(self, name):
+        return NotImplemented
+
+    def delete(self, name):
+        return NotImplemented
+
+    def wait_for_running(self, name):
+        return NotImplemented
+
+    def wait_for_restart(self, name, last_creation_time):
+        return NotImplemented


⚠️ Potential issue

Implementation required: Methods return NotImplemented

All newly added methods currently return NotImplemented. This seems inconsistent with the PR objectives which state that the tests have been executed locally. These methods are crucial for testing Longhorn components recovery and should be properly implemented.

Please implement these methods using the longhorn_client. The implementation should:

Handle API interactions with the share manager

Include appropriate error handling

Follow similar patterns as the existing list method

Example implementation structure for get:

def get(self, name: str) -> dict: """Get share manager details by name. Args: name: Name of the share manager Returns: Share manager details Raises: ApiException: If the API request fails """ return self.longhorn_client.get_share_manager(name)

💡 Codebase verification

Error handling strategy needs to be implemented in Rest class

The review comment is valid. The code inspection reveals:

The base class defines abstract methods without error handling guidance

The Rest class inherits from Base and uses longhorn_client but has unimplemented methods returning NotImplemented

The CRD implementation shows error handling patterns with try-except blocks and logging

No error handling exists for API calls, timeouts, or resource states in the Rest class

Key areas needing error handling:

list() method's longhorn_client.list_share_manager() call

Unimplemented methods should handle API errors and resource states

Wait operations need timeout handling and state validation

Consistent error logging similar to CRD implementation

🔗 Analysis chain

Define error handling strategy

The methods need a clear error handling strategy for various scenarios:

API errors from longhorn_client

Timeouts during wait operations

Resource not found cases

Invalid state transitions

Let's verify the error handling in the base class:

Consider implementing a common error handling decorator or base method that can:

Handle common API exceptions

Implement retries for transient failures

Provide consistent error messages

Log relevant details for debugging

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Check error handling patterns in the base class rg -A 5 "raise|except" "e2e/libs/sharemanager/base.py"

Length of output: 54

Script:

#!/bin/bash # Let's first check if the file exists and see its content fd "base.py" "e2e/libs/sharemanager" --exec cat {} # Also check the rest.py implementation for context fd "rest.py" "e2e/libs/sharemanager" --exec cat {} # Look for any error handling patterns in related files rg -A 5 "raise|except|error" "e2e/libs/sharemanager/"

Length of output: 1736

coderabbitai · 2024-10-30T08:15:14Z

e2e/libs/sharemanager/sharemanager.py

+    def wait_for_restart(self, name, last_creation_time):
+        return self.sharemanager.wait_for_restart(name, last_creation_time)


💡 Codebase verification

Based on the results, I can now generate the final response since I have sufficient information about the restart detection implementations across different classes.

Critical: REST strategy has unimplemented restart detection

The verification revealed important implementation gaps:

The REST strategy (rest.py) returns NotImplemented for wait_for_restart

Only the CRD strategy (crd.py) has a concrete implementation with retry logic and error handling

The base class (base.py) defines the interface but returns NotImplemented

This could lead to runtime failures when using the REST strategy for recovery testing. The CRD implementation looks solid with proper retry and error handling, but the REST strategy needs to be implemented to match this behavior.

🔗 Analysis chain

Verify restart detection logic in strategy implementations.

The wait_for_restart method seems critical for recovery testing. Let's verify the underlying implementation handles edge cases properly.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Check the restart detection implementation in strategy classes # Look for wait_for_restart implementations in CRD and Rest classes echo "Checking CRD implementation:" ast-grep --pattern 'def wait_for_restart($_, $_) { $$$ }' echo -e "\nChecking error handling patterns:" rg -A 5 "wait_for_restart"

Length of output: 1552

coderabbitai · 2024-10-30T08:15:14Z

e2e/libs/keywords/backing_image_keywords.py

+    def delete_backing_image_manager(self, name):
+        self.backing_image.delete_backing_image_manager(name)
+
+    def wait_all_backing_image_managers_running(self):
+        self.backing_image.wait_all_backing_image_managers_running()
+
+    def wait_backing_image_manager_restart(self, name, last_creation_time):
+        self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+
+    def list_backing_image_manager(self):
+        return self.backing_image.list_backing_image_manager()


🛠️ Refactor suggestion

Add docstrings and error handling to the new methods.

These test automation keywords would benefit from:

Docstrings explaining the purpose, parameters, and expected behavior

Timeout parameters for the wait methods to prevent hanging tests

Basic input validation and error handling

Here's a suggested improvement for one of the methods as an example:

def wait_backing_image_manager_restart(self, name, last_creation_time): + """Wait for a backing image manager to restart after deletion. + + Args: + name (str): Name of the backing image manager + last_creation_time (str): Previous creation timestamp + + Raises: + TimeoutError: If manager doesn't restart within timeout + ValueError: If name or timestamp is invalid + """ + if not name or not last_creation_time: + raise ValueError("Name and last_creation_time must be provided") self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)

Committable suggestion was skipped due to low confidence.

coderabbitai · 2024-10-30T08:15:14Z

e2e/libs/backing_image/backing_image.py

+    def delete_backing_image_manager(self, name):
+        self.backing_image = CRD()
+        return self.backing_image.delete_backing_image_manager(name)
+
+    def wait_all_backing_image_managers_running(self):
+        self.backing_image = CRD()
+        return self.backing_image.wait_all_backing_image_managers_running()
+
+    def wait_backing_image_manager_restart(self, name, last_creation_time):
+        self.backing_image = CRD()
+        self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+
+    def list_backing_image_manager(self):
+        self.backing_image = CRD()
+        return self.backing_image.list_backing_image_manager()


💡 Codebase verification

⚠️ Potential issue

The shell script results provide clear evidence about the strategy pattern implementation across the codebase. Let me generate the final response:

Strategy pattern implementation needs revision to match codebase standards

The current implementation in backing_image.py deviates from the established pattern used consistently across other modules. All other modules:

Initialize the strategy object only once in __init__

Handle both REST and CRD cases properly with if-else blocks

Use the initialized strategy object throughout their methods

Here's the correct implementation matching the codebase standards:

def __init__(self): if self._strategy == LonghornOperationStrategy.CRD: self.backing_image = CRD() else: self.backing_image = Rest()

Remove the redundant CRD instantiation from:

delete_backing_image_manager

wait_all_backing_image_managers_running

wait_backing_image_manager_restart

list_backing_image_manager

🔗 Analysis chain

Strategy pattern implementation needs revision

The current implementation has several architectural concerns:

The new methods bypass the strategy pattern by directly instantiating CRD, which violates the pattern's purpose.

Creating a new CRD instance per method call is inefficient and could lead to resource management issues.

Overwriting self.backing_image could affect other methods expecting the REST implementation.

Consider refactoring to:

Respect the strategy pattern:

class BackingImage(Base): _strategy = LonghornOperationStrategy.REST def __init__(self): if self._strategy == LonghornOperationStrategy.REST: self.backing_image = Rest() + else: + self.backing_image = CRD() def delete_backing_image_manager(self, name): - self.backing_image = CRD() return self.backing_image.delete_backing_image_manager(name)

If both REST and CRD implementations are needed, consider using separate attributes:

def __init__(self): self.rest_client = Rest() self.crd_client = CRD()

The wait_backing_image_manager_restart method doesn't return the result from the delegate call:

def wait_backing_image_manager_restart(self, name, last_creation_time): self.backing_image = CRD() - self.backing_image.wait_backing_image_manager_restart(name, last_creation_time) + return self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)

Consider renaming list_backing_image_manager to list_backing_image_managers for consistency with other method names.

Let's verify the strategy pattern usage across the codebase:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Check if other classes follow the same pattern rg -l "LonghornOperationStrategy" | xargs rg -A 5 "_strategy\s*="

Length of output: 5660

coderabbitai · 2024-10-30T08:15:16Z

e2e/libs/backing_image/crd.py

+                backing_image_manager = self.obj_api.get_namespaced_custom_object(
+                    group="longhorn.io",
+                    version="v1beta2",
+                    namespace="longhorn-system",
+                    plural="backingimagemanagers",
+                    name=name
+                    )
+            except Exception as e:
+                logging(f"Finding backing image manager {name} failed with error {e}")
+                continue


🛠️ Refactor suggestion

Handle exceptions more specifically and improve logging

Catching all exceptions with a bare except can obscure unexpected errors. It's better to catch specific exceptions. Additionally, consider logging the stack trace for better debugging.

Apply this diff to catch specific exceptions and log the stack trace:

- except Exception as e: - logging(f"Finding backing image manager {name} failed with error {e}") + except client.exceptions.ApiException as e: + logging(f"Failed to find backing image manager {name}: {e}") + except Exception as e: + logging(f"An unexpected error occurred while finding backing image manager {name}: {e}", exc_info=True)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

backing_image_manager = self.obj_api.get_namespaced_custom_object(

group="longhorn.io",

version="v1beta2",

namespace="longhorn-system",

plural="backingimagemanagers",

name=name

)

except Exception as e:

logging(f"Finding backing image manager {name} failed with error {e}")

continue

backing_image_manager = self.obj_api.get_namespaced_custom_object(

group="longhorn.io",

version="v1beta2",

namespace="longhorn-system",

plural="backingimagemanagers",

name=name

)

except client.exceptions.ApiException as e:

logging(f"Failed to find backing image manager {name}: {e}")

except Exception as e:

logging(f"An unexpected error occurred while finding backing image manager {name}: {e}", exc_info=True)

continue

coderabbitai · 2024-10-30T08:15:16Z