Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(robot): add test case Test Longhorn components recovery #2143

Merged
merged 2 commits into from
Nov 11, 2024

Conversation

chriscchien
Copy link
Contributor

@chriscchien chriscchien commented Oct 15, 2024

Which issue(s) this PR fixes:

Issue #9536

What this PR does / why we need it:

Automate manutal test case Test Longhorn components recovery into blow sub test cases

  • Test Longhorn components recovery
  • Test Longhorn volume recovery
  • Test Longhorn backing image volume recovery
  • Test Longhorn dynamic provisioned RWX volume recovery
  • Test Longhorn dynamic provisioned RWO volume recovery

Special notes for your reviewer:

Tested on my local env

Additional documentation or context

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced new keywords for managing backing images, Longhorn components, and share managers, enhancing operational capabilities.
    • Added a new test suite for validating the resilience of Longhorn components and volumes under failure conditions.
  • Improvements

    • Enhanced pod management functionality with new methods for waiting on pod statuses and retrieving pods from specific namespaces.
    • Improved logging and error handling in various components for better operational visibility.
  • Bug Fixes

    • Adjusted method signatures to allow for namespace specifications, increasing flexibility in managing resources.

@chriscchien chriscchien self-assigned this Oct 15, 2024
@chriscchien chriscchien requested a review from a team as a code owner October 15, 2024 09:02
@chriscchien chriscchien requested a review from a team October 22, 2024 06:46
Copy link

coderabbitai bot commented Oct 30, 2024

Walkthrough

This pull request introduces multiple enhancements across various resource management functionalities in the Longhorn system. New keywords related to backing images, Longhorn components, and share managers have been added to facilitate operations such as deletion, waiting for operational status, and ensuring recovery. Additionally, modifications to existing methods improve their flexibility by allowing namespace specifications. A new test suite has been created to validate the resilience of Longhorn components and volumes under failure conditions, incorporating various test scenarios.

Changes

File Path Change Summary
e2e/keywords/backing_image.resource Added keywords: Delete backing image managers and wait for recreation, Wait backing image managers running.
e2e/keywords/longhorn.resource Added keywords: Delete instance-manager of volume, Delete instance-manager of deployment volume, Wait for Longhorn components all running.
e2e/keywords/sharemanager.resource Added keywords: Delete sharemanager of deployment and wait for recreation, Wait for sharemanager of deployment running.
e2e/keywords/workload.resource Updated Delete Longhorn pod on node to include conditional logic for label_selector; added Delete Longhorn pod.
e2e/libs/backing_image/backing_image.py Added methods: delete_backing_image_manager, wait_all_backing_image_managers_running, wait_backing_image_manager_restart, list_backing_image_manager.
e2e/libs/backing_image/base.py Added abstract methods: wait_all_backing_image_managers_running, list_backing_image_manager, delete_backing_image_manager, wait_backing_image_manager_restart.
e2e/libs/backing_image/crd.py Introduced class CRD with methods for managing backing images, including create, get, delete, and state management methods.
e2e/libs/backing_image/rest.py Added methods: delete_backing_image_manager, wait_all_backing_image_managers_running, wait_backing_image_manager_restart, list_backing_image_manager.
e2e/libs/k8s/k8s.py Added function: wait_for_namespace_pods_running(namespace); modified wait_all_pods_evicted to include logging.
e2e/libs/keywords/backing_image_keywords.py Added methods for backing image management: delete_backing_image_manager, wait_all_backing_image_managers_running, wait_backing_image_manager_restart, list_backing_image_manager, delete_all_backing_image_managers_and_wait_for_recreation.
e2e/libs/keywords/k8s_keywords.py Added method: wait_for_namespace_pods_running(namespace).
e2e/libs/keywords/sharemanager_keywords.py Added methods: delete_sharemanager, delete_sharemanager_and_wait_for_recreation, wait_for_share_manager_running.
e2e/libs/keywords/workload_keywords.py Updated several methods to include namespace and label_selector parameters for improved workload management.
e2e/libs/sharemanager/base.py Added abstract methods: get(name), delete(name), wait_for_running(name), wait_for_restart(name, last_creation_time).
e2e/libs/sharemanager/crd.py Updated constructor and added methods for managing share managers: get, delete, wait_for_running, wait_for_restart.
e2e/libs/sharemanager/rest.py Added methods: get, delete, wait_for_running, wait_for_restart.
e2e/libs/sharemanager/sharemanager.py Added methods that delegate functionality to the sharemanager instance: delete, wait_for_running, get, wait_for_restart.
e2e/libs/workload/workload.py Updated get_workload_pod_names and get_workload_pods to include namespace parameter for flexibility.
e2e/tests/negative/component_resilience.robot Introduced a new test suite for negative test cases on Longhorn resilience, including multiple keywords and test cases for component recovery.

Possibly related PRs

Suggested reviewers

  • yangchiu

🐰 In the meadow where changes bloom,
New keywords sprout, dispelling gloom.
Backing images, share managers too,
All wait and delete, just like new!
With tests to ensure they stand tall,
Longhorn's resilience will conquer all! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 27

🧹 Outside diff range and nitpick comments (24)
e2e/libs/sharemanager/base.py (1)

21-23: Consider adding type hints and docstring for wait_for_restart.

The wait_for_restart method would benefit from type hints and documentation, especially for the last_creation_time parameter, as its type and format might not be immediately obvious to implementers.

     @abstractmethod
-    def wait_for_restart(self, name, last_creation_time):
+    def wait_for_restart(self, name: str, last_creation_time: str) -> bool:
+        """Wait for a share manager to restart after the specified creation time.
+
+        Args:
+            name: Name of the share manager
+            last_creation_time: Previous creation timestamp to compare against
+
+        Returns:
+            bool: True if the share manager has restarted, False otherwise
+        """
         return NotImplemented
e2e/libs/sharemanager/rest.py (4)

14-15: Add type hints and docstring to get method

The get method lacks type hints and documentation. This information is crucial for maintainability and usage understanding.

-    def get(self, name):
+    def get(self, name: str) -> dict:
+        """Get share manager details by name.
+        
+        Args:
+            name: Name of the share manager
+            
+        Returns:
+            Share manager details
+            
+        Raises:
+            ApiException: If the API request fails
+        """

17-18: Add type hints and docstring to delete method

The delete method lacks type hints and documentation.

-    def delete(self, name):
+    def delete(self, name: str) -> None:
+        """Delete a share manager.
+        
+        Args:
+            name: Name of the share manager to delete
+            
+        Raises:
+            ApiException: If the deletion fails
+        """

20-21: Add type hints and docstring to wait_for_running method

The wait_for_running method lacks type hints and documentation.

-    def wait_for_running(self, name):
+    def wait_for_running(self, name: str, timeout: int = 300) -> None:
+        """Wait for share manager to reach running state.
+        
+        Args:
+            name: Name of the share manager
+            timeout: Maximum time to wait in seconds
+            
+        Raises:
+            TimeoutError: If the share manager doesn't reach running state
+            ApiException: If the status check fails
+        """

23-24: Add type hints and docstring to wait_for_restart method

The wait_for_restart method lacks type hints and documentation.

-    def wait_for_restart(self, name, last_creation_time):
+    def wait_for_restart(self, name: str, last_creation_time: str, timeout: int = 300) -> None:
+        """Wait for share manager to restart.
+        
+        Args:
+            name: Name of the share manager
+            last_creation_time: Previous creation timestamp
+            timeout: Maximum time to wait in seconds
+            
+        Raises:
+            TimeoutError: If the share manager doesn't restart
+            ApiException: If the status check fails
+        """
e2e/libs/sharemanager/sharemanager.py (1)

20-31: Consider adding docstrings for better maintainability.

While the implementation is clean and follows the strategy pattern correctly, adding docstrings would improve maintainability and help other developers understand the purpose and expected behavior of each method.

Example improvement:

 def delete(self, name):
+    """Delete a share manager instance.
+    
+    Args:
+        name (str): Name of the share manager to delete
+        
+    Returns:
+        The result from the underlying implementation
+    """
     return self.sharemanager.delete(name)
e2e/keywords/backing_image.resource (1)

29-30: Consider adding documentation for the keyword.

The keyword is well-implemented, but adding documentation would help users understand its purpose and expected behavior.

-Wait backing image managers running
+Wait backing image managers running
+    [Documentation]    Waits until all backing image managers are in running state.
e2e/libs/backing_image/base.py (1)

38-40: Consider using plural form in method name for consistency.

The method list_backing_image_manager returns a collection of managers, so consider renaming it to list_backing_image_managers for consistency with typical naming conventions for methods returning collections.

-    def list_backing_image_manager(self):
+    def list_backing_image_managers(self):
         return NotImplemented
e2e/keywords/sharemanager.resource (3)

24-27: Add documentation for the new keyword.

The implementation looks good and aligns with the PR objectives. Consider adding documentation to describe the purpose, parameters, and expected behavior of this keyword.

 Delete sharemanager of deployment ${deployment_id} and wait for recreation
+    [Documentation]    Deletes the sharemanager associated with the given deployment ID and waits for it to be recreated.
+    ...                
+    ...                Arguments:
+    ...                - deployment_id: The ID of the deployment whose sharemanager should be deleted
     ${deployment_name} =   generate_name_with_suffix    deployment    ${deployment_id}
     ${volume_name} =    get_workload_volume_name    ${deployment_name}
     delete_sharemanager_and_wait_for_recreation    ${volume_name}

29-32: Add documentation and consider timeout parameter.

The implementation looks good but could benefit from some enhancements:

  1. Add documentation to describe the keyword's purpose and parameters
  2. Consider adding a timeout parameter to control how long to wait
 Wait for sharemanager of deployment ${deployment_id} running
+    [Documentation]    Waits for the sharemanager associated with the given deployment ID to be in running state.
+    ...                
+    ...                Arguments:
+    ...                - deployment_id: The ID of the deployment whose sharemanager should be monitored
+    [Arguments]    ${deployment_id}    ${timeout}=300
     ${deployment_name} =   generate_name_with_suffix    deployment    ${deployment_id}
     ${volume_name} =    get_workload_volume_name    ${deployment_name}
-    wait_for_share_manager_running    ${volume_name}
+    wait_for_share_manager_running    ${volume_name}    timeout=${timeout}

24-32: Implementation aligns well with PR objectives.

The new keywords provide essential functionality for testing sharemanager recovery scenarios:

  1. Delete sharemanager of deployment enables testing recovery by triggering sharemanager recreation
  2. Wait for sharemanager of deployment running allows verification of successful recovery

These additions will effectively support the automation of Longhorn components recovery test cases as outlined in the PR objectives.

Consider adding error handling keywords to handle cases where recovery fails or times out, which would make the test suite more robust.

e2e/libs/keywords/backing_image_keywords.py (1)

23-42: Consider implementing a retry mechanism for resilience testing.

Since this code is part of component recovery testing, consider implementing a retry mechanism with exponential backoff for the wait operations. This would make the tests more resilient and better simulate real-world recovery scenarios.

Key recommendations:

  1. Create a common retry decorator/utility for all wait operations
  2. Add configurable retry parameters (max attempts, backoff factor)
  3. Implement detailed logging of retry attempts for test debugging
  4. Consider adding assertions about the time taken for recovery

Would you like me to provide an example implementation of the retry mechanism?

e2e/libs/keywords/sharemanager_keywords.py (3)

51-52: Add documentation and error handling.

Consider adding a docstring and basic error handling to improve test maintainability and debugging:

 def delete_sharemanager(self, name):
+    """Delete a share manager instance by name.
+    
+    Args:
+        name (str): Name of the share manager to delete
+    
+    Returns:
+        The result of the deletion operation
+    
+    Raises:
+        Exception: If deletion fails
+    """
+    try:
         return self.sharemanager.delete(name)
+    except Exception as e:
+        logging(f"Failed to delete share manager {name}: {str(e)}")
+        raise

60-61: Add documentation and error handling for wait operation.

Consider adding a docstring and error handling to improve test reliability:

 def wait_for_share_manager_running(self, name):
+    """Wait for a share manager to reach running state.
+    
+    Args:
+        name (str): Name of the share manager
+    
+    Raises:
+        TimeoutError: If the share manager doesn't reach running state
+        ValueError: If name is empty
+    """
+    if not name:
+        raise ValueError("Share manager name cannot be empty")
+
+    try:
         return self.sharemanager.wait_for_running(name)
+    except Exception as e:
+        logging(f"Failed waiting for share manager {name} to run: {str(e)}")
+        raise

50-61: Consider architectural improvements for better maintainability.

The new methods are well-integrated, but consider these improvements:

  1. Extract common timeout and wait logic into a base method to avoid duplication
  2. Add integration tests to verify the recovery scenarios
  3. Consider using a configuration object for timeouts and retry settings

Example of a base wait method:

def _wait_with_timeout(self, operation, timeout=300, interval=2):
    """Base method for wait operations with timeout.
    
    Args:
        operation (callable): Function to execute
        timeout (int): Maximum wait time in seconds
        interval (int): Sleep interval between retries
    """
    start_time = time.time()
    while time.time() - start_time < timeout:
        try:
            return operation()
        except Exception as e:
            if time.time() - start_time >= timeout:
                raise TimeoutError(f"Operation timed out: {str(e)}")
            time.sleep(interval)
e2e/libs/sharemanager/crd.py (1)

63-66: Consider extracting timestamp comparison logic

The datetime parsing and comparison logic could be moved to a utility function for reuse across other test cases, especially since this PR involves multiple recovery test scenarios.

Consider creating a utility function like:

def is_newer_timestamp(new_time: str, old_time: str, fmt: str = "%Y-%m-%dT%H:%M:%SZ") -> bool:
    return datetime.strptime(new_time, fmt) > datetime.strptime(old_time, fmt)
e2e/libs/keywords/k8s_keywords.py (1)

83-84: Consider adding a docstring for better maintainability.

The method implementation looks good and follows the class's pattern of wrapping k8s module functions. However, adding a docstring would improve maintainability by documenting the method's purpose and parameters.

Consider adding documentation like this:

 def wait_for_namespace_pods_running(self, namespace):
+    """Wait for all pods in the specified namespace to be in running state.
+
+    Args:
+        namespace (str): The namespace to check for running pods
+
+    Returns:
+        bool: True if all pods are running, False otherwise
+    """
     return wait_for_namespace_pods_running(namespace)
e2e/libs/backing_image/rest.py (1)

113-113: Fix whitespace consistency

There are extra blank lines around the new methods.

Apply this diff to maintain consistent spacing:

-    
     def delete_backing_image_manager(self, name):
e2e/libs/keywords/workload_keywords.py (2)

64-70: LGTM: Enhanced pod selection capabilities.

The addition of namespace and label_selector parameters improves the flexibility of pod selection and deletion.

Consider adding docstring to document the parameters, especially the format expected for label_selector:

def delete_workload_pod_on_node(self, workload_name, node_name, namespace="default", label_selector=""):
    """Delete workload pod on specific node.
    
    Args:
        workload_name (str): Name of the workload
        node_name (str): Name of the node
        namespace (str, optional): Kubernetes namespace. Defaults to "default"
        label_selector (str, optional): Kubernetes label selector (e.g. "app=nginx"). Defaults to ""
    """

49-51: LGTM: Consistent namespace parameter implementation.

The addition of namespace parameters across methods follows a consistent pattern and improves the test framework's flexibility while maintaining backward compatibility.

Consider creating a base class or configuration object to store common parameters like default namespace. This would make it easier to modify defaults across all methods and reduce parameter repetition.

Example:

class WorkloadConfig:
    DEFAULT_NAMESPACE = "default"

class workload_keywords:
    def __init__(self):
        self.config = WorkloadConfig()
        # ... rest of init ...

    def delete_pod(self, pod_name, namespace=None):
        namespace = namespace or self.config.DEFAULT_NAMESPACE
        # ... rest of method ...

Also applies to: 64-70, 71-72

e2e/keywords/workload.resource (3)

190-201: Consider improving maintainability and consistency.

A few suggestions to enhance the code:

  1. Remove unnecessary empty lines for consistency with the rest of the file.
  2. Consider using a mapping for label selectors to improve maintainability.

Apply this diff to implement the suggestions:

Delete Longhorn ${workload_kind} ${workload_name} pod on node ${node_id}
-    
    ${node_name} =    get_node_by_index    ${node_id}
-    
    IF    '${workload_name}' == 'engine-image'
        ${label_selector} =    Set Variable    longhorn.io/component=engine-image       
    ELSE IF    '${workload_name}' == 'instance-manager'
        ${label_selector} =    Set Variable    longhorn.io/component=instance-manager
    ELSE
        ${label_selector} =    Set Variable    ${EMPTY}
    END
    delete_workload_pod_on_node    ${workload_name}    ${node_name}    longhorn-system    ${label_selector}

Additionally, consider creating a variable at the top of the file to map workload names to their label selectors:

*** Variables ***
&{LONGHORN_COMPONENT_LABELS}    engine-image=longhorn.io/component=engine-image    instance-manager=longhorn.io/component=instance-manager

Then simplify the keyword:

Delete Longhorn ${workload_kind} ${workload_name} pod on node ${node_id}
    ${node_name} =    get_node_by_index    ${node_id}
    ${label_selector} =    Get From Dictionary    ${LONGHORN_COMPONENT_LABELS}    ${workload_name}    ${EMPTY}
    delete_workload_pod_on_node    ${workload_name}    ${node_name}    longhorn-system    ${label_selector}

202-205: Add documentation and error handling.

The keyword would benefit from:

  1. Documentation explaining its purpose and usage.
  2. Error handling for cases where the pod doesn't exist.
  3. Verification that the pod was successfully deleted.

Apply this diff to implement the suggestions:

Delete Longhorn ${workload_kind} ${workload_name} pod
+    [Documentation]    Deletes a Longhorn pod of specified workload kind and name from the longhorn-system namespace.
+    ...    Logs the pod name before deletion and verifies successful deletion.
+    ...    
+    ...    Arguments:
+    ...    - workload_kind: The kind of workload (e.g., deployment, statefulset)
+    ...    - workload_name: The name of the workload to delete
     ${pod_name} =    get_workload_pod_name    ${workload_name}    longhorn-system
+    Should Not Be Empty    ${pod_name}    msg=No pod found for workload ${workload_name}
     Log    ${pod_name}
     delete_pod    ${pod_name}     longhorn-system
+    Wait Until Keyword Succeeds    30s    5s    Should Not Exist    pod    ${pod_name}    longhorn-system

190-205: Consider adding more verification steps for component recovery testing.

Given that these keywords are part of automating test cases for Longhorn components recovery, consider adding more verification steps:

  1. Verify that all associated resources are cleaned up after pod deletion.
  2. Add wait conditions to ensure the system is in a known state before proceeding with recovery tests.

Would you like me to provide examples of additional verification steps that could be added?

e2e/libs/workload/workload.py (1)

Line range hint 24-45: Consider standardizing namespace handling across all functions

While the core pod retrieval functions now support custom namespaces, several other functions in this file still hardcode the 'default' namespace (e.g., write_pod_random_data, write_pod_large_data). Consider:

  1. Adding namespace parameters consistently across all pod-related functions
  2. Creating a module-level default namespace configuration
  3. Updating all exec/stream operations to use the specified namespace

Example pattern to consider:

# At module level
DEFAULT_NAMESPACE = "default"

def write_pod_random_data(pod_name, size_in_mb, file_name,
                         data_directory="/data", namespace=DEFAULT_NAMESPACE):
    # ... use namespace parameter in api calls ...
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 942e57e and 7597939.

📒 Files selected for processing (19)
  • e2e/keywords/backing_image.resource (1 hunks)
  • e2e/keywords/longhorn.resource (1 hunks)
  • e2e/keywords/sharemanager.resource (1 hunks)
  • e2e/keywords/workload.resource (1 hunks)
  • e2e/libs/backing_image/backing_image.py (2 hunks)
  • e2e/libs/backing_image/base.py (1 hunks)
  • e2e/libs/backing_image/crd.py (1 hunks)
  • e2e/libs/backing_image/rest.py (1 hunks)
  • e2e/libs/k8s/k8s.py (3 hunks)
  • e2e/libs/keywords/backing_image_keywords.py (1 hunks)
  • e2e/libs/keywords/k8s_keywords.py (2 hunks)
  • e2e/libs/keywords/sharemanager_keywords.py (1 hunks)
  • e2e/libs/keywords/workload_keywords.py (2 hunks)
  • e2e/libs/sharemanager/base.py (1 hunks)
  • e2e/libs/sharemanager/crd.py (2 hunks)
  • e2e/libs/sharemanager/rest.py (1 hunks)
  • e2e/libs/sharemanager/sharemanager.py (1 hunks)
  • e2e/libs/workload/workload.py (1 hunks)
  • e2e/tests/negative/component_resilience.robot (1 hunks)
🧰 Additional context used
🪛 Ruff
e2e/libs/backing_image/crd.py

57-57: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


69-69: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


69-69: f-string without any placeholders

Remove extraneous f prefix

(F541)


72-72: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


91-91: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

e2e/libs/k8s/k8s.py

8-8: workload.pod.wait_for_pod_status imported but unused

Remove unused import: workload.pod.wait_for_pod_status

(F401)


9-9: workload.pod.get_pod imported but unused

Remove unused import: workload.pod.get_pod

(F401)


178-178: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


195-195: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

e2e/libs/sharemanager/crd.py

44-44: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


52-52: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


55-55: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


68-68: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

🔇 Additional comments (20)
e2e/libs/sharemanager/base.py (1)

8-23: LGTM! Well-structured abstract interface for share manager operations.

The new abstract methods provide a clean and comprehensive interface for share manager lifecycle operations, which aligns well with the PR's objective of testing component recovery. The method signatures are clear and follow consistent patterns.

e2e/libs/sharemanager/sharemanager.py (3)

21-22: LGTM! Clean delegation to strategy implementation.

The delete method follows the strategy pattern correctly and maintains a clean interface.


24-25: LGTM! Consistent with recovery testing requirements.

The wait_for_running method aligns well with the PR's objective of testing component recovery.


27-28: LGTM! Simple and focused getter implementation.

The get method provides a clean interface to retrieve share manager instances.

e2e/keywords/backing_image.resource (1)

26-27: LGTM! The keyword follows Robot Framework conventions.

The keyword is well-named and properly maps to its underlying implementation for testing backing image manager recovery.

e2e/libs/backing_image/base.py (1)

33-48: Well-structured additions for recovery testing!

The new abstract methods form a comprehensive interface for managing backing image managers, which aligns well with the PR's objective of testing Longhorn components recovery. The methods provide the necessary operations for:

  • Monitoring manager status (wait_all_backing_image_managers_running)
  • Managing lifecycle (delete_backing_image_manager, wait_backing_image_manager_restart)
  • Retrieving state (list_backing_image_manager)

These additions will enable thorough testing of recovery scenarios.

e2e/keywords/sharemanager.resource (1)

23-23: LGTM! Good spacing.

The added empty line improves readability by properly separating keyword definitions.

e2e/libs/keywords/backing_image_keywords.py (1)

36-42: ⚠️ Potential issue

Improve robustness of the manager deletion and recreation process.

The current implementation has several potential issues:

  1. No error handling for API responses
  2. Possible race conditions during iteration
  3. No timeout for the complete operation
  4. Lack of logging for test debugging

Here's a suggested improvement:

     def delete_all_backing_image_managers_and_wait_for_recreation(self):
+        """Delete all backing image managers and ensure they are recreated.
+        
+        Raises:
+            TimeoutError: If operation exceeds timeout
+            RuntimeError: If API calls fail
+        """
+        try:
             backing_image_managers = self.backing_image.list_backing_image_manager()
+            if not backing_image_managers.get("items"):
+                return
+            
+            # Store all managers first to avoid race conditions
+            managers_to_delete = [
+                (m["metadata"]["name"], m["metadata"]["creationTimestamp"])
+                for m in backing_image_managers["items"]
+            ]
+            
-            for backing_image in backing_image_managers["items"]:
-                name = backing_image["metadata"]["name"]
-                last_creation_time = backing_image["metadata"]["creationTimestamp"]
+            for name, last_creation_time in managers_to_delete:
+                print(f"Deleting backing image manager: {name}")
                 self.backing_image.delete_backing_image_manager(name)
                 self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+        except Exception as e:
+            raise RuntimeError(f"Failed to manage backing image managers: {str(e)}")

Let's verify if there are any existing error handling patterns in the codebase:

✅ Verification successful

Let me gather more information about the logging patterns and timeout handling in the codebase.


Based on the analysis of the codebase, I can now provide the final response:

Add error handling and logging to backing image manager operations

The codebase shows consistent patterns of error handling and logging across other components. Let's align with these patterns:

     def delete_all_backing_image_managers_and_wait_for_recreation(self):
+        """Delete all backing image managers and wait for their recreation.
+        
+        Raises:
+            RuntimeError: If API calls fail
+        """
+        logging(f'Cleaning up backing image managers')
+        try:
             backing_image_managers = self.backing_image.list_backing_image_manager()
+            if not backing_image_managers.get("items"):
+                return
+
             for backing_image in backing_image_managers["items"]:
                 name = backing_image["metadata"]["name"]
                 last_creation_time = backing_image["metadata"]["creationTimestamp"]
+                logging(f'Deleting backing image manager: {name}')
                 self.backing_image.delete_backing_image_manager(name)
                 self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+        except Exception as e:
+            raise RuntimeError(f"Failed to manage backing image managers: {str(e)}")

The codebase shows:

  1. Consistent error handling with try-except blocks raising RuntimeError
  2. Extensive use of logging for operations and their progress
  3. Null checks before processing lists
  4. No explicit timeout handling in similar wait operations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for error handling patterns in similar test files
rg -A 3 "try:|except:|raise" "e2e/libs/keywords/"

Length of output: 1531


Script:

#!/bin/bash
# Search for logging patterns
rg -A 2 "logging\(" "e2e/libs/keywords/"

# Search for timeout patterns and wait operations
rg -A 2 "wait.*timeout|timeout.*wait" "e2e/libs/keywords/"

Length of output: 25649

e2e/libs/backing_image/backing_image.py (1)

3-3: LGTM: Import statement is correctly placed

The CRD import is appropriately grouped with other backing_image related imports.

e2e/libs/sharemanager/crd.py (1)

2-14: LGTM: Well-structured initialization with configurable retries

The addition of retry configuration through get_retry_count_and_interval() is a good practice for test automation, allowing for environment-specific tuning.

e2e/libs/keywords/k8s_keywords.py (1)

12-12: LGTM!

The import follows the established pattern and is properly placed with other k8s imports.

e2e/keywords/longhorn.resource (1)

70-84: Verify test coverage for recovery scenarios

The new keywords align well with the PR objectives for testing Longhorn components recovery. Let's verify the test coverage:

✅ Verification successful

Test coverage for recovery scenarios is comprehensive

The implementation provides thorough test coverage for Longhorn components recovery:

  • Test cases cover all key recovery scenarios:
    • Component recovery (CSI plugins, managers, UI, etc.)
    • Volume recovery for different volume types (RWO, RWX)
    • Backing image volume recovery
    • Recovery during replica rebuilding
    • Share manager recovery for RWX volumes

The new keywords are effectively utilized in well-structured test cases that verify both the failure injection and recovery verification, with data integrity checks.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if there are corresponding test files using these keywords
# and verify the test scenarios cover all recovery cases mentioned in PR objectives

echo "Checking for test files using the new keywords..."
rg -l "Delete instance-manager of (volume|deployment)" --type robot

echo "Verifying test scenarios for different volume types..."
rg -l "Test.*recovery.*(RWO|RWX|backing image)" --type robot

Length of output: 406


Script:

#!/bin/bash
# Let's examine the actual test scenarios in the component_resilience.robot file
echo "Checking test scenarios in component_resilience.robot..."
cat e2e/tests/negative/component_resilience.robot

# Also check for any other related test files that might be using these keywords
echo -e "\nChecking for other potential test files..."
rg -l "longhorn.resource" --type robot

Length of output: 11001

e2e/libs/backing_image/rest.py (1)

114-124: ⚠️ Potential issue

Implement backing image manager methods following existing patterns

The newly added methods are currently returning NotImplemented. Based on the existing patterns in the class and the PR objectives for testing Longhorn components recovery, these methods should be implemented.

Here's a suggested implementation following the class patterns:

def delete_backing_image_manager(self, name):
-    return NotImplemented
+    logging(f"Deleting backing image manager {name}")
+    manager = get_longhorn_client().by_id_backing_image_manager(name)
+    get_longhorn_client().delete(manager)

def wait_all_backing_image_managers_running(self):
-    return NotImplemented
+    for i in range(self.retry_count):
+        logging(f"Waiting for all backing image managers running... ({i})")
+        managers = get_longhorn_client().list_backing_image_manager()
+        all_running = True
+        for manager in managers:
+            if manager.status.state != "running":
+                all_running = False
+                break
+        if all_running:
+            return
+        time.sleep(self.retry_interval)
+    assert False, "Timeout waiting for all backing image managers running"

def wait_backing_image_manager_restart(self, name, last_creation_time):
-    return NotImplemented
+    for i in range(self.retry_count):
+        logging(f"Waiting for backing image manager {name} restart... ({i})")
+        manager = get_longhorn_client().by_id_backing_image_manager(name)
+        if manager.metadata.creationTimestamp > last_creation_time:
+            return
+        time.sleep(self.retry_interval)
+    assert False, f"Timeout waiting for backing image manager {name} restart"

def list_backing_image_manager(self):
-    return NotImplemented
+    return get_longhorn_client().list_backing_image_manager()

The implementation:

  1. Follows existing error handling and logging patterns
  2. Uses the retry mechanism consistently
  3. Maintains similar assertion patterns for timeouts
  4. Utilizes the Longhorn client methods for operations

Let's verify the Longhorn client API methods exist:

e2e/libs/keywords/workload_keywords.py (2)

49-51: LGTM: Namespace parameter addition is well-implemented.

The addition of the namespace parameter with a default value maintains backward compatibility while enabling better resource isolation for tests.


71-72: Consider handling multiple pods scenario.

While the namespace parameter addition is good, returning the first pod from the list might be problematic if there are multiple pods and the order matters.

Consider either:

  1. Documenting that this method should only be used with single-pod workloads, or
  2. Adding a parameter to specify which pod to return

Let's verify if this method is used with multi-pod workloads:

e2e/libs/workload/workload.py (3)

17-18: LGTM: Clean namespace parameter addition

The namespace parameter addition with a default value maintains backward compatibility while enabling cross-namespace pod operations.


Line range hint 24-45: Verify RBAC permissions for cross-namespace operations

The addition of cross-namespace support requires appropriate RBAC permissions. Please ensure:

  1. The test service account has necessary permissions to access pods across namespaces
  2. The test framework validates namespace accessibility before operations
#!/bin/bash
# Description: Check for RBAC configurations and namespace-related permissions

# Search for RBAC definitions
rg "Role|ClusterRole|ServiceAccount" --type yaml -B 5 -A 5

# Look for namespace-related test configurations
rg "namespace.*test" --type py -B 2 -A 2

Line range hint 24-45: LGTM: Namespace support properly implemented

The namespace parameter is correctly integrated with proper error handling and logging.

Let's verify that all callers of this function handle the namespace parameter correctly:

✅ Verification successful

The previous search might have failed due to syntax. Let's try a different approach to find the callers and verify the namespace parameter usage.


Based on the shell script results, I can now provide a final response:

All callers properly handle the namespace parameter

The code analysis shows that all callers of get_workload_pods() properly handle the namespace parameter:

  1. get_workload_pod_names() correctly propagates the namespace parameter
  2. wait_for_workload_pods_running() properly passes the namespace parameter
  3. wait_for_workload_pods_stable() correctly uses the namespace parameter
  4. wait_for_workload_pod_kept_in_state() properly handles the namespace parameter
  5. is_workload_pods_has_annotations() correctly uses the namespace parameter

The function's default namespace value of "default" maintains backward compatibility while allowing explicit namespace specification when needed. All callers either use the default value or explicitly pass the namespace parameter.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any direct calls to get_workload_pods that might need namespace parameter updates

# Search for function calls
rg "get_workload_pods\(" --type py -B 2 -A 2

# Search for potential namespace-related issues in test files
rg "namespace.*=.*default" --type py -B 2 -A 2

Length of output: 6894


Script:

#!/bin/bash
# Let's try a more specific search for get_workload_pods calls
rg "get_workload_pods\(" --type py -l | xargs cat

# Also search for the function definition to understand its full context
rg "def get_workload_pods" --type py -B 5 -A 20

Length of output: 27502

e2e/tests/negative/component_resilience.robot (2)

121-123: Consistent use of replica node variable

Ensure that the variable replica node used in deleting volume replicas is consistently defined and accessible within the test scope.

Confirm that replica node refers to a valid and correctly assigned node in the test environment.

Also applies to: 146-148, 174-176, 202-204


81-91: Ensure node selection is appropriately defined

In the steps where Longhorn component pods are deleted (lines 81-91), the node is specified as node 1. Verify that node 1 accurately references an existing node in your cluster.

Run the following script to list all nodes in the cluster and confirm node 1 exists:

Comment on lines +17 to +19
@abstractmethod
def wait_for_running(self, name):
return NotImplemented
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adding timeout parameter to wait_for_running.

For robustness in test scenarios, consider adding an optional timeout parameter to control how long the method should wait before giving up.

     @abstractmethod
-    def wait_for_running(self, name):
+    def wait_for_running(self, name: str, timeout: int = 300) -> bool:
+        """Wait for a share manager to reach running state.
+
+        Args:
+            name: Name of the share manager
+            timeout: Maximum time to wait in seconds (default: 300)
+
+        Returns:
+            bool: True if running state is reached, False if timeout occurs
+        """
         return NotImplemented
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@abstractmethod
def wait_for_running(self, name):
return NotImplemented
@abstractmethod
def wait_for_running(self, name: str, timeout: int = 300) -> bool:
"""Wait for a share manager to reach running state.
Args:
name: Name of the share manager
timeout: Maximum time to wait in seconds (default: 300)
Returns:
bool: True if running state is reached, False if timeout occurs
"""
return NotImplemented

Comment on lines +14 to +24
def get(self, name):
return NotImplemented

def delete(self, name):
return NotImplemented

def wait_for_running(self, name):
return NotImplemented

def wait_for_restart(self, name, last_creation_time):
return NotImplemented
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implementation required: Methods return NotImplemented

All newly added methods currently return NotImplemented. This seems inconsistent with the PR objectives which state that the tests have been executed locally. These methods are crucial for testing Longhorn components recovery and should be properly implemented.

Please implement these methods using the longhorn_client. The implementation should:

  1. Handle API interactions with the share manager
  2. Include appropriate error handling
  3. Follow similar patterns as the existing list method

Example implementation structure for get:

def get(self, name: str) -> dict:
    """Get share manager details by name.
    
    Args:
        name: Name of the share manager
        
    Returns:
        Share manager details
        
    Raises:
        ApiException: If the API request fails
    """
    return self.longhorn_client.get_share_manager(name)

💡 Codebase verification

Error handling strategy needs to be implemented in Rest class

The review comment is valid. The code inspection reveals:

  • The base class defines abstract methods without error handling guidance
  • The Rest class inherits from Base and uses longhorn_client but has unimplemented methods returning NotImplemented
  • The CRD implementation shows error handling patterns with try-except blocks and logging
  • No error handling exists for API calls, timeouts, or resource states in the Rest class

Key areas needing error handling:

  • list() method's longhorn_client.list_share_manager() call
  • Unimplemented methods should handle API errors and resource states
  • Wait operations need timeout handling and state validation
  • Consistent error logging similar to CRD implementation
🔗 Analysis chain

Define error handling strategy

The methods need a clear error handling strategy for various scenarios:

  • API errors from longhorn_client
  • Timeouts during wait operations
  • Resource not found cases
  • Invalid state transitions

Let's verify the error handling in the base class:

Consider implementing a common error handling decorator or base method that can:

  1. Handle common API exceptions
  2. Implement retries for transient failures
  3. Provide consistent error messages
  4. Log relevant details for debugging
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check error handling patterns in the base class
rg -A 5 "raise|except" "e2e/libs/sharemanager/base.py"

Length of output: 54


Script:

#!/bin/bash
# Let's first check if the file exists and see its content
fd "base.py" "e2e/libs/sharemanager" --exec cat {}

# Also check the rest.py implementation for context
fd "rest.py" "e2e/libs/sharemanager" --exec cat {}

# Look for any error handling patterns in related files
rg -A 5 "raise|except|error" "e2e/libs/sharemanager/"

Length of output: 1736

Comment on lines +30 to +31
def wait_for_restart(self, name, last_creation_time):
return self.sharemanager.wait_for_restart(name, last_creation_time)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Based on the results, I can now generate the final response since I have sufficient information about the restart detection implementations across different classes.

Critical: REST strategy has unimplemented restart detection

The verification revealed important implementation gaps:

  • The REST strategy (rest.py) returns NotImplemented for wait_for_restart
  • Only the CRD strategy (crd.py) has a concrete implementation with retry logic and error handling
  • The base class (base.py) defines the interface but returns NotImplemented

This could lead to runtime failures when using the REST strategy for recovery testing. The CRD implementation looks solid with proper retry and error handling, but the REST strategy needs to be implemented to match this behavior.

🔗 Analysis chain

Verify restart detection logic in strategy implementations.

The wait_for_restart method seems critical for recovery testing. Let's verify the underlying implementation handles edge cases properly.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the restart detection implementation in strategy classes

# Look for wait_for_restart implementations in CRD and Rest classes
echo "Checking CRD implementation:"
ast-grep --pattern 'def wait_for_restart($_, $_) { $$$ }' 

echo -e "\nChecking error handling patterns:"
rg -A 5 "wait_for_restart"

Length of output: 1552

Comment on lines +24 to +34
def delete_backing_image_manager(self, name):
self.backing_image.delete_backing_image_manager(name)

def wait_all_backing_image_managers_running(self):
self.backing_image.wait_all_backing_image_managers_running()

def wait_backing_image_manager_restart(self, name, last_creation_time):
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)

def list_backing_image_manager(self):
return self.backing_image.list_backing_image_manager()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add docstrings and error handling to the new methods.

These test automation keywords would benefit from:

  1. Docstrings explaining the purpose, parameters, and expected behavior
  2. Timeout parameters for the wait methods to prevent hanging tests
  3. Basic input validation and error handling

Here's a suggested improvement for one of the methods as an example:

     def wait_backing_image_manager_restart(self, name, last_creation_time):
+        """Wait for a backing image manager to restart after deletion.
+        
+        Args:
+            name (str): Name of the backing image manager
+            last_creation_time (str): Previous creation timestamp
+            
+        Raises:
+            TimeoutError: If manager doesn't restart within timeout
+            ValueError: If name or timestamp is invalid
+        """
+        if not name or not last_creation_time:
+            raise ValueError("Name and last_creation_time must be provided")
         self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)

Committable suggestion was skipped due to low confidence.

Comment on lines +34 to +48
def delete_backing_image_manager(self, name):
self.backing_image = CRD()
return self.backing_image.delete_backing_image_manager(name)

def wait_all_backing_image_managers_running(self):
self.backing_image = CRD()
return self.backing_image.wait_all_backing_image_managers_running()

def wait_backing_image_manager_restart(self, name, last_creation_time):
self.backing_image = CRD()
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)

def list_backing_image_manager(self):
self.backing_image = CRD()
return self.backing_image.list_backing_image_manager()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

⚠️ Potential issue

The shell script results provide clear evidence about the strategy pattern implementation across the codebase. Let me generate the final response:

Strategy pattern implementation needs revision to match codebase standards

The current implementation in backing_image.py deviates from the established pattern used consistently across other modules. All other modules:

  1. Initialize the strategy object only once in __init__
  2. Handle both REST and CRD cases properly with if-else blocks
  3. Use the initialized strategy object throughout their methods

Here's the correct implementation matching the codebase standards:

def __init__(self):
    if self._strategy == LonghornOperationStrategy.CRD:
        self.backing_image = CRD()
    else:
        self.backing_image = Rest()

Remove the redundant CRD instantiation from:

  • delete_backing_image_manager
  • wait_all_backing_image_managers_running
  • wait_backing_image_manager_restart
  • list_backing_image_manager
🔗 Analysis chain

Strategy pattern implementation needs revision

The current implementation has several architectural concerns:

  1. The new methods bypass the strategy pattern by directly instantiating CRD, which violates the pattern's purpose.
  2. Creating a new CRD instance per method call is inefficient and could lead to resource management issues.
  3. Overwriting self.backing_image could affect other methods expecting the REST implementation.

Consider refactoring to:

  1. Respect the strategy pattern:
 class BackingImage(Base):
     _strategy = LonghornOperationStrategy.REST
 
     def __init__(self):
         if self._strategy == LonghornOperationStrategy.REST:
             self.backing_image = Rest()
+        else:
+            self.backing_image = CRD()
 
     def delete_backing_image_manager(self, name):
-        self.backing_image = CRD()
         return self.backing_image.delete_backing_image_manager(name)
  1. If both REST and CRD implementations are needed, consider using separate attributes:
def __init__(self):
    self.rest_client = Rest()
    self.crd_client = CRD()

The wait_backing_image_manager_restart method doesn't return the result from the delegate call:

 def wait_backing_image_manager_restart(self, name, last_creation_time):
     self.backing_image = CRD()
-    self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+    return self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)

Consider renaming list_backing_image_manager to list_backing_image_managers for consistency with other method names.

Let's verify the strategy pattern usage across the codebase:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if other classes follow the same pattern
rg -l "LonghornOperationStrategy" | xargs rg -A 5 "_strategy\s*="

Length of output: 5660

Comment on lines +75 to +84
backing_image_manager = self.obj_api.get_namespaced_custom_object(
group="longhorn.io",
version="v1beta2",
namespace="longhorn-system",
plural="backingimagemanagers",
name=name
)
except Exception as e:
logging(f"Finding backing image manager {name} failed with error {e}")
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Handle exceptions more specifically and improve logging

Catching all exceptions with a bare except can obscure unexpected errors. It's better to catch specific exceptions. Additionally, consider logging the stack trace for better debugging.

Apply this diff to catch specific exceptions and log the stack trace:

-            except Exception as e:
-                logging(f"Finding backing image manager {name} failed with error {e}")
+            except client.exceptions.ApiException as e:
+                logging(f"Failed to find backing image manager {name}: {e}")
+            except Exception as e:
+                logging(f"An unexpected error occurred while finding backing image manager {name}: {e}", exc_info=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
backing_image_manager = self.obj_api.get_namespaced_custom_object(
group="longhorn.io",
version="v1beta2",
namespace="longhorn-system",
plural="backingimagemanagers",
name=name
)
except Exception as e:
logging(f"Finding backing image manager {name} failed with error {e}")
continue
backing_image_manager = self.obj_api.get_namespaced_custom_object(
group="longhorn.io",
version="v1beta2",
namespace="longhorn-system",
plural="backingimagemanagers",
name=name
)
except client.exceptions.ApiException as e:
logging(f"Failed to find backing image manager {name}: {e}")
except Exception as e:
logging(f"An unexpected error occurred while finding backing image manager {name}: {e}", exc_info=True)
continue

Resource ../keywords/backing_image.resource
Resource ../keywords/storageclass.resource
Resource ../keywords/persistentvolumeclaim.resource
Resource ../keywords/k8s.resource
Copy link

@coderabbitai coderabbitai bot Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove duplicate resource import

The resource file ../keywords/k8s.resource is imported twice on lines 11 and 17. This duplication is unnecessary and could lead to confusion or maintenance issues.

Apply this diff to remove the duplicate import:

- Resource    ../keywords/k8s.resource

Also applies to: 17-17

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chriscchien This review is valid.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

And Wait for volume 0 healthy
And Check volume 0 data is intact
And Wait for deployment 0 pods stable
And And Check deployment 0 data in file data.txt is intact
Copy link

@coderabbitai coderabbitai bot Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct duplicate 'And' keywords in test steps

There are typographical errors on lines 96 and 100 where the keyword And is duplicated.

Apply this diff to fix the typos:

-    And And Check deployment 0 data in file data.txt is intact
+    And Check deployment 0 data in file data.txt is intact
-    And And Check deployment 1 data in file data.txt is intact
+    And Check deployment 1 data in file data.txt is intact

Also applies to: 100-100

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chriscchien This review is valid.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines +68 to +79
IF '${DATA_ENGINE}' == 'v1'
When Create backing image bi with url=https://longhorn-backing-image.s3-us-west-1.amazonaws.com/parrot.qcow2
And Create volume 1 with backingImage=bi dataEngine=${DATA_ENGINE}
And Attach volume 1
And Wait for volume 1 healthy
And Write data to volume 1

When Create storageclass longhorn-test-1 with dataEngine=${DATA_ENGINE}
And Create persistentvolumeclaim 1 using RWX volume with longhorn-test-1 storageclass
And Create deployment 1 with persistentvolumeclaim 1
And Write 100 MB data to file data.txt in deployment 1
END

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Simplify conditional checks for ${DATA_ENGINE}

The variable ${DATA_ENGINE} is set to v1 at line 27. The conditional IF '${DATA_ENGINE}' == 'v1' will always evaluate to true throughout the test cases.

Consider removing the conditional checks or making ${DATA_ENGINE} configurable if different engine versions are intended to be tested. This will simplify the code and improve readability.

-    IF    '${DATA_ENGINE}' == 'v1'
...
-    END
-    IF    '${DATA_ENGINE}' == 'v1'
...
-    END
-    IF    '${DATA_ENGINE}' == 'v1'
...
-    END
-    IF    '${DATA_ENGINE}' == 'v1'
...
-    END

Also applies to: 97-102, 138-153, 167-183

And Attach volume 0
And Wait for volume 0 healthy
And Write data to volume 0
Then Delete instance-manager of volume 0 and wait for recover
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential race conditions during instance manager deletion

Deleting the instance manager while a volume is rebuilding may lead to inconsistent states or test flakiness.

Consider adding a wait or verification step to ensure the rebuild process has properly initiated before deleting the instance manager. This can help in making the test more reliable.

+           And Wait for replica rebuilding progress

Also applies to: 123-123, 145-145, 148-148, 172-172, 176-176, 200-200, 204-204

Resource ../keywords/backing_image.resource
Resource ../keywords/storageclass.resource
Resource ../keywords/persistentvolumeclaim.resource
Resource ../keywords/k8s.resource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chriscchien This review is valid.

And Wait for volume 0 healthy
And Check volume 0 data is intact
And Wait for deployment 0 pods stable
And And Check deployment 0 data in file data.txt is intact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chriscchien This review is valid.

Copy link

mergify bot commented Nov 10, 2024

This pull request is now in conflict. Could you fix it @chriscchien? 🙏

@chriscchien
Copy link
Contributor Author

Updated.

cc @yangchiu , @PhanLe1010 , @shuo-wu

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.

Actionable comments posted: 7

🧹 Outside diff range and nitpick comments (12)
e2e/libs/sharemanager/base.py (2)

9-11: Add type hints and docstring to get method.

For better maintainability and implementation guidance, consider adding type hints and documentation.

     @abstractmethod
-    def get(self, name):
+    def get(self, name: str) -> dict:
+        """Retrieve share manager details.
+
+        Args:
+            name: Name of the share manager
+
+        Returns:
+            dict: Share manager details
+        """
         return NotImplemented

13-15: Enhance delete method signature with type hints and force option.

Consider adding return type and force delete option for test scenarios where cleanup is critical.

     @abstractmethod
-    def delete(self, name):
+    def delete(self, name: str, force: bool = False) -> bool:
+        """Delete a share manager.
+
+        Args:
+            name: Name of the share manager
+            force: Whether to force delete even if resources are in use
+
+        Returns:
+            bool: True if deletion was successful
+        """
         return NotImplemented
e2e/libs/sharemanager/crd.py (1)

Line range hint 10-68: Consider architectural improvements for better maintainability

To improve the maintainability and reusability of this class, consider these architectural improvements:

  1. Move common CRD configurations to class attributes:
class CRD(Base):
    GROUP = "longhorn.io"
    VERSION = "v1beta2"
    NAMESPACE = "longhorn-system"
    PLURAL = "sharemanagers"
  1. Create a decorator for common error handling:
def handle_k8s_errors(func):
    def wrapper(self, *args, **kwargs):
        try:
            return func(self, *args, **kwargs)
        except client.rest.ApiException as e:
            logging(f"Operation failed: {e}")
            raise
    return wrapper

This would make the code more DRY and easier to maintain.

Would you like me to help implement these improvements?

🧰 Tools
🪛 Ruff

44-44: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


52-52: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


55-55: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


68-68: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

e2e/keywords/longhorn.resource (1)

70-84: Consider implementing a more comprehensive recovery testing pattern.

The current implementation provides basic building blocks for recovery testing. Consider enhancing it with:

  1. A structured recovery testing pattern that includes:

    • Pre-failure state validation
    • Controlled failure injection
    • Recovery monitoring
    • Post-recovery state validation
  2. Metrics collection during recovery:

    • Recovery time measurements
    • Component state transitions
    • Resource usage during recovery
  3. Recovery scenarios matrix:

    • Single component failures
    • Multiple component failures
    • Cascading failures
    • Network partition scenarios

This would provide more thorough validation of Longhorn's recovery capabilities.

e2e/libs/keywords/workload_keywords.py (2)

64-70: Consider adding error handling for pod retrieval.

While the implementation is correct, consider adding error handling for cases where no pods are found matching the criteria. This would improve the robustness of the test automation.

 def delete_workload_pod_on_node(self, workload_name, node_name, namespace="default", label_selector=""):
     pods = get_workload_pods(workload_name, namespace=namespace, label_selector=label_selector)
+    if not pods:
+        logging(f'No pods found for workload {workload_name} on node {node_name}')
+        return
     for pod in pods:
         if pod.spec.node_name == node_name:
             logging(f'Deleting pod {pod.metadata.name} on node {node_name}')
             delete_pod(pod.metadata.name, namespace=namespace)

Line range hint 1-190: Consider architectural improvements for better async/sync separation.

The class mixes synchronous and asynchronous methods, along with different concurrency approaches (multiprocessing and asyncio). Consider:

  1. Separating sync and async methods into different classes or modules
  2. Standardizing on either multiprocessing or asyncio for concurrent operations
  3. Grouping related methods (e.g., pod operations, volume operations) together

This would improve maintainability and make the testing framework's behavior more predictable.

e2e/keywords/workload.resource (3)

190-190: Remove unnecessary empty lines.

These empty lines don't serve any purpose and should be removed for better code organization.

Also applies to: 192-192


193-201: Add documentation for label selector logic.

The conditional logic for label selectors is well-implemented, but would benefit from documentation explaining:

  • Why different components need different label selectors
  • What each label selector accomplishes
  • When to use empty label selector

Add documentation above the IF block:

Delete Longhorn ${workload_kind} ${workload_name} pod on node ${node_id}
    ${node_name} =    get_node_by_index    ${node_id}
+    # Select appropriate label for Longhorn components:
+    # - engine-image: Targets engine image pods
+    # - instance-manager: Targets instance manager pods
+    # - others: No specific label selection
    IF    '${workload_name}' == 'engine-image'

202-205: Add documentation and improve error handling.

The keyword implementation is clean but would benefit from:

  1. Documentation explaining its purpose and when to use it vs. the node-specific version
  2. Error handling for cases where the pod doesn't exist

Consider this enhancement:

+[Documentation]    Deletes a Longhorn workload pod from the longhorn-system namespace.
+...               Use this keyword when the node location is not relevant.
 Delete Longhorn ${workload_kind} ${workload_name} pod
     ${pod_name} =    get_workload_pod_name    ${workload_name}    longhorn-system
+    Should Not Be Empty    ${pod_name}    msg=Pod ${workload_name} not found in longhorn-system namespace
     Log    ${pod_name}
     delete_pod    ${pod_name}     longhorn-system
e2e/tests/negative/component_resilience.robot (3)

118-118: Use 'When' instead of 'Then' for actions in test steps

In the test steps at lines 118, 123, 143, 175, 199, and 203, the keyword Then is used to perform actions (Delete instance-manager...), whereas When is more appropriate for action steps. Then is generally used for assertions or verifying outcomes.

Apply this diff to replace Then with When:

-Then Delete instance-manager of volume 0 and wait for recover
+When Delete instance-manager of volume 0 and wait for recover

Also applies to: 123-123, 143-143, 175-175, 199-199, 203-203


68-68: Avoid hardcoding URLs; consider parameterizing the backing image URL

The backing image URL is hardcoded at lines 68 and 138. Hardcoding URLs can lead to maintenance issues if the URL changes. Consider defining the URL as a variable to improve flexibility and maintainability.

Apply this diff to define the URL as a variable:

 *** Variables ***
 ${LOOP_COUNT}    1
 ${RETRY_COUNT}    300
 ${RETRY_INTERVAL}    1
 ${RWX_VOLUME_FAST_FAILOVER}    false
 ${DATA_ENGINE}    v1
+${BACKING_IMAGE_URL}    https://longhorn-backing-image.s3-us-west-1.amazonaws.com/parrot.qcow2

 *** Test Cases ***
 Test Longhorn components recovery
     ...
-    When Create backing image bi with    url=https://longhorn-backing-image.s3-us-west-1.amazonaws.com/parrot.qcow2
+    When Create backing image bi with    url=${BACKING_IMAGE_URL}

Also applies to: 138-138


43-101: Extract common test steps into reusable keywords to improve maintainability

The test cases from lines 43 to 203 contain several similar steps that are repeated across different tests. Extracting these common steps into reusable keywords will reduce code duplication, enhance readability, and make the tests easier to maintain.

Consider defining new keywords for recurring actions, such as:

  • Setup Test Environment: Creating volumes, storage classes, and deployments.
  • Write Data: Writing data to volumes or deployments.
  • Delete Instance Manager and Wait for Recovery: Deleting instance managers and verifying recovery.
  • Verify Data Integrity: Checking that data is intact after recovery.

Applying this structure will make your test cases cleaner and more modular.

Also applies to: 102-123, 124-152, 153-182, 183-203

🛑 Comments failed to post (7)
e2e/libs/sharemanager/base.py (1)

21-23: 🛠️ Refactor suggestion

Enhance wait_for_restart method signature.

For consistency with wait_for_running and better type safety:

  1. Add timeout and polling interval parameters
  2. Properly type the timestamp parameter
  3. Add comprehensive documentation
     @abstractmethod
-    def wait_for_restart(self, name, last_creation_time):
+    def wait_for_restart(
+        self,
+        name: str,
+        last_creation_time: datetime,
+        timeout: int = 300,
+        poll_interval: int = 5
+    ) -> bool:
+        """Wait for a share manager to restart after a specific time.
+
+        Args:
+            name: Name of the share manager
+            last_creation_time: Timestamp to compare against for restart detection
+            timeout: Maximum time to wait in seconds (default: 300)
+            poll_interval: Time between checks in seconds (default: 5)
+
+        Returns:
+            bool: True if restart is detected, False if timeout occurs
+        """
         return NotImplemented

Don't forget to add the datetime import at the top of the file:

+from datetime import datetime
 from abc import ABC, abstractmethod
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    @abstractmethod
    def wait_for_restart(
        self,
        name: str,
        last_creation_time: datetime,
        timeout: int = 300,
        poll_interval: int = 5
    ) -> bool:
        """Wait for a share manager to restart after a specific time.

        Args:
            name: Name of the share manager
            last_creation_time: Timestamp to compare against for restart detection
            timeout: Maximum time to wait in seconds (default: 300)
            poll_interval: Time between checks in seconds (default: 5)

        Returns:
            bool: True if restart is detected, False if timeout occurs
        """
        return NotImplemented
e2e/libs/keywords/sharemanager_keywords.py (2)

51-52: 🛠️ Refactor suggestion

Add input validation and error handling.

The method should include basic validation and error handling to ensure robustness:

 def delete_sharemanager(self, name):
+    """Delete a share manager.
+    
+    Args:
+        name (str): Name of the share manager to delete
+        
+    Raises:
+        ValueError: If name is empty
+        RuntimeError: If deletion fails
+    """
+    if not name:
+        raise ValueError("Share manager name cannot be empty")
+    try:
         return self.sharemanager.delete(name)
+    except Exception as e:
+        raise RuntimeError(f"Failed to delete share manager {name}: {str(e)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def delete_sharemanager(self, name):
        """Delete a share manager.
        
        Args:
            name (str): Name of the share manager to delete
            
        Raises:
            ValueError: If name is empty
            RuntimeError: If deletion fails
        """
        if not name:
            raise ValueError("Share manager name cannot be empty")
        try:
            return self.sharemanager.delete(name)
        except Exception as e:
            raise RuntimeError(f"Failed to delete share manager {name}: {str(e)}")

60-61: 🛠️ Refactor suggestion

Add timeout and error handling for robustness.

The method should include timeout, error handling, and proper documentation:

-def wait_for_share_manager_running(self, name):
-    return self.sharemanager.wait_for_running(name)
+def wait_for_share_manager_running(self, name, timeout=300):
+    """Wait for a share manager to reach running state.
+    
+    Args:
+        name (str): Name of the share manager
+        timeout (int): Maximum time to wait in seconds
+        
+    Raises:
+        ValueError: If name is empty
+        TimeoutError: If share manager doesn't reach running state within timeout
+    """
+    if not name:
+        raise ValueError("Share manager name cannot be empty")
+    try:
+        return self.sharemanager.wait_for_running(name, timeout=timeout)
+    except Exception as e:
+        raise TimeoutError(f"Share manager {name} failed to reach running state: {str(e)}")

Committable suggestion skipped: line range outside the PR's diff.

e2e/libs/backing_image/crd.py (3)

37-54: 🛠️ Refactor suggestion

Add configuration options and error handling for manager operations.

Consider the following improvements:

  1. Make the namespace configurable instead of hardcoding "longhorn-system"
  2. Add error handling for API operations
  3. Add logging for successful deletion
 def list_backing_image_manager(self):
     label_selector = 'longhorn.io/component=backing-image-manager'
+    try:
         return self.obj_api.list_namespaced_custom_object(
             group="longhorn.io",
             version="v1beta2",
-            namespace="longhorn-system",
+            namespace=self.namespace,  # Add to __init__
             plural="backingimagemanagers",
             label_selector=label_selector)
+    except client.exceptions.ApiException as e:
+        logging(f"Failed to list backing image managers: {e}")
+        raise

 def delete_backing_image_manager(self, name):
     logging(f"deleting backing image manager {name} ...")
+    try:
         self.obj_api.delete_namespaced_custom_object(
             group="longhorn.io",
             version="v1beta2",
-            namespace="longhorn-system",
+            namespace=self.namespace,  # Add to __init__
             plural="backingimagemanagers",
             name=name
         )
+        logging(f"Successfully deleted backing image manager {name}")
+    except client.exceptions.ApiException as e:
+        logging(f"Failed to delete backing image manager {name}: {e}")
+        raise

Committable suggestion skipped: line range outside the PR's diff.


56-69: 🛠️ Refactor suggestion

Improve error handling and assertions in wait_all_backing_image_managers_running.

The method needs several improvements:

  1. Safe dictionary access (as noted in previous review)
  2. Better error handling
  3. Proper assertion usage
  4. Unused loop variable fix
-    def wait_all_backing_image_managers_running(self):
-        for i in range(self.retry_count):
+    def wait_all_backing_image_managers_running(self):
+        for _ in range(self.retry_count):
             all_running = True
-            backing_image_managers = self.list_backing_image_manager()            
+            try:
+                backing_image_managers = self.list_backing_image_manager()
+            except Exception as e:
+                logging(f"Failed to list backing image managers: {e}")
+                time.sleep(self.retry_interval)
+                continue
+
             for backing_image_manager in backing_image_managers["items"]:
-                current_state = backing_image_manager["status"]["currentState"]
                 name = backing_image_manager["metadata"]["name"]
+                current_state = backing_image_manager.get("status", {}).get("currentState", "unknown")
                 logging(f"backing image manager {name} currently in {current_state} state")
                 if current_state != "running":
                     all_running = False
             if all_running is True:
                 return
             time.sleep(self.retry_interval)
-        assert False, f"Waiting all backing image manager in running state timeout"
+        raise AssertionError("Timeout waiting for all backing image managers to reach running state")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def wait_all_backing_image_managers_running(self):
        for _ in range(self.retry_count):
            all_running = True
            try:
                backing_image_managers = self.list_backing_image_manager()
            except Exception as e:
                logging(f"Failed to list backing image managers: {e}")
                time.sleep(self.retry_interval)
                continue

            for backing_image_manager in backing_image_managers["items"]:
                name = backing_image_manager["metadata"]["name"]
                current_state = backing_image_manager.get("status", {}).get("currentState", "unknown")
                logging(f"backing image manager {name} currently in {current_state} state")
                if current_state != "running":
                    all_running = False
            if all_running is True:
                return
            time.sleep(self.retry_interval)
        raise AssertionError("Timeout waiting for all backing image managers to reach running state")
🧰 Tools
🪛 Ruff

57-57: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


69-69: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


69-69: f-string without any placeholders

Remove extraneous f prefix

(F541)


71-91: 🛠️ Refactor suggestion

Improve error handling and datetime comparison in wait_backing_image_manager_restart.

The method needs several improvements:

  1. Better exception handling
  2. Proper assertion usage
  3. More robust datetime comparison
  4. Unused loop variable fix
-    def wait_backing_image_manager_restart(self, name, last_creation_time):
-        for i in range(self.retry_count):
+    def wait_backing_image_manager_restart(self, name, last_creation_time):
+        fmt = "%Y-%m-%dT%H:%M:%SZ"
+        try:
+            last_creation_dt = datetime.strptime(last_creation_time, fmt)
+        except ValueError as e:
+            raise ValueError(f"Invalid last_creation_time format: {e}")
+
+        for _ in range(self.retry_count):
             time.sleep(self.retry_interval)            
             try:
                 backing_image_manager = self.obj_api.get_namespaced_custom_object(
                     group="longhorn.io",
                     version="v1beta2",
                     namespace="longhorn-system",
                     plural="backingimagemanagers",
                     name=name
                     )
-            except Exception as e:
+            except client.exceptions.ApiException as e:
                 logging(f"Finding backing image manager {name} failed with error {e}")
                 continue
             
             creation_time = backing_image_manager["metadata"]["creationTimestamp"]
-            fmt = "%Y-%m-%dT%H:%M:%SZ"
-            if datetime.strptime(creation_time, fmt) > datetime.strptime(last_creation_time, fmt):
-                return
+            try:
+                current_creation_dt = datetime.strptime(creation_time, fmt)
+                if current_creation_dt > last_creation_dt:
+                    return
+            except ValueError as e:
+                logging(f"Invalid creation_time format from API: {e}")
+                continue

-        assert False, f"Wait backing image manager {name} restart failed ..."
+        raise AssertionError(f"Timeout waiting for backing image manager {name} to restart")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def wait_backing_image_manager_restart(self, name, last_creation_time):
        fmt = "%Y-%m-%dT%H:%M:%SZ"
        try:
            last_creation_dt = datetime.strptime(last_creation_time, fmt)
        except ValueError as e:
            raise ValueError(f"Invalid last_creation_time format: {e}")

        for _ in range(self.retry_count):
            time.sleep(self.retry_interval)            
            try:
                backing_image_manager = self.obj_api.get_namespaced_custom_object(
                    group="longhorn.io",
                    version="v1beta2",
                    namespace="longhorn-system",
                    plural="backingimagemanagers",
                    name=name
                    )
            except client.exceptions.ApiException as e:
                logging(f"Finding backing image manager {name} failed with error {e}")
                continue
            
            creation_time = backing_image_manager["metadata"]["creationTimestamp"]
            try:
                current_creation_dt = datetime.strptime(creation_time, fmt)
                if current_creation_dt > last_creation_dt:
                    return
            except ValueError as e:
                logging(f"Invalid creation_time format from API: {e}")
                continue

        raise AssertionError(f"Timeout waiting for backing image manager {name} to restart")
🧰 Tools
🪛 Ruff

72-72: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


91-91: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

e2e/libs/keywords/workload_keywords.py (1)

71-72: ⚠️ Potential issue

Handle potential IndexError in get_workload_pod_name.

The method assumes at least one pod exists and may raise an IndexError if no pods are found. Consider adding error handling.

 def get_workload_pod_name(self, workload_name, namespace="default"):
-    return get_workload_pod_names(workload_name, namespace)[0]
+    pod_names = get_workload_pod_names(workload_name, namespace)
+    if not pod_names:
+        raise ValueError(f"No pods found for workload {workload_name} in namespace {namespace}")
+    return pod_names[0]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    def get_workload_pod_name(self, workload_name, namespace="default"):
        pod_names = get_workload_pod_names(workload_name, namespace)
        if not pod_names:
            raise ValueError(f"No pods found for workload {workload_name} in namespace {namespace}")
        return pod_names[0]

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (15)
e2e/libs/sharemanager/base.py (2)

9-11: Add type hints and documentation for the get method.

Consider adding type hints and docstring to better define the interface contract.

     @abstractmethod
-    def get(self, name):
+    def get(self, name: str) -> dict:
+        """Retrieve share manager details.
+
+        Args:
+            name: Name of the share manager
+
+        Returns:
+            dict: Share manager details
+        """
         return NotImplemented

13-15: Add type hints and documentation for the delete method.

Consider adding type hints and docstring to better define the interface contract.

     @abstractmethod
-    def delete(self, name):
+    def delete(self, name: str) -> bool:
+        """Delete a share manager.
+
+        Args:
+            name: Name of the share manager
+
+        Returns:
+            bool: True if deletion was successful, False otherwise
+        """
         return NotImplemented
e2e/libs/sharemanager/crd.py (1)

43-52: Consider enhancing logging and reusability.

While the basic functionality is correct, consider these improvements:

  1. Add more detailed logging (e.g., retry attempt number, time elapsed)
  2. Extract the retry logic into a common utility function that can be reused by other wait methods

Example enhancement:

def wait_for_condition(self, name, condition_fn, error_msg):
    start_time = time.time()
    for attempt in range(self.retry_count):
        logging(f"Attempt {attempt + 1}/{self.retry_count} ({time.time() - start_time:.1f}s elapsed)")
        if condition_fn():
            return
        time.sleep(self.retry_interval)
    raise AssertionError(error_msg)
🧰 Tools
🪛 Ruff

44-44: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


52-52: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

e2e/libs/backing_image/crd.py (3)

14-36: Add docstrings to NotImplemented methods.

These methods lack documentation describing their intended functionality, parameters, and return values. This documentation would be valuable for future implementation and maintenance.

Example docstring format:

def create(self, bi_name, source_type, url, expected_checksum):
    """Create a new backing image resource.

    Args:
        bi_name (str): Name of the backing image
        source_type (str): Type of the source
        url (str): URL of the source
        expected_checksum (str): Expected checksum for validation

    Returns:
        NotImplemented: Currently not implemented
    """
    return NotImplemented

37-45: Make namespace and API version configurable.

The namespace "longhorn-system" and API version "v1beta2" are hardcoded. Consider making these configurable through class initialization or environment variables for better flexibility.

 def __init__(self):
     self.obj_api = client.CustomObjectsApi()
     self.retry_count, self.retry_interval = get_retry_count_and_interval()
+    self.namespace = os.getenv("LONGHORN_NAMESPACE", "longhorn-system")
+    self.api_version = os.getenv("LONGHORN_API_VERSION", "v1beta2")

86-89: Add buffer time to creation timestamp comparison.

The direct timestamp comparison might be unreliable if the timestamps are very close. Consider adding a small buffer time to ensure the restart is detected correctly.

-            if datetime.strptime(creation_time, fmt) > datetime.strptime(last_creation_time, fmt):
+            new_time = datetime.strptime(creation_time, fmt)
+            old_time = datetime.strptime(last_creation_time, fmt)
+            # Add a small buffer (e.g., 1 second) to avoid race conditions
+            if (new_time - old_time).total_seconds() > 1:
e2e/keywords/workload.resource (2)

200-211: Improve maintainability and documentation of the pod deletion keyword.

The implementation looks good but could benefit from some improvements:

  1. Consider using a dictionary for label selectors to make it more maintainable:
COMPONENT_LABELS = {
    'engine-image': 'longhorn.io/component=engine-image',
    'instance-manager': 'longhorn.io/component=instance-manager'
}
  1. Add documentation to explain the purpose and parameters:
[Documentation]    Deletes a Longhorn pod of specified workload type on a given node
...    ${workload_kind} - Type of the workload
...    ${workload_name} - Name of the workload (engine-image/instance-manager)
...    ${node_id} - Target node ID

212-215: Add error handling and documentation to the pod deletion keyword.

The implementation is good but could be enhanced with:

  1. Documentation to explain the purpose and parameters:
[Documentation]    Deletes a Longhorn pod of specified workload type
...    ${workload_kind} - Type of the workload
...    ${workload_name} - Name of the workload
  1. Error handling and deletion verification:
Delete Longhorn ${workload_kind} ${workload_name} pod
    ${pod_name} =    get_workload_pod_name    ${workload_name}    longhorn-system
    Should Not Be Empty    ${pod_name}    msg=Pod name not found for workload ${workload_name}
    Log    Deleting pod: ${pod_name}
    delete_pod    ${pod_name}    longhorn-system
    Wait Until Keyword Succeeds    30s    5s    Should Not Exist    pod    ${pod_name}    longhorn-system
e2e/libs/workload/workload.py (5)

Line range hint 17-40: LGTM! Consider enhancing error message clarity.

The namespace parameter addition and pod filtering logic look good. The error handling is improved with namespace context.

Consider making the log message more explicit when no pods are found:

-        logging(f"No pods found for workload {workload_name} in namespace {namespace}")
+        logging(f"No running pods found for workload '{workload_name}' in namespace '{namespace}'. Either the workload doesn't exist or all pods were terminated by kubelet.")

Line range hint 54-67: Enhance error handling for PVC retrieval.

While the namespace support is correctly implemented, the error handling could be more robust.

Consider these improvements:

 def get_workload_persistent_volume_claim_names(workload_name, namespace="default"):
     claim_names = []
     api = client.CoreV1Api()
     label_selector = f"app={workload_name}"
-    claim = api.list_namespaced_persistent_volume_claim(
-        namespace=namespace,
-        label_selector=label_selector
-    )
+    try:
+        claim = api.list_namespaced_persistent_volume_claim(
+            namespace=namespace,
+            label_selector=label_selector
+        )
+    except client.exceptions.ApiException as e:
+        if e.status == 404:
+            raise ValueError(f"Namespace '{namespace}' not found")
+        raise RuntimeError(f"Failed to list PVCs for workload '{workload_name}' in namespace '{namespace}': {e}")

     for item in claim.items:
         claim_names.append(item.metadata.name)
     claim_names.sort()

-    assert len(claim_names) > 0, f"Failed to get PVC names for workload {workload_name}"
+    assert len(claim_names) > 0, f"No PVCs found for workload '{workload_name}' in namespace '{namespace}' with label selector '{label_selector}'"
     return claim_names

Line range hint 171-187: Improve logging for better debugging.

The namespace support and retry logic are well implemented, but the logging could be more informative.

Consider these improvements:

     for i in range(retry_count):
         pods = get_workload_pods(workload_name, namespace=namespace)
+        total_pods = len(pods)
+        running_pods = []
         if len(pods) > 0:
-            running_pods = []
             for pod in pods:
                 if pod.status.phase == "Running":
                     running_pods.append(pod.metadata.name)
-            if len(running_pods) == len(pods):
+            if len(running_pods) == total_pods:
                 return

-        logging(f"Waiting for {workload_name} pods {running_pods} running, retry ({i}) ...")
+        logging(f"Waiting for workload '{workload_name}' pods in namespace '{namespace}' to be running ({len(running_pods)}/{total_pods} ready), retry {i+1}/{retry_count}")
         time.sleep(retry_interval)

-    assert False, f"Timeout waiting for {workload_name} pods running"
+    assert False, f"Timeout waiting for workload '{workload_name}' pods in namespace '{namespace}' to be running after {retry_count} retries"

Line range hint 232-270: Improve code structure and readability.

While the namespace support and state tracking logic are correct, the code structure could be improved.

Consider these improvements:

+    def is_pod_in_expected_state(pod, expect_state):
+        if expect_state == "ContainerCreating":
+            return pod.status.phase == "Pending"
+        if expect_state == "Terminating":
+            return hasattr(pod.metadata, "deletion_timestamp") and pod.status.phase == "Running"
+        if expect_state == "Running":
+            return pod.status.phase == "Running"
+        return False

     def count_pod_in_specifc_state_duration(count_pod_in_state_duration, pods, expect_state):
         for pod in pods:
             pod_name = pod.metadata.name
             if pod_name not in count_pod_in_state_duration:
                 count_pod_in_state_duration[pod_name] = 0
-            elif (expect_state == "ContainerCreating" and pod.status.phase == "Pending") or \
-                ((expect_state == "Terminating" and hasattr(pod.metadata, "deletion_timestamp") and pod.status.phase == "Running")) or \
-                (expect_state == "Running" and pod.status.phase == "Running"):
+            elif is_pod_in_expected_state(pod, expect_state):
                 count_pod_in_state_duration[pod_name] += 1
             else:
                 count_pod_in_state_duration[pod_name] = 0

Line range hint 277-282: Rename function and improve error handling.

While the namespace and label selector support are correctly implemented, there are some improvements needed.

Consider these improvements:

-def is_workload_pods_has_annotations(workload_name, annotation_key, namespace="default", label_selector=""):
+def do_workload_pods_have_annotation(workload_name, annotation_key, namespace="default", label_selector=""):
     pods = get_workload_pods(workload_name, namespace=namespace, label_selector=label_selector)
+    if not pods:
+        logging(f"No pods found for workload '{workload_name}' in namespace '{namespace}' with label selector '{label_selector}'")
+        return False
+
     for pod in pods:
         if not (pod.metadata.annotations and annotation_key in pod.metadata.annotations):
+            logging(f"Pod '{pod.metadata.name}' in namespace '{namespace}' is missing annotation '{annotation_key}'")
             return False
     return True
e2e/tests/negative/component_resilience.robot (2)

29-41: Refactor similar keywords to reduce duplication

The keywords Delete instance-manager of volume ${volume_id} and wait for recover and Delete instance-manager of deployment ${deployment_id} volume and wait for recover have similar steps. Refactoring common actions into a single, parameterized keyword will improve maintainability and reduce code duplication.


43-203: Extract repeated test steps into reusable keywords

The test cases include repeated sequences such as creating volumes, attaching them, writing data, and deleting instance managers. Consider extracting these sequences into reusable keywords to enhance readability and maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5aa1885 and beebd8c.

📒 Files selected for processing (19)
  • e2e/keywords/backing_image.resource (1 hunks)
  • e2e/keywords/longhorn.resource (1 hunks)
  • e2e/keywords/sharemanager.resource (1 hunks)
  • e2e/keywords/workload.resource (1 hunks)
  • e2e/libs/backing_image/backing_image.py (2 hunks)
  • e2e/libs/backing_image/base.py (1 hunks)
  • e2e/libs/backing_image/crd.py (1 hunks)
  • e2e/libs/backing_image/rest.py (1 hunks)
  • e2e/libs/k8s/k8s.py (3 hunks)
  • e2e/libs/keywords/backing_image_keywords.py (1 hunks)
  • e2e/libs/keywords/k8s_keywords.py (2 hunks)
  • e2e/libs/keywords/sharemanager_keywords.py (1 hunks)
  • e2e/libs/keywords/workload_keywords.py (2 hunks)
  • e2e/libs/sharemanager/base.py (1 hunks)
  • e2e/libs/sharemanager/crd.py (2 hunks)
  • e2e/libs/sharemanager/rest.py (1 hunks)
  • e2e/libs/sharemanager/sharemanager.py (1 hunks)
  • e2e/libs/workload/workload.py (1 hunks)
  • e2e/tests/negative/component_resilience.robot (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
  • e2e/keywords/backing_image.resource
  • e2e/keywords/sharemanager.resource
  • e2e/libs/backing_image/backing_image.py
  • e2e/libs/backing_image/base.py
  • e2e/libs/backing_image/rest.py
  • e2e/libs/keywords/backing_image_keywords.py
  • e2e/libs/keywords/k8s_keywords.py
  • e2e/libs/keywords/sharemanager_keywords.py
  • e2e/libs/keywords/workload_keywords.py
  • e2e/libs/sharemanager/rest.py
  • e2e/libs/sharemanager/sharemanager.py
🧰 Additional context used
🪛 Ruff
e2e/libs/backing_image/crd.py

57-57: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


69-69: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


69-69: f-string without any placeholders

Remove extraneous f prefix

(F541)


72-72: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


91-91: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

e2e/libs/k8s/k8s.py

8-8: workload.pod.wait_for_pod_status imported but unused

Remove unused import: workload.pod.wait_for_pod_status

(F401)


9-9: workload.pod.get_pod imported but unused

Remove unused import: workload.pod.get_pod

(F401)


178-178: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


195-195: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

e2e/libs/sharemanager/crd.py

44-44: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


52-52: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


55-55: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


68-68: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

🔇 Additional comments (12)
e2e/libs/sharemanager/base.py (1)

17-19: Add type hints and documentation for the wait_for_running method.

Consider adding type hints and docstring to better define the interface contract.

Note: A previous review already suggested adding a timeout parameter to this method.

e2e/libs/sharemanager/crd.py (3)

2-14: LGTM: Clean imports and initialization.

The new imports and retry parameter initialization are well-structured and necessary for the added functionality.


24-31: Duplicate of previous review comment.

A previous review already suggested adding error handling and making group, version, and namespace configurable.


33-41: Duplicate of previous review comment.

A previous review already suggested adding error handling for the delete operation.

e2e/keywords/longhorn.resource (3)

70-74: Add error handling and verification steps.

The previous review comment about error handling is still applicable. Additionally, consider verifying the volume state after pod deletion to ensure the operation's success.

#!/bin/bash
# Description: Check if there are any existing error handling patterns in the codebase
# that we should follow for consistency.

# Test: Look for similar pod deletion patterns with error handling
rg -A 5 "delete_pod.*longhorn-system"

# Test: Look for volume state verification patterns
ast-grep --pattern 'get_volume_state'

76-81: Reduce code duplication and enhance error handling.

The previous review comment about code duplication and error handling improvements is still applicable.


83-84: Enhance component recovery verification.

  1. The previous review comment about adding a configurable timeout parameter is still applicable.
  2. Ensure the keyword properly handles all Longhorn component types during recovery scenarios.
#!/bin/bash
# Description: Verify that all Longhorn component types are properly handled.

# Test: Check for any component types defined in the codebase that we should explicitly wait for
rg -A 5 "longhorn-system.*component.*type"

# Test: Look for any existing component recovery patterns
ast-grep --pattern 'wait_for.*component.*running'
e2e/libs/backing_image/crd.py (1)

1-13: LGTM! Class structure and imports are well-organized.

The class follows good practices with appropriate inheritance and initialization.

e2e/libs/k8s/k8s.py (1)

8-9: Skip comment - covered by past review.

🧰 Tools
🪛 Ruff

8-8: workload.pod.wait_for_pod_status imported but unused

Remove unused import: workload.pod.wait_for_pod_status

(F401)


9-9: workload.pod.get_pod imported but unused

Remove unused import: workload.pod.get_pod

(F401)

e2e/keywords/workload.resource (1)

200-215: Verify coverage of all Longhorn component types.

The implementation supports the PR's objective of testing Longhorn components recovery. Let's verify if all component types are covered.

✅ Verification successful

Implementation correctly covers all Longhorn component types

Based on the verification results, the implementation adequately handles all major Longhorn component types:

  • The code supports specific component deletion through labels for:
    • engine-image
    • instance-manager
    • Other components without specific labels (CSI components, UI, etc.)

This is validated by:

  1. Test cases in component_resilience.robot showing comprehensive coverage of components: CSI plugin, attacher, provisioner, resizer, snapshotter, manager, engine-image, instance-manager, UI, and driver-deployer
  2. Consistent label handling in the codebase for component identification (longhorn.io/component)
  3. Support for both node-specific and general pod deletion scenarios
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if all Longhorn component types are covered in the test files
# Expected: Find references to all major Longhorn component types in the tests

echo "Checking for Longhorn component types in test files..."
rg -l "longhorn.io/component=" --type yaml

echo "Checking for component deletion test cases..."
rg "Delete Longhorn .* pod" --type robot

echo "Checking for recovery test implementations..."
rg "Wait.*recovery" --type robot

Length of output: 1645


Script:

#!/bin/bash
# Let's check the component labels in the Longhorn YAML manifests
echo "Checking component labels in Longhorn manifests..."
rg "component: " --type yaml -B 2 -A 2

# Let's also check for any other potential component types in the codebase
echo -e "\nChecking for other component references..."
rg "longhorn.io/component" -B 2 -A 2

Length of output: 3465

e2e/libs/workload/workload.py (1)

Line range hint 190-229: LGTM! Consider improving error messages.

The async implementation and stability checking logic are solid. The namespace support is correctly implemented.

Let's verify the async implementation is used correctly throughout the codebase:

Consider improving the error messages:

-        logging(f"Waiting for {workload_name} pods {wait_for_stable_pod} stable, retry ({i}) ...")
+        logging(f"Waiting for workload '{workload_name}' pods in namespace '{namespace}' to stabilize. Unstable pods: {wait_for_stable_pod}, retry {i+1}/{retry_count}")

-    assert False, f"Timeout waiting for {workload_name} pods {wait_for_stable_pod} stable)"
+    assert False, f"Timeout waiting for workload '{workload_name}' pods in namespace '{namespace}' to stabilize. Remaining unstable pods: {wait_for_stable_pod}"
✅ Verification successful

Async implementation is correctly used across the codebase

The verification shows that wait_for_workload_pods_stable is:

  • Properly defined as an async function
  • Correctly awaited in workload_keywords.py using await and asyncio.create_task()
  • Used in robot framework keywords (deployment.resource and statefulset.resource) which handle async/await internally
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the async function is properly awaited
# Expected: All calls to wait_for_workload_pods_stable should use await

rg -l "wait_for_workload_pods_stable" | xargs rg "wait_for_workload_pods_stable" -B 2 -A 2

Length of output: 2600

e2e/tests/negative/component_resilience.robot (1)

26-26: Consider making ${DATA_ENGINE} configurable or remove redundant conditionals

${DATA_ENGINE} is set to 'v1' at line 26. All conditionals checking if ${DATA_ENGINE} == 'v1' will always evaluate to True. If testing different engine versions is not intended, consider removing these conditionals to simplify the code. Alternatively, make ${DATA_ENGINE} configurable to allow testing with other engine versions.

Comment on lines +21 to +23
@abstractmethod
def wait_for_restart(self, name, last_creation_time):
return NotImplemented
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add type hints, documentation, and timeout parameter for consistency.

The method should follow similar patterns to wait_for_running, including a timeout parameter for consistency.

     @abstractmethod
-    def wait_for_restart(self, name, last_creation_time):
+    def wait_for_restart(self, name: str, last_creation_time: str, timeout: int = 300) -> bool:
+        """Wait for a share manager to restart.
+
+        Args:
+            name: Name of the share manager
+            last_creation_time: Previous creation timestamp to compare against
+            timeout: Maximum time to wait in seconds (default: 300)
+
+        Returns:
+            bool: True if restart is confirmed, False if timeout occurs
+        """
         return NotImplemented
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@abstractmethod
def wait_for_restart(self, name, last_creation_time):
return NotImplemented
@abstractmethod
def wait_for_restart(self, name: str, last_creation_time: str, timeout: int = 300) -> bool:
"""Wait for a share manager to restart.
Args:
name: Name of the share manager
last_creation_time: Previous creation timestamp to compare against
timeout: Maximum time to wait in seconds (default: 300)
Returns:
bool: True if restart is confirmed, False if timeout occurs
"""
return NotImplemented

Comment on lines +54 to +68
def wait_for_restart(self, name, last_creation_time):
for i in range(self.retry_count):
time.sleep(self.retry_interval)
try:
sharemanager = self.get(name)
except Exception as e:
logging(f"Finding sharemanager {name} failed with error {e}")
continue

creation_time = sharemanager["metadata"]["creationTimestamp"]
fmt = "%Y-%m-%dT%H:%M:%SZ"
if datetime.strptime(creation_time, fmt) > datetime.strptime(last_creation_time, fmt):
return

assert False, f"Wait share manager {name} restart failed ..."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Simplify timestamp comparison logic.

The datetime comparison logic could be simplified and made more robust:

  1. Parse timestamps once at the start
  2. Use UTC consistently
  3. Add validation for timestamp format

Here's a suggested improvement:

     def wait_for_restart(self, name, last_creation_time):
+        fmt = "%Y-%m-%dT%H:%M:%SZ"
+        try:
+            last_time = datetime.strptime(last_creation_time, fmt)
+        except ValueError as e:
+            raise ValueError(f"Invalid last_creation_time format: {e}")
+
         for i in range(self.retry_count):
             try:
                 sharemanager = self.get(name)                    
                 creation_time = sharemanager["metadata"]["creationTimestamp"]
-                fmt = "%Y-%m-%dT%H:%M:%SZ"
-                if datetime.strptime(creation_time, fmt) > datetime.strptime(last_creation_time, fmt):
+                current_time = datetime.strptime(creation_time, fmt)
+                if current_time > last_time:
                     return
             except Exception as e:
                 logging(f"Finding sharemanager {name} failed with error {e}")
                 continue

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff

55-55: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


68-68: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

Comment on lines +46 to +54
def delete_backing_image_manager(self, name):
logging(f"deleting backing image manager {name} ...")
self.obj_api.delete_namespaced_custom_object(
group="longhorn.io",
version="v1beta2",
namespace="longhorn-system",
plural="backingimagemanagers",
name=name
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling to delete_backing_image_manager.

The method should handle potential API exceptions and verify deletion success.

 def delete_backing_image_manager(self, name):
     logging(f"deleting backing image manager {name} ...")
-    self.obj_api.delete_namespaced_custom_object(
-        group="longhorn.io",
-        version="v1beta2",
-        namespace="longhorn-system",
-        plural="backingimagemanagers",
-        name=name
-    )
+    try:
+        self.obj_api.delete_namespaced_custom_object(
+            group="longhorn.io",
+            version="v1beta2",
+            namespace="longhorn-system",
+            plural="backingimagemanagers",
+            name=name
+        )
+    except client.exceptions.ApiException as e:
+        logging(f"Failed to delete backing image manager {name}: {e}")
+        raise
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def delete_backing_image_manager(self, name):
logging(f"deleting backing image manager {name} ...")
self.obj_api.delete_namespaced_custom_object(
group="longhorn.io",
version="v1beta2",
namespace="longhorn-system",
plural="backingimagemanagers",
name=name
)
def delete_backing_image_manager(self, name):
logging(f"deleting backing image manager {name} ...")
try:
self.obj_api.delete_namespaced_custom_object(
group="longhorn.io",
version="v1beta2",
namespace="longhorn-system",
plural="backingimagemanagers",
name=name
)
except client.exceptions.ApiException as e:
logging(f"Failed to delete backing image manager {name}: {e}")
raise

Comment on lines +175 to +195
def wait_for_namespace_pods_running(namespace):
retry_count, retry_interval = get_retry_count_and_interval()

for i in range(retry_count):
time.sleep(retry_interval)
pod_list = list_namespace_pods(namespace)
all_running = True

for pod in pod_list.items:
pod_name = pod.metadata.name
pod_status = pod.status.phase

if pod_status != "Running":
logging(f"Pod {pod_name} is in {pod_status} state, waiting...")
all_running = False

if all_running:
logging(f"All pods in namespace {namespace} are in Running state!")
return

assert False, f"wait all pod in namespace {namespace} running failed"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling and documentation to the function.

The function needs additional error handling and documentation:

  1. Add docstring describing purpose, parameters, and return value
  2. Handle empty pod list scenario
  3. Add safety checks for pod status
  4. Consider CrashLoopBackOff scenarios

Here's the suggested implementation:

 def wait_for_namespace_pods_running(namespace):    
+    """Wait for all pods in the specified namespace to reach Running state.
+    
+    Args:
+        namespace (str): The namespace to check pods in
+        
+    Raises:
+        AssertionError: If pods don't reach Running state within retry limit
+    """
     retry_count, retry_interval = get_retry_count_and_interval()
 
     for i in range(retry_count):        
         time.sleep(retry_interval)
         pod_list = list_namespace_pods(namespace)        
+        if not pod_list.items:
+            logging(f"No pods found in namespace {namespace}")
+            return
+            
         all_running = True
 
         for pod in pod_list.items:
             pod_name = pod.metadata.name
-            pod_status = pod.status.phase
+            # Safety check for pod status
+            if not pod.status or not pod.status.phase:
+                logging(f"Pod {pod_name} status not available")
+                all_running = False
+                continue
+                
+            pod_status = pod.status.phase
+            # Check for CrashLoopBackOff
+            if pod.status.container_statuses:
+                for container in pod.status.container_statuses:
+                    if container.state.waiting and \
+                       container.state.waiting.reason == 'CrashLoopBackOff':
+                        logging(f"Pod {pod_name} container {container.name} " \
+                               "is in CrashLoopBackOff state")
+                        all_running = False
 
             if pod_status != "Running":
                 logging(f"Pod {pod_name} is in {pod_status} state, waiting...")
                 all_running = False
 
         if all_running:
             logging(f"All pods in namespace {namespace} are in Running state!")
             return
 
     raise AssertionError(f"Timed out waiting for all pods in namespace {namespace} to reach Running state")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def wait_for_namespace_pods_running(namespace):
retry_count, retry_interval = get_retry_count_and_interval()
for i in range(retry_count):
time.sleep(retry_interval)
pod_list = list_namespace_pods(namespace)
all_running = True
for pod in pod_list.items:
pod_name = pod.metadata.name
pod_status = pod.status.phase
if pod_status != "Running":
logging(f"Pod {pod_name} is in {pod_status} state, waiting...")
all_running = False
if all_running:
logging(f"All pods in namespace {namespace} are in Running state!")
return
assert False, f"wait all pod in namespace {namespace} running failed"
def wait_for_namespace_pods_running(namespace):
"""Wait for all pods in the specified namespace to reach Running state.
Args:
namespace (str): The namespace to check pods in
Raises:
AssertionError: If pods don't reach Running state within retry limit
"""
retry_count, retry_interval = get_retry_count_and_interval()
for i in range(retry_count):
time.sleep(retry_interval)
pod_list = list_namespace_pods(namespace)
if not pod_list.items:
logging(f"No pods found in namespace {namespace}")
return
all_running = True
for pod in pod_list.items:
pod_name = pod.metadata.name
# Safety check for pod status
if not pod.status or not pod.status.phase:
logging(f"Pod {pod_name} status not available")
all_running = False
continue
pod_status = pod.status.phase
# Check for CrashLoopBackOff
if pod.status.container_statuses:
for container in pod.status.container_statuses:
if container.state.waiting and \
container.state.waiting.reason == 'CrashLoopBackOff':
logging(f"Pod {pod_name} container {container.name} " \
"is in CrashLoopBackOff state")
all_running = False
if pod_status != "Running":
logging(f"Pod {pod_name} is in {pod_status} state, waiting...")
all_running = False
if all_running:
logging(f"All pods in namespace {namespace} are in Running state!")
return
raise AssertionError(f"Timed out waiting for all pods in namespace {namespace} to reach Running state")
🧰 Tools
🪛 Ruff

178-178: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


195-195: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

Copy link
Member

@yangchiu yangchiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@khushboo-rancher khushboo-rancher enabled auto-merge (rebase) November 11, 2024 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants