Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add case test_metric_longhorn_backup #2145

Merged
merged 2 commits into from
Nov 11, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
167 changes: 162 additions & 5 deletions manager/integration/tests/test_metric.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from kubernetes.stream import stream
from prometheus_client.parser import text_string_to_metric_families

from common import client, core_api, pod, volume_name # NOQA
from common import client, core_api, pod, volume_name, batch_v1_api # NOQA

from common import crash_engine_process_with_sigkill
from common import delete_replica_processes
Expand Down Expand Up @@ -35,6 +35,25 @@
from common import DEFAULT_DISK_PATH
from common import Gi

from backupstore import set_random_backupstore # NOQA
from common import create_recurring_jobs
from common import check_recurring_jobs
from common import wait_for_cron_job_count
from common import create_backup
from common import wait_for_backup_count
from common import delete_backup_volume

RECURRING_JOB_NAME = "recurring-test"
TASK = "task"
GROUPS = "groups"
CRON = "cron"
RETAIN = "retain"
BACKUP = "backup"
CONCURRENCY = "concurrency"
LABELS = "labels"
DEFAULT = "default"
SCHEDULE_1MIN = "* * * * *"

# The dictionaries use float type of value because the value obtained from
# prometheus_client is in float type.
# https://github.com/longhorn/longhorn-tests/pull/1531#issuecomment-1833349994
Expand Down Expand Up @@ -138,6 +157,21 @@ def examine_metric_value(found_metric, metric_labels, expected_value=None):
assert found_metric.value >= 0.0


def wait_for_metric_sum_on_all_nodes(client, core_api, metric_name, metric_labels, expected_value): # NOQA
for _ in range(RETRY_COUNTS):
time.sleep(RETRY_INTERVAL)

try:
check_metric_sum_on_all_nodes(client, core_api, metric_name,
metric_labels, expected_value)
return
except AssertionError:
continue

check_metric_sum_on_all_nodes(client, core_api, metric_name,
metric_labels, expected_value)

Comment on lines +160 to +173
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider enhancing the wait_for_metric_sum_on_all_nodes function

The function is well-implemented, but consider the following improvements:

  1. Add parameters for RETRY_COUNTS and RETRY_INTERVAL to make the function more flexible.
  2. Consider using exponential backoff instead of fixed intervals for more efficient retrying.
  3. Add logging to provide more visibility into the waiting process.
  4. Consider returning a boolean to indicate success or failure instead of relying on an exception.

Example implementation:

def wait_for_metric_sum_on_all_nodes(client, core_api, metric_name, metric_labels, expected_value, max_wait_time=300, initial_interval=1):
    start_time = time.time()
    interval = initial_interval
    while time.time() - start_time < max_wait_time:
        try:
            check_metric_sum_on_all_nodes(client, core_api, metric_name, metric_labels, expected_value)
            return True
        except AssertionError:
            time.sleep(interval)
            interval = min(interval * 2, 60)  # exponential backoff, max 60 seconds
    
    return False

This implementation provides more flexibility and better handles long-running waits.


def check_metric_sum_on_all_nodes(client, core_api, metric_name, expected_labels, expected_value=None): # NOQA
# Initialize total_metrics to store the sum of the metric values.
total_metrics = {"labels": defaultdict(None), "value": 0.0}
Expand Down Expand Up @@ -440,12 +474,12 @@ def test_metric_longhorn_snapshot_actual_size_bytes(client, core_api, volume_nam

When 1 snapshot is created by user
And 1 snapshot is created by system
Then has a metric longhorn_snapshot_actual_size_bytes value equals to the
size of the user created snapshot,
Then has a metric longhorn_snapshot_actual_size_bytes value
equals to the size of the user created snapshot,
and volume label is the volume name
and user_created label is true
And has a metric longhorn_snapshot_actual_size_bytes value equals to the
size of the system created snapshot,
And has a metric longhorn_snapshot_actual_size_bytes value
equals to the size of the system created snapshot,
and volume label is the volume name
and user_created label is false

Expand Down Expand Up @@ -615,3 +649,126 @@ def test_node_metrics(client, core_api): # NOQA
wait_for_node_update(client, lht_hostId, "allowScheduling", False)
check_metric_with_condition(core_api, "longhorn_node_status",
metric_labels, 0.0)


def test_metric_longhorn_backup(set_random_backupstore, client, core_api, batch_v1_api, volume_name): # NOQA
"""
Scenario: test metric longhorn_backup_actual_size_bytes and
longhorn_backup_state

Issue: https://github.com/longhorn/longhorn/issues/9429

Given a volume

When a backup is created by user
Then has a metric longhorn_backup_actual_size_bytes value
equals to the size of the backup,
and volume label is the volume name
and recurring_job label is empty
And has a metric longhorn_backup_state value equals to 3 (Completed),
and volume label is the volume name
and recurring_job label is empty

When a recurring backup job is created
Then should have a metric longhorn_backup_actual_size_bytes value
equals to the size of the backup,
and volume label is the volume name
and recurring_job label is the job name
And should have a metric longhorn_backup_state
value equals to 3 (Completed),
and volume label is the volume name
and recurring_job label is the job name
"""
self_hostId = get_self_host_id()

# create a volume and attach it to a node.
volume_size = 50 * Mi
client.create_volume(name=volume_name,
numberOfReplicas=1,
size=str(volume_size))
volume = wait_for_volume_detached(client, volume_name)
volume.attach(hostId=self_hostId)
volume = wait_for_volume_healthy(client, volume_name)
Comment on lines +684 to +691
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider parameterizing volume creation

The volume creation process is hardcoded. Consider parameterizing the volume size and number of replicas to make the test more flexible and reusable.

Example:

def create_test_volume(client, name, size=50*Mi, replicas=1):
    client.create_volume(name=name, numberOfReplicas=replicas, size=str(size))
    volume = wait_for_volume_detached(client, name)
    volume.attach(hostId=get_self_host_id())
    return wait_for_volume_healthy(client, name)

volume = create_test_volume(client, volume_name)


# create the user backup.
data_size = 10 * Mi
backup_data = {'pos': 0,
'len': data_size,
'content': generate_random_data(data_size)}
write_volume_data(volume, backup_data)
create_backup(client, volume_name)
bv = client.by_id_backupVolume(volume_name)
wait_for_backup_count(bv, 1)

# get the backup size.
backup_size = 0
backups = bv.backupList().data
for backup in backups:
if backup['snapshotName'] == "volume-head":
continue

backup_size = int(backup['size'])
assert backup_size > 0
Comment on lines +706 to +711
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential issue with backup size assignment

In the loop iterating over backups, the variable backup_size is assigned the size of each backup without accumulating. If multiple backups exist, backup_size will hold the size of the last backup processed. Ensure that you are capturing the correct backup size intended for the test, possibly by identifying the specific backup needed or summing the sizes if appropriate.

Comment on lines +704 to +711
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor duplicate code for getting backup size

The code for obtaining the backup size is duplicated for user backups and recurring backups. Consider refactoring this into a helper function to improve maintainability and reduce redundancy.

Here's a suggested helper function:

def get_backup_size(backup_volume):
    backups = backup_volume.backupList().data
    for backup in backups:
        if backup['snapshotName'] == "volume-head":
            continue
        return int(backup['size'])
    return 0

# Then use it like this:
backup_size = get_backup_size(bv)
assert backup_size > 0, "Backup size should be greater than 0"

Also applies to: 751-758


Comment on lines +704 to +712
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor to eliminate code duplication when calculating backup sizes

The code for obtaining backup_size (lines 701-709) and recurring_backup_size (lines 749-757) is similar. Consider refactoring this duplicated code into a helper function to improve maintainability and reduce redundancy.

Here's an example of a helper function:

def get_backup_size(backup_volume):
    backups = backup_volume.backupList().data
    for backup in backups:
        if backup['snapshotName'] == "volume-head":
            continue
        return int(backup['size'])
    return 0

You can then use this function to obtain the backup sizes:

backup_size = get_backup_size(bv)

Also applies to: 749-757

# assert the metric values for the user backup.
user_backup_metric_labels = {
"volume": volume_name,
"recurring_job": "",
}
wait_for_metric_sum_on_all_nodes(client, core_api,
"longhorn_backup_actual_size_bytes",
user_backup_metric_labels,
backup_size)

wait_for_metric_sum_on_all_nodes(client, core_api,
"longhorn_backup_state",
user_backup_metric_labels,
3)

# delete the existing backup before creating a recurring backup job.
delete_backup_volume(client, volume_name)
Comment on lines +728 to +729
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adding error handling for backup volume deletion

The delete_backup_volume call should be wrapped in a try-except block to handle potential errors during deletion. This will make the test more robust.

Example:

try:
    delete_backup_volume(client, volume_name)
except Exception as e:
    pytest.fail(f"Failed to delete backup volume: {str(e)}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @c3y1huang

In my test case test_metric_longhorn_backup, I have referenced wait_for_metric_count_all_nodes for similar usage. However, I have a question regarding my test case design. Although I have deleted the backups of volumes that were not created by the recurring job, I am concerned about potential data caching issues when using the same volume to apply a backup recurring job to check the Longhorn backup metric. Could you please help review this test case?

Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After longhorn/longhorn-manager#3216 was merged, my test case passed, and I did not observe any potential data caching issues when using the same volume to apply a backup recurring job to check the Longhorn backup metrics.

Screenshot_20241024_154554


# create a recurring backup job.
recurring_jobs = {
RECURRING_JOB_NAME: {
TASK: BACKUP,
GROUPS: [DEFAULT],
CRON: SCHEDULE_1MIN,
RETAIN: 1,
CONCURRENCY: 1,
LABELS: {},
},
}
create_recurring_jobs(client, recurring_jobs)
check_recurring_jobs(client, recurring_jobs)
wait_for_cron_job_count(batch_v1_api, 1)

# wait for the recurring backup job to run.
time.sleep(60)
bv = client.by_id_backupVolume(volume_name)
wait_for_backup_count(bv, 1)
Comment on lines +746 to +749
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider replacing sleep with a more robust waiting mechanism

Using a fixed time.sleep(60) may introduce unnecessary delays or may not be sufficient if the backup takes longer. Consider using a polling mechanism or a wait function that checks for the specific condition you're waiting for.

Here's a suggested approach:

def wait_for_backup_completion(client, volume_name, timeout=300, interval=2):
    start_time = time.time()
    while time.time() - start_time < timeout:
        bv = client.by_id_backupVolume(volume_name)
        if len(bv.backupList().data) > 0:
            return True
        time.sleep(interval)
    raise TimeoutError(f"Backup for volume {volume_name} did not complete within {timeout} seconds")

# Replace the sleep and subsequent lines with:
wait_for_backup_completion(client, volume_name)
bv = client.by_id_backupVolume(volume_name)


# get the recurring backup size.
recurring_backup_size = 0
backups = bv.backupList().data
for backup in backups:
if backup['snapshotName'] == "volume-head":
continue

recurring_backup_size = int(backup['size'])
assert recurring_backup_size > 0
Comment on lines +754 to +759
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential issue with recurring backup size assignment

Similar to the user backup size, in the loop iterating over backups, recurring_backup_size is assigned the size of each backup without accumulating. If multiple backups exist, recurring_backup_size will hold the size of the last backup processed. Ensure that you are capturing the correct backup size for the recurring backup test, possibly by identifying the specific backup associated with the recurring job.


# assert the metric values for the recurring backup.
recurring_backup_metric_labels = {
"volume": volume_name,
"recurring_job": RECURRING_JOB_NAME,
}
wait_for_metric_sum_on_all_nodes(client, core_api,
"longhorn_backup_actual_size_bytes",
recurring_backup_metric_labels,
recurring_backup_size)

wait_for_metric_sum_on_all_nodes(client, core_api,
"longhorn_backup_state",
recurring_backup_metric_labels,
3)