-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare chunks in a worker process #8618
base: develop
Are you sure you want to change the base?
Conversation
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe changes in this pull request encompass various enhancements across multiple files, primarily focusing on improving debugging configurations, cache management, error handling, and testing capabilities within the CVAT application. Key updates include new debugging configurations for REST API tests, significant modifications to the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Outside diff range and nitpick comments (8)
supervisord/worker.chunks.conf (1)
1-9
: Consider security implications of socket location
The Unix socket is currently placed in /tmp/supervisord/
. This location could pose security risks as:
/tmp
is world-writable- Socket files might persist between container restarts
- Permissions need to be carefully managed
Consider moving the socket to a dedicated directory (e.g., /var/run/supervisord/
) with proper permissions and ownership.
tests/docker-compose.minio.yml (1)
11-11
: Consider adding resource constraints for production readiness.
While the current configuration is consistent with other services, consider defining resource limits (CPU, memory) for the worker service to ensure proper resource management in production environments.
Example addition:
cvat_worker_chunks:
<<: *allow-minio
deploy:
resources:
limits:
cpus: '1.0'
memory: 2G
reservations:
cpus: '0.5'
memory: 1G
helm-chart/templates/cvat_backend/worker_chunks/deployment.yml (2)
1-19
: Consider adding operational metadata annotations.
While the basic metadata is well-structured, consider adding these operational annotations for better manageability:
app.kubernetes.io/version
: Track the CVAT versionapp.kubernetes.io/part-of
: Indicate it's part of CVAT platformprometheus.io/scrape
: Enable metrics scraping if applicable
annotations:
+ app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
+ app.kubernetes.io/part-of: cvat
+ prometheus.io/scrape: "true"
🧰 Tools
🪛 yamllint
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
21-23
: Consider using RollingUpdate strategy with PodDisruptionBudget.
The current Recreate
strategy will cause downtime during updates. For better availability:
- Switch to
RollingUpdate
- Add a PodDisruptionBudget
strategy:
- type: Recreate
+ type: RollingUpdate
+ rollingUpdate:
+ maxSurge: 25%
+ maxUnavailable: 25%
Consider adding a PodDisruptionBudget in a separate file:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: {{ .Release.Name }}-backend-worker-chunks-pdb
spec:
minAvailable: 50%
selector:
matchLabels:
app: cvat-app
tier: backend
component: worker-chunks
docker-compose.yml (1)
227-241
: Consider future enhancements for operational robustness.
While the current implementation is consistent with other services, consider these potential enhancements for future iterations:
- Resource limits (memory/CPU) for better resource management
- Health checks for improved container orchestration
- Specific logging configuration for better observability
helm-chart/values.yaml (1)
120-129
: Document the chunks worker configuration.
Consider adding comments to document:
- The purpose of the chunks worker
- Guidelines for adjusting replica count
- Resource requirements and scaling considerations
Example documentation:
analyticsreports:
replicas: 1
labels: {}
annotations: {}
+ # Configuration for the chunks processing worker
+ # Handles parallel processing of data chunks
+ # Default replica count is set to 4 for optimal parallel processing
+ # Adjust based on your workload and available resources
chunks:
replicas: 4
labels: {}
cvat/apps/engine/cache.py (2)
82-83
: Add return type annotation to _cache
method
Adding a return type annotation enhances code readability and assists static type checkers in detecting potential issues.
Apply this diff to add the return type:
@staticmethod
def _cache():
+ -> Any:
return caches[MediaCache._CACHE_NAME]
73-79
: Add documentation for class-level constants
Adding comments explaining the purpose of class-level constants improves code maintainability and readability.
Consider adding comments like:
_QUEUE_NAME = settings.CVAT_QUEUES.CHUNKS.value # Queue name for chunk processing
_QUEUE_JOB_PREFIX_TASK = "chunks:prepare-item-" # Prefix for job IDs
_SLEEP_TIMEOUT = 0.2 # Sleep duration between job status checks in seconds
_CHUNK_CREATE_TIMEOUT = 50 # Maximum time to wait for chunk creation in seconds
_CACHE_NAME = "media" # Name of the cache to use
_RQ_JOB_RESULT_TTL = 60 # Time-to-live for job results in seconds
_RQ_JOB_FAILURE_TTL = 3600 * 24 * 14 # Time-to-live for failed jobs (2 weeks)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (16)
- .vscode/launch.json (5 hunks)
- cvat/apps/engine/cache.py (19 hunks)
- cvat/apps/engine/frame_provider.py (3 hunks)
- cvat/apps/engine/rq_job_handler.py (1 hunks)
- cvat/apps/engine/views.py (4 hunks)
- cvat/settings/base.py (3 hunks)
- docker-compose.yml (1 hunks)
- helm-chart/templates/cvat_backend/worker_chunks/deployment.yml (1 hunks)
- helm-chart/values.yaml (1 hunks)
- supervisord/worker.chunks.conf (1 hunks)
- tests/docker-compose.file_share.yml (1 hunks)
- tests/docker-compose.minio.yml (1 hunks)
- tests/python/rest_api/test_jobs.py (1 hunks)
- tests/python/rest_api/test_tasks.py (3 hunks)
- tests/python/sdk/test_auto_annotation.py (1 hunks)
- tests/python/sdk/test_datasets.py (1 hunks)
🧰 Additional context used
🪛 yamllint
helm-chart/templates/cvat_backend/worker_chunks/deployment.yml
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
🔇 Additional comments (20)
tests/docker-compose.file_share.yml (1)
8-10
: LGTM! Volume configuration follows established patterns.
The volume mount configuration for the new cvat_worker_chunks
service is consistent with existing services, maintaining the same mount point and permissions necessary for testing.
Let's verify the consistency of this configuration across other compose files:
✅ Verification successful
Volume configuration is correctly aligned with other services
The volume mount configuration for cvat_worker_chunks
is perfectly consistent with both cvat_server
and cvat_worker_import
services, using the identical mount path ./tests/mounted_file_share:/home/django/share:rw
. This ensures proper file sharing functionality across all related services in the test environment.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify cvat_worker_chunks configuration across compose files
# Expected: Should find consistent volume configurations in other compose files
# Check for cvat_worker_chunks service in all docker-compose files
rg -A 5 "cvat_worker_chunks:" "docker-compose*.yml" "tests/docker-compose*.yml"
# Check for the specific volume mount pattern
rg "/home/django/share" "docker-compose*.yml" "tests/docker-compose*.yml"
Length of output: 405
Script:
#!/bin/bash
# Let's try with fd to find docker-compose files first, then use them in our search
echo "=== Docker Compose files in the repository ==="
fd -e yml -e yaml docker-compose
echo -e "\n=== Checking for cvat_worker_chunks service ==="
fd -e yml -e yaml docker-compose -x rg -A 5 "cvat_worker_chunks:"
echo -e "\n=== Checking volume mount patterns ==="
fd -e yml -e yaml docker-compose -x rg "/home/django/share"
Length of output: 1355
Script:
#!/bin/bash
# Let's check the context of these volume mounts to see which services they belong to
echo "=== Checking services with shared volume mounts ==="
fd -e yml -e yaml docker-compose -x rg -B 5 "./tests/mounted_file_share:/home/django/share:rw"
Length of output: 486
supervisord/worker.chunks.conf (2)
28-29
: 🛠️ Refactor suggestion
Consider adding health check and logging configuration for smokescreen
While the localhost binding is secure, the configuration could be enhanced with:
- Health check configuration
- Log file configuration
- Error handling directives
- Process restart parameters
Let's check if these configurations exist in other smokescreen setups:
#!/bin/bash
# Search for other smokescreen configurations
rg "smokescreen" -A 5 -B 5
Consider adding:
[program:smokescreen]
command=smokescreen --listen-ip=127.0.0.1 %(ENV_SMOKESCREEN_OPTS)s
+stdout_logfile=%(ENV_HOME)s/logs/smokescreen.log
+stderr_logfile=%(ENV_HOME)s/logs/smokescreen.err
+autorestart=true
+startretries=3
+stopwaitsecs=10
19-27
: Verify dependencies and consider adding documentation
The configuration looks well-structured, but let's verify the dependencies:
Consider adding a comment block explaining:
- The purpose of this worker
- Expected NUMPROCS values
- The role of VECTOR_EVENT_HANDLER
✅ Verification successful
Dependencies verified successfully, documentation suggestion remains
The verification shows that:
wait_for_deps.sh
exists and properly checks for required services (PostgreSQL, Redis)DefaultWorker
class is properly defined inrqworker.py
with debug support- The worker configuration is correctly set up for chunk processing
Consider adding a comment block explaining:
- The purpose of this worker (chunk processing)
- Expected NUMPROCS values
- The role of VECTOR_EVENT_HANDLER
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if wait_for_deps.sh exists and is executable
fd "wait_for_deps.sh" --type f --exec test -x {} \; || echo "Script not executable"
# Verify the worker class exists
ast-grep --pattern 'class DefaultWorker'
Length of output: 5322
Script:
#!/bin/bash
# Check for wait_for_deps.sh
fd "wait_for_deps.sh" --type f
# Look for DefaultWorker class definition
rg "class DefaultWorker" -A 5
# Look for chunks worker related code/docs
rg "rqworker.*chunks" -A 3
Length of output: 605
Script:
#!/bin/bash
# Check content of wait_for_deps.sh
cat wait_for_deps.sh
# Look for DefaultWorker in cvat/rqworker.py or similar paths
fd "rqworker.py" --type f --exec cat {}
Length of output: 2815
tests/docker-compose.minio.yml (1)
11-11
: LGTM! The new service is properly configured.
The cvat_worker_chunks
service is correctly integrated with MinIO using the allow-minio
anchor, maintaining consistency with other worker services.
cvat/apps/engine/rq_job_handler.py (1)
31-32
: Consider security implications of exception metadata storage.
While adding exception type and args to job metadata improves debugging capabilities, ensure that sensitive information is properly sanitized before storage. Consider adding documentation about:
- The purpose and usage of these fields
- Any sanitization measures in place
- How these fields integrate with the new chunks worker
Let's verify the usage of these new fields:
✅ Verification successful
Exception metadata storage implementation is properly secured
The code review findings show that the exception metadata fields are used appropriately and securely:
- In
views.py
, the fields store actual exception information from the system (exc_type, exc_value.args) only for chunk-related jobs - In
cache.py
, these fields are used to reconstruct and re-raise the original exception, providing accurate error reporting - The implementation is properly scoped to RQ job metadata and doesn't expose sensitive information
The usage pattern shows these fields serve their intended purpose for error handling in the chunks worker while maintaining security best practices.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for usage of the new exception metadata fields
rg -l "EXCEPTION_TYPE|EXCEPTION_ARGS" --type python
Length of output: 84
Script:
#!/bin/bash
# Search for usage of the new exception metadata fields without file type restriction
rg -l "EXCEPTION_TYPE|EXCEPTION_ARGS"
# Also search for any exception handling patterns in the RQ job handler
rg -A 5 "except|raise" cvat/apps/engine/rq_job_handler.py
Length of output: 882
Script:
#!/bin/bash
# Let's examine how these fields are used in the identified files
rg -B 2 -A 5 "EXCEPTION_TYPE|EXCEPTION_ARGS" cvat/apps/engine/views.py cvat/apps/engine/rq_job_handler.py cvat/apps/engine/cache.py
# Also check for any exception metadata setting patterns
rg -B 2 -A 5 "set_job_meta|update_job_meta" cvat/apps/engine/views.py cvat/apps/engine/rq_job_handler.py cvat/apps/engine/cache.py
Length of output: 2081
tests/python/sdk/test_datasets.py (1)
27-27
: LGTM! Consider documenting the Redis restoration strategy.
The addition of restore_redis_inmem_per_function
alongside restore_redis_ondisk_per_function
provides comprehensive test isolation. This is a good practice to prevent test interference.
Let's verify the Redis restoration strategy across the test suite:
Consider adding a comment explaining:
- Why both ondisk and inmem Redis restoration are necessary
- The performance implications of per-function restoration
✅ Verification successful
Let me analyze the Redis restoration fixtures usage across the test suite to make a final determination.
From the results:
- The fixtures are defined in
tests/python/shared/fixtures/init.py
- They are used consistently across both SDK and REST API tests
- Some tests use both fixtures while others use only one, showing intentional usage patterns:
- SDK tests often use both or just ondisk
- REST API tests predominantly use inmem
Let me check the fixture implementations to understand their specific purposes:
The Redis restoration strategy is well-implemented and necessary
The implementation reveals clear distinctions between the two Redis instances:
inmem
: Selectively cleans keys while preserving specific ones (via_get_redis_inmem_keys_to_keep()
)ondisk
: Performs a complete flush of the Redis instance running on port 6666
The usage in test_datasets.py
is correct as it needs both fixtures because:
- The inmem instance handles temporary data and caching
- The ondisk instance manages persistent data
Both are required for complete test isolation in dataset operations.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check Redis restoration patterns and documentation
# Test 1: Look for other test files using these fixtures
rg "restore_redis_(inmem|ondisk)_per_function"
# Test 2: Check for any Redis-related documentation
fd -e md -e rst | xargs rg -l "redis"
Length of output: 2996
Script:
#!/bin/bash
# Check the fixture implementations
cat tests/python/shared/fixtures/init.py | grep -A 20 "restore_redis_"
Length of output: 3060
docker-compose.yml (1)
227-241
: LGTM! Service definition follows established patterns.
The new cvat_worker_chunks
service is well-integrated and follows the consistent patterns established by other worker services in terms of:
- Image and restart policy
- Backend dependencies
- Environment variables
- Volume mounts
- Network configuration
helm-chart/values.yaml (1)
120-129
: Verify the need for 4 replicas.
The configuration structure follows the established pattern and provides all necessary customization points. However, the default replica count (4) is higher than other workers (1-2). Please verify if this level of parallelization is required for the chunk processing workload.
.vscode/launch.json (4)
172-172
: LGTM! Good practice with trailing commas.
Adding trailing commas makes future changes cleaner and reduces merge conflicts.
Also applies to: 182-182
364-385
: LGTM! Well-structured RQ worker configuration.
The new chunks worker configuration follows the established pattern and includes all necessary settings consistent with other RQ workers.
579-580
: LGTM! Proper integration into debug compound.
The new chunks worker is correctly added to the compound debug configuration, maintaining consistent style and order.
373-374
: Verify RQ queue registration.
Ensure that the "chunks" queue is properly registered in the RQ configuration.
✅ Verification successful
Queue registration verified successfully
The "chunks" queue is properly registered in the RQ configuration. Found in cvat/settings/base.py
:
- Defined as
CHUNKS
in theCVAT_QUEUES
enum - Configured in
RQ_QUEUES
with shared queue settings and a 1-minute default timeout
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that the chunks queue is registered in RQ configuration
# Expected: Find queue registration in Django settings or RQ configuration files
# Search for queue registration
rg -l 'chunks.*queue' --type py
Length of output: 33
Script:
#!/bin/bash
# Let's try different search patterns to find RQ queue configurations
# Search for RQ queue configurations
rg -l "QUEUES.*=.*\[.*\]" --type py
# Search for any mentions of RQ queues
rg -l "RQ.*QUEUES" --type py
# Search for chunks worker or queue definitions
rg -l "chunks.*worker|QUEUE.*chunks" --type py
# Look for Django settings files that might contain queue configs
fd "settings.*\.py$"
Length of output: 515
Script:
#!/bin/bash
# Let's examine the content of base settings and other relevant files
# Check base settings file for queue configurations
rg -A 5 -B 5 "QUEUES|RQ_" cvat/settings/base.py
# Check engine default settings
rg -A 5 -B 5 "QUEUES|RQ_" cvat/apps/engine/default_settings.py
# Look for any worker or queue related configurations
rg -A 5 -B 5 "WORKER_QUEUE|DEFAULT_QUEUE|QUEUES" cvat/apps/engine/default_settings.py cvat/settings/base.py
Length of output: 5256
tests/python/sdk/test_auto_annotation.py (1)
33-33
: Good addition for test isolation!
Adding restore_redis_inmem_per_function
alongside restore_redis_ondisk_per_function
ensures complete Redis state cleanup between test functions. This improves test isolation by managing both on-disk and in-memory Redis states, which helps prevent test interference and increases reliability.
Let's verify the usage of this fixture across the test suite:
✅ Verification successful
Redis fixture usage is consistent across the test suite
The fixture restore_redis_inmem_per_function
is properly defined in tests/python/shared/fixtures/init.py
and is consistently used across multiple test files including SDK tests and REST API tests. The addition in test_auto_annotation.py
follows the established pattern of using this fixture for Redis state cleanup, ensuring test isolation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for other test files using this fixture to ensure consistent usage
rg "restore_redis_inmem_per_function" "tests/"
Length of output: 1958
cvat/settings/base.py (2)
552-558
: Verify Redis connection string format
The Redis connection string for the media cache appears correct, but let's ensure the URL encoding of special characters in the password is handled properly.
Let's verify the Redis connection string handling:
#!/bin/bash
# Search for Redis connection handling to ensure proper URL encoding
rg -g '!*.{json,lock}' "LOCATION.*redis://"
# Check if there are any existing Redis connection error handlers
rg -g '!*.{json,lock}' -A 5 "RedisCache.*error"
277-277
: LGTM: Queue configuration for chunks processing
The addition of the CHUNKS
queue with a 1-minute timeout appears reasonable for chunk processing tasks. The configuration follows the established pattern and reuses the shared Redis connection settings.
Let's verify that the timeout aligns with typical chunk processing times:
Also applies to: 323-326
tests/python/rest_api/test_jobs.py (1)
694-694
: LGTM: Redis fixture addition enhances test isolation.
The addition of the restore_redis_inmem_per_class
fixture to TestGetGtJobData
ensures that Redis in-memory storage is properly restored between test classes, preventing potential test interference and improving test reliability.
cvat/apps/engine/views.py (4)
726-727
: LGTM: Proper timeout handling with retry mechanism
The implementation correctly handles TimeoutError by returning a 429 (Too Many Requests) response with a Retry-After header, allowing clients to implement proper backoff strategies.
811-827
: LGTM: Consistent timeout handling for job data retrieval
The implementation properly handles TimeoutError during chunk retrieval, maintaining consistency with the parent class implementation. The error response includes appropriate status code and retry information.
2976-2977
: LGTM: Consistent timeout handling for cloud storage operations
The implementation follows the established pattern for timeout handling across the codebase, providing consistent error responses for cloud storage operations.
3264-3266
: LGTM: Enhanced exception handling for chunk processing
The implementation adds specialized exception metadata storage for chunk processing jobs, improving debugging capabilities while maintaining the existing exception handling behavior.
[supervisord] | ||
nodaemon=true | ||
logfile=%(ENV_HOME)s/logs/supervisord.log ; supervisord log file | ||
logfile_maxbytes=50MB ; maximum size of logfile before rotation | ||
logfile_backups=10 ; number of backed up logfiles | ||
loglevel=debug ; info, debug, warn, trace | ||
pidfile=/tmp/supervisord/supervisord.pid ; pidfile location | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Debug logging and PID file location need attention across all supervisord configurations
The verification confirms that debug-level logging and /tmp
PID file location are consistently used across all supervisord configuration files:
- All 9 supervisord config files have
loglevel=debug
- All configs use
/tmp/supervisord/supervisord.pid
for PID file storage
This is a systemic configuration that should be addressed across all files to:
- Set appropriate production logging level
- Use a more secure location for PID files
- Ensure consistent changes across all supervisord configurations
🔗 Analysis chain
Review logging configuration for production use
- Debug level logging is enabled, which could:
- Impact performance
- Fill disk space faster
- Potentially expose sensitive information
- PID file is stored in
/tmp
, which has similar security implications as the socket file
Let's check if debug logging is used in other worker configurations:
Consider:
- Setting loglevel to 'info' for production
- Moving PID file to a more appropriate location
- Adding log rotation configuration to other log handlers
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check logging levels in other supervisord configs
rg "loglevel=" -A 1 -B 1
Length of output: 5322
Script:
#!/bin/bash
# Check logging levels in supervisord configs using a more specific pattern
rg "loglevel\s*=" supervisord/
Length of output: 817
Script:
#!/bin/bash
# Check for PID file configurations in supervisord configs
rg "pidfile\s*=" supervisord/
Length of output: 861
Script:
#!/bin/bash
# List all supervisord config files to ensure we're not missing any
fd -e conf -e ini supervisord/
Length of output: 386
spec: | ||
serviceAccountName: {{ include "cvat.backend.serviceAccountName" . }} | ||
containers: | ||
- name: cvat-backend | ||
image: {{ .Values.cvat.backend.image }}:{{ .Values.cvat.backend.tag }} | ||
imagePullPolicy: {{ .Values.cvat.backend.imagePullPolicy }} | ||
{{- with merge $localValues.resources .Values.cvat.backend.resources }} | ||
resources: | ||
{{- toYaml . | nindent 12 }} | ||
{{- end }} | ||
args: ["run", "worker.chunks"] | ||
env: | ||
{{ include "cvat.sharedBackendEnv" . | indent 10 }} | ||
{{- with concat .Values.cvat.backend.additionalEnv $localValues.additionalEnv }} | ||
{{- toYaml . | nindent 10 }} | ||
{{- end }} | ||
{{- $probeArgs := list "chunks" -}} | ||
{{- $probeConfig := dict "args" $probeArgs "livenessProbe" $.Values.cvat.backend.worker.livenessProbe -}} | ||
{{ include "cvat.backend.worker.livenessProbe" $probeConfig | indent 10 }} | ||
volumeMounts: | ||
- mountPath: /home/django/data | ||
name: cvat-backend-data | ||
subPath: data | ||
- mountPath: /home/django/logs | ||
name: cvat-backend-data | ||
subPath: logs | ||
{{- with concat .Values.cvat.backend.additionalVolumeMounts $localValues.additionalVolumeMounts }} | ||
{{- toYaml . | nindent 10 }} | ||
{{- end }} | ||
{{- with merge $localValues.affinity .Values.cvat.backend.affinity }} | ||
affinity: | ||
{{- toYaml . | nindent 8 }} | ||
{{- end }} | ||
{{- with concat .Values.cvat.backend.tolerations $localValues.tolerations }} | ||
tolerations: | ||
{{- toYaml . | nindent 8 }} | ||
{{- end }} | ||
volumes: | ||
{{- if .Values.cvat.backend.defaultStorage.enabled }} | ||
- name: cvat-backend-data | ||
persistentVolumeClaim: | ||
claimName: "{{ .Release.Name }}-backend-data" | ||
{{- end }} | ||
{{- with concat .Values.cvat.backend.additionalVolumes $localValues.additionalVolumes }} | ||
{{- toYaml . | nindent 8 }} | ||
{{- end }} | ||
{{- with .Values.imagePullSecrets }} | ||
imagePullSecrets: | ||
{{- toYaml . | nindent 8 }} | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add security context and reliability configurations.
Several important security and reliability configurations are missing:
- Security context for running as non-root
- Readiness probe for proper load balancing
- Mandatory resource limits
- Node affinity for data locality
Apply these security and reliability improvements:
spec:
serviceAccountName: {{ include "cvat.backend.serviceAccountName" . }}
+ securityContext:
+ runAsNonRoot: true
+ runAsUser: 1000
+ fsGroup: 1000
containers:
- name: cvat-backend
image: {{ .Values.cvat.backend.image }}:{{ .Values.cvat.backend.tag }}
imagePullPolicy: {{ .Values.cvat.backend.imagePullPolicy }}
+ securityContext:
+ allowPrivilegeEscalation: false
+ capabilities:
+ drop: ["ALL"]
resources:
+ limits:
+ cpu: "2"
+ memory: "4Gi"
+ requests:
+ cpu: "500m"
+ memory: "1Gi"
# ... existing config ...
+ readinessProbe:
+ exec:
+ command: ["python3", "manage.py", "health_check", "chunks"]
+ initialDelaySeconds: 30
+ periodSeconds: 60
affinity:
+ nodeAffinity:
+ preferredDuringSchedulingIgnoredDuringExecution:
+ - weight: 100
+ preference:
+ matchExpressions:
+ - key: cvat.io/chunk-processing
+ operator: In
+ values:
+ - "true"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
spec: | |
serviceAccountName: {{ include "cvat.backend.serviceAccountName" . }} | |
containers: | |
- name: cvat-backend | |
image: {{ .Values.cvat.backend.image }}:{{ .Values.cvat.backend.tag }} | |
imagePullPolicy: {{ .Values.cvat.backend.imagePullPolicy }} | |
{{- with merge $localValues.resources .Values.cvat.backend.resources }} | |
resources: | |
{{- toYaml . | nindent 12 }} | |
{{- end }} | |
args: ["run", "worker.chunks"] | |
env: | |
{{ include "cvat.sharedBackendEnv" . | indent 10 }} | |
{{- with concat .Values.cvat.backend.additionalEnv $localValues.additionalEnv }} | |
{{- toYaml . | nindent 10 }} | |
{{- end }} | |
{{- $probeArgs := list "chunks" -}} | |
{{- $probeConfig := dict "args" $probeArgs "livenessProbe" $.Values.cvat.backend.worker.livenessProbe -}} | |
{{ include "cvat.backend.worker.livenessProbe" $probeConfig | indent 10 }} | |
volumeMounts: | |
- mountPath: /home/django/data | |
name: cvat-backend-data | |
subPath: data | |
- mountPath: /home/django/logs | |
name: cvat-backend-data | |
subPath: logs | |
{{- with concat .Values.cvat.backend.additionalVolumeMounts $localValues.additionalVolumeMounts }} | |
{{- toYaml . | nindent 10 }} | |
{{- end }} | |
{{- with merge $localValues.affinity .Values.cvat.backend.affinity }} | |
affinity: | |
{{- toYaml . | nindent 8 }} | |
{{- end }} | |
{{- with concat .Values.cvat.backend.tolerations $localValues.tolerations }} | |
tolerations: | |
{{- toYaml . | nindent 8 }} | |
{{- end }} | |
volumes: | |
{{- if .Values.cvat.backend.defaultStorage.enabled }} | |
- name: cvat-backend-data | |
persistentVolumeClaim: | |
claimName: "{{ .Release.Name }}-backend-data" | |
{{- end }} | |
{{- with concat .Values.cvat.backend.additionalVolumes $localValues.additionalVolumes }} | |
{{- toYaml . | nindent 8 }} | |
{{- end }} | |
{{- with .Values.imagePullSecrets }} | |
imagePullSecrets: | |
{{- toYaml . | nindent 8 }} | |
{{- end }} | |
spec: | |
serviceAccountName: {{ include "cvat.backend.serviceAccountName" . }} | |
securityContext: | |
runAsNonRoot: true | |
runAsUser: 1000 | |
fsGroup: 1000 | |
containers: | |
- name: cvat-backend | |
image: {{ .Values.cvat.backend.image }}:{{ .Values.cvat.backend.tag }} | |
imagePullPolicy: {{ .Values.cvat.backend.imagePullPolicy }} | |
securityContext: | |
allowPrivilegeEscalation: false | |
capabilities: | |
drop: ["ALL"] | |
resources: | |
limits: | |
cpu: "2" | |
memory: "4Gi" | |
requests: | |
cpu: "500m" | |
memory: "1Gi" | |
args: ["run", "worker.chunks"] | |
env: | |
{{ include "cvat.sharedBackendEnv" . | indent 10 }} | |
{{- with concat .Values.cvat.backend.additionalEnv $localValues.additionalEnv }} | |
{{- toYaml . | nindent 10 }} | |
{{- end }} | |
{{- $probeArgs := list "chunks" -}} | |
{{- $probeConfig := dict "args" $probeArgs "livenessProbe" $.Values.cvat.backend.worker.livenessProbe -}} | |
{{ include "cvat.backend.worker.livenessProbe" $probeConfig | indent 10 }} | |
readinessProbe: | |
exec: | |
command: ["python3", "manage.py", "health_check", "chunks"] | |
initialDelaySeconds: 30 | |
periodSeconds: 60 | |
volumeMounts: | |
- mountPath: /home/django/data | |
name: cvat-backend-data | |
subPath: data | |
- mountPath: /home/django/logs | |
name: cvat-backend-data | |
subPath: logs | |
{{- with concat .Values.cvat.backend.additionalVolumeMounts $localValues.additionalVolumeMounts }} | |
{{- toYaml . | nindent 10 }} | |
{{- end }} | |
{{- with merge $localValues.affinity .Values.cvat.backend.affinity }} | |
affinity: | |
nodeAffinity: | |
preferredDuringSchedulingIgnoredDuringExecution: | |
- weight: 100 | |
preference: | |
matchExpressions: | |
- key: cvat.io/chunk-processing | |
operator: In | |
values: | |
- "true" | |
{{- toYaml . | nindent 8 }} | |
{{- end }} | |
{{- with concat .Values.cvat.backend.tolerations $localValues.tolerations }} | |
tolerations: | |
{{- toYaml . | nindent 8 }} | |
{{- end }} | |
volumes: | |
{{- if .Values.cvat.backend.defaultStorage.enabled }} | |
- name: cvat-backend-data | |
persistentVolumeClaim: | |
claimName: "{{ .Release.Name }}-backend-data" | |
{{- end }} | |
{{- with concat .Values.cvat.backend.additionalVolumes $localValues.additionalVolumes }} | |
{{- toYaml . | nindent 8 }} | |
{{- end }} | |
{{- with .Values.imagePullSecrets }} | |
imagePullSecrets: | |
{{- toYaml . | nindent 8 }} | |
{{- end }} |
frame, frame_name, _ = segment_frame_provider._get_raw_frame( | ||
task_chunk_frames_with_rel_numbers[task_chunk_frame_id], quality=quality | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid accessing private methods from outside the class
The method _get_raw_frame
is intended to be private, as indicated by the leading underscore. Accessing it from outside the SegmentFrameProvider
class violates encapsulation principles. Consider adding a public method to expose the necessary functionality or refactor the code to avoid the need to access this private method externally.
cvat/apps/engine/frame_provider.py
Outdated
@staticmethod | ||
def _get_chunk_create_callback(db_segment: models.Segment | int, segment_chunk_frame_ids, chunk_number, quality) -> DataWithMime: | ||
# Create and return a joined / cleaned chunk | ||
if isinstance(db_segment, int): | ||
db_segment = models.Segment.objects.get(pk=db_segment) | ||
|
||
if db_segment.type == models.SegmentType.RANGE: | ||
return MediaCache.prepare_custom_range_segment_chunk( | ||
db_task=db_segment.task, | ||
frame_ids=segment_chunk_frame_ids, | ||
quality=quality, | ||
) | ||
elif db_segment.type == models.SegmentType.SPECIFIC_FRAMES: | ||
return MediaCache.prepare_custom_masked_range_segment_chunk( | ||
db_task=db_segment.task, | ||
frame_ids=segment_chunk_frame_ids, | ||
chunk_number=chunk_number, | ||
quality=quality, | ||
insert_placeholders=True, | ||
) | ||
else: | ||
assert False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor to eliminate code duplication between _get_chunk_create_callback
methods
The _get_chunk_create_callback
methods in both TaskFrameProvider
and JobFrameProvider
share similar logic. This code duplication can make maintenance more challenging and increase the risk of inconsistencies. Consider refactoring the shared code into a common function or a helper method to promote reusability and improve maintainability.
else: | ||
assert False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace assert False
with explicit exception
Using assert False
might be problematic because assertions can be disabled when Python is run with optimization flags (-O
). It's safer to raise an explicit exception to handle unexpected cases. Consider raising an appropriate exception such as NotImplementedError
to indicate unsupported segment types.
Suggested change:
else:
- assert False
+ raise NotImplementedError(f"Unsupported segment type: {db_segment.type}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
else: | |
assert False | |
else: | |
raise NotImplementedError(f"Unsupported segment type: {db_segment.type}") | |
cvat/apps/engine/cache.py
Outdated
exc_type = job_meta.get(RQJobMetaField.EXCEPTION_TYPE, Exception) | ||
exc_args = job_meta.get(RQJobMetaField.EXCEPTION_ARGS, ("Cannot create chunk",)) | ||
raise exc_type(*exc_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate exception type before raising
When raising an exception using exc_type(*exc_args)
, ensure that exc_type
is indeed an exception class to prevent unexpected behavior. Add validation to confirm exc_type
is a subclass of Exception
.
Apply this diff to add validation:
job_meta = rq_job.get_meta()
exc_type = job_meta.get(RQJobMetaField.EXCEPTION_TYPE, Exception)
+ if not issubclass(exc_type, Exception):
+ exc_type = Exception
exc_args = job_meta.get(RQJobMetaField.EXCEPTION_ARGS, ("Cannot create chunk",))
raise exc_type(*exc_args)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
exc_type = job_meta.get(RQJobMetaField.EXCEPTION_TYPE, Exception) | |
exc_args = job_meta.get(RQJobMetaField.EXCEPTION_ARGS, ("Cannot create chunk",)) | |
raise exc_type(*exc_args) | |
exc_type = job_meta.get(RQJobMetaField.EXCEPTION_TYPE, Exception) | |
if not issubclass(exc_type, Exception): | |
exc_type = Exception | |
exc_args = job_meta.get(RQJobMetaField.EXCEPTION_ARGS, ("Cannot create chunk",)) | |
raise exc_type(*exc_args) |
cvat/apps/engine/cache.py
Outdated
assert item is not None | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace assert
statement with exception handling
Using assert
statements in production code is not recommended as they can be skipped when Python is run with optimizations. Replace the assert
with explicit exception handling to ensure None
values are properly managed.
Apply this diff to fix the issue:
item = self._get_cache_item(key)
- assert item is not None
+ if item is None:
+ raise ValueError(f"Failed to retrieve cache item {key} after job completion")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
assert item is not None | |
else: | |
if item is None: | |
raise ValueError(f"Failed to retrieve cache item {key} after job completion") | |
else: |
cvat/apps/engine/cache.py
Outdated
slogger.glob.info(f"Starting to prepare chunk: key {key}") | ||
if self._is_run_inside_rq(): | ||
item = self._create_and_set_cache_item( | ||
key, | ||
create_callback, | ||
True, | ||
*args, | ||
**kwargs, | ||
) | ||
else: | ||
# compare checksum | ||
item_data = item[0].getbuffer() if isinstance(item[0], io.BytesIO) else item[0] | ||
item_checksum = item[2] if len(item) == 3 else None | ||
if item_checksum != self._get_checksum(item_data): | ||
slogger.glob.info(f"Recreating cache item {key} due to checksum mismatch") | ||
item = create_item() | ||
queue = self._get_queue() | ||
rq_id = self._make_queue_job_id_base(key) | ||
rq_job = queue.fetch_job(rq_id) | ||
|
||
if not rq_job: | ||
rq_job = queue.enqueue( | ||
self._create_and_set_cache_item, | ||
key, | ||
create_callback, | ||
False, | ||
*args, | ||
**kwargs, | ||
job_id=rq_id, | ||
result_ttl=self._RQ_JOB_RESULT_TTL, | ||
failure_ttl=self._RQ_JOB_FAILURE_TTL, | ||
) | ||
|
||
if self._wait_for_rq_job(rq_job): | ||
item = self._get_cache_item(key) | ||
assert item is not None | ||
else: | ||
raise TimeoutError(f"Chunk processing takes too long {key}") | ||
|
||
slogger.glob.info(f"Ending to prepare chunk: key {key}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add exception handling for create_callback
failures
In _create_cache_item
, if create_callback
raises an exception, it may not be properly handled, leading to unhandled exceptions in the job. Consider adding exception handling around create_callback
to ensure errors are logged and the job fails gracefully.
Apply this diff to enhance error handling:
slogger.glob.info(f"Starting to prepare chunk: key {key}")
if self._is_run_inside_rq():
+ try:
item = self._create_and_set_cache_item(
key,
create_callback,
True,
*args,
**kwargs,
)
+ except Exception as e:
+ slogger.glob.error(f"Error creating cache item for key {key}: {e}", exc_info=True)
+ raise
else:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
slogger.glob.info(f"Starting to prepare chunk: key {key}") | |
if self._is_run_inside_rq(): | |
item = self._create_and_set_cache_item( | |
key, | |
create_callback, | |
True, | |
*args, | |
**kwargs, | |
) | |
else: | |
# compare checksum | |
item_data = item[0].getbuffer() if isinstance(item[0], io.BytesIO) else item[0] | |
item_checksum = item[2] if len(item) == 3 else None | |
if item_checksum != self._get_checksum(item_data): | |
slogger.glob.info(f"Recreating cache item {key} due to checksum mismatch") | |
item = create_item() | |
queue = self._get_queue() | |
rq_id = self._make_queue_job_id_base(key) | |
rq_job = queue.fetch_job(rq_id) | |
if not rq_job: | |
rq_job = queue.enqueue( | |
self._create_and_set_cache_item, | |
key, | |
create_callback, | |
False, | |
*args, | |
**kwargs, | |
job_id=rq_id, | |
result_ttl=self._RQ_JOB_RESULT_TTL, | |
failure_ttl=self._RQ_JOB_FAILURE_TTL, | |
) | |
if self._wait_for_rq_job(rq_job): | |
item = self._get_cache_item(key) | |
assert item is not None | |
else: | |
raise TimeoutError(f"Chunk processing takes too long {key}") | |
slogger.glob.info(f"Ending to prepare chunk: key {key}") | |
slogger.glob.info(f"Starting to prepare chunk: key {key}") | |
if self._is_run_inside_rq(): | |
try: | |
item = self._create_and_set_cache_item( | |
key, | |
create_callback, | |
True, | |
*args, | |
**kwargs, | |
) | |
except Exception as e: | |
slogger.glob.error(f"Error creating cache item for key {key}: {e}", exc_info=True) | |
raise | |
else: | |
queue = self._get_queue() | |
rq_id = self._make_queue_job_id_base(key) | |
rq_job = queue.fetch_job(rq_id) | |
if not rq_job: | |
rq_job = queue.enqueue( | |
self._create_and_set_cache_item, | |
key, | |
create_callback, | |
False, | |
*args, | |
**kwargs, | |
job_id=rq_id, | |
result_ttl=self._RQ_JOB_RESULT_TTL, | |
failure_ttl=self._RQ_JOB_FAILURE_TTL, | |
) | |
if self._wait_for_rq_job(rq_job): | |
item = self._get_cache_item(key) | |
assert item is not None | |
else: | |
raise TimeoutError(f"Chunk processing takes too long {key}") | |
slogger.glob.info(f"Ending to prepare chunk: key {key}") |
{{- toYaml . | nindent 8 }} | ||
{{- end }} | ||
spec: | ||
serviceAccountName: {{ include "cvat.backend.serviceAccountName" . }} |
Check warning
Code scanning / SonarCloud
Service account permissions should be restricted Medium
spec: | ||
serviceAccountName: {{ include "cvat.backend.serviceAccountName" . }} | ||
containers: | ||
- name: cvat-backend |
Check warning
Code scanning / SonarCloud
Storage limits should be enforced Medium
spec: | ||
serviceAccountName: {{ include "cvat.backend.serviceAccountName" . }} | ||
containers: | ||
- name: cvat-backend |
Check warning
Code scanning / SonarCloud
Memory limits should be enforced Medium
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8618 +/- ##
===========================================
- Coverage 74.19% 74.06% -0.13%
===========================================
Files 401 401
Lines 43504 43658 +154
Branches 3950 3950
===========================================
+ Hits 32278 32337 +59
- Misses 11226 11321 +95
|
Could you also add queue handling and locking for this action: cvat/cvat/apps/engine/serializers.py Line 1149 in 5520212
|
a5d115f
to
1831cc4
Compare
Quality Gate failedFailed conditions See analysis details on SonarQube Cloud Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE |
Added:
Motivation and context
How has this been tested?
Checklist
develop
branch(cvat-canvas,
cvat-core,
cvat-data and
cvat-ui)
License
Feel free to contact the maintainers if that's a concern.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests