You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using OverlayBD in production we will need to monitor the healthiness of OverlayBD components using popular cloud native instrumentation toolings.
A similar issue was brought up here: containerd/overlaybd#101 There are certain things users could try but it would be great that it's supported by the DADI service so it can be standardized and re-used. I believe this is key for helping DADI adoption.
The following metrics are some rough idea for what we'd like to monitor:
Overlaybd:
Healthcheck ping for the Overlaybd daemon
number of failed blob reads group by http status (500 for registry error, 404 for blob not exists, 403 for auth failure etc.)
blob read latency for each block (e.g. 1M)
Other unexpected errors such as failed to write to local cache or online decompression failures.
Virtual block device IO hang monitoring
Virtual block device IO latency
Overlaybd-snapshotter:
Healthcheck ping for the snapshotter daemon
Error count of all GRPC APIs (prepare, commit etc.)
Latency for all GRPC APIs
It's ideal that the above metrics can be exposed in Prometheus such that's it's easy to monitor DADI in cloud native envs.
When using OverlayBD in production we will need to monitor the healthiness of OverlayBD components using popular cloud native instrumentation toolings.
A similar issue was brought up here: containerd/overlaybd#101 There are certain things users could try but it would be great that it's supported by the DADI service so it can be standardized and re-used. I believe this is key for helping DADI adoption.
The following metrics are some rough idea for what we'd like to monitor:
Overlaybd:
Overlaybd-snapshotter:
It's ideal that the above metrics can be exposed in Prometheus such that's it's easy to monitor DADI in cloud native envs.
Some similar monitoring support:
Please let me know your thoughts, the metrics mentioned above are just some quick ideas, would be happy to discuss, too.
The text was updated successfully, but these errors were encountered: