Skip to content

Latest commit

 

History

History
419 lines (334 loc) · 21.5 KB

0005-metrics.adoc

File metadata and controls

419 lines (334 loc) · 21.5 KB

Metrics

Real-time metrics provided by Manta form the basis of situational awareness, particularly during incident response. Understanding the metrics provided by the various components internal to Manta requires a basic understanding of these components and how they work together.

This section discusses a number of useful metrics exposed by Manta. When we say a metric is exposed by Manta, we usually mean that Triton (via CMON) and/or individual Manta components collect these metrics and serve them over HTTP in Prometheus format. Within a Manta deployment, it’s up to operators to set up systems for collecting, presenting, and alerting on these metrics. For more on this, see [_deployment_specific_details]. There are screenshots from existing deployments below, but the specific metrics available and the appearance of graphs may vary in your deployment.

Note
This section covers operational metrics used to understand Manta’s runtime behavior. These are not end-user-visible metrics.
Note
This section covers what information is available and how to interpret it. Unfortunately, there is no single place today that documents the list of metrics available.

Key service metrics

Key metrics for assessing the health of a service are driven by whatever the customer cares about. For Manta, that’s typically:

  • the error rate

  • throughput of data in and out of Manta

  • latency

These are closely related to the the Four Golden Signals described in the Site Reliability Engineering book from Google. That section provides a good summary of these metrics and why they’re so important.

A Manta deployment typically seeks to achieve a particular level of performance, usually expressed in terms of throughput (objects per second read or written or bytes per second read or written) or latency (the time required to complete each request) while maintaining an acceptable error rate. Errors must be considered when looking at performance, since many types of error responses can be served very quickly — that doesn’t mean the service is working as expected!

Note
We use the term error rate to refer to the fraction of requests that failed with an error (e.g., "3% of responses were 500-level errors" would be a 3% error rate). The same term is sometimes used to refer to the count of errors per unit time (e.g., "3 errors per second"), but that’s usually less useful for describing the service’s health.

Since there may be thousands of requests completed per second at any given time, when understanding latency, we almost always use some aggregated form of latency:

  • Average latency is useful for use-cases that primarily care about throughput because the average latency is closely related to throughput: throughput (in requests per second) equals the total number of clients divided by the average latency per request. When all you know is that throughput has gone down, it’s hard to tell whether there’s a problem because it could reflect a server issue or a lack of client activity. Average latency resolves this ambiguity: if average latency has gone up, there’s likely a server problem. If not, clients have likely stopped trying to do as much work.

  • Tail latency refers to the latency of the the slowest requests (the tail of the latency distribution). Tail latency is often a better indicator of general service health than average latency because small or occasional problems often affect tail latency significantly even when they don’t affect the average that much. Tail latency is often expressed as a percentile: for example, the shorthand "p90" refers to the 90th percentile, which is the minimum latency of the slowest 10% of requests. Similarly, p99 is the minimum latency of the slowest 1% of requests; and so on. When we say that the p99 is 300ms, we mean that 99% of requests completed within 300ms.

Not only do we use these metrics (error rate, throughput, and latency) to describe Manta’s overall health, but we also use them to describe the health of the various components within Manta.

Top-level Manta metrics

To understand Manta’s overall health, we typically look at the error rate, throughput, and latency for Muskie (webapi) since Muskie most directly handles the HTTP requests being issued by end users.

Note
If you’re looking for help understanding a Manta problem, see the [_incident_response_decision_tree]. This section provides background on the metrics referenced in that section.

Errors at webapi: For background on errors in Manta, see [_investigating_an_increase_in_error_rate]. That section also discusses how to debug specific errors. At this level, we’re typically talking about explicit 500-level errors. When evaluating an error rate against expectations, we usually use its inverse — the success rate: the fraction of requests that completed successfully. This is often measured in "nines". Generally, if a deployment seeks a particular level of availability (e.g., 99.9%), an incident may be raised if the error rate exceeds a target percentage (e.g., 0.1%) for some period of time (e.g., 5 minutes); an incident may be considered resolved once the error rate is below this threshold. The specific thresholds vary by deployment, but are usually at least 99%. Generally, errors indicate a software failure, a significant server-side failure, or unexpected overload — all of which are not supposed to be common in production deployments.

Throughput at webapi: Manta throughput can be measured in

  • objects created per second

  • objects fetched per second

  • bytes written per second

  • bytes read per second

Any of these might be significant to end users. The target throughput in any of these dimensions is deployment-specific. As mentioned above, it’s very difficult to use throughput directly to assess the server’s behavior because it’s significantly affected by client behavior. Average latency may be more useful.

Latency at webapi: For webapi, we usually define latency as the time-to-first-byte for each request. For uploads, this is the time between when the request is received at the server to the time when the server tells the client to proceed with the upload. For downloads, this is the time between when the request is received and client receives the first bytes of data. For other requests (which do not involve object data transfer), we look at the whole latency of the request. As we’ve defined it, latency includes the time used by the server to parse the request, validate it, authenticate it, authorize it, load metadata, and initiate any data transfer, but it does not include the time to transfer data. This is useful, since transfer time depends on object size, and we usually want to factor that out.

For general system health, we typically monitor Muskie error rate and tail latency (p90, p99). When throughput is important, average latency is also useful for the reasons mentioned above.

Key metrics internal to Manta

When understanding problems with Manta, we use the same key metrics — error rate, throughput, and latency — measured by various other components.

Note
If you’re looking for help understanding a Manta problem, see the [_incident_response_decision_tree]. This section provides background on the metrics referenced in that section.

Metadata tier: We generally consider the metadata tier to include:

  • Electric-Moray, which is the gateway for nearly all requests to the metadata tier that come from the data path

  • Moray, each instance of which handles requests for a particular database shard

  • PostgreSQL (deployed under Manatee), which ultimately services requests to the metadata tier

Electric-Moray and Moray both operate in terms of RPCs. They both expose metrics that count the number of RPCs completed and failed, as well as histograms that can be used to calculate average latency and estimate tail latency.

PostgreSQL operates in terms of transactions. Manta exposes metrics collected from PostgreSQL about transactions completed and aborted, but not latency. We typically use Moray latency as a proxy for PostgreSQL latency.

Storage tier: We do not currently record request throughput, latency, or error rate from the storage tier. However, Triton (via CMON) collects network bytes transferred and ZFS bytes read and written, which are useful proxies for inbound and outbound data transfer.

Other operational metrics

Manta exposes a number of other useful metrics:

  • CPU utilization, broken out by zone (and filterable by type of component). For stateless services (i.e., most services within Manta), this is a useful way to determine if instances (or a whole service) is overloaded. For example, webapi instances are typically deployed using 16 processes that are effectively single-threaded. If any webapi instances are using close to 1600% of one CPU (i.e., 16 CPUs), they’re likely overloaded, and end users are likely to experience elevated latency as a consequence. In order to interpret these values, you generally have to know how many CPUs a particular component can typically use.

  • Disk utilization, broken out by zone (and filterable by type of component). This is useful for understanding disk capacity at both the metadata tier and storage tier.

  • PostgreSQL active connections, broken out by shard. This roughly reflects how much concurrent work is happening on each PostgreSQL shard. This can be useful for identifying busy or slow shards (though it can be hard to tell if a shard is slow because it’s busy or if it’s busy because it’s slow).

  • PostgreSQL vacuum activity, described below.

  • TCP errors (collected via Triton’s CMON), including failed connection attempts, listen drops, and retransmitted packets. These can reflect various types of network issues.

  • OS anonymous allocation failures (collected via Triton’s CMON). This particular event indicates that a process attempted to allocated memory but failed because it has reached a memory cap. Many programs do not handle running out of memory well, so these allocation failures can sometimes result in cascading failures.

Summary of metrics

Below is a rough summary of the metrics exposed by Manta and which components expose them. There are several caveats:

  • This information is subject to change without notice as the underlying software evolves!

  • This table does not describe which Prometheus instances collect, aggregate, and serve each metric. See [_deployment_specific_details].

  • Relatedly, in large deployments, Prometheus recording rules may be used to precalculate important metrics. These are not documented here.

  • Many metrics provided a number of breakdowns using Prometheus labels.

Component being measured Where the metric is collected Metric name Notes

Manta itself

webapi

muskie_inbound_streamed_bytes

count of bytes uploaded to Muskie (all uploads, including in-progress and failures). This is a primary metric for end users.

Manta itself

webapi

muskie_outbound_streamed_bytes

count of bytes downloaded from Muskie (all downloads, including in-progress and failures). This is a primary metric for end users.

Manta itself

webapi

http_requests_completed

count of requests completed, with labels for individual HTTP response codes. This can be used to calculate the error rate as well. This is the basis for several primary metrics for end users.

Manta itself

webapi

http_request_latency_ms

histogram of request latency, used to calculate average latency and to estimate percentiles. This is a primary metric for end users.

Electric-Moray

electric-moray

fast_requests_completed

count of requests completed, with a label for RPC method name. This is useful for measuring overall throughput at the metadata tier.

Electric-Moray

electric-moray

fast_server_request_time_seconds

histogram of RPC latency, with a label for RPC method name. This is useful for calculating average latency and estimating tail latency at the metadata tier.

Moray

moray

fast_requests_completed

Same as for electric-moray, but this is measured for a particular Moray instance (and so a particular database shard).

Moray

moray

fast_server_request_time_seconds

Same as for electric-moray, but this is measured for a particular Moray instance (and so a particular database shard).

PostgreSQL

pgstatsmon

Various

A number of stats exposed by PostgreSQL are collected and exposed by pgstatsmon. For the authoritative set, see the pgstatsmon source. These stats are named according to their PostgreSQL names, so for example the xact_commit stat in the pg_stat_database view is exposed as pg_stat_database_xact_commit. Labels are used to identify the PostgreSQL instance, which can often be used to break out by shard.

TCP stack

CMON (in Triton)

tcp_listen_drop_count, tcp_listen_drop_Qzero_count

count of the number of times a TCP connect attempt was dropped on the server side, often due to overload. This is useful for identifying TCP server problems.

TCP stack

CMON (in Triton)

tcp_failed_connection_attempt_count

count of the number of times a TCP connect attempt failed on the client side. This is useful for identifying when clients are having issues, even if you can’t see corresponding server-side failures.

TCP stack

CMON (in Triton)

tcp_retransmitted_segment_count

count of the number of times a TCP packet was retransmitted. This can indicate a network problem or a software problem on either end of the TCP connection, but interpreting this stat is difficult because there are many non-failure cases where packets may be retransmitted.

OS

CMON (in Triton)

mem_anon_alloc_fail

count of the number of times an operating system process attempted to allocate memory but failed because the container would exceeds its cap. This often indicates a type of memory exhaustion.

OS

CMON (in Triton)

cpu_user_usage, cpu_sys_usage

count of the number nanoseconds of CPU time (user and system time, respectively) used by a given container, with labels for the container being measured. This is useful for understanding CPU usage, including problems of CPU saturation.

ZFS (filesystem)

CMON (in Triton)

zfs_used, zfs_available

gauge of the number of bytes used and available, with labels for the container being measured. This is useful for identifying containers that are low on disk space and for understanding overall system storage capacity.

Predicting autovacuum activity

Background on vacuum in Manta

Autovacuum activity in PostgreSQL is a major source of degraded performance in large deployments, known to cause a throughput degradation as much as 70% on a per-shard basis. It’s helpful for operators to understand some of the basics of autovacuum. A deeper understanding requires digging rather deep into PostgreSQL internals. The PostgreSQL documentation describes autovacuum, the reason for it, and the conditions for it in detail.

Operators should understand at least the following:

  • "Vacuum" is a long-running activity that runs on a per-table basis. This is a maintenance operation that generally has to be run periodically on all tables in all PostgreSQL databases.

  • "Autovacuum" is the name for any vacuum that is scheduled and managed by PostgreSQL itself, as opposed to one that an operator kicks off explicitly.

  • Manta has two primary tables: "manta" and "manta_directory_counts". As mentioned above, each vacuum operation runs on one table at a time (though multiple vacuums can be running at the same time on different tables.)

  • There’s generally a significant degradation in both average query latency and tail latency while vacuum is running. In fixed-concurrency deployments (i.e., when there are a fixed number of clients), an increase in average latency corresponds directly to a decrease in throughput.

We classify vacuum operations into two types:

  • Normal vacuums clean up tuples (rows) in the table that have been invalidated since the last vacuum (usually by UPDATE and DELETE operations). PostgreSQL kicks these off whenever the fraction of dead tuples exceeds a configurable threshold of the table size, which is generally 20%.

  • "Anti-wraparound vacuums" (also sometimes called "wraparound vacuums", "freeze vacuums", or "aggressive vacuums") are responsible for freezing old tuples. PostgreSQL kicks these off whenever it’s been more than a fixed number of transactions since the last time this was done.

Note that each type of vacuum may do the work of the other. A normal vacuum may freeze some tuples, and a freeze vacuum will generally clean up dead tuples. This classification is about what caused PostgreSQL to start the vacuum, and it’s useful because we can monitor the underlying metrics in order to predict when PostgreSQL will kick off vacuum operations.

Again, there’s significantly more information about all of this in the above-linked PostgreSQL documentation.

Using metrics to predict normal autovacuums

As mentioned above, a normal vacuum is kicked off when the number of dead tuples has exceeded 20% of the total tuples in the table. We can see this in Manta metrics. Here’s a graph of live tuples, dead tuples, and the fraction of dead tuples for a made-up table called "test_table" (normally in Manta this would be the "manta" or "manta_directory_counts" table):

metrics postgresql tuples autovac
Figure 1. A graph of dead tuples, live tuples, and autovacuum activity in a test environment.

In this graph:

  • In the upper graph, the green line shows live tuples. This system is running a heavy INSERT workload, so the count of live tuples increases relatively constantly.

  • In the upper graph, the yellow line shows dead tuples. A fraction of this workload runs UPDATE queries, so there’s a steady increase in dead tuples as well.

  • In the upper graph, the blue line (which goes with the right-hand y-axis) shows the percentage of tuples in the table that are dead. This value also climbs, though not at a linear rate.

  • In the bottom graph, the green bars represent periods where a normal vacuum was running. (You can ignore the yellow bars in this graph.)

Critically autovacuum starts running when the blue line reaches 20%, for the reasons described above. Further, when vacuum finishes, the count (and fraction) of dead tuples decreases suddenly — because vacuum has cleaned up those dead tuples. As a result, the blue line can be used to predict when normal vacuums will kick off.

Using metrics to predict anti-wraparound autovacuums

As mentioned above, an anti-wraparound vacuum is kicked off on a table when the number of transactions in a database that have been executed since the last such vacuum exceeds some threshold. Manta exposes this metric as well.

Typically, as a workload runs, the transactions-until-wraparound-vacuum decreases at a rate determined by how many transactions are running in the database. For a single shard, we can plot this on a line graph (one graph for each major table):

metrics postgresql wraparound leadup
Figure 2. Graphs of the number of transactions left until the next wraparound autovacuum kicks off for the "manta" and "manta_directory_counts" tables.

For a large number of shards, we can plot this as a heat map, which helps us see the pattern across shards:

metrics postgresql wraparound leadup heatmap
Figure 3. Heat maps of the number of transactions left until the next wraparound autovacuum. Brighter blocks indicate a larger number of data points (shards).

In the right-hand heat map, the bright line above 400M indicates that most shards are over 400 million transactions away from the next wraparound autovacuum. The darker line around 100M shows that a smaller number are much closer to the threshold. The left-hand heat map shows much greater variance for the "manta" table, though there’s a cluster (a bright line) just under 100M transactions from the next wraparound autovacuum.

When any of these lines reaches zero, that means we’d PostgreSQL to kick off a wraparound autovacuum. The line will continue decreasing (to negative numbers) until the wraparound autovacuum completes, at which point it will jump back up. Here, we can see a whole wraparound autovacuum cycle:

metrics postgresql wraparound duration
Figure 4. Graphs of the number of transactions until the next wraparound autovacuum, plus the number of wraparound autovacuums running, for a particular shard over the course of one autovacuum operation.

We see here that we’d expect a wraparound autovacuum to kick off when the threshold reaches 0. It keeps falling until the vacuum completes, at which point it jumps back up. Another round will kick off when the line reaches zero again. (Note that the lower graph here is a prediction, based directly on the graph above it. It’s possible (though not common in practice) that PostgreSQL won’t actually have kicked off the wraparound autovacuum at this time.)

Because of this behavior, the graph of transactions until wraparound autovacuum can be used to predict when wraparound autovacuums will kick off for each shard.