Skip to content

Latest commit

 

History

History
559 lines (457 loc) · 26 KB

0001-incident-response.adoc

File metadata and controls

559 lines (457 loc) · 26 KB

Incident Response Decision Tree

Choose a place to start:

Many of these sections refer to checking recent historical metrics. See [_deployment_specific_details] for more information.

Investigating a decrease in overall throughput

Start here if you have reason to believe that inbound or outbound throughput to Manta has recently been reduced.

First, can you confirm the decrease in throughput from Manta’s own metrics? If not, the problem may be external to Manta. Since problems often unfold over several minutes, you may have to look back over several hours to get a sense of recent throughput.

metrics webapi bytes falling 1
Figure 1. A sudden drop in inbound throughput after several hours of steady performance.
  • Has the error rate increased? Check recent historical metrics for an increase in 400-level or 500-level errors.

    • Yes, there’s an increase in the overall error rate. See [_investigating_an_increase_in_error_rate]. It may be that many requests that were previously accounting for throughput are now failing. (If the error rate is not high enough to account for the change in throughput, then it may be useful to investigate latency instead.)

      Spike in 503 errors
      Figure 2. A transient spike in 503 errors. (Note the absolute number is extremely low here. To see this against a larger number of successful requests, you may need to click on "503" in the legend to show only the 503s.)
    • No, the error rate is unchanged. Has webapi latency increased? Check recent historical metrics for an increase in average latency, p90 latency, or p95 latency.

      • Yes, webapi latency has increased. See [_investigating_an_increase_in_latency].

        metrics webapi latency increasing 1
        Figure 3. An increase in webapi tail latency.
      • No, webapi latency is unchanged or lower. If there are no errors and no increased latency, then Manta appears healthy. Check upstream network connections, client metrics, and client behavior. (e.g., has the client changed its workload?)

Background: By definition, the throughput of the system (either inbound or outbound) is the number of completed requests per second times average request size (in terms of inbound or outbound bytes, depending on which throughput you’re interested in). If throughput is lower, then one of the following is likely true:

  1. There are fewer requests per second because there’s a fixed concurrency and average request latency has increased. This is generally a server problem. For example, it’s common for there to be a fixed number of clients, each with a fixed maximum concurrency. In that situation, if Manta requests start taking twice as long, then the number of requests completed per second will naturally be cut in half, cutting overall throughput in half.

  2. Average request size has reduced because requests are failing (resulting in small requests). This is generally a server problem, though it could also be a client problem. For example, if the average upload is 10 MiB, but suddenly 10% of requests are failing before they reach the upload stage, overall throughput is likely to degrade by 10%, since 10% of requests are now uploading almost no data.

  3. There are fewer requests per second because clients are making fewer requests. This would be a client change (and may not be an issue). For example, this could happen when an operator turns off some clients.

  4. Average request size has reduced because of a deliberate change in the client workload. This would be a client change (and may not be an issue).

Investigating an increase in latency

When there’s reason to suspect an overall increase in webapi latency, real-time metric dashboards can be used to identify the source. The first step is typically confirming whether latency is visible at webapi.

Note

A Manta deployment can easily execute tens of thousands of requests per second, each with an associated latency. It’s useful to summarize a large number of latencies with a single number so that we can identify trends, both for raising alarms and for general situational awareness during incident response.

Average latency is the sum of latencies divided by the request count. This is useful for understanding changes in throughput, since throughput changes that result from increased latency would manifest as increased average latency.

However, normal-looking averages can obscure high-latency requests that have an outsize impact on clients. As a result, we often focus on tail latency, or the latencies of the slowest requests. The notation p99 refers to the 99th percentile latency, which is the latency value where 99% of requests completed within this latency. If the p90 is 300ms, then 90% of requests completed within 300ms.

When making broad statements about latency, it’s helpful to be specific about whether you’re talking about average latency or tail latency. However, the questions below are intentionally ambiguous, as increases in either average or tail latency might be cause for further investigation.

metrics webapi latency increasing 1
Figure 4. A steady increase in webapi tail latency.
metrics webapi latency spike 1
Figure 5. A transient spike in webapi tail latency.

Has latency at Muskie increased?

  • No, Muskie latency has not increased. There does not appear to be an increase in latency from Manta. If clients are reporting elevated latency, try to confirm that. One approach is to collect information from a client about a specific slow request and build a request timeline. This can help determine if latency is coming from within Manta or not.

  • Yes, Muskie latency has increased. At this point, you can find or generate a specific slow request and then understand its latency. But when you’re looking at a trend like this, it’s often helpful to look at latency at other parts of the system. The next question is: Has latency at Electric-Moray increased?

    • Yes, latency has increased at Electric-Moray, too. In this case, has latency at Moray increased as well?

      • Yes, latency has increased at Moray, too. See below.

      • No, there’s no corresponding increase in latency at Moray. This is unusual. There may be a problem with particular electric-moray instances, or the whole tier may be overloaded. See [_investigating_a_slow_tier_of_node_processes].

    • No, there’s no corresponding increase in latency at Electric-Moray. This is unusual. There may be a problem with particular webapi instances, or the whole tier may be overloaded. See [_investigating_a_slow_tier_of_node_processes].

metrics latency correlation 1
Figure 6. A decrease in webapi latency that’s well correlated with a decrease in maximum Moray tail latency across all shards.

If you found above that latency at Moray has increased, take a look at the Moray latency metrics broken out by shard name (or zonename, and then map the zonenames to shard names). Remember to ignore shard 1. Have multiple Moray shards seen an increase in latency?

  • No, only one shard has elevated latency. You’ll need to dig into the behavior of this shard. See below.

  • Yes, multiple Moray shards have elevated latency. Moray shards are generally independent, so problems with multiple shards may need to be investigated separately. On the other hand, you may wind up looking at system-wide PostgreSQL metrics next, in which case you may be able to answer questions about several shards at once.

If you’ve established that a particular shard’s Moray latency has increased, the next step is to identify if it’s Moray or PostgreSQL that’s the source of the latency. This is a bit more difficult than previous steps because we do not have metrics for p99 query latency from PostgreSQL.

There are a couple of different questions to ask:

  • Are all Moray zones affected in the same way? Specifically, compare tail latency, average latency (if possible), and queue depth across zones for this shard. Are all of them elevated, or are some different than others?

    • Yes, all Moray zones in this shard appear affected in the same way. This suggests a problem with PostgreSQL rather than Moray.

    • No, some Moray zones in this shard appear much more affected than others. This is more likely to reflect a problem with specific Moray instances rather than PostgreSQL. See the question below about database connections, and see also [_investigating_a_slow_tier_of_node_processes].

  • Is there a high rate of Moray queueing on this shard, relative to other shards? If there’s a high rate of queueing, the database might be the source of the latency. If not, it’s possible that Moray is the source of the problem.

  • If there is queueing at Moray, check the number of backend connections (or processes) reported on the PostgreSQL dashboard. Does this shard have the same number of database connections as other shards?

    • Yes, this shard has the same number of connections as other shards. Move on to other questions.

    • No, this shard has fewer connections than other shards. This may indicate a connection management problem at Moray. If Moray instances lose track of connections, they may be limited in how much work they can dispatch to PostgreSQL, resulting in increased latency as requests queue up.

    • No, this shard has far more connections than other shards. If this shard has over 500 connections, that may indicate a problem with Moray. There are generally supposed to be at most 64 connections per Moray zone in this shard, and we usually deploy 3-6 zones per shard.

Based on these questions, if it seems like the problem is associated with PostgreSQL, see [_investigating_postgresql_latency]. Otherwise, the problem is likely with Moray. See [_investigating_a_slow_tier_of_node_processes].

Investigating an increase in error rate

There are a couple of major kinds of error.

  • A 500-level response (that is, a well-formed HTTP response with a status code between 500 and 599) generally reflects a problem with Manta.

  • A 400-level response (that is, a well-formed HTTP response with a status code between 400 and 499) may happen under normal operation and may indicate no problem at all, or it may reflect a client issue. In rare cases, a 499 response can reflect a server issue. The details depend on the specific type of error and whether the client expects it or not.

  • If the client gives up before the server has sent a response, then the client will likely report a client timeout, while the server will likely report a connection that was abruptly closed by the client.

  • If there’s a networking issue that causes the client or server to abandon the connection, both sides will generally report an explicit socket error.

Errors that manifest as explicit HTTP responses (i.e., 400-level and 500-level responses) are visible to operators via logs and metrics provided by Manta.

There are less common failures as well:

  • A client may find no servers in DNS. This may or may not be visible to operators.

  • A client may time out attempting to resolve DNS. This is unlikely to be visible to Manta operators.

  • A client may time out attempting to establish a TCP connection to Manta. This is not likely to be visible to Manta operators.

The rest of this section discusses specific response codes for errors and what they mean.

Response code 507

Response code 503

A 503 Service Unavailable response generally indicates that Manta is refusing some requests because it is overloaded or some dependencies are not functioning. There are three major cases where this happens:

  • At least one Moray instance is at its maximum queue length and is refusing new requests.

  • There are not enough online storage nodes to handle the upload.

  • Muskie did not respond to the request quickly enough.

In all cases, you can [_investigating_a_specific_request_that_has_failed] to find the cause of the failure.

If you already have a particular 503 response, you can quickly determine which of these cases caused it.

If you have a large number of 503s, you can check for systemic causes:

  • Are there Moray shards with high queue lengths? Check recent historical metrics for Moray queue length. If any zone or shard has more than a few thousand items queued, it may be causing 503-level responses.

Response code 502

Manta no longer issues this response code. If you see it, please file a bug. Historically, this was associated with slow Muskie requests. These are generally now reported as 503s.

Response code 500

This generally indicates a server-side bug. See [_finding_or_generating_a_failed_request] to learn why the request failed.

Response code 499

499 is an internal status code used to describe when a client appears to have abandoned a request. Specifically, this is recorded when a client closes its socket before finishing a request. In this case, there is no response, since the server may have no place to send it.

499s may be seen if:

  • the client gave up (timed out) before the server sent an initial response (e.g., a 100-continue)

  • the client crashed (closing its sockets)

  • a network issue disrupted the connection between client and server

If you have a specific 499 request’s Muskie log entry already (as from a Muskie log), was the latency fairly high? (If you know the client’s timeout, was the request latency longer than this timeout?) Check the "latency" field in the Muskie log entry. Also compare the Date header in the request with the timestamp of the log entry. If these don’t match up, the request may have been queued somewhere before being processed by Muskie.

Response code 413

Other 400-level response codes

These are usually client issues, though it’s always possible there are server-side bugs that cause erroneous 400-level responses. (For an obscure example, see MANTA-2319.) The only way to be sure is to examine the request and response to see if the response appears correct.

Investigating a specific request that has failed

Start here if you want to understand why a specific request has failed. These steps will help you find corresponding log entries with more detail.

Ideally, you’ll want to have:

  • the x-server-name header from the response

  • the x-request-id header from the response

  • the approximate time of the response (which calendar hour it was sent)

  • the IP address that the client used to reach Manta

In many cases, you can get by with only some of this information. The more information you have, the easier (and faster) it will be to find more information.

You might also try generating your own request to investigate.

If you find the log entry, see [_understanding_a_muskie_log_entry] for details. If you find none, see [_if_there_is_no_muskie_log_entry].

If you have the x-request-id and x-server-name headers

The x-server-name header gives you the uuid for the "webapi" zone that processed this request.

  • Was the request completed after the top of the current hour?

    • Yes, the request was handled after the top of the hour. The log entry will be inside the Muskie zone. First, find the datacenter where the Muskie zone that handled the request is deployed. From the headnode of that datacenter, use manta-oneach to search the Muskie log file for the request id:

      manta-oneach -z WEBAPI_ZONE_UUID 'svcs -L muskie | xargs cat | grep REQUEST_ID' | bunyan

      filling in WEBAPI_ZONE_UUID from the x-server-name header and REQUEST_ID from the x-request-id header.

    • No, the request was handled earlier than that. The log entry will generally be in a historical log file inside Manta itself. Use mlogin or mget to fetch the path:

      /poseidon/stor/logs/muskie/YYYY/MM/DD/HH/UUID8.log

      where YYYY/MM/DD/HH represent the year, month, day, and hour when the request completed and UUID8 is the first 8 characters of the x-server-name header. If this object does not exist in Manta, and Manta has been having availability issues, then the historical log file may still be inside the corresponding "webapi" zone. Log into the "webapi" zone and use grep to search for the request ID in the files in /var/log/manta/upload.

    • I don’t know when the request was handled. In this case, you need to check all of the log files mentioned above. You may be able to use a Manta job to scan a large number of historical files at once. For example, you can search all of a day’s log files for one server using:

      mfind -t o -n UUID8.log /poseidon/stor/logs/muskie/YYYY/MM/DD |
          mjob create -o -m 'grep REQUEST_ID || true' -r bunyan

      As before, UUID8 is the first 8 characters of the x-server-name header.

If you find the log entry, see [_understanding_a_muskie_log_entry] for details. If you find none, see [_if_there_is_no_muskie_log_entry].

If you have the x-request-id, but no x-server-name

In this case, you have to check the log files for all "webapi" zones to find the log entry.

  • Was the request completed since the top of the current hour?

    • Yes, the request was handled since the top of the hour. The log entry will be inside the Muskie zone. Separately for each datacenter in this Manta, use manta-oneach to search all the Muskie logs:

      manta-oneach -s webapi 'svcs -L muskie | xargs cat | grep REQUEST_ID' | bunyan
    • No, the request was handled earlier than that. Use a job to search historical logs with names:

      /poseidon/stor/logs/muskie/YYYY/MM/DD/HH/*.log

      where YYYY/MM/DD/HH represent the year, month, day, and hour when the request completed.

      For example, you can search all log files for a particular hour with:

      mfind -t o /poseidon/stor/logs/muskie/YYYY/MM/DD/HH |
          mjob create -o -m 'grep REQUEST_ID || true' -r bunyan
    • I don’t know when the request was handled. In this case, you need to check all of the log files mentioned above.

If you find the log entry, see [_understanding_a_muskie_log_entry] for details. If you find none, see [_if_there_is_no_muskie_log_entry].

If you don’t have the x-request-id

If you don’t have the request id, then you’ll need some other information about the request that you can use to filter it. Examples include:

  • the name of the account, if that account only made a few requests around the time in question

  • the path that was used, if that’s relatively unique among requests

  • a particular client header that’s somewhat uncommon

  • a very small time window in which the request may have happened

If you have this sort of information, your best bet is to use some combination of grep or json to scan all of the log entries for the appropriate time.

Tip
When working out a grep or json pipeline, it’s helpful to use mlogin to get an interactive shell for a particular Muskie log file. There, you can practice your shell pipeline a few times until it matches what you want, possibly using slightly different parameters (e.g., a different account name) than you’ll use for the real search, since you probably didn’t happen to pick a log file with the precise entry you’re looking for). Then run that same shell pipeline in a Manta job over a much larger number of Muskie log files.

If you find the log entry, see [_understanding_a_muskie_log_entry] for details. If you find none, see [_if_there_is_no_muskie_log_entry].

If there is no Muskie log entry

There’s a difference between there being no Muskie log entry and not being able to find the Muskie log entry for a request.

You may know that there’s no log entry for a request if:

  • you have the rough timestamp and x-server-name header, found a non-empty log for that server for that hour, and there’s no entry for the request in it, or

  • you know the rough timestamp of the request, found non-empty log files for all servers for that hour, and there’s no matching request

Otherwise, it’s possible that the log entry was lost (e.g., if a log file was lost or clobbered, due to a bug or extended availability loss).

  • Did the HTTP response contain an x-server-name or x-request-id header?

    • Yes, there was a response with these headers. In this case, a Muskie instance definitely handled the request. There should be a log entry.

    • There was a response, but it did not contain these headers. In this case, the response very likely came from the load balancer and not Muskie. See [_finding_a_load_balancer_log_entry] to find more information about the request. This typically happens for one of two reasons:

      • Muskie took too long (usually more than two minutes) to handle the request. Note that even though the load balancer may have reported a 500-level error, the request may have completed successfully (or failed for some other reason) inside Muskie.

      • Muskie did process the request, but it just took longer than the load balancer timeout. This is often a sign of high latency at the metadata tier.

      • Muskie stopped processing a request. This would be a bug in Muskie. It often leads to file descriptor leaks and memory leaks, so it’s very serious. Examples: MANTA-3338, MANTA-2916, MANTA-2907.

      • Muskie sent an invalid HTTP response. (This is very uncommon. Example: MANTA-3489)

    • There was no response, or the client timed out before receiving a response. It would be very unusual for the system to produce no response within 2 minutes of a request being completed, but it’s not uncommon for a client to give up before receiving a response.

    • I don’t know if there was a response.

In all of these cases, you can get more information about what happened by [_finding_a_load_balancer_log_entry].

Investigating elevated Electric-Moray latency

Investigating elevated Moray latency

Investigating elevated PostgreSQL latency