You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During our latest l2 safe head stall incident in Alfajores we spent a long time figuring out what was going on with
the l1 derivation pipeline.
We did not catch the EOF error in the op-batcher, and the sequencers op-node logs did not actively indicate
that the l2 batch queue was stuck.
This is because only the tracing log-level would have given some indication of the problem:
nextBatch.Batch.LogContext(bq.log).Info("Found next batch")
returnnextBatch.Batch, nil
}
Proposed solution
Add metrics of the different batch-queues within the derivation pipeline, so that
we can observe the growth of the remaining queue vs the BatchQueue.batches queue.
There we would have seen a linear growth of "remaining" batches and a constant length of BatchQueue.batches,
while the opposite is expected during normal operations.
This could potentially be done in upstream-optimism, since this is not a Celo specific problem.
The text was updated successfully, but these errors were encountered:
Problem definition
During our latest l2 safe head stall incident in Alfajores we spent a long time figuring out what was going on with
the l1 derivation pipeline.
We did not catch the
EOF
error in the op-batcher, and the sequencersop-node
logs did not actively indicatethat the l2 batch queue was stuck.
This is because only the tracing log-level would have given some indication of the problem:
optimism/op-node/rollup/derive/batches.go
Lines 71 to 74 in 3d7ab07
And the main indication of a batch queue stall in the info log-level is the lack of
"Found next batch"
logs:optimism/op-node/rollup/derive/batch_queue.go
Lines 295 to 298 in 3d7ab07
Proposed solution
Add metrics of the different batch-queues within the derivation pipeline, so that
we can observe the growth of the
remaining
queue vs theBatchQueue.batches
queue.optimism/op-node/rollup/derive/batch_queue.go
Lines 270 to 272 in 3d7ab07
There we would have seen a linear growth of "remaining" batches and a constant length of
BatchQueue.batches
,while the opposite is expected during normal operations.
This could potentially be done in upstream-optimism, since this is not a Celo specific problem.
The text was updated successfully, but these errors were encountered: