-
Notifications
You must be signed in to change notification settings - Fork 95
Failure Conditions
Components of a chainweb-node should check during initialization for failure conditions that would prevent a node from performing its task. If a node detects such a condition it should,
- try to fix the issue, and if that isn't possible,
- emit an error log that describes the problem and possibly provides hints how to resolve the issue, and
- throw an exception, which will cause the node to terminate.
The exception message will show up on stderr
of the node. On systems with systemd
the exception message will be recorded in the journal. On the testnet nodes journalctl -b -u chainweb
can be used to view the journal.
Also any failure in the logging system or any log messages that are emitted before the logging system is initialized are logged to stderr
, and show up in the journal.
Once the node is initialized and API servers and the P2P clients are started, components should try hard to avoid failing. Components should
- catch all synchronous exceptions,
- emit an error log or warning log that describes the problem, including possible actions that must be take to address the issue,
- restart the component, subject to backoff or throttling logic as needed.
Most components do this by being wrapped in runForever
or runForeverThrottled
from Chainweb.Utils
.
Components must not catch asynchronous exceptions, that don't originate from the component itself. The functions catchSynchronous
and catchAllSynchronous
(and their variants) from Chainweb.Utils
can be used to catch synchronous exceptions but ignore asynchronous exceptions.
There are a few fatal conditions that a node can't recover from by itself. In those cases an asynchronous exception should be thrown that terminates the node.
One example of such a condition is when the node receives a KILL
signal from the environment. Another example is when a kill-switch triggers.
Here is an example how an asynchronous exception can be defined:
newtype KillSwitch = KillSwitch T.Text
instance Show KillSwitch where
show (KillSwitch t) = "kill switch triggered: " <> T.unpack t
instance Exception KillSwitch where
fromException = asyncExceptionFromException
toException = asyncExceptionToException
When such an exception is thrown it will escape from the exception handlers use in runForever
and terminate the chainweb node. When this happens the exception value is printed by the runtime to stderr.
The code should also write a meaning full Error
log message before throwing such an exception.
-
Configuration:
- parsing of configuration fails
- validation of configuration fails
-
Logging system:
- Elasticsearch index can't be created
- Log files can't be opened
-
Databases:
- RocksDb database can't be opened
- sqlite database can't be opened
- not enough disk space available
-
Networking:
- Certificate generation fails
- Certificate or Key can't be read
- Certificate is invalid (e.g. expired)
-
Chain Resources:
- pruning of block header database files files (detects inconsistencies)
-
BlockHeaderDb / Consensus:
- Hashes of genesis headers don't match expected hashes for the given chainweb version
- Missing dependencies in BlockHeaderDb (in part checked by db pruning)
-
Pact Service:
- Hashes of genesis payloads don't match the expected hashes for the given chainweb version
-
Mempool:
-
P2P Networking:
- No bootstrap nodes configured (should this be a failure?)
- Synchronization with all bootstrap nodes fails
- No network link available
- DNS lookup not available (is this a failure? most peers are know by IP)
- All HTTP connections fail with 502
-
HTTP Server:
- port can't be allocated
-
Miner:
-
Ctrl-C
/kill
KillSwitch
ReorgLimitExceeded