If all nodes in a cluster crash, and persistence is enabled, the node with the most recent journal should be started first to ensure consistent state. But it is not easy for the user to determine which node this is.
This might be solved using a tool that would identify which node to start first, or perhaps a tool that would restart a failed cluster
In practice, this problem does not frequently arise, since every node in the cluster must fail before it does.
This is not just abot a total crash. If some nodes in a cluster crash and the remainder are shut-down cleanly, then any of the nodes with clean journals can be the first member in the cluster.
The right solution to this is to have it handled automatically by the cluster during start up, so the user can start nodes in any order but the brokers with dirty journals will wait for a broker with a clean journal to start.
A running broker would mark its journal dirty, so it will be marked dirty if the broker dies unexpectedly. The journal is marked clean in 2 cases
- broker shuts down as part of orderly cluster shutdown.
- broker becomes last-man-standing
In a partial crash + shut down, only the cleanly shut down nodes can be first-in-cluster. In a total crash where the cluster was reduced to one member who finally crashed, only the last member can be first-in-cluster.
In a crash where more than one member died without either becoming a clear last-man-standing, manual intervention is required.
To avoid the manual intervention case we could write a cluster sequence ID to the journal headers. On restart, if all members have a dirty store, the journal(s) with the highest sequence IDs are eligible to be first-in-cluster.
This requires the members to know what "all members" means. This could be:
- configure a list of members
- configure an expected count of members
- use a timeout (in case some members can't be started)
- user runs config tool to tell cluster when all members are present.
On a clean startup, cluster members with clean journals should check that they all have the same journal sequence number for consistency and refuse to start if not.
Addressed in commits up to 883999. See user description at
Remaining piece is manual recovery from a complete cluster failure: Bug 541426