Red Hat Bugzilla – Bug 860018
Possible message loss if a cluster is partitioned
Last modified: 2013-02-22 13:18:16 EST
Description of problem:
If there is a partition of a cluster, it is theoretically possible that some messages sent by a client of an inquorate broker could be lost because of mis-match between cman and corosync status.
Version-Release number of selected component (if applicable): 0.18
How reproducible: has never been observed, this is a theoretical bug.
Steps to Reproduce: Unknown
Actual results: message loss
Expected results: no message loss
Qpidd monitors the cman for quorum changes and shuts the broker down if it becomes inquorate. This is intended to prevent message loss by forcing clients to fail over to a healthy broker and replay their un-acknowledged messages.
However there is a possible race condition between when qpidd checks the quorum status with cpg and when it multicasts messages to corosync.
Each time a broker joins or leaves the cluster, the cluster is considered to be in a new configuration. Each configuration is identified by a sequence number called the ring-id. Although CPG and corosync are dealing with the same cluster, they update their cluster status independently. As qpidd is currently coded, it's possible for it to see a cpg status from an older indicating a quorate configuration but to send corosync messages to a newer inquorate configuration.
In order to be sure not to send messages to an inquorate cluster, qpidd needs to check before each mcast that the cman and corosync ring-ids are the same AND cman indicates is quorate. If not, qpidd needs to wait till the sequence numbers converge before mcasting anything.
The fix should be reasonably straightforward, but testing will probably be very difficult. I'm not sure how the problem could be reproduced.
Created attachment 621562 [details]
A "weak" reproducer - using the script test_bz860018.sh, I was able to very few times to get message loss (once per many hours of run) and/or message duplicity (twice per the same time).
The repro simply runs qpid-send and qpid-receive (with message loss&duplicity checks on) against a broker where network failure is emulated.
The network failure is emulated following https://access.redhat.com/knowledge/solutions/79523 where it is dropped the whole traffic on the eth.interface used by corosync+cman (note, one needs to run this test on a machine with 2 NICs to keep AMQP traffic passing).
The reproducer has two flaws:
1) It takes ages to detect and recover from a split-brain. Usually, node reboot is required to un-fence. The script somehow mimics this just without reboots.
2) Message loss or duplicity is seen quite rarely, needs to run for a long time to verify possible fix.