Bug 494393
Summary: | First two nodes join 'simultaneously'; no node can reach the 'ready' state. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Frantisek Reznicek <freznice> | ||||||
Component: | qpid-cpp | Assignee: | Alan Conway <aconway> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 1.1 | CC: | aconway, esammons, gsim, iboverma | ||||||
Target Milestone: | 1.3 | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Previously, it was possible for two brokers to join a cluster simultaneously. Consequent to this, none of the brokers was recognized as the first node, and both the qpidd service and clients stopped responding. With this update, one of the brokers always assumes the role of the first node, and both the qpidd service and clients now work as expected.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2010-10-14 15:58:42 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 592999 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Description
Frantisek Reznicek
2009-04-06 17:14:02 UTC
This appears to be caused by multicast events being suppressed as all nodes are stuck in JOINER state for some reason. Two of the four nodes appear to be treated as joining simultaneously, i.e. the first config change they each receive contains both nodes: 2009-apr-06 13:25:27 debug 10.16.64.42:24961(INIT) config change: 10.16.64.42:24961 10.16.64.42:24963 2009-apr-06 13:25:27 debug 10.16.64.42:24963(INIT) config change: 10.16.64.42:24961 10.16.64.42:24963 This means that neither considers itself the first node and the cluster never gets into the READY state, all nodes are stuck in JOINER state and ignore update requests and all data events are held up indefinitely. I have verifed that a short delay between starting each node (e.g. 1 sec) avoids hitting this problem and consequently am lowering the priority and targetting for 1.2. *** Bug 509439 has been marked as a duplicate of this bug. *** qpidd makes the incorrect assumption that the first CPG config-change always contains a single member. It is possible for the first config-change to have multiple members if they join concurrently. If this happens, all members come up as "JOINER" with nobody taking on the role of first member and the cluster hangs. We need an additional protocol on joining to handle the case of multiple members in the first config change, probably an extension of the existing request-update/offer-update protocol. The reproducer described in bug 510504 is also quite good at reproducing this issue. Created attachment 366783 [details] bz494393 new reproducer Retested (at the moment just short time) and found that issue is much less frequent, in fact I was not able to get stuck (trigger the issue) on RHEL 5.4 with latest qpidc and openais. [root@mrg-qe-02 bz494393]# rpm -q openais openais-0.80.6-8.el5 openais-0.80.6-8.el5 [root@mrg-qe-02 bz494393]# rpm -q qpidd qpidd-0.5.752581-30.el5 There are two scripts './run.sh' will launch one test run with 4 nodes './looper.sh' will be calling ./run.sh until stuck or failure found I believe this might help in reproducing the issue. This should be fixed by changes in r881423, but since it's not on latest qpid/openais it's hard to confirm. Current testing shows that persistant broker cluster has issues with startup again, see the issue tracked as bug 592999. 592999 marked as blocker. The issue has been fixed, verified in long test run (few hundreds of cluster restarts, qpidd min/max logging, various cluster widths) on RHEL 5.5 i386 / x86_64 using packages: openais-0.80.6-16.el5_5.1 openais-debuginfo-0.80.6-16.el5_5.1 openais-devel-0.80.6-16.el5_5.1 python-qpid-0.7.946106-1.el5 qpid-cpp-client-0.7.946106-2.el5 qpid-cpp-client-devel-0.7.946106-2.el5 qpid-cpp-client-devel-docs-0.7.946106-2.el5 qpid-cpp-client-ssl-0.7.946106-2.el5 qpid-cpp-mrg-debuginfo-0.7.946106-2.el5 qpid-cpp-server-0.7.946106-2.el5 qpid-cpp-server-cluster-0.7.946106-2.el5 qpid-cpp-server-devel-0.7.946106-2.el5 qpid-cpp-server-ssl-0.7.946106-2.el5 qpid-cpp-server-store-0.7.946106-2.el5 qpid-cpp-server-xml-0.7.946106-2.el5 qpid-java-client-0.7.946106-3.el5 qpid-java-common-0.7.946106-3.el5 qpid-tests-0.7.946106-1.el5 qpid-tools-0.7.946106-4.el5 ruby-qpid-0.7.946106-1.el5 -> VERIFIED Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, it was possible for two brokers to join a cluster simultaneously. Consequent to this, none of the brokers was recognized as the first node, and both the qpidd service and clients stopped responding. With this update, one of the brokers always assumes the role of the first node, and both the qpidd service and clients now work as expected. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html |