Bug 483807
| Summary: | resolve join state for store recover in cluster for joining nodes | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Carl Trieloff <cctrieloff> |
| Component: | qpid-cpp | Assignee: | Kim van der Riet <kim.vdriet> |
| Status: | CLOSED ERRATA | QA Contact: | Jan Sarenik <jsarenik> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 1.1 | CC: | aconway, freznice, iboverma, jsarenik, lans.carstensen, lbrindle, tao |
| Target Milestone: | 1.2 | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Messaging bug fix
C: When a node in a cluster failed, and was then brought back up, it was attempting to sync with both the store, and the running cluster
C: The node that attempting to rejoin the running cluster failed
F: Only the first node started in a cluster will restore from the store. All subsequent nodes added to the cluster will discard the store data and will synchronize with the master node in the cluster.
R: Rejoining a running cluster now operates as expected.
When a node in a cluster failed, and was then brought back up, it was attempting to restore using information from both the store, and the running master node. This resulted in the node that was attempting to rejoin failing. This has been corrected, so that only the first node started in a cluster will restore from the store. All subsequent nodes added to the cluster will discard the store data and will synchronize with the master node in the cluster. Rejoining a running cluster now operates as expected.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2009-12-03 09:17:43 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 527551 | ||
|
Description
Carl Trieloff
2009-02-03 18:04:13 UTC
This can be worked around by identifying the node to start first, and removing the stores from the other nodes before restart.
in broker.cpp
if (store.get() != 0) {
RecoveryManagerImpl recoverer(queues, exchanges, links, dtxManager,
conf.stagingThreshold);
store->recover(recoverer);
}
needs to be not called for joining nodes.
In revision 740793 Cluster sets recovery flag on Broker for first member in cluster. Disable recovery from local store if the recovery flag is not set. Need store test case, tbd kim Changing priority to high; set target milestone to 1.1.2. *** Bug 486991 has been marked as a duplicate of this bug. *** The error described in Bug 486991 (marked as a dup of this one) is the result of BDB errors when trying to set up mandatory broker exchanges when they have already been restored. This happens on all cluster nodes which are not the first in the cluster and are restored from the persistence store. The work-around up until now has been to delete the store directory from all the nodes (or all the nodes except the first to be restarted) when there are messages to be recovered. A fix now modifies the startup sequence of the store so that when a node is not the first in a cluster to restart and has been restored, the restored data is discarded and the store files are "pushed down" into a bak folder (in case the order of cluster recovery is incorrect, and the store from other nodes can be restored) then the node is restarted without recovery. QA: This bug is easy to reproduce: 1. Start a multi-node cluster. 2. Shut down any node in the cluster. 3. Restart that node. The broker start will fail with "Exchange already exists: amq.direct (MessageStoreImpl.cpp:488)" message. 4. If all nodes are shut down, then all nodes after the first will fail with this error. Built-in store python test test_Cluster_04_SingleClusterRemoveRestoreNodes tests this scenario. qpid r. 773004 store r. 3368 Reproduced on RHEL5.3 i386. Related packages (mrg-devel repo): qpidd-cluster-0.5.752581-5.el5 qpidd-0.5.752581-5.el5 openais-0.80.3-22.el5_3.4 Waiting for new packages to verify. Backported qpid r.773004 onto git mrg_1.1.x branch: http://git.et.redhat.com/git/qpid.git/?p=qpid.git;a=commitdiff;h=441c88204cb0135564669d7b004d62a1bc03828a Verified on qpidd-0.5.752581-28.el5, both i386 and x86_64. Included in store backport for 1.2. I forgot to mention rhm-0.5.3206-14.el5 Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cluster joining nodes now recover correctly by preserving (instead of replicating) any stored data they already had prior to rejoining (483807) Modified the release note to the following: Only the first node started in a cluster will restore from the store. All subsequent nodes added to the cluster will discard the store data (the store files will be pushed down into a bak directory) and will instead synchronize with the master node in the cluster. (483807) Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -Cluster joining nodes now recover correctly by preserving (instead of replicating) any stored data they already had prior to rejoining (483807)+Only the first node started in a cluster will restore from the store. All subsequent nodes added to the cluster will discard the store data (the store files will be pushed down into a bak directory) and will instead synchronize with the master node in the cluster. (483807) *** Bug 539287 has been marked as a duplicate of this bug. *** Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,8 @@ -Only the first node started in a cluster will restore from the store. All subsequent nodes added to the cluster will discard the store data (the store files will be pushed down into a bak directory) and will instead synchronize with the master node in the cluster. (483807)+Messaging bug fix + +C: When a node in a cluster failed, and was then brought back up, it was attempting to sync with both the store, and the running cluster +C: The node that attempting to rejoin the running cluster failed +F: Only the first node started in a cluster will restore from the store. All subsequent nodes added to the cluster will discard the store data and will synchronize with the master node in the cluster. +R: Rejoining a running cluster now operates as expected. + +When a node in a cluster failed, and was then brought back up, it was attempting to restore using information from both the store, and the running master node. This resulted in the node that was attempting to rejoin failing. This has been corrected, so that only the first node started in a cluster will restore from the store. All subsequent nodes added to the cluster will discard the store data and will synchronize with the master node in the cluster. Rejoining a running cluster now operates as expected. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1633.html |