Red Hat Bugzilla – Bug 487214
upgrade node to 5.3, openais dies after trying to join 5.2 cluster
Last modified: 2016-04-26 11:00:38 EDT
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:188.8.131.52) Gecko/2009020410 Fedora/3.0.6-1.fc10 Firefox/3.0.6
Cluster node was upgraded to 5.3 to do a rolling upgrade. Rebooted the
node after upgrade then try to join the 5.2 cluster.
The 5.3 node will join and then it will leave the cluster after it joins.
Running the join from the commandline an assertion is thrown:
$ cman_tool -d join
Then the last message is:
[TOTEM] entering OPERATIONAL state.
[CMAN ] quorum regained, resuming activity
[CLM ] got nodejoin message 192.168.1.163
[CLM ] got nodejoin message 192.168.1.164
[CLM ] got nodejoin message 192.168.1.165
aisexec: ckpt.c:3961: message_handler_req_exec_ckpt_sync_checkpoint_refcount: Assertion `checkpoint != ((void *)0)' failed.
Since openais has died then everything else fails such as fenced,
groupd. They will get failed to communicate or "cman_init" errors.
Steps to Reproduce:
1. Upgrade a Node to 5.3
3. Join Cluster of 5.2 nodes
openais dies and nodes leaves the cluster.
Node should join the cluster and stay a member.
I have been able to recreate with a cluster of xen virtual machines.
If I run cman_join -d join twice in a row it will succeed sometimes.
$ cman_tool -d join
$ cman_tool -d join
I took a fresh install of 3 nodes. lowest node id was 5.3, second and third highest node id are 5.2. I started cman on 5.2 nodes with service cman start. then I started cman on 5.3 node with no segfault or crash.
Could you be more specific in the steps you take to reproduce the issue?
The part about the platform of the lowest node id is important since it indicates who is responsible for synchronization.
Well I'm embarrassed to say this defect shouldn't have made it through engineering unit testing but it still doesn't reproduce on my hardware. Thanks for access to your test cluster. Anyway, I have a patch for the problem.
Immediately clone a 5.3.z bugzilla and we will release it in the new 5.3.z upcoming release of openais. the 5.3.z release is pending so there is some urgency here.
The root of the problem is checkpoints of type GLOBALID are rejected in checkpoint sync, but not in section sync or refcount sync. the globalid checkpoint is virtual, meaning it doesn't really consume a checkpoint position in the system. At the beginning of the sync algo, the globalid sets the global checkpoint id, but then ignores the rest of the message, not actually creating message. this same check must be added to refcount and section synchronization to be backward compatible.
The 5.3 code only sends this checkpoint sync at the start of synchronization, whereas the old sync code would actually create a checkpoint for the section. Then when refcount went to refcount the "virtual checkpoint", it wouldn't find it and assert.
Thanks for your help
*** Bug 489596 has been marked as a duplicate of this bug. ***
~~ Attention - RHEL 5.4 Beta Released! ~~
RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!
If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.
Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.
Questions can be posted to this bug or your customer or partner representative.
Verified rolling upgrades from 5.2 and 5.3 GA to 5.4 snapshot 4.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.