Bug 487214 - upgrade node to 5.3, openais dies after trying to join 5.2 cluster
upgrade node to 5.3, openais dies after trying to join 5.2 cluster
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
All Linux
urgent Severity medium
: rc
: ---
Assigned To: Steven Dake
Cluster QE
: ZStream
: 489596 (view as bug list)
Depends On:
Blocks: 489445 490307 509894
  Show dependency treegraph
Reported: 2009-02-24 15:20 EST by Shane Bradley
Modified: 2016-04-26 11:00 EDT (History)
11 users (show)

See Also:
Fixed In Version: openais-0.80.5-4.el5_4
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 489445 (view as bug list)
Last Closed: 2009-09-02 07:29:26 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Shane Bradley 2009-02-24 15:20:37 EST
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/2009020410 Fedora/3.0.6-1.fc10 Firefox/3.0.6

Cluster node was upgraded to 5.3 to do a rolling upgrade. Rebooted the
node after upgrade then try to join the 5.2 cluster.

The 5.3 node will join and then it will leave the cluster after it joins.

Running the join from the commandline an assertion is thrown:
$ /sbin/ccsd
$ cman_tool -d join

Then the last message is:
[TOTEM] entering OPERATIONAL state.
[CMAN ] quorum regained, resuming activity
[CLM  ] got nodejoin message
[CLM  ] got nodejoin message
[CLM  ] got nodejoin message
aisexec: ckpt.c:3961: message_handler_req_exec_ckpt_sync_checkpoint_refcount: Assertion `checkpoint != ((void *)0)' failed.

Since openais has died then everything else fails such as fenced,
groupd. They will get failed to communicate or "cman_init" errors.

Reproducible: Always

Steps to Reproduce:
1. Upgrade a Node to 5.3
2. Reboot
3. Join Cluster of 5.2 nodes
Actual Results:  
openais dies and nodes leaves the cluster.

Expected Results:  
Node should join the cluster and stay a member.

I have been able to recreate with a cluster of xen virtual machines.

If I run cman_join -d join twice in a row it will succeed sometimes.
$ cman_tool -d join
$ cman_tool -d join
Comment 3 Steven Dake 2009-03-04 03:46:34 EST
I took a fresh install of 3 nodes.  lowest node id was 5.3, second and third highest node id are 5.2.  I started cman on 5.2 nodes with service cman start.  then I started cman on 5.3 node with no segfault or crash.

Could you be more specific in the steps you take to reproduce the issue?

The part about the platform of the lowest node id is important since it indicates who is responsible for synchronization.

Comment 5 Steven Dake 2009-03-10 02:19:21 EDT
Well I'm embarrassed to say this defect shouldn't have made it through engineering unit testing but it still doesn't reproduce on my hardware.  Thanks for access to your test cluster.  Anyway, I have a patch for the problem.

Immediately clone a 5.3.z bugzilla and we will release it in the new 5.3.z upcoming release of openais.  the 5.3.z release is pending so there is some urgency here.

The root of the problem is checkpoints of type GLOBALID are rejected in checkpoint sync, but not in section sync or refcount sync.  the globalid checkpoint is virtual, meaning it doesn't really consume a checkpoint position in the system.  At the beginning of the sync algo, the globalid sets the global checkpoint id, but then ignores the rest of the message, not actually creating  message.  this same check must be added to refcount and section synchronization to be backward compatible.

The 5.3 code only sends this checkpoint sync at the start of synchronization, whereas the old sync code would actually create a checkpoint for the section.  Then when refcount went to refcount the "virtual checkpoint", it wouldn't find it and assert.

Thanks for your help
Comment 6 Steven Dake 2009-03-10 22:17:20 EDT
*** Bug 489596 has been marked as a duplicate of this bug. ***
Comment 12 Chris Ward 2009-07-03 14:25:22 EDT
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.
Comment 14 Nate Straz 2009-07-29 14:41:46 EDT
Verified rolling upgrades from 5.2 and 5.3 GA to 5.4 snapshot 4.
Comment 16 errata-xmlrpc 2009-09-02 07:29:26 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.