Bug 487214

Summary: upgrade node to 5.3, openais dies after trying to join 5.2 cluster
Product: Red Hat Enterprise Linux 5 Reporter: Shane Bradley <sbradley>
Component: openaisAssignee: Steven Dake <sdake>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: urgent    
Version: 5.3CC: cbuissar, cluster-maint, cward, davdunc, edamato, mgoulish, schlegel, sghosh, shota.a, slords, tao
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: openais-0.80.5-4.el5_4 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 489445 (view as bug list) Environment:
Last Closed: 2009-09-02 11:29:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 489445, 490307, 509894    

Description Shane Bradley 2009-02-24 20:20:37 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.6) Gecko/2009020410 Fedora/3.0.6-1.fc10 Firefox/3.0.6

Cluster node was upgraded to 5.3 to do a rolling upgrade. Rebooted the
node after upgrade then try to join the 5.2 cluster.

The 5.3 node will join and then it will leave the cluster after it joins.

Running the join from the commandline an assertion is thrown:
$ /sbin/ccsd
$ cman_tool -d join

Then the last message is:
[TOTEM] entering OPERATIONAL state.
[CMAN ] quorum regained, resuming activity
[CLM  ] got nodejoin message 192.168.1.163
[CLM  ] got nodejoin message 192.168.1.164
[CLM  ] got nodejoin message 192.168.1.165
aisexec: ckpt.c:3961: message_handler_req_exec_ckpt_sync_checkpoint_refcount: Assertion `checkpoint != ((void *)0)' failed.

Since openais has died then everything else fails such as fenced,
groupd. They will get failed to communicate or "cman_init" errors.



Reproducible: Always

Steps to Reproduce:
1. Upgrade a Node to 5.3
2. Reboot
3. Join Cluster of 5.2 nodes
Actual Results:  
openais dies and nodes leaves the cluster.

Expected Results:  
Node should join the cluster and stay a member.

I have been able to recreate with a cluster of xen virtual machines.

Note:
If I run cman_join -d join twice in a row it will succeed sometimes.
$ cman_tool -d join
$ cman_tool -d join

Comment 3 Steven Dake 2009-03-04 08:46:34 UTC
I took a fresh install of 3 nodes.  lowest node id was 5.3, second and third highest node id are 5.2.  I started cman on 5.2 nodes with service cman start.  then I started cman on 5.3 node with no segfault or crash.

Could you be more specific in the steps you take to reproduce the issue?

The part about the platform of the lowest node id is important since it indicates who is responsible for synchronization.

Regards
-steve

Comment 5 Steven Dake 2009-03-10 06:19:21 UTC
Well I'm embarrassed to say this defect shouldn't have made it through engineering unit testing but it still doesn't reproduce on my hardware.  Thanks for access to your test cluster.  Anyway, I have a patch for the problem.

Immediately clone a 5.3.z bugzilla and we will release it in the new 5.3.z upcoming release of openais.  the 5.3.z release is pending so there is some urgency here.

The root of the problem is checkpoints of type GLOBALID are rejected in checkpoint sync, but not in section sync or refcount sync.  the globalid checkpoint is virtual, meaning it doesn't really consume a checkpoint position in the system.  At the beginning of the sync algo, the globalid sets the global checkpoint id, but then ignores the rest of the message, not actually creating  message.  this same check must be added to refcount and section synchronization to be backward compatible.

The 5.3 code only sends this checkpoint sync at the start of synchronization, whereas the old sync code would actually create a checkpoint for the section.  Then when refcount went to refcount the "virtual checkpoint", it wouldn't find it and assert.

Thanks for your help
Regards
-steve

Comment 6 Steven Dake 2009-03-11 02:17:20 UTC
*** Bug 489596 has been marked as a duplicate of this bug. ***

Comment 12 Chris Ward 2009-07-03 18:25:22 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 14 Nate Straz 2009-07-29 18:41:46 UTC
Verified rolling upgrades from 5.2 and 5.3 GA to 5.4 snapshot 4.

Comment 16 errata-xmlrpc 2009-09-02 11:29:26 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1366.html