Description of problem: I saw $summary on one node while I was running revolver. Four nodes out of a six node cluster were shot. [TOTEM] entering GATHER state from 8. [TOTEM] entering GATHER state from 11. [TOTEM] Saving state aru 0 high seq received 0 [TOTEM] Storing new sequence id for ring e60 [TOTEM] entering COMMIT state. [TOTEM] entering RECOVERY state. [TOTEM] position [0] member 10.15.89.61: [TOTEM] previous ring seq 3676 rep 10.15.89.61 [TOTEM] aru 331 high delivered 287 received flag 1 [TOTEM] position [1] member 10.15.89.63: [TOTEM] previous ring seq 3676 rep 10.15.89.61 [TOTEM] aru 331 high delivered 287 received flag 1 [TOTEM] position [2] member 10.15.89.64: [TOTEM] previous ring seq 3676 rep 10.15.89.61 [TOTEM] aru 331 high delivered 287 received flag 1 [TOTEM] position [3] member 10.15.89.91: [TOTEM] previous ring seq 3660 rep 10.15.89.91 [TOTEM] aru 0 high delivered 0 received flag 1 [TOTEM] position [4] member 10.15.89.93: [TOTEM] previous ring seq 3676 rep 10.15.89.61 [TOTEM] aru 331 high delivered 287 received flag 1 [TOTEM] position [5] member 10.15.89.94: [TOTEM] previous ring seq 3676 rep 10.15.89.61 [TOTEM] aru 331 high delivered 287 received flag 1 [TOTEM] Did not need to originate any messages in recovery. [CLM ] CLM CONFIGURATION CHANGE [CLM ] New Configuration: [CLM ] Members Left: [CLM ] Members Joined: [CLM ] CLM CONFIGURATION CHANGE [CLM ] New Configuration: [CLM ] r(0) ip(10.15.89.61) [CLM ] r(0) ip(10.15.89.63) [CLM ] r(0) ip(10.15.89.64) [CLM ] r(0) ip(10.15.89.91) [CLM ] r(0) ip(10.15.89.93) [CLM ] r(0) ip(10.15.89.94) [CLM ] Members Left: [CLM ] Members Joined: [CLM ] r(0) ip(10.15.89.61) [CLM ] r(0) ip(10.15.89.63) [CLM ] r(0) ip(10.15.89.64) [CLM ] r(0) ip(10.15.89.91) [CLM ] r(0) ip(10.15.89.93) [CLM ] r(0) ip(10.15.89.94) [SYNC ] This node is within the primary component and will provide service. [TOTEM] entering OPERATIONAL state. [CMAN ] quorum regained, resuming activity [CMAN ] quorum lost, blocking activity [TOTEM] Message continuation doesn't match previous frag e: 0 - a: 242 [TOTEM] Throwing away broken message: continuation 0, index 0 After this, aisexec was not running on the system. The cman init script failed trying to start cman. Version-Release number of selected component (if applicable): openais-0.80.3-21.el5 cman-2.0.97-1.el5 How reproducible: Unknown
On other nodes I did see messages like this: morph-03 openais[2707]: [CLM ] got nodejoin message 10.15.89.93 morph-03 openais[2707]: [CLM ] got nodejoin message 10.15.89.94 morph-03 openais[2707]: [CLM ] got nodejoin message 10.15.89.61 morph-03 openais[2707]: [CLM ] got nodejoin message 10.15.89.63 morph-03 openais[2707]: [CLM ] got nodejoin message 10.15.89.64 morph-03 openais[2707]: [EVT ] Can't find cluster node at r(0) ip(10.15.89.91) morph-03 openais[2707]: [CPG ] got joinlist message from node 4 morph-03 openais[2707]: [CPG ] got joinlist message from node 6 morph-03 openais[2707]: [CPG ] got joinlist message from node 7 morph-03 openais[2707]: [CPG ] got joinlist message from node 2
Created attachment 324536 [details] core dump from tank-01, gzipped Here's the core dump from tank-01. It's an i386 core from aisexec from package openais-0.80.3-21.el5
this is a dup of 261381. *** This bug has been marked as a duplicate of bug 261381 ***