Bug 496985 - OpenAIS is the likely candidate for cluster mirror regression
OpenAIS is the likely candidate for cluster mirror regression
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.4
All Linux
high Severity high
: rc
: ---
Assigned To: Steven Dake
Cluster QE
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-21 18:25 EDT by Jonathan Earl Brassow
Modified: 2016-04-26 10:45 EDT (History)
4 users (show)

See Also:
Fixed In Version: openais-0.80.6-1.el5_4
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 07:30:21 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jonathan Earl Brassow 2009-04-21 18:25:28 EDT
Cluster mirror code works fine under rhel5.3, but the same code fails on rhel5.4.  I get various openAIS library errors.  Will try to get better info soon, but it is easy enough to reproduce.  Simply run 'mirror_sanity -e segmented_pvmove' (one of QA's tests) and watch things fail.
Comment 1 Jonathan Earl Brassow 2009-04-21 18:26:09 EDT
I should have a cluster available for your use tomorrow.
Comment 2 Corey Marthaler 2009-05-07 10:40:54 EDT
I appear to be seeing a similar issue when running linear -> mirror -> linear convert loops now in 5.4 that used to work in 5.3. I don't see any lib errors, but there is definitely something wrong now when running this test case. 


SCENARIO - [looping_mirror_to_linear_converts]
Create a mirror and then down and up convert it 20 times
grant-01: lvcreate -m 1 -n mirror_2_linear -L 1G --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!
1: down convert to linear on grant-03; up convert on mirror grant-01          
2: down convert to linear on grant-02; up convert on mirror grant-01          
3: down convert to linear on grant-01; up convert on mirror grant-03          
  Error locking on node grant-02: Command timed out                           
  Error locking on node grant-01: Command timed out                           
  Problem reactivating mirror_2_linear                                        
up converting the mirror failed                                   

Another time:
============================================================
SCENARIO - [looping_mirror_to_linear_converts]
Create a mirror and then down and up convert it 20 times
grant-03: lvcreate -m 1 -n mirror_2_linear -L 1G --nosync mirror_sanity
  WARNING: New mirror won't be synchronised. Don't read what you didn't write!
1: down convert to linear on grant-02; up convert on mirror grant-03
2: down convert to linear on grant-03; up convert on mirror grant-03
3: down convert to linear on grant-01; up convert on mirror grant-03
4: down convert to linear on grant-03; up convert on mirror grant-02
5: down convert to linear on grant-01; up convert on mirror grant-03
6: down convert to linear on grant-01; up convert on mirror grant-01
7: down convert to linear on grant-02; up convert on mirror grant-01
8: down convert to linear on grant-03; up convert on mirror grant-03
9: down convert to linear on grant-03; up convert on mirror grant-03
10: down convert to linear on grant-03; up convert on mirror grant-02
11: down convert to linear on grant-01; up convert on mirror grant-02
12: down convert to linear on grant-03; up convert on mirror grant-03
13: down convert to linear on grant-03; up convert on mirror grant-01
Could not connect to grant-01:5008, 111: Connection refused
up converting the mirror failed

May  6 17:36:58 grant-01 qarshd[29864]: Running cmdline: lvconvert -m 0 /dev/mirror_sanity/mirror_2_linear
May  6 17:36:58 grant-01 lvm[11157]: No longer monitoring mirror device mirror_sanity-mirror_2_linear for events
May  6 17:36:58 grant-01 xinetd[9968]: EXIT: qarsh status=0 pid=29864 duration=0(sec)
May  6 17:36:59 grant-01 [11157]: Monitoring mirror device mirror_sanity-mirror_2_linear for events
May  6 17:37:29 grant-01 lvm[11157]: No longer monitoring mirror device mirror_sanity-mirror_2_linear for events
May  6 17:40:35 grant-01 syslogd 1.4.1: restart.


May  6 17:36:15 grant-02 qarshd[29991]: Running cmdline: lvconvert -m 1 /dev/mirror_sanity/mirror_2_linear
May  6 17:36:16 grant-02 [11335]: Monitoring mirror device mirror_sanity-mirror_2_linear for events
May  6 17:36:47 grant-02 lvm[11335]: No longer monitoring mirror device mirror_sanity-mirror_2_linear for events
May  6 17:36:57 grant-02 openais[10888]: [TOTEM] The token was lost in the OPERATIONAL state.
May  6 17:36:57 grant-02 openais[10888]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
May  6 17:36:57 grant-02 openais[10888]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
May  6 17:36:57 grant-02 openais[10888]: [TOTEM] entering GATHER state from 2.
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] entering GATHER state from 11.
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] Creating commit token because I am the rep.
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] Saving state aru e0e93 high seq received e0e93
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] Storing new sequence id for ring 160
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] entering COMMIT state.
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] entering RECOVERY state.
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] position [0] member 10.15.89.152:
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] previous ring seq 348 rep 10.15.89.151
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] aru e0e93 high delivered e0e93 received flag 1
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] position [1] member 10.15.89.153:
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] previous ring seq 348 rep 10.15.89.151
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] aru e0e93 high delivered e0e93 received flag 1
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] Did not need to originate any messages in recovery.
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] Sending initial ORF token
May  6 17:37:01 grant-02 openais[10888]: [CLM  ] CLM CONFIGURATION CHANGE
May  6 17:37:01 grant-02 openais[10888]: [CLM  ] New Configuration:
May  6 17:37:01 grant-02 kernel: dlm: closing connection to node 1
May  6 17:37:01 grant-02 openais[10888]: [CLM  ]        r(0) ip(10.15.89.152)
May  6 17:37:01 grant-02 openais[10888]: [CLM  ]        r(0) ip(10.15.89.153)
May  6 17:37:01 grant-02 openais[10888]: [CLM  ] Members Left:
May  6 17:37:01 grant-02 openais[10888]: [CLM  ]        r(0) ip(10.15.89.151)
May  6 17:37:01 grant-02 openais[10888]: [CLM  ] Members Joined:
May  6 17:37:01 grant-02 openais[10888]: [CLM  ] CLM CONFIGURATION CHANGE
May  6 17:37:01 grant-02 openais[10888]: [CLM  ] New Configuration:
May  6 17:37:01 grant-02 openais[10888]: [CLM  ]        r(0) ip(10.15.89.152)
May  6 17:37:01 grant-02 openais[10888]: [CLM  ]        r(0) ip(10.15.89.153)
May  6 17:37:01 grant-02 openais[10888]: [CLM  ] Members Left:
May  6 17:37:01 grant-02 openais[10888]: [CLM  ] Members Joined:
May  6 17:37:01 grant-02 openais[10888]: [SYNC ] This node is within the primary component and will provide service.
May  6 17:37:01 grant-02 openais[10888]: [TOTEM] entering OPERATIONAL state.
May  6 17:37:01 grant-02 openais[10888]: [CLM  ] got nodejoin message 10.15.89.152
May  6 17:37:01 grant-02 openais[10888]: [CLM  ] got nodejoin message 10.15.89.153
May  6 17:37:01 grant-02 openais[10888]: [CPG  ] got joinlist message from node 3
May  6 17:37:01 grant-02 openais[10888]: [CPG  ] got joinlist message from node 2
May  6 17:37:02 grant-02 [11335]: Monitoring mirror device mirror_sanity-mirror_2_linear for events
May  6 17:37:02 grant-02 lvm[11335]: mirror_sanity-mirror_2_linear is now in-sync
May  6 17:37:31 grant-02 fenced[10909]: grant-01 not a cluster member after 30 sec post_fail_delay
May  6 17:37:31 grant-02 fenced[10909]: fencing node "grant-01"
May  6 17:37:49 grant-02 fenced[10909]: fence "grant-01" success
May  6 17:37:50 grant-02 xinetd[9999]: EXIT: qarsh status=0 pid=29991 duration=95(sec)
Comment 3 Corey Marthaler 2009-05-07 11:05:10 EDT
Tried this again without device-mapper-multipath involved after a fresh reboot and started seeing errors right away.
 
May  7 10:01:59 grant-01 openais[10563]: [CLM  ] CLM CONFIGURATION CHANGE 
May  7 10:01:59 grant-01 openais[10563]: [CLM  ] New Configuration: 
May  7 10:01:59 grant-01 clogd[10634]: [5YdVXXNr]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May  7 10:01:59 grant-01 openais[10563]: [CLM  ]        r(0) ip(10.15.89.151)  
May  7 10:01:59 grant-01 openais[10563]: [CLM  ]        r(0) ip(10.15.89.152)
May  7 10:01:59 grant-01 clogd[10634]: [5YdVXXNr]  Retry #2 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May  7 10:01:59 grant-01 openais[10563]: [CLM  ]        r(0) ip(10.15.89.153)  
May  7 10:01:59 grant-01 openais[10563]: [CLM  ] Members Left: 
May  7 10:01:59 grant-01 openais[10563]: [CLM  ] Members Joined:
May  7 10:01:59 grant-01 clogd[10634]: [5YdVXXNr]  Retry #3 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
May  7 10:01:59 grant-01 openais[10563]: [CLM  ] CLM CONFIGURATION CHANGE 
May  7 10:01:59 grant-01 openais[10563]: [CLM  ] New Configuration: 
May  7 10:01:59 grant-01 openais[10563]: [CLM  ]        r(0) ip(10.15.89.151)
May  7 10:01:59 grant-01 openais[10563]: [CLM  ]        r(0) ip(10.15.89.152)
May  7 10:01:59 grant-01 openais[10563]: [CLM  ]        r(0) ip(10.15.89.153)
May  7 10:01:59 grant-01 openais[10563]: [CLM  ] Members Left:
May  7 10:01:59 grant-01 openais[10563]: [CLM  ] Members Joined:
May  7 10:01:59 grant-01 openais[10563]: [SYNC ] This node is within the primary component and will provide service.
May  7 10:01:59 grant-01 openais[10563]: [TOTEM] entering OPERATIONAL state.
May  7 10:01:59 grant-01 openais[10563]: [CLM  ] got nodejoin message 10.15.89.151
May  7 10:01:59 grant-01 openais[10563]: [CLM  ] got nodejoin message 10.15.89.152
May  7 10:01:59 grant-01 openais[10563]: [CLM  ] got nodejoin message 10.15.89.153
May  7 10:01:59 grant-01 openais[10563]: [CPG  ] got joinlist message from node 1
May  7 10:01:59 grant-01 openais[10563]: [CPG  ] got joinlist message from node 2
May  7 10:01:59 grant-01 openais[10563]: [CPG  ] got joinlist message from node 3
May  7 10:01:59 grant-01 openais[10563]: [TOTEM] Retransmit List: cb
May  7 10:02:08 grant-01 openais[10563]: [TOTEM] FAILED TO RECEIVE
Comment 6 errata-xmlrpc 2009-09-02 07:30:21 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1366.html
Comment 7 Corey Marthaler 2009-09-24 17:34:04 EDT
Was this ever actually fixed or just added to the 5.4 errata? I appear to be seeing the same messages still with the same test case.

Note You need to log in before you can comment on or make changes to this bug.