Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 594480 - All nodes fail during recovery with stack protector sigabrt
All nodes fail during recovery with stack protector sigabrt
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.5
All Linux
high Severity high
: rc
: ---
Assigned To: Steven Dake
Cluster QE
: ZStream
Depends On:
Blocks: 600043 601085 601086
  Show dependency treegraph
 
Reported: 2010-05-20 15:33 EDT by Shane Bradley
Modified: 2016-04-26 10:38 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
In high loss networks, all nodes in a cluster experienced a buffer overflow and aborted when a threshold of unprocessed/not transmitted packets was reached. With this update, even when a significant number of packets is unprocessed/not transmitted, all nodes in a cluster work as expected and do not abort.
Story Points: ---
Clone Of:
: 600043 (view as bug list)
Environment:
Last Closed: 2011-01-13 18:56:46 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
revision 2145 from whitetank branch (1.94 KB, patch)
2010-06-03 17:23 EDT, Steven Dake
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0100 normal SHIPPED_LIVE openais bug fix update 2011-01-12 12:21:13 EST

  None (edit)
Description Shane Bradley 2010-05-20 15:33:28 EDT
Description of problem:

Cluster went down. All nodes in the cluster experience a buffer
overflow on aisexec process since the retransmit list grew to large.

May 19 20:06:43 node1 openais[11034]: [TOTEM] Retransmit List: 2ad307  
May 19 20:07:15 node1 openais[11034]: [TOTEM] entering GATHER state from 12. 
May 19 20:07:15 node1 openais[11034]: [TOTEM] Creating commit token because I am the rep. 
May 19 20:07:15 node1 openais[11034]: [TOTEM] Saving state aru 2a97ec high seq received 2ad7b8 
May 19 20:07:15 node1 openais[11034]: [TOTEM] Storing new sequence id for ring 2b88 
May 19 20:07:15 node1 openais[11034]: [TOTEM] entering COMMIT state. 
May 19 20:07:15 node1 openais[11034]: [TOTEM] entering RECOVERY state. 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [0] member 192.168.0.33: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2a97ec high delivered 2ad7b8 received flag 0 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [1] member 192.168.0.34: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [2] member 192.168.0.35: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [3] member 192.168.0.36: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: *** buffer overflow detected ***: aisexec terminated 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [4] member 192.168.0.37: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [5] member 192.168.0.38: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [6] member 192.168.0.39: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [7] member 192.168.0.40: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: [TOTEM] copying all old ring messages from 2a97ed-2ad7b8. 
May 19 20:07:16 node1 fenced[11054]: cluster is down, exiting
May 19 20:07:16 node1 dlm_controld[11060]: cluster is down, exiting
May 19 20:07:16 node1 gfs_controld[11066]: cluster is down, exiting
May 19 20:07:16 node1 kernel: dlm: connecting to 6
May 19 20:07:16 node1 kernel: dlm: closing connection to node 7
May 19 20:07:16 node1 kernel: dlm: closing connection to node 8
May 19 20:07:19 node1 kernel: dlm: closing connection to node 6
May 19 20:07:19 node1 kernel: dlm: closing connection to node 5
May 19 20:07:19 node1 kernel: dlm: closing connection to node 4
May 19 20:07:19 node1 kernel: dlm: closing connection to node 3
May 19 20:07:19 node1 kernel: dlm: closing connection to node 2
May 19 20:07:19 node1 kernel: dlm: closing connection to node 1
May 19 20:07:44 node1 ccsd[11023]: Unable to connect to cluster infrastructure after 30 seconds. 
May 19 20:08:14 node1 ccsd[11023]: Unable to connect to cluster infrastructure after 60 seconds. 

Version-Release number of selected component (if applicable):
cman-2.0.115-1.el5_4.9-x86_64
openais-0.80.6-8.el5_4.6-x86_64 

How reproducible:
Everytime that threshold of unprocessed/not transmitted packets is reached.

Steps to Reproduce:
1. Turn up high load on nic so packets are not processed
2. Eventually openais will have buffer overflow
  
Actual results:
All nodes in cluster will die since openais dies.

Expected results:
Cluster(all nodes) should not die.

Additional info:
Comment 5 Steven Dake 2010-06-03 17:00:33 EDT
Based upon analysis of log files in attached bugzilla, node 2 records a FAILED TO RECEIVE.  This indicates node 1 was unable to receive multicast traffic for some period of time (~33 seconds - speculation network driver failed on platform).  After 33 seconds, (token timeout), the network driver is reopened and everything works fine.

The logic of totem in FAILED TO RECV is incorrectly implemented.  In this condition, node 1 should have recorded the FAILED TO RECV condition, added itself to the failed list, formed a singleton ring, while the remaining nodes formed a new configuration.  Then node 1 should be fenced.

I am uncertain how this patch will affect RHCS in this condition, but the current behaviour at the totem level doesn't match protocol specifications (which leads to segfault).  I could only replicate it by hard-coding a failure to receive into openais.
Comment 6 Steven Dake 2010-06-03 17:23:50 EDT
Created attachment 421038 [details]
revision 2145 from whitetank branch
Comment 15 Douglas Silas 2011-01-11 18:16:16 EST
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In high loss networks, all nodes in a cluster experienced a buffer overflow and aborted when a threshold of unprocessed/not transmitted packets was reached. With this update, even when a significant number of packets is unprocessed/not transmitted, all nodes in a cluster work as expected and do not abort.
Comment 17 errata-xmlrpc 2011-01-13 18:56:46 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0100.html

Note You need to log in before you can comment on or make changes to this bug.