Bug 594480

Summary: All nodes fail during recovery with stack protector sigabrt
Product: Red Hat Enterprise Linux 5 Reporter: Shane Bradley <sbradley>
Component: openaisAssignee: Steven Dake <sdake>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 5.5CC: cluster-maint, dejohnso, edamato, jkortus, jwest, tao
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
In high loss networks, all nodes in a cluster experienced a buffer overflow and aborted when a threshold of unprocessed/not transmitted packets was reached. With this update, even when a significant number of packets is unprocessed/not transmitted, all nodes in a cluster work as expected and do not abort.
Story Points: ---
Clone Of:
: 600043 (view as bug list) Environment:
Last Closed: 2011-01-13 23:56:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 600043, 601085, 601086    
Attachments:
Description Flags
revision 2145 from whitetank branch none

Description Shane Bradley 2010-05-20 19:33:28 UTC
Description of problem:

Cluster went down. All nodes in the cluster experience a buffer
overflow on aisexec process since the retransmit list grew to large.

May 19 20:06:43 node1 openais[11034]: [TOTEM] Retransmit List: 2ad307  
May 19 20:07:15 node1 openais[11034]: [TOTEM] entering GATHER state from 12. 
May 19 20:07:15 node1 openais[11034]: [TOTEM] Creating commit token because I am the rep. 
May 19 20:07:15 node1 openais[11034]: [TOTEM] Saving state aru 2a97ec high seq received 2ad7b8 
May 19 20:07:15 node1 openais[11034]: [TOTEM] Storing new sequence id for ring 2b88 
May 19 20:07:15 node1 openais[11034]: [TOTEM] entering COMMIT state. 
May 19 20:07:15 node1 openais[11034]: [TOTEM] entering RECOVERY state. 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [0] member 192.168.0.33: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2a97ec high delivered 2ad7b8 received flag 0 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [1] member 192.168.0.34: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [2] member 192.168.0.35: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [3] member 192.168.0.36: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: *** buffer overflow detected ***: aisexec terminated 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [4] member 192.168.0.37: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [5] member 192.168.0.38: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [6] member 192.168.0.39: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: [TOTEM] position [7] member 192.168.0.40: 
May 19 20:07:15 node1 openais[11034]: [TOTEM] previous ring seq 11140 rep 192.168.0.33 
May 19 20:07:15 node1 openais[11034]: [TOTEM] aru 2ad7b8 high delivered 2ad7b8 received flag 1 
May 19 20:07:15 node1 openais[11034]: [TOTEM] copying all old ring messages from 2a97ed-2ad7b8. 
May 19 20:07:16 node1 fenced[11054]: cluster is down, exiting
May 19 20:07:16 node1 dlm_controld[11060]: cluster is down, exiting
May 19 20:07:16 node1 gfs_controld[11066]: cluster is down, exiting
May 19 20:07:16 node1 kernel: dlm: connecting to 6
May 19 20:07:16 node1 kernel: dlm: closing connection to node 7
May 19 20:07:16 node1 kernel: dlm: closing connection to node 8
May 19 20:07:19 node1 kernel: dlm: closing connection to node 6
May 19 20:07:19 node1 kernel: dlm: closing connection to node 5
May 19 20:07:19 node1 kernel: dlm: closing connection to node 4
May 19 20:07:19 node1 kernel: dlm: closing connection to node 3
May 19 20:07:19 node1 kernel: dlm: closing connection to node 2
May 19 20:07:19 node1 kernel: dlm: closing connection to node 1
May 19 20:07:44 node1 ccsd[11023]: Unable to connect to cluster infrastructure after 30 seconds. 
May 19 20:08:14 node1 ccsd[11023]: Unable to connect to cluster infrastructure after 60 seconds. 

Version-Release number of selected component (if applicable):
cman-2.0.115-1.el5_4.9-x86_64
openais-0.80.6-8.el5_4.6-x86_64 

How reproducible:
Everytime that threshold of unprocessed/not transmitted packets is reached.

Steps to Reproduce:
1. Turn up high load on nic so packets are not processed
2. Eventually openais will have buffer overflow
  
Actual results:
All nodes in cluster will die since openais dies.

Expected results:
Cluster(all nodes) should not die.

Additional info:

Comment 5 Steven Dake 2010-06-03 21:00:33 UTC
Based upon analysis of log files in attached bugzilla, node 2 records a FAILED TO RECEIVE.  This indicates node 1 was unable to receive multicast traffic for some period of time (~33 seconds - speculation network driver failed on platform).  After 33 seconds, (token timeout), the network driver is reopened and everything works fine.

The logic of totem in FAILED TO RECV is incorrectly implemented.  In this condition, node 1 should have recorded the FAILED TO RECV condition, added itself to the failed list, formed a singleton ring, while the remaining nodes formed a new configuration.  Then node 1 should be fenced.

I am uncertain how this patch will affect RHCS in this condition, but the current behaviour at the totem level doesn't match protocol specifications (which leads to segfault).  I could only replicate it by hard-coding a failure to receive into openais.

Comment 6 Steven Dake 2010-06-03 21:23:50 UTC
Created attachment 421038 [details]
revision 2145 from whitetank branch

Comment 15 Douglas Silas 2011-01-11 23:16:16 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In high loss networks, all nodes in a cluster experienced a buffer overflow and aborted when a threshold of unprocessed/not transmitted packets was reached. With this update, even when a significant number of packets is unprocessed/not transmitted, all nodes in a cluster work as expected and do not abort.

Comment 17 errata-xmlrpc 2011-01-13 23:56:46 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0100.html