Bug 142844

Summary: -ENOBUFS when sending message
Product: [Retired] Red Hat Cluster Suite Reporter: michael conrad tadpol tilstra <mtilstra>
Component: dlmAssignee: David Teigland <teigland>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: ccaulfie, cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-02-21 19:07:41 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 133240, 144795    
Attachments:
Description Flags
lockdump from scrollback buffer.
none
Full dlm assert dump none

Description michael conrad tadpol tilstra 2004-12-14 17:50:43 UTC
Created attachment 108541 [details]
lockdump from scrollback buffer.

I got as much as I could.

Comment 1 michael conrad tadpol tilstra 2004-12-14 17:52:16 UTC
while grabbing that, dlm spit out a few more lines that might be useful:

dlm: Joga: remote_stage error -105 10012
dlm: Joga: remote_stage error -105 103be
dlm: Joga: remote_stage error -105 603e6
dlm: Joga: remote_stage error -105 502e8
dlm: Joga: remote_stage error -105 500ca
dlm: Joga: remote_stage error -105 60081
dlm: Joga: remote_stage error -105 5017e
dlm: Joga: remote_stage error -105 50213
dlm: Joga: remote_stage error -105 803a9
dlm: Joga: remote_stage error -105 301b3
dlm: Joga: remote_stage error -105 70034
dlm: Joga: remote_stage error -105 500e7
dlm: Joga: remote_stage error -105 301a7
dlm: Joga: remote_stage error -105 c02d7
dlm: Joga: remote_stage error -105 900b1
dlm: Joga: remote_stage error -105 90084
dlm: Joga: remote_stage error -105 60087
dlm: Joga: remote_stage error -105 700f1
dlm: Joga: remote_stage error -105 a03b0
dlm: Joga: remote_stage error -105 d0059
dlm: Joga: remote_stage error -105 e03e0
dlm: Joga: remote_stage error -105 8023e
dlm: Joga: remote_stage error -105 c036f


Comment 2 michael conrad tadpol tilstra 2004-12-15 14:39:36 UTC
have gotten this twice now.
just did a cvs update to grab pjc checkin for bug #142853.  dlm code is same though.

Comment 3 michael conrad tadpol tilstra 2004-12-15 18:14:41 UTC
Created attachment 108640 [details]
Full dlm assert dump

Turned on screen logging, got full output this time.  Also included output from
other nodes.  clocks are synced accross all nodes.

Comment 4 David Teigland 2005-01-04 08:02:41 UTC
The assertion is preceded by a bunch of the ENOBUFS errors in
remote_stage which is the first and root cause.  

I'm adding Patrick to this one in case he has any thoughts on whether
the machine has simply exhausted its memory or if there's some other
possible reason for getting this error when we send a message.

Comment 5 David Teigland 2005-01-04 08:05:42 UTC
*** Bug 142874 has been marked as a duplicate of this bug. ***

Comment 6 Christine Caulfield 2005-01-04 10:33:29 UTC
The only other reason ENOBUFS would be returned would be if lowcomms
has been told to shut down and has disabled outgoing connections -
that doesn't look likely from the logs.

It's possible that sending has stalled for some reason and that is
what is causing the memory shortage but it's very difficult to tell.

Comment 7 David Teigland 2005-01-17 03:32:43 UTC
Running "mu_loop 8 8" on 4 va nodes for 7 hours when va04 failed
an ENOBUFS assertion.  It was due to cman shutting down dlm:

<6>CMAN: node va04 has been removed from the cluster : No response to
messages
<6>CMAN: killed by STARTTRANS or NOMINATE
<6>CMAN: we are leaving the cluster. 

0xc8ea2f90     4975        1  0    1   R  0xc8ea31d0  cman_comms
EBP        EIP        Function (args)
0xc8eade98 0xc033eccc schedule+0x2fc (0xc8eadeac, 0x18bda78, 0xa0003c,
0x100100, 0x200200)
0xc8eadee4 0xc033f46e schedule_timeout+0x6e (0xc55203a8)
0xc8eadef0 0xc0128e80 msleep+0x30 (0x3e8, 0x6b, 0x3, 0xc9282d34)
0xc8eadf08 0xd095307c ÃdlmÃdlm_recoverd_stop+0x5c (0xc5520338,
0xcfd9b31c, 0x6b, 0xc9282d34, 0xc9282d34)
0xc8eadf28 0xd0947588 ÃdlmÃrelease_lockspace+0x38 (0xc5520338, 0x3,
0xcf810754, 0x0)
0xc8eadf40 0xd094787c ÃdlmÃdlm_emergency_shutdown+0x4c
0xc8eadf48 0xd0949f45 ÃdlmÃcman_callback+0x15 (0x2, 0x0, 0xd093bb4c,
0x1, 0xd093bb8c)
0xc8eadf64 0xd091d57a ÃcmanÃnotify_kernel_listeners+0x5a (0x2, 0x0,
0x100100, 0x200200, 0xd093bb8c)
0xc8eadf94 0xd092139e ÃcmanÃnode_shutdown+0x5e (0xd0934410, 0x6b, 0x0,
0xc8eac000, 0xc8eac000)
0xc8eadfec 0xd091d47b ÃcmanÃcluster_kthread+0x2ab (0xd093baf4,
0xd093baf4, 0x0, 0x0, 0x0)
           0xc011b640 default_wake_function


Because of switching to 2.6.10 and compiling a different cluster src
tree before switching back to 2.6.9, I lost track of the src tree
used to build the 2.6.9 modules used in this test.  I can't tell 
from the kernel module itself which version of the source it came
from so I don't know if the latest cman changes were part of this test.


Comment 8 David Teigland 2005-01-25 04:26:44 UTC

*** This bug has been marked as a duplicate of 139738 ***

Comment 9 Red Hat Bugzilla 2006-02-21 19:07:41 UTC
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.