Bug 142844 - -ENOBUFS when sending message
-ENOBUFS when sending message
Status: CLOSED DUPLICATE of bug 139738
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm (Show other bugs)
4
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
: 142874 (view as bug list)
Depends On:
Blocks: 133240 144795
  Show dependency treegraph
 
Reported: 2004-12-14 12:17 EST by michael conrad tadpol tilstra
Modified: 2009-04-16 16:29 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-02-21 14:07:41 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
lockdump from scrollback buffer. (54.66 KB, text/plain)
2004-12-14 12:50 EST, michael conrad tadpol tilstra
no flags Details
Full dlm assert dump (63.88 KB, application/x-bzip2)
2004-12-15 13:14 EST, michael conrad tadpol tilstra
no flags Details

  None (edit)
Description michael conrad tadpol tilstra 2004-12-14 12:50:43 EST
Created attachment 108541 [details]
lockdump from scrollback buffer.

I got as much as I could.
Comment 1 michael conrad tadpol tilstra 2004-12-14 12:52:16 EST
while grabbing that, dlm spit out a few more lines that might be useful:

dlm: Joga: remote_stage error -105 10012
dlm: Joga: remote_stage error -105 103be
dlm: Joga: remote_stage error -105 603e6
dlm: Joga: remote_stage error -105 502e8
dlm: Joga: remote_stage error -105 500ca
dlm: Joga: remote_stage error -105 60081
dlm: Joga: remote_stage error -105 5017e
dlm: Joga: remote_stage error -105 50213
dlm: Joga: remote_stage error -105 803a9
dlm: Joga: remote_stage error -105 301b3
dlm: Joga: remote_stage error -105 70034
dlm: Joga: remote_stage error -105 500e7
dlm: Joga: remote_stage error -105 301a7
dlm: Joga: remote_stage error -105 c02d7
dlm: Joga: remote_stage error -105 900b1
dlm: Joga: remote_stage error -105 90084
dlm: Joga: remote_stage error -105 60087
dlm: Joga: remote_stage error -105 700f1
dlm: Joga: remote_stage error -105 a03b0
dlm: Joga: remote_stage error -105 d0059
dlm: Joga: remote_stage error -105 e03e0
dlm: Joga: remote_stage error -105 8023e
dlm: Joga: remote_stage error -105 c036f
Comment 2 michael conrad tadpol tilstra 2004-12-15 09:39:36 EST
have gotten this twice now.
just did a cvs update to grab pjc checkin for bug #142853.  dlm code is same though.
Comment 3 michael conrad tadpol tilstra 2004-12-15 13:14:41 EST
Created attachment 108640 [details]
Full dlm assert dump

Turned on screen logging, got full output this time.  Also included output from
other nodes.  clocks are synced accross all nodes.
Comment 4 David Teigland 2005-01-04 03:02:41 EST
The assertion is preceded by a bunch of the ENOBUFS errors in
remote_stage which is the first and root cause.  

I'm adding Patrick to this one in case he has any thoughts on whether
the machine has simply exhausted its memory or if there's some other
possible reason for getting this error when we send a message.
Comment 5 David Teigland 2005-01-04 03:05:42 EST
*** Bug 142874 has been marked as a duplicate of this bug. ***
Comment 6 Christine Caulfield 2005-01-04 05:33:29 EST
The only other reason ENOBUFS would be returned would be if lowcomms
has been told to shut down and has disabled outgoing connections -
that doesn't look likely from the logs.

It's possible that sending has stalled for some reason and that is
what is causing the memory shortage but it's very difficult to tell.
Comment 7 David Teigland 2005-01-16 22:32:43 EST
Running "mu_loop 8 8" on 4 va nodes for 7 hours when va04 failed
an ENOBUFS assertion.  It was due to cman shutting down dlm:

<6>CMAN: node va04 has been removed from the cluster : No response to
messages
<6>CMAN: killed by STARTTRANS or NOMINATE
<6>CMAN: we are leaving the cluster. 

0xc8ea2f90     4975        1  0    1   R  0xc8ea31d0  cman_comms
EBP        EIP        Function (args)
0xc8eade98 0xc033eccc schedule+0x2fc (0xc8eadeac, 0x18bda78, 0xa0003c,
0x100100, 0x200200)
0xc8eadee4 0xc033f46e schedule_timeout+0x6e (0xc55203a8)
0xc8eadef0 0xc0128e80 msleep+0x30 (0x3e8, 0x6b, 0x3, 0xc9282d34)
0xc8eadf08 0xd095307c ÄdlmÜdlm_recoverd_stop+0x5c (0xc5520338,
0xcfd9b31c, 0x6b, 0xc9282d34, 0xc9282d34)
0xc8eadf28 0xd0947588 ÄdlmÜrelease_lockspace+0x38 (0xc5520338, 0x3,
0xcf810754, 0x0)
0xc8eadf40 0xd094787c ÄdlmÜdlm_emergency_shutdown+0x4c
0xc8eadf48 0xd0949f45 ÄdlmÜcman_callback+0x15 (0x2, 0x0, 0xd093bb4c,
0x1, 0xd093bb8c)
0xc8eadf64 0xd091d57a ÄcmanÜnotify_kernel_listeners+0x5a (0x2, 0x0,
0x100100, 0x200200, 0xd093bb8c)
0xc8eadf94 0xd092139e ÄcmanÜnode_shutdown+0x5e (0xd0934410, 0x6b, 0x0,
0xc8eac000, 0xc8eac000)
0xc8eadfec 0xd091d47b ÄcmanÜcluster_kthread+0x2ab (0xd093baf4,
0xd093baf4, 0x0, 0x0, 0x0)
           0xc011b640 default_wake_function


Because of switching to 2.6.10 and compiling a different cluster src
tree before switching back to 2.6.9, I lost track of the src tree
used to build the 2.6.9 modules used in this test.  I can't tell 
from the kernel module itself which version of the source it came
from so I don't know if the latest cman changes were part of this test.
Comment 8 David Teigland 2005-01-24 23:26:44 EST

*** This bug has been marked as a duplicate of 139738 ***
Comment 9 Red Hat Bugzilla 2006-02-21 14:07:41 EST
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.

Note You need to log in before you can comment on or make changes to this bug.