Bug 142844
Summary: | -ENOBUFS when sending message | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | michael conrad tadpol tilstra <mtilstra> | ||||||
Component: | dlm | Assignee: | David Teigland <teigland> | ||||||
Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4 | CC: | ccaulfie, cluster-maint | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i386 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-02-21 19:07:41 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 133240, 144795 | ||||||||
Attachments: |
|
while grabbing that, dlm spit out a few more lines that might be useful: dlm: Joga: remote_stage error -105 10012 dlm: Joga: remote_stage error -105 103be dlm: Joga: remote_stage error -105 603e6 dlm: Joga: remote_stage error -105 502e8 dlm: Joga: remote_stage error -105 500ca dlm: Joga: remote_stage error -105 60081 dlm: Joga: remote_stage error -105 5017e dlm: Joga: remote_stage error -105 50213 dlm: Joga: remote_stage error -105 803a9 dlm: Joga: remote_stage error -105 301b3 dlm: Joga: remote_stage error -105 70034 dlm: Joga: remote_stage error -105 500e7 dlm: Joga: remote_stage error -105 301a7 dlm: Joga: remote_stage error -105 c02d7 dlm: Joga: remote_stage error -105 900b1 dlm: Joga: remote_stage error -105 90084 dlm: Joga: remote_stage error -105 60087 dlm: Joga: remote_stage error -105 700f1 dlm: Joga: remote_stage error -105 a03b0 dlm: Joga: remote_stage error -105 d0059 dlm: Joga: remote_stage error -105 e03e0 dlm: Joga: remote_stage error -105 8023e dlm: Joga: remote_stage error -105 c036f have gotten this twice now. just did a cvs update to grab pjc checkin for bug #142853. dlm code is same though. Created attachment 108640 [details]
Full dlm assert dump
Turned on screen logging, got full output this time. Also included output from
other nodes. clocks are synced accross all nodes.
The assertion is preceded by a bunch of the ENOBUFS errors in remote_stage which is the first and root cause. I'm adding Patrick to this one in case he has any thoughts on whether the machine has simply exhausted its memory or if there's some other possible reason for getting this error when we send a message. *** Bug 142874 has been marked as a duplicate of this bug. *** The only other reason ENOBUFS would be returned would be if lowcomms has been told to shut down and has disabled outgoing connections - that doesn't look likely from the logs. It's possible that sending has stalled for some reason and that is what is causing the memory shortage but it's very difficult to tell. Running "mu_loop 8 8" on 4 va nodes for 7 hours when va04 failed an ENOBUFS assertion. It was due to cman shutting down dlm: <6>CMAN: node va04 has been removed from the cluster : No response to messages <6>CMAN: killed by STARTTRANS or NOMINATE <6>CMAN: we are leaving the cluster. 0xc8ea2f90 4975 1 0 1 R 0xc8ea31d0 cman_comms EBP EIP Function (args) 0xc8eade98 0xc033eccc schedule+0x2fc (0xc8eadeac, 0x18bda78, 0xa0003c, 0x100100, 0x200200) 0xc8eadee4 0xc033f46e schedule_timeout+0x6e (0xc55203a8) 0xc8eadef0 0xc0128e80 msleep+0x30 (0x3e8, 0x6b, 0x3, 0xc9282d34) 0xc8eadf08 0xd095307c ÃdlmÃdlm_recoverd_stop+0x5c (0xc5520338, 0xcfd9b31c, 0x6b, 0xc9282d34, 0xc9282d34) 0xc8eadf28 0xd0947588 ÃdlmÃrelease_lockspace+0x38 (0xc5520338, 0x3, 0xcf810754, 0x0) 0xc8eadf40 0xd094787c ÃdlmÃdlm_emergency_shutdown+0x4c 0xc8eadf48 0xd0949f45 ÃdlmÃcman_callback+0x15 (0x2, 0x0, 0xd093bb4c, 0x1, 0xd093bb8c) 0xc8eadf64 0xd091d57a ÃcmanÃnotify_kernel_listeners+0x5a (0x2, 0x0, 0x100100, 0x200200, 0xd093bb8c) 0xc8eadf94 0xd092139e ÃcmanÃnode_shutdown+0x5e (0xd0934410, 0x6b, 0x0, 0xc8eac000, 0xc8eac000) 0xc8eadfec 0xd091d47b ÃcmanÃcluster_kthread+0x2ab (0xd093baf4, 0xd093baf4, 0x0, 0x0, 0x0) 0xc011b640 default_wake_function Because of switching to 2.6.10 and compiling a different cluster src tree before switching back to 2.6.9, I lost track of the src tree used to build the 2.6.9 modules used in this test. I can't tell from the kernel module itself which version of the source it came from so I don't know if the latest cman changes were part of this test. *** This bug has been marked as a duplicate of 139738 *** Changed to 'CLOSED' state since 'RESOLVED' has been deprecated. |
Created attachment 108541 [details] lockdump from scrollback buffer. I got as much as I could.