Bug 315711 - dlm: closing connection to node 0
dlm: closing connection to node 0
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.1
All Linux
medium Severity medium
: rc
: ---
Assigned To: Lon Hohberger
Cluster QE
: TestBlocker, ZStream
Depends On:
Blocks: 328341
  Show dependency treegraph
 
Reported: 2007-10-02 14:21 EDT by Lon Hohberger
Modified: 2009-04-16 18:37 EDT (History)
3 users (show)

See Also:
Fixed In Version: RHBA-2008-0347
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-21 11:57:49 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Patch to remove qdisk from cman_get_nodes() (10.95 KB, patch)
2007-10-03 06:23 EDT, Christine Caulfield
no flags Details | Diff
Make dlm_controld always ignore node ID 0 (543 bytes, patch)
2007-10-04 11:47 EDT, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Lon Hohberger 2007-10-02 14:21:04 EDT
Description of problem:

This happens whenever qdisk is in use if we shut down qdiskd.  Effects are unknown.

Version-Release number of selected component (if applicable):
Linux tng3-2 2.6.18-48.el5 #1 SMP Mon Sep 17 17:26:31 EDT 2007 i686 i686 i386
GNU/Linux

How reproducible: 100%

Steps to Reproduce: 
1. Start qdiskd on a node
2. Wait for it to become part of the quorate qdisk set
3. Stop qdiskd
  
Actual results: dlm: closing connection to node 0

Expected results: Node 0 doesn't exist; it's a quorum disk.

Additional info: The following trace may or may not be related to this problem
but it bears noting since rgmanager got stuck in lockspace open/creation
coincidentally after the kernel reported the above:

clurgmgrd     D 4D7C0F96  3244  2623      1                2908 (NOTLB)
       d2f27e90 00000086 c04876be 4d7c0f96 00000015 c047d3a8 00000001 df59baa0
       df79e550 4d7f5cba 00000015 00034d24 00000000 df59bbac c13f4ee0 00000001
       00000000 df59baa0 00000002 00000008 00000018 00000008 e03e24b8 e03e24b4
Call Trace:
 [<c04876be>] mntput_no_expire+0x11/0x6a
 [<c047d3a8>] link_path_walk+0xb3/0xbd
 [<c0604a50>] __mutex_lock_slowpath+0x45/0x74
 [<c0604a8e>] .text.lock.mutex+0xf/0x14
 [<e03d0773>] dlm_new_lockspace+0x1b/0x79b [dlm]
 [<c045467a>] find_get_page+0x18/0x38
 [<c04570b1>] filemap_nopage+0x192/0x315
 [<c045fd14>] __handle_mm_fault+0x353/0x87b
 [<e03d5d16>] device_write+0x30f/0x4b5 [dlm]
 [<e03d5a07>] device_write+0x0/0x4b5 [dlm]
 [<c0470217>] vfs_write+0xa1/0x143
 [<c0470809>] sys_write+0x3c/0x63
 [<c0404eff>] syscall_call+0x7/0xb
Comment 1 David Teigland 2007-10-02 14:26:49 EDT
Either dlm_controld needs to ignore the disk it gets from cman_get_nodes(),
or cman_get_nodes() shouldn't return the disk as a cluster member.
Comment 2 Christine Caulfield 2007-10-03 03:07:29 EDT
The options are:

a) Change cman_get_nodes() to not return the quorum disk, and add an API call to
get the qdisk name that cman_tool can call.

b) Make everybody ignore nodeid 0.

hmm, put like that option a) sounds like the best deal.
Comment 3 Christine Caulfield 2007-10-03 06:23:15 EDT
Created attachment 214471 [details]
Patch to remove qdisk from cman_get_nodes()

This is the patch to remove the qdisk from cman_get_nodes() and into its own
API call, along with attendant changes to cman_tool.

As a quick fix, getting dlm_controld to ignore nodeid 0 would be a much smaller
patch!
Comment 4 Lon Hohberger 2007-10-03 15:35:39 EDT
This patch will definitely affect:

rgmanager

This patch may have effects on other parts including:

ccsd
lvm2-cluster
Comment 5 Lon Hohberger 2007-10-03 15:40:30 EDT
rgmanager reports the quorum disk status along with node status.  Changes at
least to clustat would be required in order to preserve output / apps which use
clustat for information.

(standard output snippet)
  Member Name                        ID   Status
  ------ ----                        ---- ------
  tng3-1                                1 Online, Local, rgmanager
  tng3-2                                2 Online, rgmanager
  tng3-3                                3 Online, rgmanager
  tng3-5                                5 Online, rgmanager
  /dev/sdd1                             0 Online, Quorum Disk

(xml snippet)
  <nodes>
    <node name="tng3-1" state="1" local="1" estranged="0" rgmanager="1"
qdisk="0" nodeid="0x00000001"/>
    <node name="tng3-2" state="1" local="0" estranged="0" rgmanager="1"
qdisk="0" nodeid="0x00000002"/>
    <node name="tng3-3" state="1" local="0" estranged="0" rgmanager="1"
qdisk="0" nodeid="0x00000003"/>
    <node name="tng3-5" state="1" local="0" estranged="0" rgmanager="1"
qdisk="0" nodeid="0x00000005"/>
    <node name="/dev/sdd1" state="1" local="0" estranged="0" rgmanager="0"
qdisk="1" nodeid="0x00000000"/>
  </nodes>
Comment 6 Corey Marthaler 2007-10-03 19:06:43 EDT
According to lon, I may be hitting this while doing recovery testing with qdisk.

I'll either see a node deadlock during start-up *or* see that node eventually
come up and then any lvm cmd will hang.
Comment 7 Christine Caulfield 2007-10-04 04:27:30 EDT
lvm2-cluster is fine as it is, or with the patch - it doesn't care about qdisk
and already ignores nodeid 0. I'm not sure about ccsd but I think the same applies.

Yes, I can believe this will cause problems with new nodes arriving in the
cluster. Connection "0" in the DLM is the listening socket. So any new
connections received after this message will be refused. 
Comment 8 Corey Marthaler 2007-10-04 11:11:07 EDT
It appears that I am definately hitting this while running revolver in a qdisk
cluster. Everytime I eventually end up seeing that "dlm: closing connection to
node 0" along with clvmd stuck indlm:dlm_new_lockspace. 

Oct  4 09:49:45 taft-02 kernel: clvmd         D ffffffff801405e0     0  5972   
  1          5997  5981 (NOTLB)
Oct  4 09:49:45 taft-02 kernel:  ffff81021198dd98 0000000000000086
00000000000000d0 00000000000000d0
Oct  4 09:49:45 taft-02 kernel:  000000000000000a ffff8101ffd52080
ffffffff802dcae0 00000025bed1ce44
Oct  4 09:49:45 taft-02 kernel:  00000000000010c7 ffff8101ffd52268
ffff810200000000 ffff81021b7bf820
Oct  4 09:49:45 taft-02 kernel: Call Trace:
Oct  4 09:49:45 taft-02 kernel:  [<ffffffff800610e7>] wait_for_completion+0x79/0xa2
Oct  4 09:49:45 taft-02 kernel:  [<ffffffff800884ac>] default_wake_function+0x0/0xe
Oct  4 09:49:45 taft-02 kernel:  [<ffffffff8014129e>] kobject_register+0x33/0x3a
Oct  4 09:49:45 taft-02 kernel:  [<ffffffff8845726b>]
:dlm:dlm_new_lockspace+0x734/0x88b
Oct  4 09:49:45 taft-02 kernel:  [<ffffffff8845cdc7>] :dlm:device_write+0x414/0x5ca
Oct  4 09:49:45 taft-02 kernel:  [<ffffffff800161c7>] vfs_write+0xce/0x174
Oct  4 09:49:45 taft-02 kernel:  [<ffffffff80016a94>] sys_write+0x45/0x6e
Oct  4 09:49:45 taft-02 kernel:  [<ffffffff8005b28d>] tracesys+0xd5/0xe0
Comment 9 Lon Hohberger 2007-10-04 11:47:56 EDT
Created attachment 215951 [details]
Make dlm_controld always ignore node ID 0

Patch marks node ID 0 as dead.
Comment 10 Corey Marthaler 2007-10-04 18:04:29 EDT
Just a note that running with the fix (in cman-2.0.73-1.1.x86_64.rpm) that lon
built appears to solve this issue.
Comment 17 Lon Hohberger 2008-03-25 17:28:40 EDT
Re-modified.
Comment 19 Lon Hohberger 2008-03-27 14:26:02 EDT
Mar 27 14:18:06 molly openais[1587]: [CMAN ] lost contact with quorum device 

However, there was no indication of "dlm: Closing connection to node 0".

This test was performed on 2.0.81.

I also tested using the test case for the bz directly:

[root@molly ~]# ./cman_port_bug 
process_cman_event - PORTOPENED 2
process_cman_event - STATECHANGE 0
  cman_is_listening(0x1f1d2010, 1, 192) = 0
  cman_is_listening(0x1f1d2010, 2, 192) = 1
process_cman_event - PORTOPENED 1
process_cman_event - STATECHANGE 0
  cman_is_listening(0x1f1d2010, 1, 192) = -1  (errno = 107)
  cman_is_listening(0x1f1d2010, 2, 192) = 1
process_cman_event - STATECHANGE 0
  cman_is_listening(0x1f1d2010, 1, 192) = -1  (errno = 107)
  cman_is_listening(0x1f1d2010, 2, 192) = 1
process_cman_event - STATECHANGE 0
  cman_is_listening(0x1f1d2010, 1, 192) = 0
  cman_is_listening(0x1f1d2010, 2, 192) = 1
process_cman_event - STATECHANGE 0
  cman_is_listening(0x1f1d2010, 1, 192) = 0
  cman_is_listening(0x1f1d2010, 2, 192) = 1
process_cman_event - STATECHANGE 0
  cman_is_listening(0x1f1d2010, 1, 192) = 0
  cman_is_listening(0x1f1d2010, 2, 192) = 1


At no point during the join/boot process did I receive the error message that
the node had the port open (but without a PORTOPENED message).

Comment 20 Lon Hohberger 2008-03-27 14:26:33 EDT
Marking verified.
Comment 22 errata-xmlrpc 2008-05-21 11:57:49 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0347.html

Note You need to log in before you can comment on or make changes to this bug.