Bug 315711
Summary: | dlm: closing connection to node 0 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Lon Hohberger <lhh> | ||||||
Component: | cman | Assignee: | Lon Hohberger <lhh> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 5.1 | CC: | cluster-maint, rkenna, teigland | ||||||
Target Milestone: | rc | Keywords: | TestBlocker, ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHBA-2008-0347 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-05-21 15:57:49 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 328341 | ||||||||
Attachments: |
|
Description
Lon Hohberger
2007-10-02 18:21:04 UTC
Either dlm_controld needs to ignore the disk it gets from cman_get_nodes(), or cman_get_nodes() shouldn't return the disk as a cluster member. The options are: a) Change cman_get_nodes() to not return the quorum disk, and add an API call to get the qdisk name that cman_tool can call. b) Make everybody ignore nodeid 0. hmm, put like that option a) sounds like the best deal. Created attachment 214471 [details]
Patch to remove qdisk from cman_get_nodes()
This is the patch to remove the qdisk from cman_get_nodes() and into its own
API call, along with attendant changes to cman_tool.
As a quick fix, getting dlm_controld to ignore nodeid 0 would be a much smaller
patch!
This patch will definitely affect: rgmanager This patch may have effects on other parts including: ccsd lvm2-cluster rgmanager reports the quorum disk status along with node status. Changes at least to clustat would be required in order to preserve output / apps which use clustat for information. (standard output snippet) Member Name ID Status ------ ---- ---- ------ tng3-1 1 Online, Local, rgmanager tng3-2 2 Online, rgmanager tng3-3 3 Online, rgmanager tng3-5 5 Online, rgmanager /dev/sdd1 0 Online, Quorum Disk (xml snippet) <nodes> <node name="tng3-1" state="1" local="1" estranged="0" rgmanager="1" qdisk="0" nodeid="0x00000001"/> <node name="tng3-2" state="1" local="0" estranged="0" rgmanager="1" qdisk="0" nodeid="0x00000002"/> <node name="tng3-3" state="1" local="0" estranged="0" rgmanager="1" qdisk="0" nodeid="0x00000003"/> <node name="tng3-5" state="1" local="0" estranged="0" rgmanager="1" qdisk="0" nodeid="0x00000005"/> <node name="/dev/sdd1" state="1" local="0" estranged="0" rgmanager="0" qdisk="1" nodeid="0x00000000"/> </nodes> According to lon, I may be hitting this while doing recovery testing with qdisk. I'll either see a node deadlock during start-up *or* see that node eventually come up and then any lvm cmd will hang. lvm2-cluster is fine as it is, or with the patch - it doesn't care about qdisk and already ignores nodeid 0. I'm not sure about ccsd but I think the same applies. Yes, I can believe this will cause problems with new nodes arriving in the cluster. Connection "0" in the DLM is the listening socket. So any new connections received after this message will be refused. It appears that I am definately hitting this while running revolver in a qdisk cluster. Everytime I eventually end up seeing that "dlm: closing connection to node 0" along with clvmd stuck indlm:dlm_new_lockspace. Oct 4 09:49:45 taft-02 kernel: clvmd D ffffffff801405e0 0 5972 1 5997 5981 (NOTLB) Oct 4 09:49:45 taft-02 kernel: ffff81021198dd98 0000000000000086 00000000000000d0 00000000000000d0 Oct 4 09:49:45 taft-02 kernel: 000000000000000a ffff8101ffd52080 ffffffff802dcae0 00000025bed1ce44 Oct 4 09:49:45 taft-02 kernel: 00000000000010c7 ffff8101ffd52268 ffff810200000000 ffff81021b7bf820 Oct 4 09:49:45 taft-02 kernel: Call Trace: Oct 4 09:49:45 taft-02 kernel: [<ffffffff800610e7>] wait_for_completion+0x79/0xa2 Oct 4 09:49:45 taft-02 kernel: [<ffffffff800884ac>] default_wake_function+0x0/0xe Oct 4 09:49:45 taft-02 kernel: [<ffffffff8014129e>] kobject_register+0x33/0x3a Oct 4 09:49:45 taft-02 kernel: [<ffffffff8845726b>] :dlm:dlm_new_lockspace+0x734/0x88b Oct 4 09:49:45 taft-02 kernel: [<ffffffff8845cdc7>] :dlm:device_write+0x414/0x5ca Oct 4 09:49:45 taft-02 kernel: [<ffffffff800161c7>] vfs_write+0xce/0x174 Oct 4 09:49:45 taft-02 kernel: [<ffffffff80016a94>] sys_write+0x45/0x6e Oct 4 09:49:45 taft-02 kernel: [<ffffffff8005b28d>] tracesys+0xd5/0xe0 Created attachment 215951 [details]
Make dlm_controld always ignore node ID 0
Patch marks node ID 0 as dead.
Just a note that running with the fix (in cman-2.0.73-1.1.x86_64.rpm) that lon built appears to solve this issue. Re-modified. Mar 27 14:18:06 molly openais[1587]: [CMAN ] lost contact with quorum device However, there was no indication of "dlm: Closing connection to node 0". This test was performed on 2.0.81. I also tested using the test case for the bz directly: [root@molly ~]# ./cman_port_bug process_cman_event - PORTOPENED 2 process_cman_event - STATECHANGE 0 cman_is_listening(0x1f1d2010, 1, 192) = 0 cman_is_listening(0x1f1d2010, 2, 192) = 1 process_cman_event - PORTOPENED 1 process_cman_event - STATECHANGE 0 cman_is_listening(0x1f1d2010, 1, 192) = -1 (errno = 107) cman_is_listening(0x1f1d2010, 2, 192) = 1 process_cman_event - STATECHANGE 0 cman_is_listening(0x1f1d2010, 1, 192) = -1 (errno = 107) cman_is_listening(0x1f1d2010, 2, 192) = 1 process_cman_event - STATECHANGE 0 cman_is_listening(0x1f1d2010, 1, 192) = 0 cman_is_listening(0x1f1d2010, 2, 192) = 1 process_cman_event - STATECHANGE 0 cman_is_listening(0x1f1d2010, 1, 192) = 0 cman_is_listening(0x1f1d2010, 2, 192) = 1 process_cman_event - STATECHANGE 0 cman_is_listening(0x1f1d2010, 1, 192) = 0 cman_is_listening(0x1f1d2010, 2, 192) = 1 At no point during the join/boot process did I receive the error message that the node had the port open (but without a PORTOPENED message). Marking verified. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0347.html |