Version-Release number of selected component (if applicable): kernel 2.6.18-92.1.10.el5 How reproducible: always Steps to Reproduce: 1. setup nodes to talk in IPv6 2. start cman 3. try to start rgmanager (with a configured service) on both nodes or try to perform a gfs/gfs2 mount operation on both nodes (it does not need to be at the same time). Actual results: dlm_recoverd is in D state. The invoking process cannot be killed. Additional info: I didn't have an option to verify if this problem is trictly related to the kernel side or userland side of dlm.
Let's reduce the steps above to just the following: 0. setup ipv6 1. make sure <dlm log_debug="1"/> is in cluster.conf 2. ccsd 3. cman_tool join 4. cman_tool nodes -a 5. fenced 6. fence_tool join 7. dlm_controld -D (doesn't fork) 8. dlm_tool join test Initially, I'm guessing that the problem is in dlm_controld, and we'll want to look at the function add_configfs_node() under "set the address".
I was able to reproduce it with this reduced test case. 0. done 1. done 1b. mounted configfs and modprobed kernel modules from a clean boot. 2. done 3. done 4a. cman_tool status to verify ipv6 connectivity. 4b. cman_tool nodes -a done 4c. groupd: done 5. done 6. done 7. done 8. executed first on rhel5-1 and then on rhel5-2 output from: 4a, 4b, fence_tool dump and 7 from both node in attachment. when running 8. on second node, dlm_recoverd is in D state.
Created attachment 314688 [details] debug info from rhel5-1 node
Created attachment 314689 [details] debug info from rhel5-2 node
Those logs all look fine. Does ps ax -o pid,stat,cmd,wchan show dlm_controld blocked on anything specific? Does anything appear in /var/log/messages, especially dlm messages, esp "connecting to" messages? I wonder if the call to bind() in Lon's recent patch could have anything to do with it?
ahh good catch: repeating the exact same steps as above, /var/log/messages: From node1: Aug 22 10:59:33 rhel5-1 kernel: dlm: Using TCP for communications Aug 22 10:59:39 rhel5-1 kernel: dlm: connecting to 2 Aug 22 10:59:39 rhel5-1 kernel: dlm: connect from non cluster node From node2: Aug 22 11:02:31 rhel5-2 kernel: dlm: Using TCP for communications Aug 22 11:02:31 rhel5-2 kernel: dlm: connecting to 1 Aug 22 11:02:31 rhel5-2 kernel: dlm: connect from non cluster node
patch for this seems to work, queueing it for upstream 2.6.28
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
posted to rhkernel Date: Wed, 3 Sep 2008 09:02:21 -0500 From: David Teigland <teigland> To: rhkernel-list Subject: [RHEL5.3 PATCH] dlm: fix address compare Message-ID: <20080903140221.GD22775>
in kernel-2.6.18-111.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html