Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. setup nodes to talk in IPv6
2. start cman
3. try to start rgmanager (with a configured service) on both nodes or try to perform a gfs/gfs2 mount operation on both nodes (it does not need to be at the same time).
dlm_recoverd is in D state. The invoking process cannot be killed.
I didn't have an option to verify if this problem is trictly related to the kernel side or userland side of dlm.
Let's reduce the steps above to just the following:
0. setup ipv6
1. make sure <dlm log_debug="1"/> is in cluster.conf
3. cman_tool join
4. cman_tool nodes -a
6. fence_tool join
7. dlm_controld -D (doesn't fork)
8. dlm_tool join test
Initially, I'm guessing that the problem is in dlm_controld, and we'll
want to look at the function add_configfs_node() under "set the address".
I was able to reproduce it with this reduced test case.
1b. mounted configfs and modprobed kernel modules from a clean boot.
4a. cman_tool status to verify ipv6 connectivity.
4b. cman_tool nodes -a done
4c. groupd: done
8. executed first on rhel5-1 and then on rhel5-2
4a, 4b, fence_tool dump and 7 from both node in attachment.
when running 8. on second node, dlm_recoverd is in D state.
Created attachment 314688 [details]
debug info from rhel5-1 node
Created attachment 314689 [details]
debug info from rhel5-2 node
Those logs all look fine. Does ps ax -o pid,stat,cmd,wchan show
dlm_controld blocked on anything specific? Does anything appear in
/var/log/messages, especially dlm messages, esp "connecting to" messages?
I wonder if the call to bind() in Lon's recent patch could have anything
to do with it?
ahh good catch:
repeating the exact same steps as above, /var/log/messages:
Aug 22 10:59:33 rhel5-1 kernel: dlm: Using TCP for communications
Aug 22 10:59:39 rhel5-1 kernel: dlm: connecting to 2
Aug 22 10:59:39 rhel5-1 kernel: dlm: connect from non cluster node
Aug 22 11:02:31 rhel5-2 kernel: dlm: Using TCP for communications
Aug 22 11:02:31 rhel5-2 kernel: dlm: connecting to 1
Aug 22 11:02:31 rhel5-2 kernel: dlm: connect from non cluster node
patch for this seems to work, queueing it for upstream 2.6.28
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
posted to rhkernel
Date: Wed, 3 Sep 2008 09:02:21 -0500
From: David Teigland <email@example.com>
Subject: [RHEL5.3 PATCH] dlm: fix address compare
You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.