Bug 1013062 - DLM on one node hangs during lockspace join
DLM on one node hangs during lockspace join
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster (Show other bugs)
6.4
Unspecified Unspecified
unspecified Severity urgent
: rc
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-27 14:13 EDT by Tomas Herfert
Modified: 2014-02-27 08:20 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-02-27 08:20:11 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
results (432.32 KB, application/x-compressed-tar)
2013-09-27 14:13 EDT, Tomas Herfert
no flags Details
session.log with output from Node 1: ps ax -o pid,stat,cmd,wchan (38.71 KB, text/plain)
2013-09-27 14:44 EDT, Anthony Gialluca
no flags Details
systemtap (996 bytes, text/plain)
2013-09-30 11:59 EDT, David Teigland
no flags Details
systemtap (996 bytes, text/plain)
2013-09-30 12:56 EDT, David Teigland
no flags Details
systemtap (810 bytes, text/plain)
2013-09-30 14:30 EDT, David Teigland
no flags Details

  None (edit)
Description Tomas Herfert 2013-09-27 14:13:54 EDT
Created attachment 804083 [details]
results

Description of problem:
One particular node (node 15) of 19-nodes cluster can't join DLM lockspace - the command hangs. 
Based on my previous discussion with David Teigland, it seems to be in network level.

Version-Release number of selected component (if applicable):
kernel 2.6.32-358.18.1.el6.x86_64

How reproducible:
Currently it behaves the same after each reboot of node 15... otherwise hardly

Steps to Reproduce:
Node1:
dlm_tool join foo2
- command finishes w/o problem

Node15:
dlm_tool join foo2
- command hangs


Additional info:
Please find the attached tcpdump from node1 and node15, captured by:
Node1:
tcpdump -w /tmp/node1-tcpdump -i eth0 host 10.16.100.34 and port 21064
Node15:
tcpdump -w /tmp/node15-tcpdump -i eth0 host 10.16.100.6 and port 21064

There are also results of the following commands attached:
dlm_tool ls
dlm_tool dump
dmesg
and log file /var/log/messages
Comment 2 Anthony Gialluca 2013-09-27 14:44:38 EDT
Created attachment 804098 [details]
session.log with output from Node 1: ps ax -o pid,stat,cmd,wchan
Comment 3 Anthony Gialluca 2013-09-27 14:45:26 EDT
I don't think I ever asked for a ps ax -o pid,stat,cmd,wchan from node1 while things were stuck, could you collect that?

session.log attached with requested information.
Comment 4 David Teigland 2013-09-27 15:45:33 EDT
I believe both nodes (1, 15) are in ping_members(), and have sent a status message to the other via dlm_rcom_status().

The tcpdump shows the status message from 15 to 1, but it shows nothing from 1 to 15.  Node 1 should also have replied to the status message it received from node 15, but we don't see any reply either.  We need to figure out where these messages are being dropped.  The dlm message debugging does not go any lower, so we need to either add that, or implicate a layer below it.
Comment 5 Tomas Herfert 2013-09-30 08:07:56 EDT
(In reply to David Teigland from comment #4)
> I believe both nodes (1, 15) are in ping_members(), and have sent a status
> message to the other via dlm_rcom_status().
> 
> The tcpdump shows the status message from 15 to 1, but it shows nothing from
> 1 to 15.  Node 1 should also have replied to the status message it received
> from node 15, but we don't see any reply either.  We need to figure out
> where these messages are being dropped.  The dlm message debugging does not
> go any lower, so we need to either add that, or implicate a layer below it.

Thanks David,
do you need us to do anything further or provide more information?

btw. looking at the network interface there are 0 dropped packets on both nodes.
Also, there are no iptables rules applied.
Comment 7 David Teigland 2013-09-30 11:59:19 EDT
Created attachment 805321 [details]
systemtap

Here's a systemtap script to show any messages the dlm receives.
I'm still trying to get a system set up to test it, so I don't know if it works.
Comment 8 David Teigland 2013-09-30 12:56:44 EDT
Created attachment 805393 [details]
systemtap

Fixed some %d/%u prints in case systemtap cares.
Comment 9 David Teigland 2013-09-30 14:30:48 EDT
Created attachment 805450 [details]
systemtap

I debugged this one on rhel7, hopefully it works on rhel6.
Comment 10 David Teigland 2013-09-30 14:40:34 EDT
Adding the systemtap steps to the steps we've run before:

Node1: dlm_tool join fooN

Node1:
tcpdump -w /tmp/node1-tcpdump -i eth0 host 10.16.100.34 and port 21064
Node15:
tcpdump -w /tmp/node15-tcpdump -i eth0 host 10.16.100.6 and port 21064

Node1:
stap dlm-recv.stp > /tmp/node1-stap
Node15:
stap dlm-recv.stp > /tmp/node15-stap

Node15: dlm_tool join fooN  (This doesn't complete.)

Stop tcpdump and stap and attached the captured data.
Comment 11 Tomas Herfert 2013-10-02 13:33:07 EDT
Unfortuantely node1 has been fenced in the meanwhile and after that the problem disappeared.
Comment 12 Fabio Massimo Di Nitto 2014-02-13 04:40:49 EST
Tomas,

have customer experienced this issue again? Otherwise i´ll need to close this one. Clearly we can re-open it again if necessary.
Comment 14 David Teigland 2014-02-13 10:02:24 EST
We would need to reproduce the problem while running tcpdump and systemtap.
That data would show whether the problem was in the dlm or outside the dlm.
Comment 16 Anthony Gialluca 2014-02-19 09:46:35 EST
(In reply to Fabio Massimo Di Nitto from comment #12)
> Tomas,
> 
> have customer experienced this issue again? Otherwise i´ll need to close
> this one. Clearly we can re-open it again if necessary.

Fabio/Tomas,

Since we have increased the Token time, and set secauth to off for
the tokens. The cluster seems to be more stable. It has stayed up
since early January with no nodes being fenced off.

I think that this can be closed and if we encounter the issue
again a new ticket can be initiated.

Thanks for you assistance with this bug.
-Tony
Comment 17 Fabio Massimo Di Nitto 2014-02-27 08:19:50 EST
Closing based on comment #16.

Please reopen if problem arise again.

Note You need to log in before you can comment on or make changes to this bug.