Bug 1013062

Summary: DLM on one node hangs during lockspace join
Product: Red Hat Enterprise Linux 6 Reporter: Tomas Herfert <therfert>
Component: clusterAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.4CC: agialluc, ccaulfie, cluster-maint, fdinitto, rpeterso, teigland, therfert
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-27 13:20:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
results
none
session.log with output from Node 1: ps ax -o pid,stat,cmd,wchan
none
systemtap
none
systemtap
none
systemtap none

Description Tomas Herfert 2013-09-27 18:13:54 UTC
Created attachment 804083 [details]
results

Description of problem:
One particular node (node 15) of 19-nodes cluster can't join DLM lockspace - the command hangs. 
Based on my previous discussion with David Teigland, it seems to be in network level.

Version-Release number of selected component (if applicable):
kernel 2.6.32-358.18.1.el6.x86_64

How reproducible:
Currently it behaves the same after each reboot of node 15... otherwise hardly

Steps to Reproduce:
Node1:
dlm_tool join foo2
- command finishes w/o problem

Node15:
dlm_tool join foo2
- command hangs


Additional info:
Please find the attached tcpdump from node1 and node15, captured by:
Node1:
tcpdump -w /tmp/node1-tcpdump -i eth0 host 10.16.100.34 and port 21064
Node15:
tcpdump -w /tmp/node15-tcpdump -i eth0 host 10.16.100.6 and port 21064

There are also results of the following commands attached:
dlm_tool ls
dlm_tool dump
dmesg
and log file /var/log/messages

Comment 2 Anthony Gialluca 2013-09-27 18:44:38 UTC
Created attachment 804098 [details]
session.log with output from Node 1: ps ax -o pid,stat,cmd,wchan

Comment 3 Anthony Gialluca 2013-09-27 18:45:26 UTC
I don't think I ever asked for a ps ax -o pid,stat,cmd,wchan from node1 while things were stuck, could you collect that?

session.log attached with requested information.

Comment 4 David Teigland 2013-09-27 19:45:33 UTC
I believe both nodes (1, 15) are in ping_members(), and have sent a status message to the other via dlm_rcom_status().

The tcpdump shows the status message from 15 to 1, but it shows nothing from 1 to 15.  Node 1 should also have replied to the status message it received from node 15, but we don't see any reply either.  We need to figure out where these messages are being dropped.  The dlm message debugging does not go any lower, so we need to either add that, or implicate a layer below it.

Comment 5 Tomas Herfert 2013-09-30 12:07:56 UTC
(In reply to David Teigland from comment #4)
> I believe both nodes (1, 15) are in ping_members(), and have sent a status
> message to the other via dlm_rcom_status().
> 
> The tcpdump shows the status message from 15 to 1, but it shows nothing from
> 1 to 15.  Node 1 should also have replied to the status message it received
> from node 15, but we don't see any reply either.  We need to figure out
> where these messages are being dropped.  The dlm message debugging does not
> go any lower, so we need to either add that, or implicate a layer below it.

Thanks David,
do you need us to do anything further or provide more information?

btw. looking at the network interface there are 0 dropped packets on both nodes.
Also, there are no iptables rules applied.

Comment 7 David Teigland 2013-09-30 15:59:19 UTC
Created attachment 805321 [details]
systemtap

Here's a systemtap script to show any messages the dlm receives.
I'm still trying to get a system set up to test it, so I don't know if it works.

Comment 8 David Teigland 2013-09-30 16:56:44 UTC
Created attachment 805393 [details]
systemtap

Fixed some %d/%u prints in case systemtap cares.

Comment 9 David Teigland 2013-09-30 18:30:48 UTC
Created attachment 805450 [details]
systemtap

I debugged this one on rhel7, hopefully it works on rhel6.

Comment 10 David Teigland 2013-09-30 18:40:34 UTC
Adding the systemtap steps to the steps we've run before:

Node1: dlm_tool join fooN

Node1:
tcpdump -w /tmp/node1-tcpdump -i eth0 host 10.16.100.34 and port 21064
Node15:
tcpdump -w /tmp/node15-tcpdump -i eth0 host 10.16.100.6 and port 21064

Node1:
stap dlm-recv.stp > /tmp/node1-stap
Node15:
stap dlm-recv.stp > /tmp/node15-stap

Node15: dlm_tool join fooN  (This doesn't complete.)

Stop tcpdump and stap and attached the captured data.

Comment 11 Tomas Herfert 2013-10-02 17:33:07 UTC
Unfortuantely node1 has been fenced in the meanwhile and after that the problem disappeared.

Comment 12 Fabio Massimo Di Nitto 2014-02-13 09:40:49 UTC
Tomas,

have customer experienced this issue again? Otherwise i´ll need to close this one. Clearly we can re-open it again if necessary.

Comment 14 David Teigland 2014-02-13 15:02:24 UTC
We would need to reproduce the problem while running tcpdump and systemtap.
That data would show whether the problem was in the dlm or outside the dlm.

Comment 16 Anthony Gialluca 2014-02-19 14:46:35 UTC
(In reply to Fabio Massimo Di Nitto from comment #12)
> Tomas,
> 
> have customer experienced this issue again? Otherwise i´ll need to close
> this one. Clearly we can re-open it again if necessary.

Fabio/Tomas,

Since we have increased the Token time, and set secauth to off for
the tokens. The cluster seems to be more stable. It has stayed up
since early January with no nodes being fenced off.

I think that this can be closed and if we encounter the issue
again a new ticket can be initiated.

Thanks for you assistance with this bug.
-Tony

Comment 17 Fabio Massimo Di Nitto 2014-02-27 13:19:50 UTC
Closing based on comment #16.

Please reopen if problem arise again.