Bug 1013062

Summary:

DLM on one node hangs during lockspace join

Product:

Red Hat Enterprise Linux 6

Reporter:

Tomas Herfert <therfert>

Component:

cluster

Assignee:

Christine Caulfield <ccaulfie>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

6.4

CC:

agialluc, ccaulfie, cluster-maint, fdinitto, rpeterso, teigland, therfert

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-02-27 13:20:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
results	none
session.log with output from Node 1: ps ax -o pid,stat,cmd,wchan	none
systemtap	none
systemtap	none
systemtap	none

Description Tomas Herfert 2013-09-27 18:13:54 UTC

Created attachment 804083 [details]
results

Description of problem:
One particular node (node 15) of 19-nodes cluster can't join DLM lockspace - the command hangs. 
Based on my previous discussion with David Teigland, it seems to be in network level.

Version-Release number of selected component (if applicable):
kernel 2.6.32-358.18.1.el6.x86_64

How reproducible:
Currently it behaves the same after each reboot of node 15... otherwise hardly

Steps to Reproduce:
Node1:
dlm_tool join foo2
- command finishes w/o problem

Node15:
dlm_tool join foo2
- command hangs


Additional info:
Please find the attached tcpdump from node1 and node15, captured by:
Node1:
tcpdump -w /tmp/node1-tcpdump -i eth0 host 10.16.100.34 and port 21064
Node15:
tcpdump -w /tmp/node15-tcpdump -i eth0 host 10.16.100.6 and port 21064

There are also results of the following commands attached:
dlm_tool ls
dlm_tool dump
dmesg
and log file /var/log/messages

Comment 2 Anthony Gialluca 2013-09-27 18:44:38 UTC

Created attachment 804098 [details]
session.log with output from Node 1: ps ax -o pid,stat,cmd,wchan

Comment 3 Anthony Gialluca 2013-09-27 18:45:26 UTC

I don't think I ever asked for a ps ax -o pid,stat,cmd,wchan from node1 while things were stuck, could you collect that?

session.log attached with requested information.

Comment 4 David Teigland 2013-09-27 19:45:33 UTC

I believe both nodes (1, 15) are in ping_members(), and have sent a status message to the other via dlm_rcom_status().

The tcpdump shows the status message from 15 to 1, but it shows nothing from 1 to 15.  Node 1 should also have replied to the status message it received from node 15, but we don't see any reply either.  We need to figure out where these messages are being dropped.  The dlm message debugging does not go any lower, so we need to either add that, or implicate a layer below it.

Comment 5 Tomas Herfert 2013-09-30 12:07:56 UTC

(In reply to David Teigland from comment #4)
> I believe both nodes (1, 15) are in ping_members(), and have sent a status
> message to the other via dlm_rcom_status().
> 
> The tcpdump shows the status message from 15 to 1, but it shows nothing from
> 1 to 15.  Node 1 should also have replied to the status message it received
> from node 15, but we don't see any reply either.  We need to figure out
> where these messages are being dropped.  The dlm message debugging does not
> go any lower, so we need to either add that, or implicate a layer below it.

Thanks David,
do you need us to do anything further or provide more information?

btw. looking at the network interface there are 0 dropped packets on both nodes.
Also, there are no iptables rules applied.

Comment 7 David Teigland 2013-09-30 15:59:19 UTC

Created attachment 805321 [details]
systemtap

Here's a systemtap script to show any messages the dlm receives.
I'm still trying to get a system set up to test it, so I don't know if it works.

Comment 8 David Teigland 2013-09-30 16:56:44 UTC

Created attachment 805393 [details]
systemtap

Fixed some %d/%u prints in case systemtap cares.

Comment 9 David Teigland 2013-09-30 18:30:48 UTC

Created attachment 805450 [details]
systemtap

I debugged this one on rhel7, hopefully it works on rhel6.

Comment 10 David Teigland 2013-09-30 18:40:34 UTC

Adding the systemtap steps to the steps we've run before:

Node1: dlm_tool join fooN

Node1:
tcpdump -w /tmp/node1-tcpdump -i eth0 host 10.16.100.34 and port 21064
Node15:
tcpdump -w /tmp/node15-tcpdump -i eth0 host 10.16.100.6 and port 21064

Node1:
stap dlm-recv.stp > /tmp/node1-stap
Node15:
stap dlm-recv.stp > /tmp/node15-stap

Node15: dlm_tool join fooN  (This doesn't complete.)

Stop tcpdump and stap and attached the captured data.

Comment 11 Tomas Herfert 2013-10-02 17:33:07 UTC

Unfortuantely node1 has been fenced in the meanwhile and after that the problem disappeared.

Comment 12 Fabio Massimo Di Nitto 2014-02-13 09:40:49 UTC

Tomas,

have customer experienced this issue again? Otherwise i´ll need to close this one. Clearly we can re-open it again if necessary.

Comment 14 David Teigland 2014-02-13 15:02:24 UTC

We would need to reproduce the problem while running tcpdump and systemtap.
That data would show whether the problem was in the dlm or outside the dlm.

Comment 16 Anthony Gialluca 2014-02-19 14:46:35 UTC

(In reply to Fabio Massimo Di Nitto from comment #12)
> Tomas,
> 
> have customer experienced this issue again? Otherwise i´ll need to close
> this one. Clearly we can re-open it again if necessary.

Fabio/Tomas,

Since we have increased the Token time, and set secauth to off for
the tokens. The cluster seems to be more stable. It has stayed up
since early January with no nodes being fenced off.

I think that this can be closed and if we encounter the issue
again a new ticket can be initiated.

Thanks for you assistance with this bug.
-Tony

Comment 17 Fabio Massimo Di Nitto 2014-02-27 13:19:50 UTC

Closing based on comment #16.

Please reopen if problem arise again.