Bug 1013062
Summary: | DLM on one node hangs during lockspace join | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Tomas Herfert <therfert> | ||||||||||||
Component: | cluster | Assignee: | Christine Caulfield <ccaulfie> | ||||||||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||
Priority: | unspecified | ||||||||||||||
Version: | 6.4 | CC: | agialluc, ccaulfie, cluster-maint, fdinitto, rpeterso, teigland, therfert | ||||||||||||
Target Milestone: | rc | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2014-02-27 13:20:11 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Created attachment 804098 [details]
session.log with output from Node 1: ps ax -o pid,stat,cmd,wchan
I don't think I ever asked for a ps ax -o pid,stat,cmd,wchan from node1 while things were stuck, could you collect that? session.log attached with requested information. I believe both nodes (1, 15) are in ping_members(), and have sent a status message to the other via dlm_rcom_status(). The tcpdump shows the status message from 15 to 1, but it shows nothing from 1 to 15. Node 1 should also have replied to the status message it received from node 15, but we don't see any reply either. We need to figure out where these messages are being dropped. The dlm message debugging does not go any lower, so we need to either add that, or implicate a layer below it. (In reply to David Teigland from comment #4) > I believe both nodes (1, 15) are in ping_members(), and have sent a status > message to the other via dlm_rcom_status(). > > The tcpdump shows the status message from 15 to 1, but it shows nothing from > 1 to 15. Node 1 should also have replied to the status message it received > from node 15, but we don't see any reply either. We need to figure out > where these messages are being dropped. The dlm message debugging does not > go any lower, so we need to either add that, or implicate a layer below it. Thanks David, do you need us to do anything further or provide more information? btw. looking at the network interface there are 0 dropped packets on both nodes. Also, there are no iptables rules applied. Created attachment 805321 [details]
systemtap
Here's a systemtap script to show any messages the dlm receives.
I'm still trying to get a system set up to test it, so I don't know if it works.
Created attachment 805393 [details]
systemtap
Fixed some %d/%u prints in case systemtap cares.
Created attachment 805450 [details]
systemtap
I debugged this one on rhel7, hopefully it works on rhel6.
Adding the systemtap steps to the steps we've run before: Node1: dlm_tool join fooN Node1: tcpdump -w /tmp/node1-tcpdump -i eth0 host 10.16.100.34 and port 21064 Node15: tcpdump -w /tmp/node15-tcpdump -i eth0 host 10.16.100.6 and port 21064 Node1: stap dlm-recv.stp > /tmp/node1-stap Node15: stap dlm-recv.stp > /tmp/node15-stap Node15: dlm_tool join fooN (This doesn't complete.) Stop tcpdump and stap and attached the captured data. Unfortuantely node1 has been fenced in the meanwhile and after that the problem disappeared. Tomas, have customer experienced this issue again? Otherwise i´ll need to close this one. Clearly we can re-open it again if necessary. We would need to reproduce the problem while running tcpdump and systemtap. That data would show whether the problem was in the dlm or outside the dlm. (In reply to Fabio Massimo Di Nitto from comment #12) > Tomas, > > have customer experienced this issue again? Otherwise i´ll need to close > this one. Clearly we can re-open it again if necessary. Fabio/Tomas, Since we have increased the Token time, and set secauth to off for the tokens. The cluster seems to be more stable. It has stayed up since early January with no nodes being fenced off. I think that this can be closed and if we encounter the issue again a new ticket can be initiated. Thanks for you assistance with this bug. -Tony Closing based on comment #16. Please reopen if problem arise again. |
Created attachment 804083 [details] results Description of problem: One particular node (node 15) of 19-nodes cluster can't join DLM lockspace - the command hangs. Based on my previous discussion with David Teigland, it seems to be in network level. Version-Release number of selected component (if applicable): kernel 2.6.32-358.18.1.el6.x86_64 How reproducible: Currently it behaves the same after each reboot of node 15... otherwise hardly Steps to Reproduce: Node1: dlm_tool join foo2 - command finishes w/o problem Node15: dlm_tool join foo2 - command hangs Additional info: Please find the attached tcpdump from node1 and node15, captured by: Node1: tcpdump -w /tmp/node1-tcpdump -i eth0 host 10.16.100.34 and port 21064 Node15: tcpdump -w /tmp/node15-tcpdump -i eth0 host 10.16.100.6 and port 21064 There are also results of the following commands attached: dlm_tool ls dlm_tool dump dmesg and log file /var/log/messages