Bug 201348 - service clvmd start hangs on x86_64 nodes in cluster
Summary: service clvmd start hangs on x86_64 nodes in cluster
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel   
(Show other bugs)
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: David Teigland
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-08-04 15:25 UTC by Robert Peterson
Modified: 2009-09-03 16:51 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-04-17 19:40:47 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Sysreq output showing the hang in vgscan. (120.84 KB, text/plain)
2006-08-04 15:25 UTC, Robert Peterson
no flags Details
Sysrq output from yesterday's hang doing the same thing (120.03 KB, text/plain)
2006-08-04 15:31 UTC, Robert Peterson
no flags Details

Description Robert Peterson 2006-08-04 15:25:16 UTC
Description of problem:
clvmd init script permanently hangs (can't be interrupted)
on a 64-bit (x86_64) node when the same command is done
simultaneously by three 32-bit nodes in a cluster.  
The 32-bit nodes work fine, but the 64-bit node hangs.

Version-Release number of selected component (if applicable):
FC6test1

How reproducible:
I've gotten this hang several times, and am sure I can do it again.

Steps to Reproduce:
1. Hard power-cycle all four nodes in a cluster of three i686's
   and one x86_64.
2. Using cssh, do "service cman start" simultaneously.
3. Using cssh, use group_tool -v to make sure all nodes are
   talking properly in the cluster.
4. Using cssh, do "service clvmd start" simultaneously.
  
Actual results:
Starting clvmd:                                            [  OK  ]
(followed by a hang that's uninterruptible by <ctrl-c>)
The clvmd service hangs on the 64-bit nodes, but not the others.

Expected results:
Starting clvmd:                                            [  OK  ]
Activating VGs:   2 logical volume(s) in volume group "VolGroup00" now active
  3 logical volume(s) in volume group "Smoke_Cluster" now active
                                                           [  OK  ]
Additional info:
I was using the "smoke" cluster in the Minneapolis lab:
camel, merit, winston and kool.  Kool is the x86_64 node that
hangs.  All machines have built LVM2 with clustering, and all
share the same SAN.

I used the sysrq trigger to dump the state of all tasks on the system,
and I'm adding that as an attachment.

Comment 1 Robert Peterson 2006-08-04 15:25:16 UTC
Created attachment 133638 [details]
Sysreq output showing the hang in vgscan.

Comment 2 Robert Peterson 2006-08-04 15:31:31 UTC
Created attachment 133640 [details]
Sysrq output from yesterday's hang doing the same thing

This output is from the same sequence of events I did yesterday.
This time, the hang appears to be in vgchange rather than vgscan,
but the result is the same: a hung "service clvmd start" command
that can't be interrupted.

Comment 3 Robert Peterson 2006-08-04 20:58:22 UTC
I tried adding a second x86_64 node to my cluster.
Doing service clvmd start there produced these messages:

Starting clvmd:                                            [  OK  ]
Activating VGs:   3 logical volume(s) in volume group "Smoke_Cluster" now active
  cluster request failed: Unknown error 65539
  2 logical volume(s) in volume group "VolGroup00" now active
  cluster request failed: Unknown error 65539
                                                           [  OK  ]

These were the only messages in syslog from that timeframe:

Aug  4 16:07:32 salem kernel: SCTP: Hash tables configured (established 65536
bind 65536)
Aug  4 16:07:32 salem kernel: Module sctp cannot be unloaded due to unsafe usage
in net/sctp/protocol.c:1189
Aug  4 16:07:32 salem kernel: dlm: clvmd: recover 1
Aug  4 16:07:32 salem kernel: dlm: clvmd: add member 4
Aug  4 16:07:32 salem kernel: dlm: clvmd: add member 3
Aug  4 16:07:32 salem kernel: dlm: clvmd: add member 2
Aug  4 16:07:32 salem kernel: dlm: clvmd: add member 6
Aug  4 16:07:32 salem kernel: dlm: Initiating association with node 2
Aug  4 16:07:32 salem kernel: dlm: got new/restarted association 1 nodeid 3
Aug  4 16:07:32 salem kernel: dlm: COMM_UP for invalid assoc ID 0
Aug  4 16:07:32 salem kernel: dlm: got new/restarted association 3 nodeid 4
Aug  4 16:07:32 salem kernel: dlm: clvmd: total members 4
Aug  4 16:07:32 salem kernel: dlm: clvmd: dlm_recover_directory
Aug  4 16:07:32 salem kernel: dlm: clvmd: dlm_recover_directory 1 entries
Aug  4 16:07:32 salem kernel: dlm: clvmd: recover 1 done: 152 ms
Aug  4 16:07:41 salem clvmd: Cluster LVM daemon started - connected to CMAN

This seems to be recreatable at will.


Comment 4 Christine Caulfield 2006-08-07 16:18:24 UTC
This is a DLM bug, introduced with the new userland code. I have a fix and I'll
test it first thing tomorrow if you can leve me the smoke cluster

Thanks.

Comment 5 Christine Caulfield 2006-08-08 09:06:40 UTC
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 7d38f91..bb2e351 100644
Here's the patch to fix. I shall forward it to Steve forthwith.


--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -3699,6 +3699,7 @@ int dlm_user_unlock(struct dlm_ls *ls, s
 	if (lvb_in && ua->lksb.sb_lvbptr)
 		memcpy(ua->lksb.sb_lvbptr, lvb_in, DLM_USER_LVB_LEN);
 	ua->castparam = ua_tmp->castparam;
+	ua->user_lksb = ua_tmp->user_lksb;
 
 	error = set_unlock_args(flags, ua, &args);
 	if (error)


Comment 6 Lenny Maiorani 2006-11-09 21:26:59 UTC
I am seeing similar behavior with RHEL4U3. This patch doesn't seem to apply
however. Any ideas on how to patch RHEL4U3? (or RHEL4U4 for that matter)

Comment 7 Christine Caulfield 2006-11-10 08:44:02 UTC
Lenny, RHEL4 has a completely different DLM so that patch is not relevant. The
cause will be something different

Could you please open up a new bug against RHEL4 and post as much evidence as
you can find: syslogs, /proc/cluster/dlm_debug & /proc/cluster/services.

Thanks.

Comment 8 Lenny Maiorani 2006-11-10 20:14:40 UTC
Looks like it may have been a misconfiguration. I will keep testing and open a
new bug if I encounter it again.

Comment 9 Robert Peterson 2007-02-05 14:35:34 UTC
I haven't seen this in a while in RHEL5, so I'm calling it verified.


Comment 10 Nate Straz 2007-12-13 17:40:51 UTC
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.

Comment 11 Robert Peterson 2009-04-17 19:40:47 UTC
Closing CURRENT_RELEASE


Note You need to log in before you can comment on or make changes to this bug.