Bug 201348 - service clvmd start hangs on x86_64 nodes in cluster
service clvmd start hangs on x86_64 nodes in cluster
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-08-04 11:25 EDT by Robert Peterson
Modified: 2009-09-03 12:51 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-04-17 15:40:47 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Sysreq output showing the hang in vgscan. (120.84 KB, text/plain)
2006-08-04 11:25 EDT, Robert Peterson
no flags Details
Sysrq output from yesterday's hang doing the same thing (120.03 KB, text/plain)
2006-08-04 11:31 EDT, Robert Peterson
no flags Details

  None (edit)
Description Robert Peterson 2006-08-04 11:25:16 EDT
Description of problem:
clvmd init script permanently hangs (can't be interrupted)
on a 64-bit (x86_64) node when the same command is done
simultaneously by three 32-bit nodes in a cluster.  
The 32-bit nodes work fine, but the 64-bit node hangs.

Version-Release number of selected component (if applicable):
FC6test1

How reproducible:
I've gotten this hang several times, and am sure I can do it again.

Steps to Reproduce:
1. Hard power-cycle all four nodes in a cluster of three i686's
   and one x86_64.
2. Using cssh, do "service cman start" simultaneously.
3. Using cssh, use group_tool -v to make sure all nodes are
   talking properly in the cluster.
4. Using cssh, do "service clvmd start" simultaneously.
  
Actual results:
Starting clvmd:                                            [  OK  ]
(followed by a hang that's uninterruptible by <ctrl-c>)
The clvmd service hangs on the 64-bit nodes, but not the others.

Expected results:
Starting clvmd:                                            [  OK  ]
Activating VGs:   2 logical volume(s) in volume group "VolGroup00" now active
  3 logical volume(s) in volume group "Smoke_Cluster" now active
                                                           [  OK  ]
Additional info:
I was using the "smoke" cluster in the Minneapolis lab:
camel, merit, winston and kool.  Kool is the x86_64 node that
hangs.  All machines have built LVM2 with clustering, and all
share the same SAN.

I used the sysrq trigger to dump the state of all tasks on the system,
and I'm adding that as an attachment.
Comment 1 Robert Peterson 2006-08-04 11:25:16 EDT
Created attachment 133638 [details]
Sysreq output showing the hang in vgscan.
Comment 2 Robert Peterson 2006-08-04 11:31:31 EDT
Created attachment 133640 [details]
Sysrq output from yesterday's hang doing the same thing

This output is from the same sequence of events I did yesterday.
This time, the hang appears to be in vgchange rather than vgscan,
but the result is the same: a hung "service clvmd start" command
that can't be interrupted.
Comment 3 Robert Peterson 2006-08-04 16:58:22 EDT
I tried adding a second x86_64 node to my cluster.
Doing service clvmd start there produced these messages:

Starting clvmd:                                            [  OK  ]
Activating VGs:   3 logical volume(s) in volume group "Smoke_Cluster" now active
  cluster request failed: Unknown error 65539
  2 logical volume(s) in volume group "VolGroup00" now active
  cluster request failed: Unknown error 65539
                                                           [  OK  ]

These were the only messages in syslog from that timeframe:

Aug  4 16:07:32 salem kernel: SCTP: Hash tables configured (established 65536
bind 65536)
Aug  4 16:07:32 salem kernel: Module sctp cannot be unloaded due to unsafe usage
in net/sctp/protocol.c:1189
Aug  4 16:07:32 salem kernel: dlm: clvmd: recover 1
Aug  4 16:07:32 salem kernel: dlm: clvmd: add member 4
Aug  4 16:07:32 salem kernel: dlm: clvmd: add member 3
Aug  4 16:07:32 salem kernel: dlm: clvmd: add member 2
Aug  4 16:07:32 salem kernel: dlm: clvmd: add member 6
Aug  4 16:07:32 salem kernel: dlm: Initiating association with node 2
Aug  4 16:07:32 salem kernel: dlm: got new/restarted association 1 nodeid 3
Aug  4 16:07:32 salem kernel: dlm: COMM_UP for invalid assoc ID 0
Aug  4 16:07:32 salem kernel: dlm: got new/restarted association 3 nodeid 4
Aug  4 16:07:32 salem kernel: dlm: clvmd: total members 4
Aug  4 16:07:32 salem kernel: dlm: clvmd: dlm_recover_directory
Aug  4 16:07:32 salem kernel: dlm: clvmd: dlm_recover_directory 1 entries
Aug  4 16:07:32 salem kernel: dlm: clvmd: recover 1 done: 152 ms
Aug  4 16:07:41 salem clvmd: Cluster LVM daemon started - connected to CMAN

This seems to be recreatable at will.
Comment 4 Christine Caulfield 2006-08-07 12:18:24 EDT
This is a DLM bug, introduced with the new userland code. I have a fix and I'll
test it first thing tomorrow if you can leve me the smoke cluster

Thanks.
Comment 5 Christine Caulfield 2006-08-08 05:06:40 EDT
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 7d38f91..bb2e351 100644
Here's the patch to fix. I shall forward it to Steve forthwith.


--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -3699,6 +3699,7 @@ int dlm_user_unlock(struct dlm_ls *ls, s
 	if (lvb_in && ua->lksb.sb_lvbptr)
 		memcpy(ua->lksb.sb_lvbptr, lvb_in, DLM_USER_LVB_LEN);
 	ua->castparam = ua_tmp->castparam;
+	ua->user_lksb = ua_tmp->user_lksb;
 
 	error = set_unlock_args(flags, ua, &args);
 	if (error)
Comment 6 Lenny Maiorani 2006-11-09 16:26:59 EST
I am seeing similar behavior with RHEL4U3. This patch doesn't seem to apply
however. Any ideas on how to patch RHEL4U3? (or RHEL4U4 for that matter)
Comment 7 Christine Caulfield 2006-11-10 03:44:02 EST
Lenny, RHEL4 has a completely different DLM so that patch is not relevant. The
cause will be something different

Could you please open up a new bug against RHEL4 and post as much evidence as
you can find: syslogs, /proc/cluster/dlm_debug & /proc/cluster/services.

Thanks.
Comment 8 Lenny Maiorani 2006-11-10 15:14:40 EST
Looks like it may have been a misconfiguration. I will keep testing and open a
new bug if I encounter it again.
Comment 9 Robert Peterson 2007-02-05 09:35:34 EST
I haven't seen this in a while in RHEL5, so I'm calling it verified.
Comment 10 Nate Straz 2007-12-13 12:40:51 EST
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.
Comment 11 Robert Peterson 2009-04-17 15:40:47 EDT
Closing CURRENT_RELEASE

Note You need to log in before you can comment on or make changes to this bug.