Bug 318851 - deadlock attemping to start clvmd with latest kernel and cluster build
deadlock attemping to start clvmd with latest kernel and cluster build
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm-kernel (Show other bugs)
4
All Linux
high Severity high
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
: TestBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-10-04 13:45 EDT by Corey Marthaler
Modified: 2009-04-16 15:49 EDT (History)
7 users (show)

See Also:
Fixed In Version: RHBA-2007-0995
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-21 16:55:58 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch to fix (1.21 KB, patch)
2007-10-05 05:51 EDT, Christine Caulfield
no flags Details | Diff

  None (edit)
Description Corey Marthaler 2007-10-04 13:45:21 EDT
Description of problem:
I can't get clvmd to start with this latest rhel4u6 build.

[root@link-08 ~]# uname -ar
Linux link-08 2.6.9-60.ELsmp #1 SMP Tue Sep 25 22:55:08 EDT 2007 x86_64 x86_64
x86_64 GNU/Linux
[root@link-08 ~]# rpm -q lvm2-cluster
lvm2-cluster-2.02.27-2.el4
[root@link-08 ~]# rpm -q cmirror-kernel
cmirror-kernel-2.6.9-38.1
[root@link-08 ~]# rpm -q lvm2
lvm2-2.02.27-2.el4
[root@link-08 ~]# rpm -q device-mapper
device-mapper-1.02.21-1.el4


[root@link-08 ~]# clvmd -d
CLVMD[9566b800]: Oct  4 11:56:37 CLVMD started
CLVMD[9566b800]: Oct  4 11:56:37 Connected to CMAN
CLVMD[9566b800]: Oct  4 11:56:37 CMAN initialisation complete
CLVMD[9566b800]: Oct  4 11:56:41 DLM initialisation complete
CLVMD[9566b800]: Oct  4 11:56:41 Cluster ready, doing some more initialisation
CLVMD[9566b800]: Oct  4 11:56:41 starting LVM thread
CLVMD[9566b800]: Oct  4 11:56:41 clvmd ready for work
CLVMD[9566b800]: Oct  4 11:56:41 Using timeout of 60 seconds
CLVMD[9566b800]: Oct  4 11:56:41 add_to_lvmqueue: cmd=0x567470. client=0x561a00,
msg=0x7fbfffeaf0, len=30, csid=0x7fbfffead4, xid=0
CLVMD[41401960]: Oct  4 11:56:41 LVM thread function started
File descriptor 4 left open
  WARNING: Locking disabled. Be careful! This could corrupt your metadata.
CLVMD[41401960]: Oct  4 11:56:41 LVM thread waiting for work
CLVMD[41401960]: Oct  4 11:56:41 process_work_item: remote
CLVMD[41401960]: Oct  4 11:56:41 process_remote_command 2 for clientid 0x0 XID
49151 on node link-07
CLVMD[41401960]: Oct  4 11:56:41 Remote node link-07 is version 0.2.1
CLVMD[41401960]: Oct  4 11:56:41 Added new node 1 to updown list
CLVMD[41401960]: Oct  4 11:56:41 LVM thread waiting for work
CLVMD[9566b800]: Oct  4 11:56:46 add_to_lvmqueue: cmd=0x567470. client=0x561a00,
msg=0x7fbfffeaf0, len=30, csid=0x7fbfffead4, xid=0
CLVMD[41401960]: Oct  4 11:56:46 process_work_item: remote
CLVMD[41401960]: Oct  4 11:56:46 process_remote_command 2 for clientid 0x0 XID
49151 on node link-02
CLVMD[41401960]: Oct  4 11:56:46 Remote node link-02 is version 0.2.1
CLVMD[41401960]: Oct  4 11:56:46 Added new node 3 to updown list
CLVMD[41401960]: Oct  4 11:56:46 LVM thread waiting for work
CLVMD[9566b800]: Oct  4 11:57:04 Got new connection on fd 8
CLVMD[9566b800]: Oct  4 11:57:04 Read on local socket 8, len = 33
CLVMD[9566b800]: Oct  4 11:57:04 creating pipe, [9, 10]
CLVMD[9566b800]: Oct  4 11:57:04 Creating pre&post thread
CLVMD[9566b800]: Oct  4 11:57:04 Created pre&post thread, state = 0
CLVMD[41e02960]: Oct  4 11:57:04 in sub thread: client = 0x581410
CLVMD[41e02960]: Oct  4 11:57:04 Sub thread ready for work.
CLVMD[41e02960]: Oct  4 11:57:04 doing PRE command LOCK_VG 'V_VolGroup00' at 1
(client=0x581410)
CLVMD[41e02960]: Oct  4 11:57:04 sync_lock: 'V_VolGroup00' mode:3 flags=0
CLVMD[41e02960]: Oct  4 11:57:04 sync_lock: returning lkid 1010b
CLVMD[41e02960]: Oct  4 11:57:04 Writing status 0 down pipe 10
CLVMD[9566b800]: Oct  4 11:57:04 read on PIPE 9: 4 bytes: status: 0
CLVMD[9566b800]: Oct  4 11:57:04 background routine status was 0,
sock_client=0x581410
CLVMD[9566b800]: Oct  4 11:57:04 distribute command: XID = 0
CLVMD[9566b800]: Oct  4 11:57:04 add_to_lvmqueue: cmd=0x5817e0. client=0x581410,
msg=0x569d50, len=33, csid=(nil), xid=0
CLVMD[41401960]: Oct  4 11:57:04 process_work_item: local
CLVMD[41401960]: Oct  4 11:57:04 process_local_command: msg=0x581830, msglen
=33, client=0x581410
CLVMD[41401960]: Oct  4 11:57:04 Reply from node link-08: 0 bytes
CLVMD[41401960]: Oct  4 11:57:04 Got 1 replies, expecting: 1
CLVMD[41e02960]: Oct  4 11:57:04 Waiting to do post command - state = 0
CLVMD[41401960]: Oct  4 11:57:04 LVM thread waiting for work
CLVMD[41e02960]: Oct  4 11:57:04 Got post command condition...
CLVMD[41e02960]: Oct  4 11:57:04 Waiting for next pre command
CLVMD[9566b800]: Oct  4 11:57:04 read on PIPE 9: 4 bytes: status: 0
CLVMD[9566b800]: Oct  4 11:57:04 background routine status was 0,
sock_client=0x581410
CLVMD[9566b800]: Oct  4 11:57:04 Send local reply
CLVMD[9566b800]: Oct  4 11:57:04 Read on local socket 8, len = 33
CLVMD[41e02960]: Oct  4 11:57:04 Got pre command condition...
CLVMD[41e02960]: Oct  4 11:57:04 doing PRE command LOCK_VG 'V_VolGroup00' at 6
(client=0x581410)
CLVMD[41e02960]: Oct  4 11:57:04 sync_unlock: 'V_VolGroup00' lkid:1010b

[root@link-08 ~]# strace pvs
[...]
lseek(4, 4096, SEEK_SET)                = 4096
read(4, "O\350\304\332 LVM2 x[5A%r0N*>\1\0\0\0\0\20\0\0\0\0\0\0"..., 2048) = 2048
lseek(4, 6144, SEEK_SET)                = 6144
read(4, "ype = \"striped\"\nstripe_count = 1"..., 2048) = 2048
lseek(4, 6144, SEEK_SET)                = 6144
read(4, "ype = \"striped\"\nstripe_count = 1"..., 2048) = 2048
close(4)                                = 0
stat("/dev/hda2", {st_mode=S_IFBLK|0660, st_rdev=makedev(3, 2), ...}) = 0
stat("/dev/hda2", {st_mode=S_IFBLK|0660, st_rdev=makedev(3, 2), ...}) = 0
open("/dev/hda2", O_RDONLY|O_DIRECT|O_NOATIME) = 4
fstat(4, {st_mode=S_IFBLK|0660, st_rdev=makedev(3, 2), ...}) = 0
ioctl(4, BLKBSZGET, 0x595e10)           = 0
lseek(4, 4096, SEEK_SET)                = 4096
read(4, "O\350\304\332 LVM2 x[5A%r0N*>\1\0\0\0\0\20\0\0\0\0\0\0"..., 2048) = 2048
lseek(4, 6144, SEEK_SET)                = 6144
read(4, "ype = \"striped\"\nstripe_count = 1"..., 2048) = 2048
lseek(4, 6144, SEEK_SET)                = 6144
read(4, "ype = \"striped\"\nstripe_count = 1"..., 2048) = 2048
lseek(4, 0, SEEK_SET)                   = 0
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2048) = 2048
lseek(4, 4096, SEEK_SET)                = 4096
read(4, "O\350\304\332 LVM2 x[5A%r0N*>\1\0\0\0\0\20\0\0\0\0\0\0"..., 2048) = 2048
lseek(4, 6144, SEEK_SET)                = 6144
read(4, "ype = \"striped\"\nstripe_count = 1"..., 2048) = 2048
lseek(4, 6144, SEEK_SET)                = 6144
read(4, "ype = \"striped\"\nstripe_count = 1"..., 2048) = 2048
close(4)                                = 0
stat("/proc/lvm/VGs/VolGroup00", 0x7fbfffd630) = -1 ENOENT (No such file or
directory)
write(3, "3\1\377\277\0\0\0\0\0\0\0\0\17\0\0\0\0\6\4V_VolGroup00\0"..., 33) = 33
read(3, 

How reproducible:
everytime
Comment 1 Corey Marthaler 2007-10-04 15:24:26 EDT
I attempted this with all my storage disconnected and still ended up seeing this
while attempting to activate VGs. The only VG on the system is the root vol.

This appears to be a huge regression. All I did was upgrade kernels from .59 to
.60 and this build RHEL4-U6-re20070927.0 along with these cluster bits
2007-10-03-1629-RHEL4U6.

What could possibly be misconfigured here?
Comment 2 Corey Marthaler 2007-10-04 15:40:41 EDT
I booted up the .59 kernel again, and everything worked fine.
Comment 3 Corey Marthaler 2007-10-04 16:10:56 EDT
If i run the .59 modules on the .60 kernel, all seems to work.

59 kern:
CMAN 2.6.9-51.6 (built Sep 13 2007 18:40:54) installed
DLM 2.6.9-48.1 (built Sep 13 2007 19:12:44) installed

60 kern:
CMAN 2.6.9-52.2 (built Oct  1 2007 17:50:24) installed
DLM 2.6.9-49.1 (built Oct  1 2007 18:15:47) installed
Comment 4 Christine Caulfield 2007-10-05 05:51:01 EDT
Created attachment 217191 [details]
Patch to fix

I didn't really think that patch would get into 4.6 to be quite honest!

Two options here I suppose.
1) revert that patch
2) fix it using the attached patch (which also makes it more consistent with
RHEL5)

I've tested it with both clvmd and the PF test program and it passes both.
Comment 5 Christine Caulfield 2007-10-08 04:57:20 EDT
Just to clear, by "that patch" I mean the one to fix bz#306391
Comment 7 Christine Caulfield 2007-10-09 03:24:28 EDT
Checking in device.c;
/cvs/cluster/cluster/dlm-kernel/src/Attic/device.c,v  <--  device.c
new revision: 1.24.2.11.2.1; previous revision: 1.24.2.11
done
Comment 10 errata-xmlrpc 2007-11-21 16:55:58 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0995.html

Note You need to log in before you can comment on or make changes to this bug.