Description of problem: I am seeing this repeatedly on my nodes when attempting to create PVs. The 'pvcreate /dev/sda' cmd just appears to hang. This is very similar to bz134353. Here is where the strace shows it's at: open("/lib/liblvm2clusterlock.so", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0L\7\0\000"..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0555, st_size=9284, ...}) = 0 old_mmap(NULL, 9716, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40218000 old_mmap(0x4021a000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x1000) = 0x4021a000 close(3) = 0 socket(PF_UNIX, SOCK_STREAM, 0) = 3 connect(3, {sa_family=AF_UNIX, path=@clvmd}, 110) = 0 rt_sigprocmask(SIG_SETMASK, ~[RTMIN], [], 8) = 0 write(3, "3\1\0\277\0\0\0\0\0\0\0\0\f\0\0\0\0\4\0P_orphans\0+", 30) = 30 read(3, How reproducible: Always
This is almost certainly a the same thing as 134353 and the same comment applies. Can you provide some clvmd debugging please as I can't make it fail here. Readins as much as I can into the strace it looks like a simple VG lock which is waiting (in clvmd). There are three possible scenarios for that: 1. Locking is suspended because either the cluster isn't quorate or the lockspace is in recovery. cat /proc/cluster/services is the diagnostic here, obviously 2. Some other process/machine already has the P_orphans lock. echo "clvmd" > /proc/cluster/dlm_locks cat /proc/cluster/dlm_locks will help to check for this 3. There's a bug somewhere in the chain, most likely in clvmd but possibly in the dlm userland interface.
At the time of the hang all nodes show that all services are in the run state: [root@morph-01 clvmd]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 3 4 5 2] DLM Lock Space: "clvmd" 2 3 run - [1 2 3 5 4] [root@morph-01 clvmd]# ps -ef | grep pvcreate root 4490 4488 0 10:02 ? 00:00:00 pvcreate /dev/sda root 4501 2153 0 10:12 pts/0 00:00:00 grep pvcreate [root@morph-01 clvmd]# echo "clvmd" > /proc/cluster/dlm_locks [root@morph-01 clvmd]# cat /proc/cluster/dlm_locks DLM lockspace 'clvmd' Resource f4d60934 (parent 00000000). Name (len=9) "P_orphans" Master Copy Granted Queue 0001002c PW Conversion Queue Waiting Queue [root@morph-01 clvmd]#
Ok, so we know it's getting the lock OK. that means it's probably clvmd not replying for some reason. I copied your clvmd onto my cluster and it still works fine so I'm going to need some help. Can you build a debug clvmd and get some output from it? or leave me your cluster to play with one morning ?
Here's a crazy twist, I can only see this bug if clvmd is run as a daemon. If I run it in the foreground (no daemonization) I don't see the bug and everything runs just fine.
That was the clue I needed. See 134353 for more details.
fix verified.
Updating version to the right level in the defects. Sorry for the storm.