Bug 232878
Description
Fabio Massimo Di Nitto
2007-03-19 06:48:38 UTC
Created attachment 150351 [details]
cluster configuration
Created attachment 150352 [details]
strace -a 80 -s 80 -f -o /home/fabbione/rgmanager.stracer clurgmgrd
the strace shows that: 13035 open("/dev/misc/dlm_rgmanager", O_RDWR) = -1 ENODEV (No such device) but: root@diapolon:/dev/misc# ls -als total 0 0 drwxr-xr-x 2 root root 120 2007-03-19 07:45 . 0 drwxr-xr-x 15 root root 13620 2007-03-19 06:42 .. 0 crw-rw---- 1 root root 10, 60 2007-03-19 06:42 dlm_clvmd 0 crw-rw---- 1 root root 10, 62 2007-03-19 06:42 dlm-control 0 crw-r--r-- 1 root root 10, 0 2007-03-19 07:45 dlm_rgmanager 0 crw------- 1 root root 10, 61 2007-03-19 06:42 lock_dlm_plock and at last: golion: cman_tool services type level name id state fence 0 default 00010001 none [1 2] dlm 1 clvmd 00020001 none [1 2] diapolon (where rgmanager fails to start and strace taken): cman_tool services type level name id state fence 0 default 00010001 none [1 2] dlm 1 clvmd 00020001 none [1 2] dlm 1 rgmanager 00030001 none [1] hope this is enough info. Anyway i can reproduce this everytime in matter of minutes, so if you need more info please don't esitate to ask. It's possible for me to grant access to the cluster since it's just a sandbox. Fabio There's something odd going on with udev that we've seen in other places. It seems (sometimes!) to create the dlm lockspace devices in /dev rather than /dev/misc. This can cause a 10 second delay in libdlm as it waits fruitlessly for the device to appear before giving up and creating it itself. There could be a race condition here then maybe ? udev seems to behave correctly. checked both rules and tested for existance of /dev/dlm* while operating. In attachement dlm_controld -D output from both nodes while reproducing the problem. (i added a bunch of --- status --- to make it easier to understand what's happening when). Fabio Created attachment 150357 [details]
dlm_controld debug output from node diapolon
Created attachment 150358 [details]
dlm_controld debug output from node golion
13035 open("/dev/misc/dlm_rgmanager", O_RDWR) = -1 ENODEV (No such device) Eeeeeeek. Ok, ENODEV fails with the open() call; ostensibly in libdlm. Patrick - if there's a race where it sometimes creates in the wrong location - would it make sense to have libdlm check the alternate location? The problem is that the device reports that it's there. It's almost as if the kernel->device linkage-magic is gone. ...and by "device reports" I mean "file system reports" Adding Ryan - I am pretty sure he's seen this on his cluster. Created attachment 150369 [details]
/proc/misc node diapolon before crash
Created attachment 150370 [details]
/proc/misc node diapolon after crash
Created attachment 150371 [details]
/proc/misc node golion before crash
Created attachment 150373 [details]
/proc/misc node golion after crash
This looks like it might be a race condition in lockspace teardown. What seems to be happening is that rgmanager is removing the lockspace when it shuts down, but this takes some time to happen. When rgmanager starts up again, it tries to recreate the lockspace. But it's already shutting down so it returns -EEXIST because it's still there. By the time rgmanager (in libdlm) gets to the point where it needs to check the device minor then lockspace has gone (no record of it in /proc/misc) so it creates a stupid device with a minor of 0. Access this always return -ENODEV. The easiest way to fix this (I think) is to patch libdlm so that if the lockspace does not exist in /proc/misc it loops back to attempt to create it again. Created attachment 150378 [details]
Very basic patch to test theory
The patch helps only partially. The service still does not failover as it should, but i can clurgmgr still running (it doesn't crash and it can start again) but it doesn't take any action on what needs to be done. cman_tools services still show a discrepancy on dlm services as posted before. Fabio After another run I noticed that the rgmanager process that before used to crash, is now spinning 100% CPU on this: write(7, "\5\0\0\0\0\0\0\0\0\0\0\0\4\0\360\267\0\0\0\0\1\0\0\0rg"..., 89) = -1 EEXIST (File exists) open("/proc/misc", O_RDONLY) = 8 fstat64(8, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7eed000 read(8, " 58 acpi\n 60 dlm_clvmd\n 61 lock_"..., 1024) = 155 read(8, "", 1024) = 0 close(8) = 0 munmap(0xb7eed000, 4096) = 0 write(7, "\5\0\0\0\0\0\0\0\0\0\0\0\4\0\360\267\0\0\0\0\1\0\0\0rg"..., 89) = -1 EEXIST (File exists) open("/proc/misc", O_RDONLY) = 8 fstat64(8, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7eed000 read(8, " 58 acpi\n 60 dlm_clvmd\n 61 lock_"..., 1024) = 155 read(8, "", 1024) = 0 close(8) = 0 munmap(0xb7eed000, 4096) = 0 write(7, "\5\0\0\0\0\0\0\0\0\0\0\0\4\0\360\267\0\0\0\0\1\0\0\0rg"..., 89) = -1 EEXIST (File exists) root@diapolon:~# cat /proc/misc 58 acpi 60 dlm_clvmd 61 lock_dlm_plock 62 dlm-control 175 agpgart 130 watchdog 183 hw_random 63 device-mapper 1 psaux 228 hpet 135 rtc 231 snapshot so i guess this is the libdlm patch from this bug that's trying to wait but never exits. Fabio So that looks like it might be a DLM kernel bug then. It's returning -EEXIST for a lockspace that exists but doesn't have a misc device associated with it. Created attachment 150470 [details]
strace of rgmanager that dies on node diapolon.
After rgmanager dies we can't restart it anylonger
Created attachment 150472 [details]
rgmanager strace on node diapolon at the first start and first stop
I see what's happening. The DLM removes the misc device, then attempts to remove the lockspace. But if the lockspace removal fails (as it can if the LS is mastering locks) then we are left with a lockspace and no associated misc device. A temporary solution is to force release the lockspace in rgmanager. This is OK because rgmanager is the only application using that lockspace on this node, and it has the benficial side-effect of remastering any locks away from this node. I'm working on a DLM patch that will recreate the misc device if the release lockspace fails and will post it shortly. Removing the lockspace BEFORE the misc device is a bad idea because it leaves a situation where an application could open the misc device that is not associated with a lockspace; with, er, interesting results I'm sure. This way leaves a small space where the misc device does not exist but then we are trying to destroy a lockspace so it's not really bad. Hi, I applied Patrick's fix to rgmanager in the RHEL5/50/HEAD trees. And the kernel fix is now in Steve's git tree Fixing Product Name. Cluster Suite was merged into Red Hat Enterpise Linux for 5.0. In addition dlm, fence and ccs were merged into the cman package, so bugzilla should reflect package name where those utilities are located. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Closing as this was fixed in kernel-2.6.18-8.el5. |