Description of problem: Recovery is hung on my four node x86_64 taft cluster. Taft-03 was shot and then fenced by taft-01 but it's join attempt is stuck due to recovery being hung on the remaining nodes. I've left the machines in this state for 23 hours now. This appears very similar to bz145683 which was eventually closed/blamed on cheap storage. The taft cluster however is using the "good" winchester storage. [root@taft-03 ~]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 4 M taft-03 2 1 4 M taft-01 3 1 4 M taft-04 4 1 4 M taft-02 [root@taft-01 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run U-1,10,1 [3 2 4] DLM Lock Space: "clvmd" 2 3 recover 4 - [3 2 4] DLM Lock Space: "taft0" 3 4 recover 4 - [3 2 4] DLM Lock Space: "taft1" 5 6 recover 4 - [3 2 4] GFS Mount Group: "taft0" 4 5 recover 0 - [3 2 4] GFS Mount Group: "taft1" 6 7 recover 0 - [3 2 4] [root@taft-02 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run U-1,10,1 [2 3 4] DLM Lock Space: "clvmd" 2 3 recover 2 - [2 3 4] DLM Lock Space: "taft0" 3 4 recover 4 - [2 3 4] DLM Lock Space: "taft1" 5 6 recover 4 - [2 3 4] GFS Mount Group: "taft0" 4 5 recover 0 - [2 3 4] GFS Mount Group: "taft1" 6 7 recover 0 - [2 3 4] [root@taft-03 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 join S-4,4,1 [3 2 4 1] [root@taft-04 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run U-1,10,1 [3 2 4] DLM Lock Space: "clvmd" 2 3 recover 4 - [3 2 4] DLM Lock Space: "taft0" 3 4 recover 2 - [3 2 4] DLM Lock Space: "taft1" 5 6 recover 2 - [3 2 4] GFS Mount Group: "taft0" 4 5 recover 0 - [3 2 4] GFS Mount Group: "taft1" 6 7 recover 0 - [3 2 4] LOGS: Nov 28 15:09:42 taft-01 kernel: CMAN: node taft-03 has been removed from the cluster : Missed too many heartbeats Nov 28 15:10:13 taft-01 fenced[3759]: taft-03 not a cluster member after 30 sec post_fail_delay Nov 28 15:10:13 taft-01 fenced[3759]: fencing node "taft-03" Nov 28 15:10:13 taft-01 fenced[3759]: fence "taft-03" success Nov 28 15:12:02 taft-01 kernel: CMAN: node taft-03 rejoining Nov 28 15:10:58 taft-02 kernel: CMAN: node taft-03 has been removed from the cluster : Missed too many heartbeats Nov 28 15:10:59 taft-02 fenced[3997]: fencing deferred to taft-01 Nov 28 15:13:18 taft-02 kernel: CMAN: node taft-03 rejoining Nov 28 15:09:44 taft-04 kernel: CMAN: removing node taft-03 from the cluster : Missed too many heartbeats Nov 28 15:09:45 taft-04 fenced[3508]: fencing deferred to taft-01 Nov 28 15:12:04 taft-04 kernel: CMAN: node taft-03 rejoining The only suspious messages possibly relating to I/O are these: Nov 28 15:10:33 taft-02 kernel: qla2300 0000:0a:02.0: scsi(1:0:0:1): Queue depth adjusted to -1. but the queue depth is still at 64 according to proc: [root@taft-02 ~]# cat /sys/block/sda/device/queue_depth 64 [root@taft-01 ~]# cat /proc/cluster/dlm_stats DLM stats (HZ=1000) Lock operations: 18905030 Unlock operations: 18764410 Convert operations: 44704613 Completion ASTs: 82376637 Blocking ASTs: 21806 Lockqueue num waittime ave WAIT_RSB 13833766 260064409 18 WAIT_CONV 35083 58832 1 WAIT_GRANT 148038 1357422 9 WAIT_UNLOCK 43212 584744 13 Total 14060099 262065407 18 [root@taft-02 ~]# cat /proc/cluster/dlm_stats DLM stats (HZ=1000) Lock operations: 26342 Unlock operations: 5199 Convert operations: 20175 Completion ASTs: 51709 Blocking ASTs: 5 Lockqueue num waittime ave WAIT_RSB 7186 90434 12 WAIT_CONV 2 0 0 WAIT_GRANT 17858 8659 0 WAIT_UNLOCK 36 406 11 Total 25082 99499 3 [root@taft-03 ~]# cat /proc/cluster/dlm_stats DLM stats (HZ=1000) Lock operations: 0 Unlock operations: 0 Convert operations: 0 Completion ASTs: 0 Blocking ASTs: 0 Lockqueue num waittime ave [root@taft-04 ~]# cat /proc/cluster/dlm_stats DLM stats (HZ=1000) Lock operations: 50511724 Unlock operations: 8699484 Convert operations: 21196276 Completion ASTs: 80408627 Blocking ASTs: 24687 Lockqueue num waittime ave WAIT_RSB 6318725 99897200 15 WAIT_CONV 10763 44975 4 WAIT_GRANT 41821869 21734846 0 WAIT_UNLOCK 37358 232860 6 Total 48188715 121909881 2 Version-Release number of selected component (if applicable): DLM 2.6.9-39.0 (built Nov 14 2005 17:38:14) installed Lock_Harness 2.6.9-44.0 (built Nov 17 2005 15:43:18) installed GFS 2.6.9-44.0 (built Nov 17 2005 15:43:35) installed Lock_Nolock 2.6.9-44.0 (built Nov 17 2005 15:43:19) How reproducible: Only once so far
more debug info: [root@taft-01 ~]# cat /proc/cluster/lock_dlm/debug plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 45136ee8-4691641f lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 46916420-46ab8700 lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 46ab8701-46bcc087 lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 46e9a7c7-46f8bfab lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 46f8bfac-46fe4fa8 lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 46fe4fa8-47076a99 lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 47076a9a-4718b65a lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 4718b65a-47223967 lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 47223967-4722a9a7 lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 4722a9a7-4722ef59 lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 4724e129-4724f49e lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 4724f49f-4725101e lkf 2000 wait 1 6585 en punlock 7,38 6585 remove 7,38 6585 ex punlock 0 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 0-a713ce2 lkf 2000 wait 1 6589 ex plock 0 6585 en plock 7,38 6585 req 7,38 ex 41b48b0a-49e24f0e lkf 2000 wait 1 6585 ex plock 0 6585 en punlock 7,38 6585 remove 7,38 6585 ex punlock 0 6585 en plock 7,38 6585 req 7,38 ex 4a32afea-4b9e61c1 lkf 2000 wait 1 6585 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6585 en punlock 7,38 6585 remove 7,38 6585 ex punlock 0 6585 en plock 7,38 6585 req 7,38 ex 4b9e61c1-4bcfec8e lkf 2000 wait 1 6585 ex plock 0 6585 en punlock 7,38 6585 remove 7,38 6585 ex punlock 0 6585 en plock 7,38 6585 req 7,38 ex 4bcfec8f-4bd124c3 lkf 2000 wait 1 6585 ex plock 0 6585 en punlock 7,38 6585 remove 7,38 6585 ex punlock 0 6585 en plock 7,38 6585 req 7,38 ex 4bdf76e6-4be17591 lkf 2000 wait 1 6585 ex plock 0 6585 en punlock 7,38 6585 remove 7,38 6585 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 42101a3b-4435a61c lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 4435a61d-472389d0 lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 4724b357-472500e8 lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 472500e9-4725062e lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 4725062f-472511d0 lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6589 en plock 7,1131326 6589 req 7,1131326 ex 472511d0-4725170a lkf 2000 wait 1 6589 ex plock 0 6589 en punlock 7,1131326 6589 remove 7,1131326 6589 ex punlock 0 6585 en plock 7,38 6585 req 7,38 ex 43e0c2c5-46552550 lkf 2000 wait 1 6585 ex plock 0 6585 en punlock 7,38 6585 remove 7,38 6585 ex punlock 0 6585 en plock 7,38 6589 en plock 7,1131326 [root@taft-02 2.6.9-22.EL-x86_64]# cat /proc/cluster/lock_dlm/debug 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 45396710-470a291e lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 470a291f-470b27e6 lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 470ff90f-471af2a3 lkf 2000 wait 1 6203 en punlock 7,10fd4d9 6203 remove 7,10fd4d9 6203 ex punlock 0 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 471af2a3-471cb7a1 lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 471cb7a1-4724f822 lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 4724f822-47250bcd lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 47250bce-4725103e lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6203 en plock 7,10fd4d9 6203 req 7,10fd4d9 ex 43d9b2f9-4842b7a6 lkf 2000 wait 1 6203 ex plock 0 6203 en punlock 7,10fd4d9 6203 remove 7,10fd4d9 6203 ex punlock 0 6203 en plock 7,10fd4d9 6203 req 7,10fd4d9 ex 4842b7a6-48f07621 lkf 2000 wait 1 6203 ex plock 0 6203 en punlock 7,10fd4d9 6203 remove 7,10fd4d9 6203 ex punlock 0 6203 en plock 7,10fd4d9 6203 req 7,10fd4d9 ex 4a7c5d85-4b8cd2ce lkf 2000 wait 1 6203 ex plock 0 6203 en punlock 7,10fd4d9 6203 remove 7,10fd4d9 6203 ex punlock 0 6203 en plock 7,10fd4d9 6203 req 7,10fd4d9 ex 4be27edc-4be64dee lkf 2000 wait 1 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 23175dce-3857436e lkf 2000 wait 1 6203 ex plock 0 6203 en punlock 7,10fd4d9 6203 remove 7,10fd4d9 6203 ex punlock 0 6203 en plock 7,10fd4d9 6203 req 7,10fd4d9 ex 4be7f58e-4be8cb32 lkf 2000 wait 1 6249 ex plock 0 6203 ex plock 0 6203 en punlock 7,10fd4d9 6203 remove 7,10fd4d9 6203 ex punlock 0 6203 en plock 7,10fd4d9 6203 req 7,10fd4d9 ex 4be90af6-4be939ed lkf 2000 wait 1 6203 ex plock 0 6203 en punlock 7,10fd4d9 6203 remove 7,10fd4d9 6203 ex punlock 0 6203 en plock 7,10fd4d9 6203 req 7,10fd4d9 ex 4be939ed-4be94bfb lkf 2000 wait 1 6203 ex plock 0 6203 en punlock 7,10fd4d9 6203 remove 7,10fd4d9 6203 ex punlock 0 6203 en plock 7,10fd4d9 6203 req 7,10fd4d9 ex 0-2903368e lkf 2000 wait 1 6203 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 42efe459-46e60834 lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 46e60835-47084344 lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 47084344-47088246 lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 4721da5a-47236c0a lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 47236c0a-4723cb6c lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 4724ee07-4724f5c4 lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 4724f5c4-4725024a lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6249 en plock 7,10fd4d0 6249 req 7,10fd4d0 ex 47250b79-47251519 lkf 2000 wait 1 6249 ex plock 0 6249 en punlock 7,10fd4d0 6249 remove 7,10fd4d0 6249 ex punlock 0 6203 en punlock 7,10fd4d9 6203 remove 7,10fd4d9 6203 ex punlock 0 6203 en plock 7,10fd4d9 6203 req 7,10fd4d9 ex 347e0420-3dbcdc91 lkf 2000 wait 1 6203 ex plock 0 6203 en punlock 7,10fd4d9 6249 en plock 7,10fd4d0 [root@taft-04 ~]# cat /proc/cluster/lock_dlm/debug 07 en plock 7,330fcd2 9308 req 7,330fcd2 ex 0-7fffffffffffffff lkf 2000 wait 1 9308 ex plock 0 9292 ex plock 0 9310 ex plock 0 9285 ex plock 0 9308 en punlock 7,330fcd2 9308 remove 7,330fcd2 9308 ex punlock 0 9308 en plock 7,330fcd1 9307 req 7,330fcd2 ex 0-7fffffffffffffff lkf 2000 wait 1 9308 req 7,330fcd1 ex 0-7fffffffffffffff lkf 2000 wait 1 9307 ex plock 0 9308 ex plock 0 9310 en punlock 7,330fcd3 9310 remove 7,330fcd3 9310 ex punlock 0 9310 en plock 7,330fcd2 9306 req 7,330fcd3 ex 0-7fffffffffffffff lkf 2000 wait 1 9306 ex plock 0 9307 en punlock 7,330fcd2 9307 remove 7,330fcd2 9307 ex punlock 0 9307 en plock 7,330fcd1 9310 req 7,330fcd2 ex 0-7fffffffffffffff lkf 2000 wait 1 9310 ex plock 0 9308 en punlock 7,330fcd1 9308 remove 7,330fcd1 9308 ex punlock 0 9308 en plock 7,330fcd2 9307 req 7,330fcd1 ex 0-7fffffffffffffff lkf 2000 wait 1 9306 en punlock 7,330fcd3 9306 remove 7,330fcd3 9306 ex punlock 0 9306 en plock 7,330fcd2 9310 en punlock 7,330fcd2 9310 remove 7,330fcd2 9310 ex punlock 0 9310 en plock 7,330fcd5 9308 req 7,330fcd2 ex 0-7fffffffffffffff lkf 2000 wait 1 9310 req 7,330fcd5 ex 0-7fffffffffffffff lkf 2000 wait 1 9308 ex plock 0 9310 ex plock 0 9308 en punlock 7,330fcd2 9308 remove 7,330fcd2 9308 ex punlock 0 9308 en plock 7,330fcd2 9306 req 7,330fcd2 ex 0-7fffffffffffffff lkf 2000 wait 1 9306 ex plock 0 9310 en punlock 7,330fcd5 9310 remove 7,330fcd5 9310 ex punlock 0 9310 en plock 7,330fcd3 9310 req 7,330fcd3 ex 0-7fffffffffffffff lkf 2000 wait 1 9307 ex plock 0 9306 en punlock 7,330fcd2 9306 remove 7,330fcd2 9306 ex punlock 0 9306 en plock 7,330fcd3 9308 req 7,330fcd2 ex 0-7fffffffffffffff lkf 2000 wait 1 9308 ex plock 0 9310 ex plock 0 9307 en punlock 7,330fcd1 9307 remove 7,330fcd1 9307 ex punlock 0 9307 en plock 7,330fcd1 9307 req 7,330fcd1 ex 0-7fffffffffffffff lkf 2000 wait 1 9308 en punlock 7,330fcd2 9308 remove 7,330fcd2 9308 ex punlock 0 9308 en plock 7,330fcd5 9308 req 7,330fcd5 ex 0-7fffffffffffffff lkf 2000 wait 1 9308 ex plock 0 9310 en punlock 7,330fcd3 9310 remove 7,330fcd3 9310 ex punlock 0 9310 en plock 7,330fcd4 9306 req 7,330fcd3 ex 0-7fffffffffffffff lkf 2000 wait 1 9310 req 7,330fcd4 ex 0-7fffffffffffffff lkf 2000 wait 1 9308 en punlock 7,330fcd5 9308 remove 7,330fcd5 9308 ex punlock 0 9308 en plock 7,330fcd3 9276 en punlock 7,34ef7e3 9276 remove 7,34ef7e3 9276 ex punlock 0 9276 en plock 7,34ef7ed 9276 req 7,34ef7ed ex 0-7fffffffffffffff lkf 2000 wait 1 9276 ex plock 0 9307 ex plock 0 9285 en punlock 7,34ef7ee 9292 en punlock 7,34ef7e4 9292 remove 7,34ef7e4 9276 en punlock 7,34ef7ed 9285 remove 7,34ef7ee 9292 ex punlock 0 9276 remove 7,34ef7ed 9292 en plock 7,34ef7e3 9292 req 7,34ef7e3 ex 0-7fffffffffffffff lkf 2000 wait 1 9285 ex punlock 0 9285 en plock 7,34ef7ee 9276 ex punlock 0 9276 en plock 7,34ef7ee 9285 req 7,34ef7ee ex 0-7fffffffffffffff lkf 2000 wait 1 9307 en punlock 7,330fcd1 9307 remove 7,330fcd1 9307 ex punlock 0 9307 en plock 7,330fcd4 9310 ex plock 0 9292 ex plock 0 9306 ex plock 0 9285 ex plock 0 9310 en punlock 7,330fcd4 9310 remove 7,330fcd4 9310 ex punlock 0 9310 en plock 7,330fcd3 9307 req 7,330fcd4 ex 0-7fffffffffffffff lkf 2000 wait 1 9292 en punlock 7,34ef7e3 9292 remove 7,34ef7e3 9292 ex punlock 0 9292 en plock 7,34ef7ee 9285 en punlock 7,34ef7ee 9285 remove 7,34ef7ee 9285 ex punlock 0 9285 en plock 7,34ef7e3 9285 req 7,34ef7e3 ex 0-7fffffffffffffff lkf 2000 wait 1 9276 req 7,34ef7ee ex 0-7fffffffffffffff lkf 2000 wait 1 9306 en punlock 7,330fcd3 9306 remove 7,330fcd3 9306 ex punlock 0 9310 req 7,330fcd3 ex 0-7fffffffffffffff lkf 2000 wait 1 9306 en plock 7,330fcd4 9276 ex plock 0 9285 ex plock 0 9307 ex plock 0 9310 ex plock 0 9276 en punlock 7,34ef7ee 9276 remove 7,34ef7ee 9276 ex punlock 0 9276 en plock 7,34ef7ed 9276 req 7,34ef7ed ex 0-7fffffffffffffff lkf 2000 wait 1 9276 ex plock 0 9281 req 7,34ef7ee ex 0-7fffffffffffffff lkf 2000 wait 1 9285 en punlock 7,34ef7e3 9285 remove 7,34ef7e3 9285 ex punlock 0 9285 en plock 7,34ef7ee 9276 en punlock 7,34ef7ed 9307 en punlock 7,330fcd4 9310 en punlock 7,330fcd3 9293 en punlock 7,3314504 9317 en punlock 7,3314195
more: [root@taft-01 ~]# ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd PID WIDE-WCHAN-COLUMN CMD 1 - init [5] 2 migration_thread [migration/0] 3 ksoftirqd [ksoftirqd/0] 4 migration_thread [migration/1] 5 ksoftirqd [ksoftirqd/1] 6 migration_thread [migration/2] 7 ksoftirqd [ksoftirqd/2] 8 migration_thread [migration/3] 9 ksoftirqd [ksoftirqd/3] 10 worker_thread [events/0] 11 worker_thread [events/1] 12 worker_thread [events/2] 13 worker_thread [events/3] 14 worker_thread [khelper] 15 worker_thread [kacpid] 51 worker_thread [kblockd/0] 52 worker_thread [kblockd/1] 53 worker_thread [kblockd/2] 54 worker_thread [kblockd/3] 72 worker_thread [aio/0] 73 worker_thread [aio/1] 74 worker_thread [aio/2] 75 worker_thread [aio/3] 55 hub_thread [khubd] 71 kswapd [kswapd0] 148 serio_thread [kseriod] 212 16045567552327254 [scsi_eh_0] 228 worker_thread [kmirrord] 229 worker_thread [kmir_mon] 237 kjournald [kjournald] 1017 16045567552327254 [scsi_eh_1] 1018 16045567552327254 [qla2300_1_dpc] 1030 - udevd 1307 - [shpchpd_event] 1705 kauditd_thread [kauditd] 1888 kjournald [kjournald] 2353 - syslogd -m 0 2357 syslog klogd -x 2368 - irqbalance 2379 - portmap 2399 - rpc.statd 2427 - rpc.idmapd 2538 - /usr/sbin/acpid 2589 - /usr/sbin/sshd 2620 - xinetd -stayalive -pidfile /var/run/xinetd.pid 2640 - sendmail: accepting connections 2648 pause sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue 2669 - gpm -m /dev/input/mice -t imps2 2729 - crond 2755 - xfs -droppriv -daemon 2774 - /usr/sbin/atd 2799 - dbus-daemon-1 --system 2812 - cups-config-daemon 2823 - hald 2860 wait login -- root 2861 - /sbin/mingetty tty1 2862 - /sbin/mingetty tty2 2863 - /sbin/mingetty tty3 2867 - /sbin/mingetty tty4 2868 - /sbin/mingetty tty5 2870 - /sbin/mingetty tty6 2944 - /usr/bin/gdm-binary -nodaemon 3410 pipe_wait /usr/bin/gdm-binary -nodaemon 3411 - /usr/X11R6/bin/X :0 -audit 0 -auth /var/gdm/:0.Xauth -nolisten tcp 3663 - /usr/bin/gdmgreeter 3664 - sshd: root@pts/0 3666 wait -bash 3705 - ccsd 3730 cluster_kthread [cman_comms] 3732 serviced [cman_serviced] 3731 membership_kthrea [cman_memb] 3733 hello_kthread [cman_hbeat] 3759 rt_sigsuspend fenced -w 3772 - clvmd 3773 dlm_astd [dlm_astd] 3774 dlm_recvd [dlm_recvd] 3775 dlm_sendd [dlm_sendd] 3776 dlm_recoverd [dlm_recoverd] 3828 dlm_recoverd [dlm_recoverd] 3829 dlm_async [lock_dlm1] 3830 dlm_async [lock_dlm2] 3831 - [gfs_scand] 3832 - [gfs_glockd] 3833 - [gfs_recoverd] 3834 - [gfs_logd] 3835 glock_wait_intern [gfs_quotad] 3836 glock_wait_intern [gfs_inoded] 3840 dlm_recoverd [dlm_recoverd] 3841 dlm_async [lock_dlm1] 3842 dlm_async [lock_dlm2] 3843 - [gfs_scand] 3844 - [gfs_glockd] 3845 - [gfs_recoverd] 3846 - [gfs_logd] 3847 - [gfs_quotad] 3848 glock_wait_intern [gfs_inoded] 3672 - cupsd 6524 - qarshd 6525 - qarshd 6526 wait sh -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/cmarthal/bin: 6527 wait sh -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/cmarthal/bin: 6529 pipe_wait /usr/bin/perl -w /usr/tests/sts/gfs/bin/revolver_load_gen -S 12437316 -r /usr/tests/sts -L HEAVY 6531 pipe_wait /usr/bin/perl -w /usr/tests/sts/gfs/bin/revolver_load_gen -S 11656384 -r /usr/tests/sts -L HEAVY 6538 wait sh -c PATH=$PATH:/usr/tests/sts/bin; pan2 -x 5 -f /tmp/revolver_load_gen.6529 -a /tmp/revolver_l 6539 wait sh -c PATH=$PATH:/usr/tests/sts/bin; pan2 -x 5 -f /tmp/revolver_load_gen.6531 -a /tmp/revolver_l 6540 - pan2 -x 5 -f /tmp/revolver_load_gen.6529 -a /tmp/revolver_load_gen.6529.active -d /tmp/revolver_ 6541 - pan2 -x 5 -f /tmp/revolver_load_gen.6531 -a /tmp/revolver_load_gen.6531.active -d /tmp/revolver_ 6544 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.6529 -a /tmp/revolver_load_gen.6529.active -d /tmp/revolver_ 6545 wait genesis -S 12723 -n 500 -d 150 -p 4 6546 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.6531 -a /tmp/revolver_load_gen.6531.active -d /tmp/revolver_ 6547 glock_wait_intern growfiles -S 12723 -i 0 -N 500 -n 4 -b 6548 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.6529 -a /tmp/revolver_load_gen.6529.active -d /tmp/revolver_ 6549 wait accordion -S 12723 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6550 dlm_lock_sync genesis -S 12723 -n 500 -d 150 -p 4 6551 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.6531 -a /tmp/revolver_load_gen.6531.active -d /tmp/revolver_ 6553 wait_async genesis -S 12723 -n 500 -d 150 -p 4 6552 wait accordion -S 12723 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6554 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.6529 -a /tmp/revolver_load_gen.6529.active -d /tmp/revolver_ 6557 glock_wait_intern genesis -S 12723 -n 500 -d 150 -p 4 6558 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.6529 -a /tmp/revolver_load_gen.6529.active -d /tmp/revolver_ 6559 - genesis -S 12723 -n 500 -d 150 -p 4 6560 lm_dlm_plock accordion -S 12723 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6561 glock_wait_intern growfiles -S 12723 -i 0 -N 500 -n 4 -b 6562 - growfiles -S 12723 -i 0 -N 500 -n 4 -b 6563 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.6531 -a /tmp/revolver_load_gen.6531.active -d /tmp/revolver_ 6564 wait genesis -S 12723 -n 500 -d 150 -p 4 6556 wait sh -c iogen -S 12723 -f buffered -m sequential -s read,write,readv,writev -t 1b -T 2487463b 2487 6565 - growfiles -S 12723 -i 0 -N 500 -n 4 -b 6566 - growfiles -S 12723 -i 0 -N 500 -n 4 -b 6568 lm_dlm_plock accordion -S 12723 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6569 glock_wait_intern growfiles -S 12723 -i 0 -N 500 -n 4 -b 6570 lm_dlm_plock accordion -S 12723 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6571 - growfiles -S 12723 -i 0 -N 500 -n 4 -b 6572 wait_async accordion -S 12723 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6573 pipe_wait iogen -S 12723 -f buffered -m sequential -s read write readv writev -t 1b -T 2487463b 2487463b:r 6576 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.6531 -a /tmp/revolver_load_gen.6531.active -d /tmp/revolver_ 6577 wait sh -c iogen -S 12723 -f buffered -m sequential -s read,write,readv,writev -t 1b -T 2331276b 2331 6578 glock_wait_intern growfiles -S 12723 -i 0 -N 500 -n 4 -b 6580 dlm_lock_sync genesis -S 12723 -n 500 -d 150 -p 4 6582 wait_async genesis -S 12723 -n 500 -d 150 -p 4 6583 wait_async genesis -S 12723 -n 500 -d 150 -p 4 6584 lm_dlm_plock genesis -S 12723 -n 500 -d 150 -p 4 6585 dlm_lock_sync doio -avk 6586 wait_async accordion -S 12723 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6588 wait_async accordion -S 12723 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6587 pipe_wait iogen -S 12723 -f buffered -m sequential -s read write readv writev -t 1b -T 2331276b 2331276b:r 6589 dlm_lock_sync doio -avk 6590 lm_dlm_plock accordion -S 12723 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 6591 lm_dlm_plock accordion -S 12723 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 8376 pdflush [pdflush] 8384 pdflush [pdflush] 11174 wait -bash 11207 - tail -f /var/log/messages 26921 pipe_wait crond 26922 wait /bin/bash /usr/bin/run-parts /etc/cron.daily 26923 pipe_wait /usr/bin/perl -w /etc/cron.daily/00-logwatch 26924 pipe_wait awk -v progname=/etc/cron.daily/00-logwatch progname {????? print progname ":\n"????? progna 27158 pipe_wait /bin/mail -s LogWatch for taft-01 root 27277 wait /bin/bash /etc/log.d/scripts/services/zz-disk_space 27279 - df -hP 27280 pipe_wait grep ^/dev/ 27281 pipe_wait sed s/\([^[:space:]]\{20,\}\)\(.*\)/\1\n \2/ 8769 - ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd
even more: [root@taft-01 ~]# cat /proc/cluster/dlm_debug d rebuilt 0 locks clvmd recover event 28 done taft1 mark waiting requests taft1 mark 26dd02dd lq 1 nodeid -1 taft1 mark 26780399 lq 1 nodeid -1 taft1 mark 26a30197 lq 1 nodeid -1 taft1 mark 263500eb lq 1 nodeid -1 taft1 mark 263503ee lq 1 nodeid -1 taft1 mark 267b023a lq 1 nodeid -1 taft1 marked 6 requests taft1 purge locks of departed nodes taft0 mark waiting requests taft0 mark 21a60124 lq 3 nodeid 1 taft0 mark 21c203cf lq 1 nodeid -1 taft0 mark 216c01bc lq 1 nodeid -1 taft0 mark 21fc00d5 lq 1 nodeid -1 taft0 mark 219b0067 lq 1 nodeid -1 taft0 mark 2170032b lq 1 nodeid -1 taft0 mark 22210220 lq 3 nodeid 1 taft0 mark 2130031b lq 1 nodeid -1 taft0 marked 8 requests taft0 purge locks of departed nodes taft0 purged 1 locks taft0 update remastered resources taft1 purged 1 locks taft1 update remastered resources taft0 updated 1102 resources taft0 rebuild locks taft1 updated 1100 resources taft1 rebuild locks taft0 rebuilt 2051 locks taft0 recover event 28 done taft1 rebuilt 1997 locks taft1 recover event 28 done [root@taft-02 ~]# cat /proc/cluster/dlm_debug d resource directory taft0 total nodes 3 taft0 rebuild resource directory clvmd total nodes 3 clvmd rebuild resource directory clvmd rebuilt 0 resources clvmd purge requests clvmd purged 0 requests taft0 rebuilt 9885 resources taft0 purge requests taft0 purged 0 requests taft1 rebuilt 11948 resources taft1 purge requests taft1 purged 0 requests taft0 mark waiting requests taft1 mark waiting requests taft0 mark 1002d6 lq 1 nodeid -1 taft0 mark d00b2 lq 1 nodeid -1 taft0 mark d0360 lq 1 nodeid -1 taft0 mark 800d3 lq 1 nodeid -1 taft0 marked 4 requests taft0 purge locks of departed nodes taft1 mark 11030e lq 1 nodeid -1 taft1 mark c039b lq 1 nodeid -1 taft1 marked 2 requests taft1 purge locks of departed nodes taft0 purged 1 locks taft0 update remastered resources taft1 purged 1 locks taft1 update remastered resources taft0 updated 1103 resources taft0 rebuild locks taft0 rebuilt 2 locks taft0 recover event 7 done taft1 updated 1100 resources taft1 rebuild locks taft1 rebuilt 37 locks taft1 recover event 7 done [root@taft-04 ~]# cat /proc/cluster/dlm_debug quests taft1 purged 0 requests clvmd mark waiting requests clvmd marked 0 requests clvmd purge locks of departed nodes clvmd purged 0 locks clvmd update remastered resources clvmd updated 2 resources clvmd rebuild locks taft1 mark waiting requests taft0 mark waiting requests taft1 mark 828001c5 lq 1 nodeid -1 taft1 mark 80b7025c lq 1 nodeid -1 taft1 mark 827a03d3 lq 1 nodeid -1 taft1 mark 81ce0335 lq 1 nodeid -1 taft1 mark 83ca0207 lq 1 nodeid -1 taft1 mark 82300159 lq 1 nodeid -1 taft1 marked 6 requests taft1 purge locks of departed nodes taft0 mark 3e04021d lq 1 nodeid -1 taft0 mark 3e75033a lq 3 nodeid 1 taft0 mark 3e400208 lq 1 nodeid -1 taft0 mark 3df000b1 lq 1 nodeid -1 taft0 mark 3e4d03cd lq 1 nodeid -1 taft0 mark 3f0902c7 lq 1 nodeid -1 taft0 marked 6 requests taft0 purge locks of departed nodes clvmd rebuilt 2 locks clvmd recover event 46 done taft0 purged 2 locks taft0 update remastered resources taft1 purged 4 locks taft1 update remastered resources taft0 updated 1102 resources taft0 rebuild locks
one more :) [root@taft-01 ~]# cat /proc/meminfo MemTotal: 8169156 kB MemFree: 1798576 kB Buffers: 44552 kB Cached: 3164572 kB SwapCached: 0 kB Active: 5843884 kB Inactive: 230056 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 8169156 kB LowFree: 1798576 kB SwapTotal: 2031608 kB SwapFree: 2031608 kB Dirty: 60 kB Writeback: 0 kB Mapped: 2929164 kB Slab: 268904 kB Committed_AS: 3696092 kB PageTables: 9656 kB VmallocTotal: 536870911 kB VmallocUsed: 264088 kB VmallocChunk: 536606639 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB [root@taft-02 ~]# cat /proc/meminfo MemTotal: 8169156 kB MemFree: 2760112 kB Buffers: 41328 kB Cached: 2727564 kB SwapCached: 0 kB Active: 5115020 kB Inactive: 155240 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 8169156 kB LowFree: 2760112 kB SwapTotal: 2031608 kB SwapFree: 2031608 kB Dirty: 48 kB Writeback: 0 kB Mapped: 2568480 kB Slab: 112744 kB Committed_AS: 2669484 kB PageTables: 8564 kB VmallocTotal: 536870911 kB VmallocUsed: 264088 kB VmallocChunk: 536606639 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB [root@taft-03 ~]# cat /proc/meminfo MemTotal: 8169152 kB MemFree: 7832668 kB Buffers: 42944 kB Cached: 215728 kB SwapCached: 0 kB Active: 136596 kB Inactive: 142428 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 8169152 kB LowFree: 7832668 kB SwapTotal: 2031608 kB SwapFree: 2031608 kB Dirty: 88 kB Writeback: 0 kB Mapped: 36804 kB Slab: 40608 kB Committed_AS: 527004 kB PageTables: 1792 kB VmallocTotal: 536870911 kB VmallocUsed: 263608 kB VmallocChunk: 536606871 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB [root@taft-04 ~]# cat /proc/meminfo MemTotal: 8169152 kB MemFree: 2232432 kB Buffers: 43264 kB Cached: 2752352 kB SwapCached: 0 kB Active: 5563164 kB Inactive: 124612 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 8169152 kB LowFree: 2232432 kB SwapTotal: 2031608 kB SwapFree: 2031608 kB Dirty: 48 kB Writeback: 0 kB Mapped: 2960076 kB Slab: 221996 kB Committed_AS: 3701572 kB PageTables: 9416 kB VmallocTotal: 536870911 kB VmallocUsed: 264088 kB VmallocChunk: 536606639 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB [root@taft-01 ~]# cat /proc/slabinfo slabinfo - version: 2.0 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <batchcount> <limit> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> gfs_meta_header_cache 16220 16856 80 49 1 : tunables 120 60 8 : slabdata 344 344 0 gfs_bufdata 5524 8000 160 25 1 : tunables 120 60 8 : slabdata 320 320 0 gfs_inode 31541 31563 536 7 1 : tunables 54 27 8 : slabdata 4509 4509 0 gfs_glock 67064 67248 312 12 1 : tunables 54 27 8 : slabdata 5604 5604 0 gfs_bio_wrapper 1024 1125 16 225 1 : tunables 120 60 8 : slabdata 5 5 0 dlm_conn 6 40 192 20 1 : tunables 120 60 8 : slabdata 2 2 0 dlm_lvb/range 82 476 32 119 1 : tunables 120 60 8 : slabdata 4 4 0 dlm_resdir(s) 22942 34638 56 69 1 : tunables 120 60 8 : slabdata 502 502 0 dlm_resdir(l) 0 0 88 45 1 : tunables 120 60 8 : slabdata 0 0 0 dlm_lkb 39794 40273 232 17 1 : tunables 120 60 8 : slabdata 2369 2369 0 dlm_rsb(large) 2 26 304 13 1 : tunables 54 27 8 : slabdata 2 2 0 dlm_rsb(small) 37849 38346 272 14 1 : tunables 54 27 8 : slabdata 2739 2739 0 cluster_sock 5 15 768 5 1 : tunables 54 27 8 : slabdata 3 3 0 fib6_nodes 7 61 64 61 1 : tunables 120 60 8 : slabdata 1 1 0 ip6_dst_cache 5 10 384 10 1 : tunables 54 27 8 : slabdata 1 1 0 ndisc_cache 2 15 256 15 1 : tunables 120 60 8 : slabdata 1 1 0 rawv6_sock 6 8 1024 4 1 : tunables 54 27 8 : slabdata 2 2 0 udpv6_sock 1 4 1024 4 1 : tunables 54 27 8 : slabdata 1 1 0 tcpv6_sock 4 6 1792 2 1 : tunables 24 12 8 : slabdata 3 3 0 rpc_buffers 8 8 2048 2 1 : tunables 24 12 8 : slabdata 4 4 0 rpc_tasks 8 10 384 10 1 : tunables 54 27 8 : slabdata 1 1 0 rpc_inode_cache 6 8 896 4 1 : tunables 54 27 8 : slabdata 2 2 0 ip_fib_alias 10 119 32 119 1 : tunables 120 60 8 : slabdata 1 1 0 ip_fib_hash 10 61 64 61 1 : tunables 120 60 8 : slabdata 1 1 0 uhci_urb_priv 0 0 88 45 1 : tunables 120 60 8 : slabdata 0 0 0 qla2xxx_srbs 128 135 256 15 1 : tunables 120 60 8 : slabdata 9 9 0 dm-snapshot-in 128 164 96 41 1 : tunables 120 60 8 : slabdata 4 4 0 dm-snapshot-ex 0 0 32 119 1 : tunables 120 60 8 : slabdata 0 0 0 ext3_inode_cache 14638 14652 840 4 1 : tunables 54 27 8 : slabdata 3663 3663 0 ext3_xattr 0 0 88 45 1 : tunables 120 60 8 : slabdata 0 0 0 journal_handle 15 81 48 81 1 : tunables 120 60 8 : slabdata 1 1 0 journal_head 37 90 88 45 1 : tunables 120 60 8 : slabdata 2 2 0 revoke_table 4 225 16 225 1 : tunables 120 60 8 : slabdata 1 1 0 revoke_record 0 0 32 119 1 : tunables 120 60 8 : slabdata 0 0 0 dm_tio 1064 1248 24 156 1 : tunables 120 60 8 : slabdata 8 8 0 dm_io 1064 1190 32 119 1 : tunables 120 60 8 : slabdata 10 10 0 scsi_cmd_cache 6 14 512 7 1 : tunables 54 27 8 : slabdata 2 2 0 sgpool-128 32 32 4096 1 1 : tunables 24 12 8 : slabdata 32 32 0 sgpool-64 32 32 2048 2 1 : tunables 24 12 8 : slabdata 16 16 0 sgpool-32 32 32 1024 4 1 : tunables 54 27 8 : slabdata 8 8 0 sgpool-16 32 32 512 8 1 : tunables 54 27 8 : slabdata 4 4 0 sgpool-8 37 45 256 15 1 : tunables 120 60 8 : slabdata 3 3 0 unix_sock 43 85 768 5 1 : tunables 54 27 8 : slabdata 17 17 0 ip_mrt_cache 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0 tcp_tw_bucket 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0 tcp_bind_bucket 18 119 32 119 1 : tunables 120 60 8 : slabdata 1 1 0 tcp_open_request 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0 inet_peer_cache 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0 secpath_cache 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0 xfrm_dst_cache 0 0 384 10 1 : tunables 54 27 8 : slabdata 0 0 0 ip_dst_cache 42 80 384 10 1 : tunables 54 27 8 : slabdata 8 8 0 arp_cache 5 15 256 15 1 : tunables 120 60 8 : slabdata 1 1 0 raw_sock 5 8 896 4 1 : tunables 54 27 8 : slabdata 2 2 0 udp_sock 7 12 896 4 1 : tunables 54 27 8 : slabdata 3 3 0 tcp_sock 25 40 1536 5 2 : tunables 24 12 8 : slabdata 8 8 0 flow_cache 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0 mqueue_inode_cache 1 4 896 4 1 : tunables 54 27 8 : slabdata 1 1 0 relayfs_inode_cache 0 0 576 7 1 : tunables 54 27 8 : slabdata 0 0 0 isofs_inode_cache 0 0 616 6 1 : tunables 54 27 8 : slabdata 0 0 0 hugetlbfs_inode_cache 1 6 608 6 1 : tunables 54 27 8 : slabdata 1 1 0 ext2_inode_cache 0 0 736 5 1 : tunables 54 27 8 : slabdata 0 0 0 ext2_xattr 0 0 88 45 1 : tunables 120 60 8 : slabdata 0 0 0 dquot 0 0 224 17 1 : tunables 120 60 8 : slabdata 0 0 0 eventpoll_pwq 1 54 72 54 1 : tunables 120 60 8 : slabdata 1 1 0 eventpoll_epi 1 15 256 15 1 : tunables 120 60 8 : slabdata 1 1 0 kioctx 0 0 384 10 1 : tunables 54 27 8 : slabdata 0 0 0 kiocb 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0 dnotify_cache 2 96 40 96 1 : tunables 120 60 8 : slabdata 1 1 0 fasync_cache 1 156 24 156 1 : tunables 120 60 8 : slabdata 1 1 0 shmem_inode_cache 278 280 800 5 1 : tunables 54 27 8 : slabdata 56 56 0 posix_timers_cache 0 0 184 21 1 : tunables 120 60 8 : slabdata 0 0 0 uid_cache 6 31 128 31 1 : tunables 120 60 8 : slabdata 1 1 0 cfq_pool 71 207 56 69 1 : tunables 120 60 8 : slabdata 3 3 0 crq_pool 30 216 72 54 1 : tunables 120 60 8 : slabdata 4 4 0 deadline_drq 0 0 96 41 1 : tunables 120 60 8 : slabdata 0 0 0 as_arq 0 0 112 35 1 : tunables 120 60 8 : slabdata 0 0 0 blkdev_ioc 63 238 32 119 1 : tunables 120 60 8 : slabdata 2 2 0 blkdev_queue 29 45 848 9 2 : tunables 54 27 8 : slabdata 5 5 0 blkdev_requests 24 105 264 15 1 : tunables 54 27 8 : slabdata 7 7 0 biovec-(256) 256 256 4096 1 1 : tunables 24 12 8 : slabdata 256 256 0 biovec-128 256 256 2048 2 1 : tunables 24 12 8 : slabdata 128 128 0 biovec-64 256 256 1024 4 1 : tunables 54 27 8 : slabdata 64 64 0 biovec-16 256 270 256 15 1 : tunables 120 60 8 : slabdata 18 18 0 biovec-4 256 305 64 61 1 : tunables 120 60 8 : slabdata 5 5 0 biovec-1 278 450 16 225 1 : tunables 120 60 8 : slabdata 2 2 0 bio 278 310 128 31 1 : tunables 120 60 8 : slabdata 10 10 0 file_lock_cache 29 250 160 25 1 : tunables 120 60 8 : slabdata 10 10 0 sock_inode_cache 98 125 768 5 1 : tunables 54 27 8 : slabdata 25 25 0 skbuff_head_cache 374 420 384 10 1 : tunables 54 27 8 : slabdata 42 42 0 sock 6 12 640 6 1 : tunables 54 27 8 : slabdata 2 2 0 proc_inode_cache 1068 1068 600 6 1 : tunables 54 27 8 : slabdata 178 178 0 sigqueue 4 23 168 23 1 : tunables 120 60 8 : slabdata 1 1 0 radix_tree_node 53598 62321 536 7 1 : tunables 54 27 8 : slabdata 8903 8903 0 bdev_cache 31 45 768 5 1 : tunables 54 27 8 : slabdata 9 9 0 mnt_cache 29 60 192 20 1 : tunables 120 60 8 : slabdata 3 3 0 audit_watch_cache 0 0 88 45 1 : tunables 120 60 8 : slabdata 0 0 0 inode_cache 68758 68894 568 7 1 : tunables 54 27 8 : slabdata 9842 9842 0 dentry_cache 59472 59472 240 16 1 : tunables 120 60 8 : slabdata 3717 3717 0 filp 1005 1185 256 15 1 : tunables 120 60 8 : slabdata 79 79 0 names_cache 26 26 4096 1 1 : tunables 24 12 8 : slabdata 26 26 0 avc_node 12 486 72 54 1 : tunables 120 60 8 : slabdata 9 9 0 key_jar 12 20 192 20 1 : tunables 120 60 8 : slabdata 1 1 0 idr_layer_cache 90 98 528 7 1 : tunables 54 27 8 : slabdata 14 14 0 buffer_head 747006 752310 88 45 1 : tunables 120 60 8 : slabdata 16718 16718 0 mm_struct 119 140 1152 7 2 : tunables 24 12 8 : slabdata 20 20 0 vm_area_struct 3127 3366 176 22 1 : tunables 120 60 8 : slabdata 153 153 0 fs_cache 125 305 64 61 1 : tunables 120 60 8 : slabdata 5 5 0 files_cache 119 153 832 9 2 : tunables 54 27 8 : slabdata 17 17 0 signal_cache 189 255 256 15 1 : tunables 120 60 8 : slabdata 17 17 0 sighand_cache 174 174 2112 3 2 : tunables 24 12 8 : slabdata 58 58 0 task_struct 180 180 1984 2 1 : tunables 24 12 8 : slabdata 90 90 0 anon_vma 1123 1560 24 156 1 : tunables 120 60 8 : slabdata 10 10 0 shared_policy_node 0 0 56 69 1 : tunables 120 60 8 : slabdata 0 0 0 numa_policy 45 225 16 225 1 : tunables 120 60 8 : slabdata 1 1 0 size-131072(DMA) 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0 size-131072 1 1 131072 1 32 : tunables 8 4 0 : slabdata 1 1 0 size-65536(DMA) 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0 size-65536 1 1 65536 1 16 : tunables 8 4 0 : slabdata 1 1 0 size-32768(DMA) 0 0 32768 1 8 : tunables 8 4 0 : slabdata 0 0 0 size-32768 6 6 32768 1 8 : tunables 8 4 0 : slabdata 6 6 0 size-16384(DMA) 0 0 16384 1 4 : tunables 8 4 0 : slabdata 0 0 0 size-16384 4 4 16384 1 4 : tunables 8 4 0 : slabdata 4 4 0 size-8192(DMA) 0 0 8192 1 2 : tunables 8 4 0 : slabdata 0 0 0 size-8192 22 22 8192 1 2 : tunables 8 4 0 : slabdata 22 22 0 size-4096(DMA) 0 0 4096 1 1 : tunables 24 12 8 : slabdata 0 0 0 size-4096 394 394 4096 1 1 : tunables 24 12 8 : slabdata 394 394 0 size-2048(DMA) 0 0 2048 2 1 : tunables 24 12 8 : slabdata 0 0 0 size-2048 82 82 2048 2 1 : tunables 24 12 8 : slabdata 41 41 0 size-1620(DMA) 0 0 1664 4 2 : tunables 24 12 8 : slabdata 0 0 0 size-1620 61 72 1664 4 2 : tunables 24 12 8 : slabdata 18 18 0 size-1024(DMA) 0 0 1024 4 1 : tunables 54 27 8 : slabdata 0 0 0 size-1024 503 1224 1024 4 1 : tunables 54 27 8 : slabdata 306 306 0 size-512(DMA) 0 0 512 8 1 : tunables 54 27 8 : slabdata 0 0 0 size-512 2603 2624 512 8 1 : tunables 54 27 8 : slabdata 328 328 0 size-256(DMA) 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0 size-256 70936 71310 256 15 1 : tunables 120 60 8 : slabdata 4754 4754 0 size-128(DMA) 0 0 128 31 1 : tunables 120 60 8 : slabdata 0 0 0 size-128 2910 3596 128 31 1 : tunables 120 60 8 : slabdata 116 116 0 size-64(DMA) 0 0 64 61 1 : tunables 120 60 8 : slabdata 0 0 0 size-64 10553 12383 64 61 1 : tunables 120 60 8 : slabdata 203 203 0 size-32(DMA) 0 0 32 119 1 : tunables 120 60 8 : slabdata 0 0 0 size-32 3180 3570 32 119 1 : tunables 120 60 8 : slabdata 30 30 0 kmem_cache 180 180 256 15 1 : tunables 120 60 8 : slabdata 12 12 0 [root@taft-01 ~]# cat /proc/cluster/sm_debug 01000003 recover state 2 01000002 recover state 3 01000003 cb recover state 2 01000005 recover state 2 01000003 recover state 3 01000002 recover state 4 01000005 cb recover state 2 01000005 recover state 3 01000003 recover state 4 01000002 recover state 4 [root@taft-02 ~]# cat /proc/cluster/sm_debug 01000003 recover state 2 01000002 recover state 1 01000003 cb recover state 2 01000005 recover state 2 01000003 recover state 3 01000002 recover state 2 01000005 cb recover state 2 01000005 recover state 3 01000003 recover state 4 01000002 recover state 2 [root@taft-03 ~]# cat /proc/cluster/sm_debug 00000000 sevent state 1 00000001 sevent state 3 [root@taft-04 ~]# cat /proc/cluster/sm_debug 4 01000005 recover state 2 01000003 recover state 1 01000002 recover state 4 01000005 recover state 2 01000003 recover state 2 01000002 recover state 1 01000002 cb recover state 2 01000005 recover state 2 01000003 recover state 2 01000002 recover state 3
one peculiar thing, due to the revolver I/O load, there were many of the following messages: Nov 28 15:10:33 taft-02 kernel: qla2300 0000:0a:02.0: scsi(1:0:0:0): Queue depth adjusted to -1. which Tom Coughlan had this to say about: <coughlan> corey: You got a Queue Full error from the storage device. This can happen when there are multiple initiators issuing commands at the same time. The message indicates that the driver responded by turning off queueing and going to one-at-a-time command issuing. This will kill performance, and is probably not what it should have done, unless there were many Queue Full errors. <corey> coughlan: I had a pretty "heavy" I/O load going at the time, my concern however is that during this test, recovery hung and before I file a bug, I wanted to know if this queueing could have somehow caused this by delaying or dropping the needed recovery I/O <coughlan> corey: anything is possible, but at least in theory, the queue full error should have been retried, and the reduction in subsequent queue depth should only impact performance. But any error path is certainly suspect, due to lack of testing. Hard to say with a lot more details.
Storage errors in the past were blamed for i/o hangs during gfs journal recovery. This hang is during dlm recovery, so the qla messages probably aren't relevant. DLM recovery is hung in all three lockspaces. In the case of each lockspace, one node is hung at some stage of dlm recovery while the other two nodes have finished. For lockspace "clvmd", taft-02 is stuck after "purged 0 requests". taft-01 and taft-04 have finished. For lockspace "taft0", taft-04 is stuck after "rebuild locks". taft-01 and taft-02 have finished. For lockspace "taft1", taft-04 is stuck after "update remastered resources". taft-01 and taft-02 have finished. I haven't found any clues yet as to why dlm recovery is stuck in each case. A 'ps' on taft-02 and taft-04 may have revealed a little more about what dlm_recoverd is stuck or waiting on.
Appear to have hit this again (taft-04 was shot) here's the ps output: [root@taft-01 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run U-1,10,2 [3 4 1] DLM Lock Space: "clvmd" 2 3 run - [3 4 1] DLM Lock Space: "taft0" 4 5 recover 4 - [3 4 1] DLM Lock Space: "taft1" 6 7 run - [3 4 1] GFS Mount Group: "taft0" 5 6 recover 0 - [3 4 1] GFS Mount Group: "taft1" 7 8 recover 0 - [3 4 1] User: "usrm::manager" 3 4 recover 0 - [3 4 1] [root@taft-02 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run U-1,10,2 [3 4 1] DLM Lock Space: "clvmd" 2 3 run - [3 4 1] DLM Lock Space: "taft0" 4 5 recover 2 - [3 4 1] DLM Lock Space: "taft1" 6 7 run - [3 4 1] GFS Mount Group: "taft0" 5 6 recover 0 - [3 4 1] GFS Mount Group: "taft1" 7 8 recover 0 - [3 4 1] User: "usrm::manager" 3 4 recover 0 - [3 4 1] [root@taft-03 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run U-1,10,2 [4 3 1] DLM Lock Space: "clvmd" 2 3 run - [4 3 1] DLM Lock Space: "taft0" 4 5 recover 4 - [4 3 1] DLM Lock Space: "taft1" 6 7 run - [4 3 1] GFS Mount Group: "taft0" 5 6 recover 0 - [4 3 1] GFS Mount Group: "taft1" 7 8 recover 0 - [4 3 1] User: "usrm::manager" 3 4 recover 0 - [4 3 1] [root@taft-01 ~]# ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd PID WIDE-WCHAN-COLUMN CMD 1 - init [5] 2 migration_thread [migration/0] 3 ksoftirqd [ksoftirqd/0] 4 migration_thread [migration/1] 5 ksoftirqd [ksoftirqd/1] 6 migration_thread [migration/2] 7 ksoftirqd [ksoftirqd/2] 8 migration_thread [migration/3] 9 ksoftirqd [ksoftirqd/3] 10 worker_thread [events/0] 11 worker_thread [events/1] 12 worker_thread [events/2] 13 worker_thread [events/3] 14 worker_thread [khelper] 15 worker_thread [kacpid] 51 worker_thread [kblockd/0] 52 worker_thread [kblockd/1] 53 worker_thread [kblockd/2] 54 worker_thread [kblockd/3] 72 worker_thread [aio/0] 73 worker_thread [aio/1] 74 worker_thread [aio/2] 75 worker_thread [aio/3] 55 hub_thread [khubd] 71 kswapd [kswapd0] 148 serio_thread [kseriod] 212 16045567552327254 [scsi_eh_0] 228 worker_thread [kmirrord] 229 worker_thread [kmir_mon] 237 kjournald [kjournald] 1023 16045567552327254 [scsi_eh_1] 1028 - udevd 1045 16045567552327254 [qla2300_1_dpc] 1326 - [shpchpd_event] 1680 kauditd_thread [kauditd] 1892 kjournald [kjournald] 2357 - syslogd -m 0 2361 syslog klogd -x 2372 - irqbalance 2383 - portmap 2403 - rpc.statd 2431 - rpc.idmapd 2475 - ccsd 2527 cluster_kthread [cman_comms] 2529 serviced [cman_serviced] 2528 membership_kthrea [cman_memb] 2530 hello_kthread [cman_hbeat] 2565 rt_sigsuspend fenced -w 2577 - clvmd 2578 dlm_astd [dlm_astd] 2579 dlm_recvd [dlm_recvd] 2580 dlm_sendd [dlm_sendd] 2581 dlm_recoverd [dlm_recoverd] 2707 - /usr/sbin/acpid 2719 - cupsd 2744 - /usr/sbin/sshd 2759 - xinetd -stayalive -pidfile /var/run/xinetd.pid 2779 - sendmail: accepting connections 2787 pause sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue 2803 - gpm -m /dev/input/mice -t imps2 2848 - crond 2874 - xfs -droppriv -daemon 2884 rt_sigsuspend anacron -s 2893 - /usr/sbin/atd 2923 - dbus-daemon-1 --system 2937 - cups-config-daemon 2948 - hald 2965 - clurgmgrd 2970 - /sbin/agetty ttyS0 115200 vt100-nav 2971 - /sbin/mingetty tty1 2972 - /sbin/mingetty tty2 2973 - /sbin/mingetty tty3 2975 - /sbin/mingetty tty4 3002 - /sbin/mingetty tty5 3006 - /sbin/mingetty tty6 3009 - /usr/bin/gdm-binary -nodaemon 3543 pipe_wait /usr/bin/gdm-binary -nodaemon 3544 - /usr/X11R6/bin/X :0 -audit 0 -auth /var/gdm/:0.Xauth -nolisten tcp 3787 dlm_recoverd [dlm_recoverd] 3788 dlm_async [lock_dlm1] 3789 dlm_async [lock_dlm2] 3816 - [gfs_scand] 3817 - [gfs_glockd] 3822 - /usr/bin/gdmgreeter 3836 - [gfs_recoverd] 3837 - [gfs_logd] 3838 glock_wait_intern [gfs_quotad] 3839 glock_wait_intern [gfs_inoded] 3848 dlm_recoverd [dlm_recoverd] 3849 dlm_async [lock_dlm1] 3850 dlm_async [lock_dlm2] 3851 - [gfs_scand] 3852 gfs_glockd [gfs_glockd] 3863 - [gfs_recoverd] 3865 - [gfs_logd] 3866 glock_wait_intern [gfs_quotad] 3867 glock_wait_intern [gfs_inoded] 3875 - qarshd 3876 wait sh -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/cmarthal/bin: 3878 pipe_wait /usr/bin/perl -w /usr/tests/sts/gfs/bin/revolver_load_gen -S 12441653 -r /usr/tests/sts -L HEAVY 3882 wait sh -c PATH=$PATH:/usr/tests/sts/bin; pan2 -x 5 -f /tmp/revolver_load_gen.3878 -a /tmp/revolver_l 3883 - pan2 -x 5 -f /tmp/revolver_load_gen.3878 -a /tmp/revolver_load_gen.3878.active -d /tmp/revolver_ 3884 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3878 -a /tmp/revolver_load_gen.3878.active -d /tmp/revolver_ 3885 wait accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3888 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3878 -a /tmp/revolver_load_gen.3878.active -d /tmp/revolver_ 3889 wait genesis -S 11654 -n 500 -d 150 -p 4 3890 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3878 -a /tmp/revolver_load_gen.3878.active -d /tmp/revolver_ 3891 wait sh -c iogen -S 11654 -f buffered -m sequential -s read,write,readv,writev -t 1b -T 2488330b 2488 3893 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3878 -a /tmp/revolver_load_gen.3878.active -d /tmp/revolver_ 3894 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3896 pipe_wait iogen -S 11654 -f buffered -m sequential -s read write readv writev -t 1b -T 2488330b 2488330b:r 3897 dlm_lock_sync doio -avk 3898 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3899 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3900 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3901 wait_async accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3902 wait_async genesis -S 11654 -n 500 -d 150 -p 4 3903 wait_async genesis -S 11654 -n 500 -d 150 -p 4 3904 - genesis -S 11654 -n 500 -d 150 -p 4 3905 - genesis -S 11654 -n 500 -d 150 -p 4 3906 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3907 glock_wait_intern growfiles -S 11654 -i 0 -N 500 -n 4 -b 3908 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3909 - qarshd 3910 wait sh -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/cmarthal/bin: 3912 pipe_wait /usr/bin/perl -w /usr/tests/sts/gfs/bin/revolver_load_gen -S 11661784 -r /usr/tests/sts -L HEAVY 3916 wait sh -c PATH=$PATH:/usr/tests/sts/bin; pan2 -x 5 -f /tmp/revolver_load_gen.3912 -a /tmp/revolver_l 3917 - pan2 -x 5 -f /tmp/revolver_load_gen.3912 -a /tmp/revolver_load_gen.3912.active -d /tmp/revolver_ 3920 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3912 -a /tmp/revolver_load_gen.3912.active -d /tmp/revolver_ 3921 wait genesis -S 11654 -n 500 -d 150 -p 4 3922 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3912 -a /tmp/revolver_load_gen.3912.active -d /tmp/revolver_ 3923 wait sh -c iogen -S 11654 -f buffered -m sequential -s read,write,readv,writev -t 1b -T 2332356b 2332 3924 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3912 -a /tmp/revolver_load_gen.3912.active -d /tmp/revolver_ 3925 wait_async genesis -S 11654 -n 500 -d 150 -p 4 3926 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3927 dlm_lock_sync genesis -S 11654 -n 500 -d 150 -p 4 3929 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3912 -a /tmp/revolver_load_gen.3912.active -d /tmp/revolver_ 3930 dlm_lock_sync genesis -S 11654 -n 500 -d 150 -p 4 3931 wait accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3933 wait_async genesis -S 11654 -n 500 -d 150 -p 4 3934 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3935 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3936 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3937 glock_wait_intern growfiles -S 11654 -i 0 -N 500 -n 4 -b 3938 wait_async accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3939 pipe_wait iogen -S 11654 -f buffered -m sequential -s read write readv writev -t 1b -T 2332356b 2332356b:r 3941 wait_async accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3940 dlm_lock_sync doio -avk 3942 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 4312 pdflush [pdflush] 4322 pdflush [pdflush] 5159 - sshd: root@pts/0 5161 wait -bash 5564 - ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd [root@taft-02 ~]# ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd PID WIDE-WCHAN-COLUMN CMD 1 - init [5] 2 migration_thread [migration/0] 3 ksoftirqd [ksoftirqd/0] 4 migration_thread [migration/1] 5 ksoftirqd [ksoftirqd/1] 6 migration_thread [migration/2] 7 ksoftirqd [ksoftirqd/2] 8 migration_thread [migration/3] 9 ksoftirqd [ksoftirqd/3] 10 worker_thread [events/0] 11 worker_thread [events/1] 12 worker_thread [events/2] 13 worker_thread [events/3] 14 worker_thread [khelper] 15 worker_thread [kacpid] 51 worker_thread [kblockd/0] 52 worker_thread [kblockd/1] 53 worker_thread [kblockd/2] 54 worker_thread [kblockd/3] 55 hub_thread [khubd] 72 worker_thread [aio/0] 71 kswapd [kswapd0] 73 worker_thread [aio/1] 74 worker_thread [aio/2] 75 worker_thread [aio/3] 148 serio_thread [kseriod] 212 16045567552327254 [scsi_eh_0] 228 worker_thread [kmirrord] 229 worker_thread [kmir_mon] 237 kjournald [kjournald] 1017 16045567552327254 [scsi_eh_1] 1018 16045567552327254 [qla2300_1_dpc] 1030 - udevd 1699 kauditd_thread [kauditd] 1890 kjournald [kjournald] 2342 - syslogd -m 0 2346 syslog klogd -x 2356 - irqbalance 2366 - portmap 2385 - rpc.statd 2411 - rpc.idmapd 2423 - ccsd 2474 cluster_kthread [cman_comms] 2476 serviced [cman_serviced] 2475 membership_kthrea [cman_memb] 2477 hello_kthread [cman_hbeat] 2510 rt_sigsuspend fenced -w 2521 - clvmd 2522 dlm_astd [dlm_astd] 2523 dlm_recvd [dlm_recvd] 2524 dlm_sendd [dlm_sendd] 2525 dlm_recoverd [dlm_recoverd] 2644 - /usr/sbin/acpid 2653 - cupsd 2676 - /usr/sbin/sshd 2689 - xinetd -stayalive -pidfile /var/run/xinetd.pid 2708 - sendmail: accepting connections 2716 pause sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue 2732 - gpm -m /dev/input/mice -t imps2 2775 - crond 2800 - xfs -droppriv -daemon 2817 - /usr/sbin/atd 2834 - dbus-daemon-1 --system 2845 - hald 2858 - clurgmgrd 2862 - /sbin/agetty ttyS0 115200 vt100-nav 2863 - /sbin/mingetty tty1 2864 - /sbin/mingetty tty2 2865 - /sbin/mingetty tty3 2867 - /sbin/mingetty tty4 2869 - /sbin/mingetty tty5 2873 - /sbin/mingetty tty6 2874 - kdm -nodaemon 3403 - /usr/X11R6/bin/X -auth /var/run/xauth/A:0-OzJ2mc vt1 3404 pipe_wait -:0 3485 dlm_wait_function [dlm_recoverd] 3528 dlm_async [lock_dlm1] 3529 dlm_async [lock_dlm2] 3556 - [gfs_scand] 3557 - [gfs_glockd] 3561 - /usr/bin/kdm_greet 3562 - [gfs_recoverd] 3564 - [gfs_logd] 3565 glock_wait_intern [gfs_quotad] 3566 - [gfs_inoded] 3575 dlm_recoverd [dlm_recoverd] 3588 dlm_async [lock_dlm1] 3589 dlm_async [lock_dlm2] 3596 - [gfs_scand] 3597 gfs_glockd [gfs_glockd] 3598 - [gfs_recoverd] 3599 - [gfs_logd] 3600 glock_wait_intern [gfs_quotad] 3601 - [gfs_inoded] 3618 - qarshd 3619 wait sh -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/cmarthal/bin: 3621 pipe_wait /usr/bin/perl -w /usr/tests/sts/gfs/bin/revolver_load_gen -S 12441653 -r /usr/tests/sts -L HEAVY 3625 wait sh -c PATH=$PATH:/usr/tests/sts/bin; pan2 -x 5 -f /tmp/revolver_load_gen.3621 -a /tmp/revolver_l 3626 - pan2 -x 5 -f /tmp/revolver_load_gen.3621 -a /tmp/revolver_load_gen.3621.active -d /tmp/revolver_ 3627 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3621 -a /tmp/revolver_load_gen.3621.active -d /tmp/revolver_ 3628 glock_wait_intern growfiles -S 11654 -i 0 -N 500 -n 4 -b 3629 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3621 -a /tmp/revolver_load_gen.3621.active -d /tmp/revolver_ 3630 wait accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3631 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3621 -a /tmp/revolver_load_gen.3621.active -d /tmp/revolver_ 3632 wait sh -c iogen -S 11654 -f buffered -m sequential -s read,write,readv,writev -t 1b -T 2488330b 2488 3635 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3621 -a /tmp/revolver_load_gen.3621.active -d /tmp/revolver_ 3636 wait genesis -S 11654 -n 500 -d 150 -p 4 3637 pipe_wait iogen -S 11654 -f buffered -m sequential -s read write readv writev -t 1b -T 2488330b 2488330b:r 3638 dlm_lock_sync doio -avk 3641 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3642 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3643 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3644 dlm_lock_sync accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3645 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3646 - genesis -S 11654 -n 500 -d 150 -p 4 3647 - genesis -S 11654 -n 500 -d 150 -p 4 3648 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3649 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3650 - genesis -S 11654 -n 500 -d 150 -p 4 3651 - genesis -S 11654 -n 500 -d 150 -p 4 3652 - qarshd 3653 wait sh -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/cmarthal/bin: 3655 pipe_wait /usr/bin/perl -w /usr/tests/sts/gfs/bin/revolver_load_gen -S 11661784 -r /usr/tests/sts -L HEAVY 3659 wait sh -c PATH=$PATH:/usr/tests/sts/bin; pan2 -x 5 -f /tmp/revolver_load_gen.3655 -a /tmp/revolver_l 3660 - pan2 -x 5 -f /tmp/revolver_load_gen.3655 -a /tmp/revolver_load_gen.3655.active -d /tmp/revolver_ 3663 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3655 -a /tmp/revolver_load_gen.3655.active -d /tmp/revolver_ 3664 wait genesis -S 11654 -n 500 -d 150 -p 4 3665 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3655 -a /tmp/revolver_load_gen.3655.active -d /tmp/revolver_ 3666 wait sh -c iogen -S 11654 -f buffered -m sequential -s read,write,readv,writev -t 1b -T 2332356b 2332 3667 glock_wait_intern genesis -S 11654 -n 500 -d 150 -p 4 3669 glock_wait_intern genesis -S 11654 -n 500 -d 150 -p 4 3670 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3655 -a /tmp/revolver_load_gen.3655.active -d /tmp/revolver_ 3672 wait accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3673 glock_wait_intern genesis -S 11654 -n 500 -d 150 -p 4 3674 glock_wait_intern genesis -S 11654 -n 500 -d 150 -p 4 3675 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3655 -a /tmp/revolver_load_gen.3655.active -d /tmp/revolver_ 3676 glock_wait_intern growfiles -S 11654 -i 0 -N 500 -n 4 -b 3677 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3678 glock_wait_intern growfiles -S 11654 -i 0 -N 500 -n 4 -b 3679 glock_wait_intern growfiles -S 11654 -i 0 -N 500 -n 4 -b 3680 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3681 glock_wait_intern growfiles -S 11654 -i 0 -N 500 -n 4 -b 3683 wait_async accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3682 pipe_wait iogen -S 11654 -f buffered -m sequential -s read write readv writev -t 1b -T 2332356b 2332356b:r 3684 dlm_lock_sync doio -avk 3685 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3910 pdflush [pdflush] 3925 pdflush [pdflush] 4766 - sshd: root@pts/0 4768 wait -bash 5171 - ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd [root@taft-03 ~]# ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd PID WIDE-WCHAN-COLUMN CMD 1 - init [5] 2 migration_thread [migration/0] 3 ksoftirqd [ksoftirqd/0] 4 migration_thread [migration/1] 5 ksoftirqd [ksoftirqd/1] 6 migration_thread [migration/2] 7 ksoftirqd [ksoftirqd/2] 8 migration_thread [migration/3] 9 ksoftirqd [ksoftirqd/3] 10 worker_thread [events/0] 11 worker_thread [events/1] 12 worker_thread [events/2] 13 worker_thread [events/3] 14 worker_thread [khelper] 15 worker_thread [kacpid] 51 worker_thread [kblockd/0] 52 worker_thread [kblockd/1] 53 worker_thread [kblockd/2] 54 worker_thread [kblockd/3] 55 hub_thread [khubd] 72 worker_thread [aio/0] 71 kswapd [kswapd0] 73 worker_thread [aio/1] 74 worker_thread [aio/2] 75 worker_thread [aio/3] 148 serio_thread [kseriod] 212 16045567552327254 [scsi_eh_0] 228 worker_thread [kmirrord] 229 worker_thread [kmir_mon] 237 kjournald [kjournald] 1017 16045567552327254 [scsi_eh_1] 1018 16045567552327254 [qla2300_1_dpc] 1027 - udevd 1683 kauditd_thread [kauditd] 1890 kjournald [kjournald] 2342 - syslogd -m 0 2346 syslog klogd -x 2356 - irqbalance 2366 - portmap 2385 - rpc.statd 2411 - rpc.idmapd 2423 - ccsd 2474 cluster_kthread [cman_comms] 2476 serviced [cman_serviced] 2475 membership_kthrea [cman_memb] 2477 hello_kthread [cman_hbeat] 2510 rt_sigsuspend fenced -w 2521 - clvmd 2522 dlm_astd [dlm_astd] 2523 - [dlm_recvd] 2524 dlm_sendd [dlm_sendd] 2525 dlm_recoverd [dlm_recoverd] 2644 - /usr/sbin/acpid 2653 - cupsd 2676 - /usr/sbin/sshd 2689 - xinetd -stayalive -pidfile /var/run/xinetd.pid 2708 - sendmail: accepting connections 2716 pause sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue 2731 - gpm -m /dev/input/mice -t imps2 2775 - crond 2800 - xfs -droppriv -daemon 2817 - /usr/sbin/atd 2835 - dbus-daemon-1 --system 2846 - hald 2860 - clurgmgrd 2864 - /sbin/agetty ttyS0 115200 vt100-nav 2865 - /sbin/mingetty tty1 2866 - /sbin/mingetty tty2 2867 - /sbin/mingetty tty3 2869 - /sbin/mingetty tty4 2910 - /sbin/mingetty tty5 2990 - /sbin/mingetty tty6 2997 - kdm -nodaemon 3405 - /usr/X11R6/bin/X -auth /var/run/xauth/A:0-i8kJuX vt1 3406 pipe_wait -:0 3505 dlm_recoverd [dlm_recoverd] 3524 dlm_async [lock_dlm1] 3525 dlm_async [lock_dlm2] 3552 - [gfs_scand] 3553 - [gfs_glockd] 3557 - /usr/bin/kdm_greet 3558 - [gfs_recoverd] 3578 - [gfs_logd] 3579 glock_wait_intern [gfs_quotad] 3580 - [gfs_inoded] 3589 dlm_recoverd [dlm_recoverd] 3590 dlm_async [lock_dlm1] 3591 dlm_async [lock_dlm2] 3592 - [gfs_scand] 3593 gfs_glockd [gfs_glockd] 3604 - [gfs_recoverd] 3605 - [gfs_logd] 3606 - [gfs_quotad] 3607 - [gfs_inoded] 3615 - qarshd 3616 wait sh -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/cmarthal/bin: 3618 pipe_wait /usr/bin/perl -w /usr/tests/sts/gfs/bin/revolver_load_gen -S 12441653 -r /usr/tests/sts -L HEAVY 3622 wait sh -c PATH=$PATH:/usr/tests/sts/bin; pan2 -x 5 -f /tmp/revolver_load_gen.3618 -a /tmp/revolver_l 3623 - pan2 -x 5 -f /tmp/revolver_load_gen.3618 -a /tmp/revolver_load_gen.3618.active -d /tmp/revolver_ 3624 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3618 -a /tmp/revolver_load_gen.3618.active -d /tmp/revolver_ 3625 glock_wait_intern growfiles -S 11654 -i 0 -N 500 -n 4 -b 3626 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3618 -a /tmp/revolver_load_gen.3618.active -d /tmp/revolver_ 3627 wait sh -c iogen -S 11654 -f buffered -m sequential -s read,write,readv,writev -t 1b -T 2488330b 2488 3628 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3618 -a /tmp/revolver_load_gen.3618.active -d /tmp/revolver_ 3629 wait accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3632 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3618 -a /tmp/revolver_load_gen.3618.active -d /tmp/revolver_ 3633 wait genesis -S 11654 -n 500 -d 150 -p 4 3634 pipe_wait iogen -S 11654 -f buffered -m sequential -s read write readv writev -t 1b -T 2488330b 2488330b:r 3635 dlm_lock_sync doio -avk 3638 wait_async genesis -S 11654 -n 500 -d 150 -p 4 3639 glock_wait_intern genesis -S 11654 -n 500 -d 150 -p 4 3640 - genesis -S 11654 -n 500 -d 150 -p 4 3641 - genesis -S 11654 -n 500 -d 150 -p 4 3642 wait_async accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3643 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3644 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3645 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3646 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3647 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3648 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3649 - qarshd 3650 wait sh -c PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/cmarthal/bin: 3652 pipe_wait /usr/bin/perl -w /usr/tests/sts/gfs/bin/revolver_load_gen -S 11661784 -r /usr/tests/sts -L HEAVY 3656 wait sh -c PATH=$PATH:/usr/tests/sts/bin; pan2 -x 5 -f /tmp/revolver_load_gen.3652 -a /tmp/revolver_l 3657 - pan2 -x 5 -f /tmp/revolver_load_gen.3652 -a /tmp/revolver_load_gen.3652.active -d /tmp/revolver_ 3658 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3652 -a /tmp/revolver_load_gen.3652.active -d /tmp/revolver_ 3659 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3662 dlm_lock_sync growfiles -S 11654 -i 0 -N 500 -n 4 -b 3663 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3652 -a /tmp/revolver_load_gen.3652.active -d /tmp/revolver_ 3664 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3665 wait accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3666 - growfiles -S 11654 -i 0 -N 500 -n 4 -b 3667 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3652 -a /tmp/revolver_load_gen.3652.active -d /tmp/revolver_ 3668 wait genesis -S 11654 -n 500 -d 150 -p 4 3669 pipe_wait pan2 -x 5 -f /tmp/revolver_load_gen.3652 -a /tmp/revolver_load_gen.3652.active -d /tmp/revolver_ 3670 wait sh -c iogen -S 11654 -f buffered -m sequential -s read,write,readv,writev -t 1b -T 2332356b 2332 3671 - genesis -S 11654 -n 500 -d 150 -p 4 3673 glock_wait_intern genesis -S 11654 -n 500 -d 150 -p 4 3674 - genesis -S 11654 -n 500 -d 150 -p 4 3676 glock_wait_intern genesis -S 11654 -n 500 -d 150 -p 4 3677 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3678 wait_async accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3679 pipe_wait iogen -S 11654 -f buffered -m sequential -s read write readv writev -t 1b -T 2332356b 2332356b:r 3680 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3681 lm_dlm_plock accordion -S 11654 -p 4 accrdfile1 accrdfile2 accrdfile3 accrdfile4 accrdfile5 3682 wait_async doio -avk 3722 pdflush [pdflush] 3749 pdflush [pdflush] 4592 - sshd: root@pts/0 4594 wait -bash 4997 - ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd
I wonder if this is related to a very similar problem that HP India are seeing. I have some detailed tcpdumps from them and it appears that the DLM connections are being closed during recovery. Why, I'm not sure. I've sent Ravi a patch with some printks in it so we can see if it's the DLM doing the closing (and if so, why).
It's the combination of /proc/cluster/dlm_debug and ps that usually pinpoints where it's hung. This second case didn't have the dlm_debug. It probably doesn't matter much, though. In both cases it looks like one node is waiting for a reply from another that it never gets. A while back we had the very same symptoms for another bug that we eventually fixed. That previous bug was related to connections being closed, much like what Patrick mentioned in comment 8. We should probably see if we can get a tcpdump of the hang. Are you using fixed nodeids (specified in cluster.conf)?
I still have this cluster in this state from yesterday if there's any other info you'd like to see as well. [root@taft-01 ~]# cat /proc/cluster/dlm_debug s taft1 purge locks of departed nodes taft0 purged 1 locks taft0 update remastered resources taft1 purged 1 locks taft1 update remastered resources taft0 updated 1102 resources taft0 rebuild locks taft1 updated 1102 resources taft1 rebuild locks taft0 rebuilt 1147 locks taft0 recover event 24 done clvmd move flags 0,0,1 ids 18,24,24 clvmd process held requests clvmd processed 0 requests clvmd resend marked requests clvmd resent 0 requests clvmd recover event 24 finished taft1 rebuilt 1162 locks taft1 recover event 24 done taft1 move flags 0,0,1 ids 22,24,24 taft1 process held requests taft1 processed 0 requests taft1 resend marked requests taft1 resend 1100fe lq 1 flg 200000 node -1/-1 " 7 taft1 resend 15002a lq 1 flg 200000 node -1/-1 " 2 taft1 resend 1000a8 lq 1 flg 200000 node -1/-1 " 7 taft1 resend 160223 lq 1 flg 200000 node -1/-1 " 7 taft1 resend 120298 lq 1 flg 200000 node -1/-1 " 7 taft1 resent 5 requests taft1 recover event 24 finished [root@taft-02 ~]# cat /proc/cluster/dlm_debug s of departed nodes taft1 purged 2 locks taft1 update remastered resources taft1 updated 1102 resources taft1 rebuild locks clvmd mark waiting requests clvmd marked 0 requests clvmd purge locks of departed nodes clvmd purged 0 locks clvmd update remastered resources clvmd updated 2 resources clvmd rebuild locks clvmd rebuilt 2 locks clvmd recover event 16 done clvmd move flags 0,0,1 ids 10,16,16 clvmd process held requests clvmd processed 0 requests clvmd resend marked requests clvmd resent 0 requests clvmd recover event 16 finished taft1 rebuilt 1151 locks taft0 mark waiting requests taft0 marked 0 requests taft0 purge locks of departed nodes taft1 recover event 16 done taft0 purged 3 locks taft0 update remastered resources taft1 move flags 0,0,1 ids 14,16,16 taft1 process held requests taft1 processed 0 requests taft1 resend marked requests taft1 resend 70369 lq 1 flg 200000 node -1/-1 " 7 taft1 resent 1 requests taft1 recover event 16 finished taft0 updated 1102 resources taft0 rebuild locks [root@taft-03 ~]# cat /proc/cluster/dlm_debug rk 302b4 lq 3 nodeid 2 taft1 mark 403eb lq 1 nodeid -1 taft1 marked 3 requests taft1 purge locks of departed nodes taft0 purged 1 locks taft0 update remastered resources taft1 purged 1 locks taft1 update remastered resources taft0 updated 1102 resources taft0 rebuild locks taft0 rebuilt 0 locks taft0 recover event 8 done taft1 updated 1101 resources taft1 rebuild locks taft1 rebuilt 0 locks taft1 recover event 8 done clvmd move flags 0,0,1 ids 2,8,8 clvmd process held requests clvmd processed 0 requests clvmd resend marked requests clvmd resent 0 requests clvmd recover event 8 finished taft1 add_to_requestq cmd 9 fr 3 taft1 move flags 0,0,1 ids 6,8,8 taft1 process held requests taft1 process_requestq cmd 9 fr 3 taft1 processed 1 requests taft1 resend marked requests taft1 resend 50022 lq 1 flg 200000 node -1/-1 " 7 taft1 resend 302b4 lq 3 flg 1200008 node 0/0 " 8 taft1 resend 403eb lq 1 flg 200000 node -1/-1 " 7 taft1 resent 3 requests taft1 recover event 8 finished
taft-02 is stuck recovering lockspace taft0. It has called dlm_wait_function() at the end of rebuild_rsbs_send(). It is not a synchronous reply from another node that it's hung on as I suggested in comment 9. Instead, it's waiting for its recover_list to become empty. So, we need to find out why there are rsb's still sitting in the recover_list. Bugs in a variety of places could be responsible for that. To start, a patch (or better, systemtap?) should allow us to see exactly what's left in the recover_list. That might give us a clue as to whether bad accounting or logic has left an rsb there incorrectly, or whether a NEWLOCKIDS message from another node has not been received.
Just a note that this was seen last night.
Hmm - could this one be related to the recent mount/umount issues we have been chasing? Devel ACK for 4.5.
This looks like a one-off case of dlm recovery not working properly, some odd condition or combination of events that's not being handled right. Fixing will require a reproducable test I can run while adding progressively refined debugging.
This has not been seen in almost 7 months, I'll reopen if reproduced again.