Description of problem: Running umount on a gfs filesystem hung after my cluster sat mostly idle for a few days. This is a three-node cluster running RHEL4-U3 code. The cluster consists of trin-12,trin-13 and trin-14, all 32-bit i686 boxes. The sequence of events was like this: 1. I rebooted all nodes in my cluster, and everything came up normally, with nfs service running on trin-13. 2. I used ls -l on my gfs mountpoint to make sure everything was up normally. 3. I used cluster management gui interface from trin-12 to disable my nfs service on trin-13. 4. I let all systems sit mostly idle for two days while I worked on other systems. No suspicious dmesgs were logged at that time. No test cases were running, no cluster-suite related commands were issued that I can recall. 5. On trin-12, after a couple days of idle time, I did umount of my gfs filesystem. On the failing (hung) system, trin-12, I dumped the task status and the umount task looked like this: umount D 00008EE2 2228 5399 3729 (NOTLB) dda46e80 00000082 54feae58 00008ee2 df194600 0000041f 54ff81f4 00008ee2 d7bc4c50 d7bc4ddc df5c51a8 df5c5180 dda46e9c dda46ed0 c030f021 00000000 d7bc4c50 c011d0ce 00000000 00000000 54feadb7 dda46eb8 c011c50c 00000001 Call Trace: [<c030f021>] wait_for_completion+0x11f/0x206 [<c011d0ce>] default_wake_function+0x0/0xc [<c011c50c>] activate_task+0x53/0x5f [<c011d0ce>] default_wake_function+0x0/0xc [<e153e021>] kcl_leave_service+0xe1/0x142 [cman] [<e01e0f56>] release_mountgroup+0x137/0x152 [lock_dlm] [<e01e357d>] lm_dlm_unmount+0x21/0x45 [lock_dlm] [<e018635e>] lm_unmount+0x39/0x6d [lock_harness] [<e197a92c>] gfs_lm_unmount+0x1e/0x26 [gfs] [<e198b07a>] gfs_put_super+0x302/0x331 [gfs] [<c016f471>] generic_shutdown_super+0x119/0x2eb [<e1988a80>] gfs_kill_sb+0x1f/0x43 [gfs] [<c016f145>] deactivate_super+0xc5/0xda [<c018a388>] sys_umount+0x65/0x6c [<c015b195>] unmap_vma_list+0xe/0x17 [<c015b544>] do_munmap+0x1c8/0x1d2 [<c018a39a>] sys_oldumount+0xb/0xe [<c03115af>] syscall_call+0x7/0xb So cman is hanging, waiting for a reply or something. On the failing system, I did: clustat and got: [root@trin-12 src]# clustat Member Status: Quorate Member Name Status ------ ---- ------ trin-12 Online, Local, rgmanager trin-13 Online, rgmanager trin-14 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- nfssvc (trin-13) disabled On the failing system, I did: cat /proc/cluster/services and saw: Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "clvmd" 2 3 run - [1 2 3] DLM Lock Space: "bobs_gfs" 3 4 run - [1 2 3] DLM Lock Space: "Magma" 6 7 run - [1 2 3] GFS Mount Group: "bobs_gfs" 4 5 run S-11,208,1 [1 2 3] User: "usrm::manager" 5 6 run - [1 2 3] So gfs is trying to leave the mountgroup "bobs_gfs". According to Dave Teigland: looks like cman/sm has started leaving the group and is waiting for some acks from the other nodes. S-11 == SEST_LEAVE_ACKWAIT Doing cat /proc/cluster/services on the other nodes produced: [root@trin-13 cluster-i686-2005-12-21-1504]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "clvmd" 2 3 run - [1 2 3] DLM Lock Space: "bobs_gfs" 3 4 run - [1 2 3] DLM Lock Space: "Magma" 6 7 run - [1 2 3] GFS Mount Group: "bobs_gfs" 4 5 run - [1 2 3] User: "usrm::manager" 5 6 run - [1 2 3] And: [root@trin-14 src]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "clvmd" 2 3 run - [1 2 3] DLM Lock Space: "bobs_gfs" 3 4 run - [2 1 3] DLM Lock Space: "Magma" 6 7 run - [1 2 3] GFS Mount Group: "bobs_gfs" 4 5 run - [2 1 3] User: "usrm::manager" 5 6 run - [1 2 3] Which seemed to indicate the other nodes maybe didn't see the message they were supposed to ack. Netstat -a on trin-12 produced this: [root@trin-12 ~]# netstat -a Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 *:nfs *:* LISTEN tcp 0 0 *:867 *:* LISTEN tcp 0 0 trin-12.lab.msp.redha:21064 *:* LISTEN tcp 0 0 *:32778 *:* LISTEN tcp 0 0 *:32779 *:* LISTEN tcp 0 0 *:41966 *:* LISTEN tcp 0 0 *:sunrpc *:* LISTEN tcp 0 0 *:41968 *:* LISTEN tcp 0 0 *:5008 *:* LISTEN tcp 0 0 *:883 *:* LISTEN tcp 0 0 *:50008 *:* LISTEN tcp 0 0 trin-12.lab.msp.redha:32775 trin-14.lab.msp.redha:21064 ESTABLISHED tcp 0 0 trin-12.lab.msp.redha:32774 trin-13.lab.msp.redha:21064 ESTABLISHED tcp 0 0 *:41967 *:* LISTEN tcp 0 0 *:41969 *:* LISTEN tcp 0 0 *:ssh *:* LISTEN tcp 0 0 ::1:50006 *:* LISTEN tcp 0 0 *:50009 *:* LISTEN tcp 0 0 trin-12.lab.msp.redhat.:ssh technetium.msp.redhat:52981 ESTABLISHED tcp 0 0 trin-12.lab.msp.redhat.:ssh technetium.msp.redhat:53897 ESTABLISHED udp 0 0 *:nfs *:* udp 0 0 *:32771 *:* udp 0 0 *:32772 *:* udp 0 0 trin-12.lab.msp.re:6809 *:* udp 0 0 broadcast:6809 *:* udp 0 0 *:23456 *:* udp 0 0 *:807 *:* udp 0 0 *:bootpc *:* udp 0 0 *:bootpc *:* udp 0 0 *:864 *:* udp 0 0 *:sunrpc *:* udp 0 0 *:880 *:* udp 0 0 *:50007 *:* Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node Path unix 2 [ ACC ] STREAM LISTENING 6551 /var/run/cluster/ccsd.sock unix 2 [ ACC ] STREAM LISTENING 7577 /tmp/.font-unix/fs7100 unix 2 [ ] DGRAM 7686 @/var/run/hal/hotplug_socket unix 2 [ ] DGRAM 4387 @udevd unix 2 [ ACC ] STREAM LISTENING 6824 @clvmd unix 14 [ ] DGRAM 6339 /dev/log unix 2 [ ACC ] STREAM LISTENING 7090 /var/run/acpid.socket unix 2 [ ACC ] STREAM LISTENING 7619 /var/run/dbus/system_bus_socket unix 2 [ ] DGRAM 110454 unix 2 [ ] DGRAM 83382 unix 2 [ ] DGRAM 75460 unix 2 [ ] DGRAM 50446 unix 2 [ ] DGRAM 9926 unix 2 [ ] DGRAM 7767 unix 3 [ ] STREAM CONNECTED 7685 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 7684 unix 2 [ ] DGRAM 7638 unix 3 [ ] STREAM CONNECTED 7622 unix 3 [ ] STREAM CONNECTED 7621 unix 2 [ ] DGRAM 7612 unix 2 [ ] DGRAM 7525 unix 2 [ ] DGRAM 7175 unix 2 [ ] DGRAM 6837 unix 2 [ ] DGRAM 6747 unix 2 [ ] DGRAM 6533 unix 3 [ ] STREAM CONNECTED 6514 unix 3 [ ] STREAM CONNECTED 6513 unix 2 [ ] DGRAM 6347 Netstat -a on trin-13 produced this: [root@trin-13 cluster-i686-2005-12-21-1504]# netstat -a Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 *:nfs *:* LISTEN tcp 0 0 *:776 *:* LISTEN tcp 0 0 trin-13.lab.msp.redha:21064 *:* LISTEN tcp 0 0 *:5801 *:* LISTEN tcp 0 0 *:5901 *:* LISTEN tcp 0 0 *:41966 *:* LISTEN tcp 0 0 *:sunrpc *:* LISTEN tcp 0 0 *:41968 *:* LISTEN tcp 0 0 *:5008 *:* LISTEN tcp 0 0 *:6001 *:* LISTEN tcp 0 0 *:32789 *:* LISTEN tcp 0 0 *:32790 *:* LISTEN tcp 0 0 *:792 *:* LISTEN tcp 0 0 *:50008 *:* LISTEN tcp 0 0 trin-13.lab.msp.redha:21064 trin-13.lab.msp.redha:32784 ESTABLISHED tcp 0 0 trin-13.lab.msp.redha:32774 trin-14.lab.msp.redha:21064 ESTABLISHED tcp 0 0 trin-13.lab.msp.redha:32784 trin-13.lab.msp.redha:21064 ESTABLISHED tcp 0 0 trin-13.lab.msp.redha:21064 trin-12.lab.msp.redha:32774 ESTABLISHED tcp 0 0 trin-13.lab.msp.redha:21064 trin-14.lab.msp.redha:32774 ESTABLISHED tcp 0 0 trin-13.lab.msp.redhat:5901 technetium.msp.redhat:51801 ESTABLISHED tcp 0 0 *:41967 *:* LISTEN tcp 0 0 *:6001 *:* LISTEN tcp 0 0 *:41969 *:* LISTEN tcp 0 0 *:ssh *:* LISTEN tcp 0 0 ::1:50006 *:* LISTEN tcp 0 0 *:50009 *:* LISTEN tcp 0 0 trin-13.lab.msp.redhat.:ssh technetium.msp.redhat:58444 ESTABLISHED udp 0 0 *:nfs *:* udp 0 0 *:32771 *:* udp 0 0 *:32772 *:* udp 0 0 *:773 *:* udp 0 0 *:789 *:* udp 0 0 trin-13.lab.msp.re:6809 *:* udp 0 0 broadcast:6809 *:* udp 0 0 *:23456 *:* udp 0 0 *:bootpc *:* udp 0 0 *:716 *:* udp 0 0 *:sunrpc *:* udp 0 0 *:50007 *:* raw 0 0 *:icmp *:* 7 Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node Path unix 2 [ ACC ] STREAM LISTENING 7186 /tmp/.font-unix/fs7100 unix 2 [ ACC ] STREAM LISTENING 7248 /var/run/dbus/system_bus_socket unix 2 [ ACC ] STREAM LISTENING 132158 @/tmp/fam-root- unix 2 [ ACC ] STREAM LISTENING 131882 /tmp/.X11-unix/X1 unix 2 [ ACC ] STREAM LISTENING 131968 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 2 [ ACC ] STREAM LISTENING 131975 /tmp/orbit-root/linc-2ed8-0-688899c94c95d unix 2 [ ACC ] STREAM LISTENING 132098 /tmp/.ICE-unix/11992 unix 2 [ ACC ] STREAM LISTENING 132456 /tmp/mapping-root unix 2 [ ACC ] STREAM LISTENING 132107 /tmp/keyring-UBDS9K/socket unix 2 [ ACC ] STREAM LISTENING 132123 /tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4 unix 2 [ ACC ] STREAM LISTENING 132143 /tmp/orbit-root/linc-2f00-0-49ca6aba5fc1 unix 2 [ ACC ] STREAM LISTENING 132267 /tmp/orbit-root/linc-2f25-0-4adc9d44cf36 unix 2 [ ACC ] STREAM LISTENING 132300 /tmp/orbit-root/linc-2f29-0-7daa02e6ed464 unix 2 [ ACC ] STREAM LISTENING 132322 /tmp/orbit-root/linc-2f2d-0-2f96663b273b unix 2 [ ACC ] STREAM LISTENING 132343 /tmp/orbit-root/linc-2f2b-0-2f96663b88298 unix 2 [ ACC ] STREAM LISTENING 132414 /tmp/orbit-root/linc-2f37-0-7de67610eaa4e unix 2 [ ACC ] STREAM LISTENING 132443 /tmp/orbit-root/linc-2f33-0-765a6e5a7056f unix 2 [ ACC ] STREAM LISTENING 132500 /tmp/orbit-root/linc-2f40-0-1978dbb4ab9f9 unix 2 [ ACC ] STREAM LISTENING 132530 /tmp/orbit-root/linc-2f42-0-abe7c1b4666f unix 2 [ ACC ] STREAM LISTENING 132567 /tmp/orbit-root/linc-2f44-0-abe7c1bdfd3f unix 2 [ ACC ] STREAM LISTENING 132597 /tmp/orbit-root/linc-2f46-0-34b75c0b1c659 unix 2 [ ACC ] STREAM LISTENING 132704 /tmp/orbit-root/linc-2f4a-0-745b5083a41a3 unix 2 [ ACC ] STREAM LISTENING 7006 /dev/gpmctl unix 16 [ ] DGRAM 6000 /dev/log unix 2 [ ] DGRAM 7429 @/var/run/hal/hotplug_socket unix 2 [ ] DGRAM 4301 @udevd unix 2 [ ACC ] STREAM LISTENING 6534 @clvmd unix 2 [ ACC ] STREAM LISTENING 6222 /var/run/cluster/ccsd.sock unix 2 [ ACC ] STREAM LISTENING 6827 /var/run/acpid.socket unix 2 [ ] DGRAM 149404 unix 3 [ ] STREAM CONNECTED 132707 /tmp/orbit-root/linc-2f4a-0-745b5083a41a3 unix 3 [ ] STREAM CONNECTED 132706 unix 3 [ ] STREAM CONNECTED 132703 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 132702 unix 3 [ ] STREAM CONNECTED 132648 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132647 unix 2 [ ] DGRAM 132642 unix 3 [ ] STREAM CONNECTED 132630 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132629 unix 3 [ ] STREAM CONNECTED 132612 /tmp/orbit-root/linc-2f29-0-7daa02e6ed464 unix 3 [ ] STREAM CONNECTED 132611 unix 3 [ ] STREAM CONNECTED 132610 /tmp/orbit-root/linc-2f46-0-34b75c0b1c659 unix 3 [ ] STREAM CONNECTED 132609 unix 3 [ ] STREAM CONNECTED 132604 /tmp/orbit-root/linc-2f46-0-34b75c0b1c659 unix 3 [ ] STREAM CONNECTED 132603 unix 3 [ ] STREAM CONNECTED 132602 /tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4 unix 3 [ ] STREAM CONNECTED 132601 unix 3 [ ] STREAM CONNECTED 132600 /tmp/orbit-root/linc-2f46-0-34b75c0b1c659 unix 3 [ ] STREAM CONNECTED 132599 unix 3 [ ] STREAM CONNECTED 132596 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 132595 unix 3 [ ] STREAM CONNECTED 132590 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132589 unix 3 [ ] STREAM CONNECTED 132582 /tmp/orbit-root/linc-2f29-0-7daa02e6ed464 unix 3 [ ] STREAM CONNECTED 132581 unix 3 [ ] STREAM CONNECTED 132580 /tmp/orbit-root/linc-2f44-0-abe7c1bdfd3f unix 3 [ ] STREAM CONNECTED 132579 unix 3 [ ] STREAM CONNECTED 132574 /tmp/orbit-root/linc-2f44-0-abe7c1bdfd3f unix 3 [ ] STREAM CONNECTED 132573 unix 3 [ ] STREAM CONNECTED 132572 /tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4 unix 3 [ ] STREAM CONNECTED 132571 unix 3 [ ] STREAM CONNECTED 132570 /tmp/orbit-root/linc-2f44-0-abe7c1bdfd3f unix 3 [ ] STREAM CONNECTED 132569 unix 3 [ ] STREAM CONNECTED 132566 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 132565 unix 3 [ ] STREAM CONNECTED 132560 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132559 unix 3 [ ] STREAM CONNECTED 132548 /tmp/orbit-root/linc-2f29-0-7daa02e6ed464 unix 3 [ ] STREAM CONNECTED 132547 unix 3 [ ] STREAM CONNECTED 132546 /tmp/orbit-root/linc-2f42-0-abe7c1b4666f unix 3 [ ] STREAM CONNECTED 132545 unix 3 [ ] STREAM CONNECTED 132537 /tmp/orbit-root/linc-2f42-0-abe7c1b4666f unix 3 [ ] STREAM CONNECTED 132536 unix 3 [ ] STREAM CONNECTED 132535 /tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4 unix 3 [ ] STREAM CONNECTED 132534 unix 3 [ ] STREAM CONNECTED 132533 /tmp/orbit-root/linc-2f42-0-abe7c1b4666f unix 3 [ ] STREAM CONNECTED 132532 unix 3 [ ] STREAM CONNECTED 132529 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 132528 unix 3 [ ] STREAM CONNECTED 132523 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132522 unix 3 [ ] STREAM CONNECTED 132515 /tmp/orbit-root/linc-2f29-0-7daa02e6ed464 unix 3 [ ] STREAM CONNECTED 132514 unix 3 [ ] STREAM CONNECTED 132513 /tmp/orbit-root/linc-2f40-0-1978dbb4ab9f9 unix 3 [ ] STREAM CONNECTED 132512 unix 3 [ ] STREAM CONNECTED 132507 /tmp/orbit-root/linc-2f40-0-1978dbb4ab9f9 unix 3 [ ] STREAM CONNECTED 132506 unix 3 [ ] STREAM CONNECTED 132505 /tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4 unix 3 [ ] STREAM CONNECTED 132504 unix 3 [ ] STREAM CONNECTED 132503 /tmp/orbit-root/linc-2f40-0-1978dbb4ab9f9 unix 3 [ ] STREAM CONNECTED 132502 unix 3 [ ] STREAM CONNECTED 132499 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 132498 unix 3 [ ] STREAM CONNECTED 132493 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132492 unix 3 [ ] STREAM CONNECTED 132460 /tmp/mapping-root unix 3 [ ] STREAM CONNECTED 132452 unix 3 [ ] STREAM CONNECTED 132446 /tmp/orbit-root/linc-2f33-0-765a6e5a7056f unix 3 [ ] STREAM CONNECTED 132445 unix 3 [ ] STREAM CONNECTED 132442 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 132441 unix 3 [ ] STREAM CONNECTED 132438 /tmp/.ICE-unix/11992 unix 3 [ ] STREAM CONNECTED 132437 unix 3 [ ] STREAM CONNECTED 132431 /tmp/orbit-root/linc-2f37-0-7de67610eaa4e unix 3 [ ] STREAM CONNECTED 132430 unix 3 [ ] STREAM CONNECTED 132429 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 132428 unix 3 [ ] STREAM CONNECTED 132427 @/tmp/fam-root- unix 3 [ ] STREAM CONNECTED 132426 unix 3 [ ] STREAM CONNECTED 132425 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 132424 unix 3 [ ] STREAM CONNECTED 132423 /tmp/orbit-root/linc-2f2b-0-2f96663b88298 unix 3 [ ] STREAM CONNECTED 132422 unix 3 [ ] STREAM CONNECTED 132421 /tmp/orbit-root/linc-2f37-0-7de67610eaa4e unix 3 [ ] STREAM CONNECTED 132420 unix 3 [ ] STREAM CONNECTED 132417 /tmp/orbit-root/linc-2f37-0-7de67610eaa4e unix 3 [ ] STREAM CONNECTED 132416 unix 3 [ ] STREAM CONNECTED 132413 /tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4 unix 3 [ ] STREAM CONNECTED 132412 unix 3 [ ] STREAM CONNECTED 132404 /tmp/orbit-root/linc-2f2b-0-2f96663b88298 unix 3 [ ] STREAM CONNECTED 132403 unix 3 [ ] STREAM CONNECTED 132402 /tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4 unix 3 [ ] STREAM CONNECTED 132401 unix 3 [ ] STREAM CONNECTED 132398 @/tmp/fam-root- unix 3 [ ] STREAM CONNECTED 132397 unix 3 [ ] STREAM CONNECTED 132396 @/tmp/fam-root- unix 3 [ ] STREAM CONNECTED 132395 unix 3 [ ] STREAM CONNECTED 132394 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132393 unix 3 [ ] STREAM CONNECTED 132374 /tmp/.ICE-unix/11992 unix 3 [ ] STREAM CONNECTED 132371 unix 3 [ ] STREAM CONNECTED 132370 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132369 unix 3 [ ] STREAM CONNECTED 132346 /tmp/orbit-root/linc-2f2b-0-2f96663b88298 unix 3 [ ] STREAM CONNECTED 132345 unix 3 [ ] STREAM CONNECTED 132342 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 132341 unix 3 [ ] STREAM CONNECTED 132338 /tmp/.ICE-unix/11992 unix 3 [ ] STREAM CONNECTED 132337 unix 3 [ ] STREAM CONNECTED 132332 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132331 unix 3 [ ] STREAM CONNECTED 132327 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 132326 unix 3 [ ] STREAM CONNECTED 132325 /tmp/orbit-root/linc-2f2d-0-2f96663b273b unix 3 [ ] STREAM CONNECTED 132324 unix 3 [ ] STREAM CONNECTED 132321 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 132320 unix 3 [ ] STREAM CONNECTED 132317 /tmp/.ICE-unix/11992 unix 3 [ ] STREAM CONNECTED 132316 unix 3 [ ] STREAM CONNECTED 132311 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132310 unix 3 [ ] STREAM CONNECTED 132309 /tmp/orbit-root/linc-2f29-0-7daa02e6ed464 unix 3 [ ] STREAM CONNECTED 132308 unix 3 [ ] STREAM CONNECTED 132307 /tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4 unix 3 [ ] STREAM CONNECTED 132306 unix 3 [ ] STREAM CONNECTED 132303 /tmp/orbit-root/linc-2f29-0-7daa02e6ed464 unix 3 [ ] STREAM CONNECTED 132302 unix 3 [ ] STREAM CONNECTED 132299 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 132298 unix 3 [ ] STREAM CONNECTED 132294 /tmp/.ICE-unix/11992 unix 3 [ ] STREAM CONNECTED 132293 unix 3 [ ] STREAM CONNECTED 132286 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132285 unix 3 [ ] STREAM CONNECTED 132274 /tmp/.ICE-unix/11992 unix 3 [ ] STREAM CONNECTED 132273 unix 3 [ ] STREAM CONNECTED 132272 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132271 unix 3 [ ] STREAM CONNECTED 132270 /tmp/orbit-root/linc-2f25-0-4adc9d44cf36 unix 3 [ ] STREAM CONNECTED 132269 unix 3 [ ] STREAM CONNECTED 132266 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 132265 unix 3 [ ] STREAM CONNECTED 132209 /tmp/orbit-root/linc-2f00-0-49ca6aba5fc1 unix 3 [ ] STREAM CONNECTED 132203 unix 3 [ ] STREAM CONNECTED 132189 /tmp/orbit-root/linc-2f00-0-49ca6aba5fc1 unix 3 [ ] STREAM CONNECTED 132188 unix 3 [ ] STREAM CONNECTED 132187 /tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4 unix 3 [ ] STREAM CONNECTED 132186 unix 3 [ ] STREAM CONNECTED 132160 @/tmp/fam-root- unix 3 [ ] STREAM CONNECTED 132159 unix 3 [ ] STREAM CONNECTED 132146 /tmp/orbit-root/linc-2f00-0-49ca6aba5fc1 unix 3 [ ] STREAM CONNECTED 132145 unix 3 [ ] STREAM CONNECTED 132142 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 132141 unix 3 [ ] STREAM CONNECTED 132136 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 132135 unix 3 [ ] STREAM CONNECTED 132129 /tmp/orbit-root/linc-2ed8-0-688899c94c95d unix 3 [ ] STREAM CONNECTED 132128 unix 3 [ ] STREAM CONNECTED 132127 /tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4 unix 3 [ ] STREAM CONNECTED 132126 unix 3 [ ] STREAM CONNECTED 132097 /tmp/orbit-root/linc-2ed8-0-688899c94c95d unix 3 [ ] STREAM CONNECTED 132096 unix 3 [ ] STREAM CONNECTED 132095 /tmp/orbit-root/linc-2efa-0-352b6cee4b0d5 unix 3 [ ] STREAM CONNECTED 131974 unix 2 [ ] DGRAM 131964 unix 3 [ ] STREAM CONNECTED 131901 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 131900 unix 3 [ ] STREAM CONNECTED 131893 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 131892 unix 3 [ ] STREAM CONNECTED 131891 /tmp/.X11-unix/X1 unix 3 [ ] STREAM CONNECTED 131890 unix 2 [ ] DGRAM 96275 unix 2 [ ] DGRAM 26684 unix 2 [ ] DGRAM 7525 unix 3 [ ] STREAM CONNECTED 7428 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 7427 unix 3 [ ] STREAM CONNECTED 7410 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 7409 unix 2 [ ] DGRAM 7267 unix 3 [ ] STREAM CONNECTED 7251 unix 3 [ ] STREAM CONNECTED 7250 unix 2 [ ] DGRAM 7231 unix 2 [ ] DGRAM 7134 unix 2 [ ] DGRAM 6933 unix 2 [ ] DGRAM 6891 unix 2 [ ] DGRAM 6545 unix 2 [ ] DGRAM 6442 unix 2 [ ] DGRAM 6418 unix 2 [ ] DGRAM 6204 unix 3 [ ] STREAM CONNECTED 6185 unix 3 [ ] STREAM CONNECTED 6184 unix 2 [ ] DGRAM 6008 Netstat -a on trin-14 produced this: [root@trin-14 src]# netstat -a Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 *:nfs *:* LISTEN tcp 0 0 trin-14.lab.msp.redha:21064 *:* LISTEN tcp 0 0 *:32777 *:* LISTEN tcp 0 0 *:32778 *:* LISTEN tcp 0 0 *:41966 *:* LISTEN tcp 0 0 *:622 *:* LISTEN tcp 0 0 *:sunrpc *:* LISTEN tcp 0 0 *:41968 *:* LISTEN tcp 0 0 *:5008 *:* LISTEN tcp 0 0 *:50008 *:* LISTEN tcp 0 0 *:1021 *:* LISTEN tcp 0 0 trin-14.lab.msp.redha:32774 trin-13.lab.msp.redha:21064 ESTABLISHED tcp 0 0 trin-14.lab.msp.redha:21064 trin-12.lab.msp.redha:32775 ESTABLISHED tcp 0 0 trin-14.lab.msp.redha:21064 trin-13.lab.msp.redha:32774 ESTABLISHED tcp 0 0 *:41967 *:* LISTEN tcp 0 0 *:41969 *:* LISTEN tcp 0 0 *:ssh *:* LISTEN tcp 0 0 ::1:50006 *:* LISTEN tcp 0 0 *:50009 *:* LISTEN tcp 0 1104 trin-14.lab.msp.redhat.:ssh technetium.msp.redhat:56306 ESTABLISHED udp 0 0 *:nfs *:* udp 0 0 *:32769 *:* udp 0 0 *:32770 *:* udp 0 0 trin-14.lab.msp.re:6809 *:* udp 0 0 broadcast:6809 *:* udp 0 0 *:925 *:* udp 0 0 *:23456 *:* udp 0 0 *:bootpc *:* udp 0 0 *:619 *:* udp 0 0 *:sunrpc *:* udp 0 0 *:1018 *:* udp 0 0 *:50007 *:* Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node Path unix 2 [ ACC ] STREAM LISTENING 6222 /var/run/dbus/system_bus_socket unix 2 [ ACC ] STREAM LISTENING 5958 /dev/gpmctl unix 2 [ ACC ] STREAM LISTENING 5191 /var/run/cluster/ccsd.sock unix 14 [ ] DGRAM 4969 /dev/log unix 2 [ ACC ] STREAM LISTENING 6156 /tmp/.font-unix/fs7100 unix 2 [ ] DGRAM 6406 @/var/run/hal/hotplug_socket unix 2 [ ] DGRAM 3397 @udevd unix 2 [ ACC ] STREAM LISTENING 5476 @clvmd unix 2 [ ACC ] STREAM LISTENING 5778 /var/run/acpid.socket unix 2 [ ] DGRAM 112847 unix 2 [ ] DGRAM 83455 unix 2 [ ] DGRAM 10957 unix 2 [ ] DGRAM 8355 unix 2 [ ] DGRAM 6512 unix 3 [ ] STREAM CONNECTED 6405 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 6404 unix 3 [ ] STREAM CONNECTED 6387 /var/run/dbus/system_bus_socket unix 3 [ ] STREAM CONNECTED 6386 unix 2 [ ] DGRAM 6246 unix 3 [ ] STREAM CONNECTED 6230 unix 3 [ ] STREAM CONNECTED 6229 unix 2 [ ] DGRAM 6197 unix 2 [ ] DGRAM 6087 unix 2 [ ] DGRAM 5885 unix 2 [ ] DGRAM 5844 unix 2 [ ] DGRAM 5490 unix 2 [ ] DGRAM 5395 unix 2 [ ] DGRAM 5175 unix 3 [ ] STREAM CONNECTED 5156 unix 3 [ ] STREAM CONNECTED 5155 unix 2 [ ] DGRAM 4977 Again, no dmesgs were reported on the nodes in my cluster. Cluster Suite tasks were in this state: ccsd S 00000000 2548 2338 1 2339 2324 (NOTLB) ddaeaeb0 00000082 c014c301 00000000 00000001 0000048c 5bfca29b 00008f4e ddcc51a0 ddcc532c de00c280 7fffffff 00000000 ddaeaf74 c030fef1 db52d518 c017d342 00000246 db57f100 ddaeaf58 db52d518 00000246 00000246 ddee1300 Call Trace: [<c014c301>] __alloc_pages+0xd5/0x2f7 [<c030fef1>] schedule_timeout+0x50/0x10c [<c017d342>] __pollwait+0x2d/0x94 [<c017d795>] do_select+0x347/0x378 [<c017d315>] __pollwait+0x0/0x94 [<c017dab9>] sys_select+0x2e0/0x43a [<c01856cc>] destroy_inode+0x3d/0x4c [<c03115af>] syscall_call+0x7/0xb [<c031007b>] rwsem_down_read_failed+0x33/0x204 ccsd S 00000000 2252 2339 1 2389 2338 (NOTLB) dd3a6eb0 00000082 c014c301 00000000 ddcc4bd0 00037232 b3f31959 00000022 df130230 df1303bc de3bbb80 7fffffff 00000000 dd3a6f74 c030fef1 db52d918 c017d342 00000246 dbf52580 dd3a6f58 db52d718 00000246 dd01a680 dd3a6f58 Call Trace: [<c014c301>] __alloc_pages+0xd5/0x2f7 [<c030fef1>] schedule_timeout+0x50/0x10c [<c017d342>] __pollwait+0x2d/0x94 [<c02b2c44>] datagram_poll+0x25/0xd1 [<c017d795>] do_select+0x347/0x378 [<c017d315>] __pollwait+0x0/0x94 [<c017dab9>] sys_select+0x2e0/0x43a [<c03115af>] syscall_call+0x7/0xb [<c031007b>] rwsem_down_read_failed+0x33/0x204 cman_comms S 0000001C 1796 2389 1 2390 2339 (L-TLB) ddba1f90 00000046 ddba1f44 0000001c ddba1f3c 00000102 c7275423 00008f4f df131970 df131afc ddba1000 e154b740 e154b760 db46f700 e152e35d 0000001f 00000000 1d244b3c 00000000 0000000a e153f85c 00000000 00000000 df126eac Call Trace: [<e152e35d>] cluster_kthread+0x10c/0x594 [cman] [<c03114ce>] ret_from_fork+0x6/0x14 [<c011d0ce>] default_wake_function+0x0/0xc [<e152e251>] cluster_kthread+0x0/0x594 [cman] [<c01041dd>] kernel_thread_helper+0x5/0xb cman_serviced S 0000005B 2672 2391 3 2440 1685 (L-TLB) ddb85fc8 00000046 df194600 0000005b df131970 00000039 5501bfb3 00008ee2 df194600 df19478c ddb85000 df126e6c 00000000 e1539617 e153972e c0139ad9 fffffffc ffffffff ffffffff c0139a70 00000000 00000000 00000000 c01041dd Call Trace: [<e1539617>] serviced+0x0/0x140 [cman] [<e153972e>] serviced+0x117/0x140 [cman] [<c0139ad9>] kthread+0x69/0x91 [<c0139a70>] kthread+0x0/0x91 [<c01041dd>] kernel_thread_helper+0x5/0xb cman_memb S 00000008 1868 2390 1 2392 2389 (L-TLB) ddb88fb4 00000046 ddb88f80 00000008 ddb88f78 0000012c 9073e7bc 00008f4f df1313a0 df13152c df1313a0 00000000 ddb88fdc 00000000 e1534a12 0000001f 00000000 e15347e3 00000000 df1313a0 c011d0ce db4e5db0 db4e5db0 00000000 Call Trace: [<e1534a12>] membership_kthread+0x22f/0x520 [cman] [<e15347e3>] membership_kthread+0x0/0x520 [cman] [<c011d0ce>] default_wake_function+0x0/0xc [<e15347e3>] membership_kthread+0x0/0x520 [cman] [<c01041dd>] kernel_thread_helper+0x5/0xb cman_hbeat S 00008F4F 2636 2392 1 2427 2390 (L-TLB) ddbecfd4 00000046 c692120b 00008f4f df131970 00000937 c692120b 00008f4f df194bd0 df194d5c ddbec000 df194bd0 e154bb80 00000000 e1534559 0000001f 00000000 e153449a 00000000 00000000 c01041dd 00000000 00000000 00000000 Call Trace: [<e1534559>] hello_kthread+0xbf/0x13c [cman] [<e153449a>] hello_kthread+0x0/0x13c [cman] [<c01041dd>] kernel_thread_helper+0x5/0xb fenced S DDA13FA8 2724 2427 1 2439 2392 (NOTLB) dda13f9c 00000086 0000000b dda13fa8 00000000 0000a5d9 05f8d371 00000023 df130dd0 df130f5c dda13000 dda13fac bff2d380 dda13000 c01054ae 00004200 00000000 00000000 00000000 bff2d380 00000000 bff2d49c c03115af bff2d380 Call Trace: [<c01054ae>] sys_rt_sigsuspend+0x209/0x224 [<c03115af>] syscall_call+0x7/0xb [<c031007b>] rwsem_down_read_failed+0x33/0x204 clvmd S C0360CF4 2212 2439 1 2446 2427 (NOTLB) dd445eb0 00000086 000000d0 c0360cf4 dd445f58 000007b7 c1d0ff55 00008f4a ddc213a0 ddc2152c 096484d8 096484d8 00000000 dd445f74 c030ff92 c035f430 ddbc7eb8 096484d8 1d244b3c 00000000 00000005 c031e8be c0320b2a 000000a8 Call Trace: [<c030ff92>] schedule_timeout+0xf1/0x10c [<c012b87d>] process_timeout+0x0/0x5 [<c017d795>] do_select+0x347/0x378 [<c017d315>] __pollwait+0x0/0x94 [<c017dab9>] sys_select+0x2e0/0x43a [<c02acb83>] sock_ioctl+0x2dd/0x38b [<c012614b>] sys_time+0xf/0x58 [<c03115af>] syscall_call+0x7/0xb clvmd S 00000024 3436 2446 1 2447 2439 (NOTLB) dd963f14 00000086 a6a1e790 00000024 de2f8700 00001971 a6a1ed5e 00000024 de19d8f0 de19da7c db5cbd80 db4fb300 dd963f4c dd963fac e154ffb3 dda7ef5c 00000003 dda7eef8 00000246 c040d544 00000038 b7f703d0 00000000 de19d8f0 Call Trace: [<e154ffb3>] dlm_read+0x187/0x625 [dlm] [<c011d0ce>] default_wake_function+0x0/0xc [<c013a661>] wake_futex+0x3a/0x44 [<c011d0ce>] default_wake_function+0x0/0xc [<c0168bb2>] vfs_read+0xb6/0xe2 [<c0168dc5>] sys_read+0x3c/0x62 [<c03115af>] syscall_call+0x7/0xb clvmd S 00000024 2472 2447 1 2570 2446 (NOTLB) de695e94 00000086 a6a2a647 00000024 de2f8700 00001c6d a6c2f991 00000024 de19c780 de19c90c 00000000 7fffffff de695000 de695ef0 c030fef1 00000001 de695ef8 de695ef8 de695ef8 de0ae530 de695ef8 c013abf8 1d244b3c 00000000 Call Trace: [<c030fef1>] schedule_timeout+0x50/0x10c [<c013abf8>] queue_me+0x59/0x121 [<c013af34>] futex_wait+0x133/0x196 [<c011d0ce>] default_wake_function+0x0/0xc [<c011d0ce>] default_wake_function+0x0/0xc [<c013b1d9>] do_futex+0x29/0x5a [<c013b30b>] sys_futex+0x101/0x10c [<c03115af>] syscall_call+0x7/0xb dlm_astd S DDB1CF84 3680 2440 3 2441 2391 (L-TLB) ddb1cfa4 00000046 e156c0a0 ddb1cf84 d7ae1770 0000002b 5dcd33dc 00008f4e ddc20800 ddc2098c ddb1c000 dd445e60 00000000 e154eb46 e154ecab 1d244b3c 00000000 0000000a e1562aa4 00000000 00000000 dd445e60 00000000 ddb1c000 Call Trace: [<e154eb46>] dlm_astd+0x0/0x1ff [dlm] [<e154ecab>] dlm_astd+0x165/0x1ff [dlm] [<c0139ad9>] kthread+0x69/0x91 [<c0139a70>] kthread+0x0/0x91 [<c01041dd>] kernel_thread_helper+0x5/0xb dlm_recvd S 00000000 2548 2441 3 2442 2440 (L-TLB) ddb1bfa4 00000046 00000000 00000000 00000000 0000016e 98dccbad 00008f4e ddc21970 ddc21afc ddb1b000 dd445e54 00000000 e1559620 e15596aa 1d244b3c 00000000 0000000a e1564010 00000000 00000000 dd445e54 00000000 ddb1b000 Call Trace: [<e1559620>] dlm_recvd+0x0/0xa7 [dlm] [<e15596aa>] dlm_recvd+0x8a/0xa7 [dlm] [<c0139ad9>] kthread+0x69/0x91 [<c0139a70>] kthread+0x0/0x91 [<c01041dd>] kernel_thread_helper+0x5/0xb dlm_sendd S 00000048 3104 2442 3 2443 2441 (L-TLB) ddb17fa4 00000046 ccfc8de0 00000048 00000048 000002de 980093a0 00008f4e df5e2cd0 df5e2e5c ddb17000 dd445e54 00000000 e155989c e1559926 1d244b3c 00000000 0000000a e1564010 00000000 00000000 dd445e54 00000000 ddb17000 Call Trace: [<e155989c>] dlm_sendd+0x0/0xac [dlm] [<e1559926>] dlm_sendd+0x8a/0xac [dlm] [<c0139ad9>] kthread+0x69/0x91 [<c0139a70>] kthread+0x0/0x91 [<c01041dd>] kernel_thread_helper+0x5/0xb dlm_recoverd S DF7D5600 3512 2443 3 2514 2442 (L-TLB) ddb10fc0 00000046 00000003 df7d5600 df131970 00003d56 97d9e54a 00000023 de2f8cd0 de2f8e5c ddb10000 df7d5600 df7d5600 e1561482 e15614a8 ddb10000 dd445e3c c0139ad9 fffffffc ffffffff ffffffff c0139a70 00000000 00000000 Call Trace: [<e1561482>] dlm_recoverd+0x0/0x55 [dlm] [<e15614a8>] dlm_recoverd+0x26/0x55 [dlm] [<c0139ad9>] kthread+0x69/0x91 [<c0139a70>] kthread+0x0/0x91 [<c01041dd>] kernel_thread_helper+0x5/0xb dlm_recoverd S DF7C9C00 3500 2514 3 2515 2443 (L-TLB) ddf96fc0 00000046 00000003 df7c9c00 df131970 00004464 26e0767a 00000025 de19c1b0 de19c33c ddf96000 df7c9c00 df7c9c00 e1561482 e15614a8 ddf96000 ddb95bfc c0139ad9 fffffffc ffffffff ffffffff c0139a70 00000000 00000000 Call Trace: [<e1561482>] dlm_recoverd+0x0/0x55 [dlm] [<e15614a8>] dlm_recoverd+0x26/0x55 [dlm] [<c0139ad9>] kthread+0x69/0x91 [<c0139a70>] kthread+0x0/0x91 [<c01041dd>] kernel_thread_helper+0x5/0xb lock_dlm1 S 00008EE2 2900 2515 3 2516 2514 (L-TLB) dda32f6c 00000046 54fc79fd 00008ee2 ddcc4030 0000005e 54fc9d5d 00008ee2 de19d320 de19d4ac df7c6400 df59a780 dda32f94 00000000 e01e6c04 0000008c 01000000 84c4eac8 00000000 de19d320 c011d0ce 00000000 00000000 c030ed50 Call Trace: [<e01e6c04>] dlm_async+0x11c/0x416 [lock_dlm] [<c011d0ce>] default_wake_function+0x0/0xc [<c030ed50>] schedule+0x438/0x5ea [<c011d0ce>] default_wake_function+0x0/0xc [<e01e6ae8>] dlm_async+0x0/0x416 [lock_dlm] [<c0139ad9>] kthread+0x69/0x91 [<c0139a70>] kthread+0x0/0x91 [<c01041dd>] kernel_thread_helper+0x5/0xb lock_dlm2 S DF7C6400 2656 2516 3 3713 2515 (L-TLB) ddaf4f6c 00000046 df59aa80 df7c6400 d7bc4c50 00000035 54fc9f3a 00008ee2 ddcc4030 ddcc41bc df7c6400 df59aa80 ddaf4f94 d89ab520 e01e6c04 0000008c 01000000 866bb168 00000000 ddcc4030 c011d0ce 00000000 00000000 c030ed50 Call Trace: [<e01e6c04>] dlm_async+0x11c/0x416 [lock_dlm] [<c011d0ce>] default_wake_function+0x0/0xc [<c030ed50>] schedule+0x438/0x5ea [<c011d0ce>] default_wake_function+0x0/0xc [<e01e6ae8>] dlm_async+0x0/0x416 [lock_dlm] [<c0139ad9>] kthread+0x69/0x91 [<c0139a70>] kthread+0x0/0x91 [<c01041dd>] kernel_thread_helper+0x5/0xb hald S 00000001 2460 2749 1 2764 2739 (NOTLB) ddbd9f1c 00000086 00000000 00000001 00000000 00000bb1 9bc284c8 00008f4f d7c7d3a0 d7c7d52c 0963f3d4 0963f3d4 ddbd9fa0 c7744240 c030ff92 c035f338 c035f338 0963f3d4 1d244b3c 00000000 00000005 c031e8be c0320b2a 000000a8 Call Trace: [<c030ff92>] schedule_timeout+0xf1/0x10c [<c012b87d>] process_timeout+0x0/0x5 [<c017dd21>] do_poll+0x8d/0xab [<c017dede>] sys_poll+0x19f/0x24d [<c017d315>] __pollwait+0x0/0x94 [<c0126225>] sys_gettimeofday+0x53/0xac [<c03115af>] syscall_call+0x7/0xb [<c031007b>] rwsem_down_read_failed+0x33/0x204 clurgmgrd S D7EA6F58 2272 2764 1 2769 2749 (NOTLB) d7ea6eb0 00000086 c0360cf4 d7ea6f58 000000d0 00000562 5e37c94b 00008f4e d7ae1770 d7ae18fc 0963fe3f 0963fe3f 00000000 d7ea6f74 c030ff92 ddc370c4 db8bccf4 0963fe3f 1d244b3c 00000000 00000005 c031e8be c0320b2a 000000a8 Call Trace: [<c030ff92>] schedule_timeout+0xf1/0x10c [<c012b87d>] process_timeout+0x0/0x5 [<c017d795>] do_select+0x347/0x378 [<c017d315>] __pollwait+0x0/0x94 [<c017dab9>] sys_select+0x2e0/0x43a [<c03115af>] syscall_call+0x7/0xb [<c031007b>] rwsem_down_read_failed+0x33/0x204 clurgmgrd S 00000000 2948 3711 1 3672 (NOTLB) ddbf7eb0 00000086 c014c301 00000000 00000001 000007aa c72a3afe 00008f4f d7bc40b0 d7bc423c 0963f2cc 0963f2cc 00000000 ddbf7f74 c030ff92 c035f330 c035f330 0963f2cc 1d244b3c 00000000 00000005 c031e8be c0320b2a 000000a8 Call Trace: [<c014c301>] __alloc_pages+0xd5/0x2f7 [<c030ff92>] schedule_timeout+0xf1/0x10c [<c012b87d>] process_timeout+0x0/0x5 [<c017d795>] do_select+0x347/0x378 [<c017d315>] __pollwait+0x0/0x94 [<c017dab9>] sys_select+0x2e0/0x43a [<c030ed50>] schedule+0x438/0x5ea [<c03115af>] syscall_call+0x7/0xb dlm_recoverd S DF7D3C00 3500 3713 3 2516 (L-TLB) d761efc0 00000046 00000003 df7d3c00 df131970 000057ff debe506d 00000028 ddc20230 ddc203bc d761e000 df7d3c00 df7d3c00 e1561482 e15614a8 d761e000 d7ea6e3c c0139ad9 fffffffc ffffffff ffffffff c0139a70 00000000 00000000 Call Trace: [<e1561482>] dlm_recoverd+0x0/0x55 [dlm] [<e15614a8>] dlm_recoverd+0x26/0x55 [dlm] [<c0139ad9>] kthread+0x69/0x91 [<c0139a70>] kthread+0x0/0x91 [<c01041dd>] kernel_thread_helper+0x5/0xb Version-Release number of selected component (if applicable): cman-1.0.4-0 How reproducible: I haven't tried to reproduce it. Steps to Reproduce: 1. Reboot all systems in the cluster. 2. Use cluster management gui to stop nfs service. 3. Wait a couple days (this step may be unnecessary). 4. Do umount of your gfs filesystem. Actual results: umount command hangs. Expected results: umount should complete normally. Additional info:
I would suggest that it's pretty unlikely that a message has just failed to get through to another node. bugzilla history indicates that multiple resends (and node down events when they don't arrive) are much more common in the event of networking glitches.
I appear to have hit this exact issue as well on link-08 by just running mount_stress on my three node cluster (link-01, link-02, link-08). ################ itr=488 ################ unmounting on link-02.../mnt/link2.../mnt/link0.../mnt/link3... unmounting on link-08.../mnt/link1... [HANG] Dump of the umount task: Jan 11 05:16:28 link-08 kernel: umount D 000001001f27c248 0 5067 851 (NOTLB) Jan 11 05:16:28 link-08 kernel: 000001003b89dd18 0000000000000002 000000000060f9c0 ffffffff00000073 Jan 11 05:16:28 link-08 kernel: 0000010020a6d7f0 0000000000000073 0000010001709fc0 0000000067616c66 Jan 11 05:16:28 link-08 kernel: 000001001a5bc7f0 00000000001041e3 Jan 11 05:16:28 link-08 kernel: Call Trace:<ffffffff80304c65>{wait_for_completion+167} <ffffffff80133401>{default_wake_function+0} Jan 11 05:16:28 link-08 kernel: <ffffffff80133401>{default_wake_function+0} <ffffffffa021c021>{:cman:kcl_leave_service+249} Jan 11 05:16:28 link-08 kernel: <ffffffffa02b8dc3>{:lock_dlm:release_mountgroup+189} Jan 11 05:16:28 link-08 kernel: <ffffffffa02baae5>{:lock_dlm:lm_dlm_unmount+38} <ffffffffa02503ae>{:lock_harness:lm_unmount+62} Jan 11 05:16:28 link-08 kernel: <ffffffffa026b73f>{:gfs:gfs_lm_unmount+33} <ffffffffa027a5a4>{:gfs:gfs_put_super+830} Jan 11 05:16:28 link-08 kernel: <ffffffff8017d0b9>{generic_shutdown_super+202} <ffffffffa0277fdd>{:gfs:gfs_kill_sb+42} Jan 11 05:16:28 link-08 kernel: <ffffffff801ccc20>{dummy_inode_permission+0} <ffffffff8017cfd6>{deactivate_super+95} Jan 11 05:16:28 link-08 kernel: <ffffffff8019253b>{sys_umount+925} <ffffffff801e865d>{__up_write+20} Jan 11 05:16:28 link-08 kernel: <ffffffff8016b7d0>{sys_munmap+94} <ffffffff801101c6>{system_call+126} Jan 11 05:16:28 link-08 kernel: [root@link-08 ~]# ps -C umount PID TTY TIME CMD 5067 ? 00:00:00 umount [root@link-08 ~]# lsof -p 5067 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME umount 5067 root cwd DIR 253,0 4096 2 / umount 5067 root rtd DIR 253,0 4096 2 / umount 5067 root txt REG 253,0 57264 6340698 /bin/umount umount 5067 root mem REG 253,0 39554608 4165972 /usr/lib/locale/locale-archive umount 5067 root mem REG 253,0 105202 1409081 /lib64/ld-2.3.4.so umount 5067 root mem REG 253,0 1489988 1409262 /lib64/tls/libc-2.3.4.so umount 5067 root 0u IPv4 242736 TCP link-08.lab.msp.redhat.com:52364->joynter.lab.msp.redhat.com:5010 (ESTABLISHED) umount 5067 root 1w CHR 1,3 1861 /dev/null umount 5067 root 2w CHR 1,3 1861 /dev/null umount 5067 root 4u IPv4 242736 TCP link-08.lab.msp.redhat.com:52364->joynter.lab.msp.redhat.com:5010 (ESTABLISHED) umount 5067 root 5u IPv4 242737 TCP link-08.lab.msp.redhat.com:52365->joynter.lab.msp.redhat.com:5011 (ESTABLISHED) umount 5067 root 6u IPv4 242738 TCP link-08.lab.msp.redhat.com:52366->joynter.lab.msp.redhat.com:5012 (ESTABLISHED) [root@link-01 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "clvmd" 3 4 run - [1 2 3] DLM Lock Space: "link1" 3730 2625 run - [1 3] DLM Lock Space: "link3" 3728 2627 run - [1] GFS Mount Group: "link1" 3731 2626 run - [1 3] GFS Mount Group: "link3" 3729 2628 run - [1] [root@link-02 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "clvmd" 3 4 run - [1 2 3] [root@link-08 ~]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3] DLM Lock Space: "clvmd" 3 4 run - [1 2 3] DLM Lock Space: "link1" 3730 2541 run - [1 3] DLM Lock Space: "link2" 3724 2543 run - [3] GFS Mount Group: "link1" 3731 2542 run S-11,208,1 [1 3] GFS Mount Group: "link2" 3725 2544 run - [3]
I suspect this may be a regression from RHEL4U2 since it looks repeatable and hasn't been reported outside of RHEL4U3 testing.
I get the same result using the STABLE branch. I had three nodes sitting idle for a couple days and the umount hung just as above. I'll be looking first for counters that might be rolling over given enough time.
bumping the severity to reflect comment #3...
It seems to be the "fix" for bz#173621 causing this, but I haven't been able to work out why, or why the symptoms are as bizarre as they are.. To reproduce: Run the (below) attached program will on two nodes. Eventually it will stall even though the cluster remains up. Now try to bring another node into the cluster and it will die with all sorts of wierd messages about inconsistent cluster views and transition timeouts. Backout the change mentioned above: cvs update -r1.42.2.17 cman-kernel/src/cnxman.c and it will run seemingly forever. This also very likely explains bz#176872
Created attachment 123162 [details] libcman program to demonstrate hang
Created attachment 123170 [details] Proposed patch for testing This is what I think the patch should look like (with an extra printk). Please test it if you can and see how may open cman/dlm bugs it actually fixes :)
First of all, I verified that Patrick is indeed right about how and why the code is failing. With my /proc/cluster/msg_history patch, I was able to see the cluster run normally until the 16-bit cman sequence number wrapped from 0xffff to 0x0000. That's when the systems in the cluster started ignoring each others messages. When I let my third node back into the cluster, it caused my second node to reproduce the symptoms of bz176872 (kernel panic). This probably DOES explain bz 177163 because part of that symtom was "wait 2 days". Based on very rough "hello message" estimates on an idle 3-node cluster, I predicted that the time for the sequence number to wrap is about 18 hours. With Patrick's test program, the time to recreate is reduced to about 13 minutes. I have done a fair amount testing on Patrick's test patch (minus the extra printk). The patch appears to work fine, and I ran the test program until the cman sequence number wrapped several times. (I was able to see this by using my msg_history patch). The same test failed without Patrick's test patch. My concern is that we can still lose/ignore cman messages when we wrap from positive numbers to negative numbers, as illustrated by the following simple program and its accompanying output: #include <stdio.h> int main() { unsigned short a,b,i; for (a=32765,b=32763,i=0; i<8; i++,a++,b++) { printf("a=%04X b=%04X: a-b<=0? ",a,b); if ((short)(((short)(a) - (short)b) <= 0)) printf("yes.\n"); else printf("no.\n"); } } Yields this output: a=7FFD b=7FFB: a-b<=0? no. a=7FFE b=7FFC: a-b<=0? no. a=7FFF b=7FFD: a-b<=0? no. a=8000 b=7FFE: a-b<=0? yes. a=8001 b=7FFF: a-b<=0? yes. a=8002 b=8000: a-b<=0? no. a=8003 b=8001: a-b<=0? no. a=8004 b=8002: a-b<=0? no. Wrapping from 0xffff to 0x0000 works properly with the patch. I made my own version of a test patch to solve this problem using 32-bit unsigned numbers rather than 16-bit. That patch also worked perfectly. With 32-bit numbers, I calculate it would take roughly 134 years for the sequence number to wrap with normal "hello" messages. With Patrick's test program, I calculate it would wrap in approximately 455 days. However, the 32-bit patch has a nasty side-effect: Since the cman message structure is changed to accomodate the 32-bit numbers, any nodes in the cluster who aren't running the 32-bit version will hang up. That means the fix would have to be applied throughout the entire cluster and then the whole cluster restarted. If any node out there has the old 16-bit code, it will get very confused and have problems coming up. I tried this and ended up having to boot the node with the old version into single-user mode, doing ifup eth0, service sshd start, etc., then copying on and applying the patch. Once the 32-bit patch is on all nodes, no problem however. I don't necessarily want to subject customers to that restriction or potential headaches, but it's one solution. I had a philosophical discussion about this with Jon Brassow, and he recommended we try a solution where we keep the sequence number as 16-bits (and thus not change the internal message structure), but use the MSB to signify a wrap-condition. The cman message would only be considered duplicate (and ignored) if the sequence number is the same (or less) AND the MSB/wrap-bit is the same. This is very close to Patrick's existing patch, but there wouldn't be a boundary problem with 0x7fff - 0x8000. Of course, there is still the possibility of hitting a cman message that will be ignored, but it's not likely to occur. The failure/wrap around time under idle conditions would be reduced from 18 hours to 9 hours (if idle) due to using 7 bits rather than 8. The messages would only be ignored if there were two wraps that occurred in between messages (i.e. the MSB goes from 0 to 1 and then back to 0, making it look as it did 18 hours ago). So the system's messages would have to disappear and reappear 18 hours later, after the 9-hour wrap occurs twice. Even with Patrick's test program, it would have to appear 13 minutes later, which is not likely. If messages are received with the wrong wrap-bit, no problem. If the node goes down and comes back, the sequence number for the new incarnation is restarted, so again, no problem. I'm running this hybrid 16-bit patch on the smoke cluster, and running a smoke test over the weekend. I'm planning to try to implement this kind of solution on Monday unless Patrick beats me to it. Also, I'm attaching my /proc/cluster/msg_history patch.
Created attachment 123185 [details] Test version w/Patrick's 16-bit fix and Bob's /proc/cluster/msg_history
My patch is correct, (see also /usr/src/linux/include/linux/jiffies.h - which is from where is was lifted). Your test program has a bug in it. The comparison line should read: if ((short)((short)(a) - (short)b) <= 0) You were casting the whole of the comparison to (short) rather than just the subtraction :)
Created attachment 123245 [details] Additional patch looks like you might need this patch on top of the first one. If there are no ACKed messages for ages (eg lots of HELLOs but no umounts) then you get a message with a different sign to the last ACK it will get ignored.
Created attachment 123292 [details] New patch for testing This patch superceds the previous two. I realised that there is a small problem with them too in that messages that get lost on the wire may not get resent if a HELLO message follows them too closely. This is an amalgamation of those two patches plus a small amount of extra code to distinguish been ACKable messages and non-ACKable ones.
Created attachment 123436 [details] Yet another patch Just in case you think I was wasting my time in a hospital bed - I realised why that last patch still doesn't work. This one just might :)
This latest patch seems pretty stable. I still haven't tried it in my two-NIC situation that was killing the previous patch, but I'll do that tomorrow. FYI: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=178367 describes a minor memory leak regarding the usage of /proc/cluster/services and /proc/cluster/nodes. The patch for that memory leak is attached to that bz, and it contains (1) Patrick's patch from 19 January 2006, (2) my patch for the leak, and (3) a newer version of my new /proc/cluster/smsg_history and /proc/cluster/msg_history patch. I'm only posting this here because that patch supercedes my patch/attachment here from 13 January 2006.
Fix checked in to U3: Checking in cnxman-private.h; /cvs/cluster/cluster/cman-kernel/src/cnxman-private.h,v <-- cnxman-private.h new revision: 1.12.2.2.10.1; previous revision: 1.12.2.2 done Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.42.2.18.2.1; previous revision: 1.42.2.18 done Fixed into RHEL4: Checking in cnxman-private.h; /cvs/cluster/cluster/cman-kernel/src/cnxman-private.h,v <-- cnxman-private.h new revision: 1.12.2.4; previous revision: 1.12.2.3 done Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.42.2.21; previous revision: 1.42.2.20 done
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0236.html