Description of problem: Hit this while running mount_stress with 10 filesystems during 4.6 regression testing. The mount tests actually finished, but one of the final "clean up" umounts ended up hanging. [root@grant-01 ~]# ps -ef | grep mount root 21400 21399 0 Nov13 ? 00:00:00 umount -f /mnt/LINK_1286 root 22154 4716 0 08:56 pts/0 00:00:00 grep mount umount D 0000090b435b5e79 0 21400 21399 (NOTLB) 00000101f1955d08 0000000000000006 ffffffff804c6b20 000000698035fc0c 00000101ed851190 000000000000469b 00000101ebe133d0 00000101ed851478 00000101fa0c41b0 00000101ed851478 Call Trace: <ffffffff8035ffec>{wait_for_completion+312} <ffffffff801355dd>{default_wake_function+0} <ffffffff801355dd>{default_wake_function+0} <ffffffffa027fc7b>{:cman:kcl_leave_service+243} <ffffffffa029d1e0>{:dlm:release_lockspace+157} <ffffffffa034a325>{:lock_dlm:release_gdlm+15} <ffffffffa034ab3f>{:lock_dlm:lm_dlm_unmount+54} <ffffffffa02bb383>{:lock_harness:lm_unmount+61} <ffffffffa02dd7c2>{:gfs:gfs_lm_unmount+32} <ffffffffa02ee543>{:gfs:gfs_put_super+787} <ffffffff801965c7>{generic_shutdown_super+334} <ffffffffa02eba74>{:gfs:gfs_kill_sb+41} <ffffffff80196460>{deactivate_super+220} <ffffffff801b6c09>{sys_umount+1822} <ffffffff8019b9b7>{sys_newstat+17} <ffffffff80111415>{error_exit+0} <ffffffff80110a92>{system_call+126} Version-Release number of selected component (if applicable): This was on a UP kernel: 2.6.9-67.EL GFS-kernel-2.6.9-75.9
I wonder if this is some how related to bz 290971?
Created attachment 258181 [details] stack traces form grant-01
[root@grant-01 ~]# cman_tool nodes Node Votes Exp Sts Name 1 1 6 M link-02 2 1 6 M grant-03 3 1 6 M grant-01 4 1 6 M grant-02 5 1 6 M link-07 6 1 6 M link-08 [root@grant-01 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 2 2 run - [1 3 5 6 4 2] DLM Lock Space: "clvmd" 3 3 run - [1 3 5 6 4 2] DLM Lock Space: "LINK_1286" 385 150 run S-15,200,2 [3 2] DLM Lock Space: "LINK_1288" 377 152 run - [3 2] DLM Lock Space: "LINK_1283" 422 154 run - [3 2] DLM Lock Space: "LINK_1284" 369 156 run - [3] DLM Lock Space: "LINK_1287" 412 158 run - [3 4 2] DLM Lock Space: "LINK_1289" 361 160 run - [3 2] DLM Lock Space: "LINK_1282" 353 162 run - [3 4 2] DLM Lock Space: "LINK_1285" 432 164 run - [3 2] GFS Mount Group: "LINK_1288" 381 153 run - [3 2] GFS Mount Group: "LINK_1283" 427 155 run - [3 2] GFS Mount Group: "LINK_1284" 373 157 run - [3] GFS Mount Group: "LINK_1287" 417 159 run - [3 4 2] GFS Mount Group: "LINK_1289" 365 161 run - [3 2] GFS Mount Group: "LINK_1282" 357 163 run - [3 4 2] GFS Mount Group: "LINK_1285" 437 165 run - [3 2]
There's a possibility it might be related to 373671 I suppose. The sooner we get that one acked & included the happier I'll be about some of these odd hangs.
grant-03 would be just as interesting to inspect, can we still get data from that node? In addition to the bug Patrick mentioned, there are a number of other bugs that we found and fixed while doing mount/unmount stress tests for nokia.
[root@grant-03 ~]# cman_tool services Service Name GID LID State Code Fence Domain: "default" 2 2 run - [1 2 3 4 5 6] DLM Lock Space: "clvmd" 3 3 run - [1 2 3 4 5 6] DLM Lock Space: "LINK_1289" 361 68 run - [2 3] DLM Lock Space: "LINK_1281" 402 70 run - [2] DLM Lock Space: "LINK_1282" 353 72 run - [2 3 4] DLM Lock Space: "LINK_1288" 377 74 run - [2 3] DLM Lock Space: "LINK_1283" 422 76 run - [2 3] DLM Lock Space: "LINK_1285" 432 78 run - [2 3] DLM Lock Space: "LINK_1280" 439 80 run - [2] DLM Lock Space: "LINK_1287" 412 82 run - [2 3 4] DLM Lock Space: "LINK_1286" 385 84 run - [2] GFS Mount Group: "LINK_1289" 365 69 run - [2 3] GFS Mount Group: "LINK_1281" 407 71 run - [2] GFS Mount Group: "LINK_1282" 357 73 run - [2 3 4] GFS Mount Group: "LINK_1288" 381 75 run - [2 3] GFS Mount Group: "LINK_1283" 427 77 run - [2 3] GFS Mount Group: "LINK_1285" 437 79 run - [2 3] GFS Mount Group: "LINK_1280" 441 81 run - [2] GFS Mount Group: "LINK_1287" 417 83 run - [2 3 4] GFS Mount Group: "LINK_1286" 389 85 run - [2]
Nov 13 19:53:04 grant-03 NET: /sbin/dhclient-script : updated /etc/resolv.conf Nov 13 19:53:04 grant-03 kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready Nov 13 19:53:04 grant-03 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 4 Nov 13 19:53:04 grant-03 dhclient: receive_packet failed on eth0: Network is down Nov 13 19:53:06 grant-03 kernel: CMAN: sendmsg failed: -22 Nov 13 19:53:06 grant-03 kernel: SM: send_nodeid_message error -22 to 3 Nov 13 19:53:07 grant-03 kernel: CMAN: resend failed: -22 Nov 13 19:53:07 grant-03 kernel: tg3: eth0: Link is up at 1000 Mbps, full duplex. Nov 13 19:53:07 grant-03 kernel: tg3: eth0: Flow control is off for TX and off for RX. Nov 13 19:53:07 grant-03 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Looks like the net was down during that umount attempt.