382621 – gfs umount deadlock cman:kcl_leave_service

Bug 382621 - gfs umount deadlock cman:kcl_leave_service

Summary: gfs umount deadlock cman:kcl_leave_service

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-11-14 15:19 UTC by Corey Marthaler
Modified:	2010-01-12 03:16 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-14 17:38:24 UTC
Embargoed:

Attachments	(Terms of Use)
stack traces form grant-01 (99.16 KB, text/plain) 2007-11-14 15:31 UTC, Corey Marthaler	no flags	Details
View All

Description Corey Marthaler 2007-11-14 15:19:57 UTC

Description of problem:
Hit this while running mount_stress with 10 filesystems during 4.6 regression
testing. The mount tests actually finished, but one of the final "clean up"
umounts ended up hanging. 

[root@grant-01 ~]# ps -ef | grep mount
root     21400 21399  0 Nov13 ?        00:00:00 umount -f /mnt/LINK_1286
root     22154  4716  0 08:56 pts/0    00:00:00 grep mount

umount   D 0000090b435b5e79     0 21400  21399                     (NOTLB)
           00000101f1955d08 0000000000000006 ffffffff804c6b20 000000698035fc0c
           00000101ed851190 000000000000469b 00000101ebe133d0 00000101ed851478
           00000101fa0c41b0 00000101ed851478
Call Trace:
<ffffffff8035ffec>{wait_for_completion+312}
<ffffffff801355dd>{default_wake_function+0}
<ffffffff801355dd>{default_wake_function+0}
<ffffffffa027fc7b>{:cman:kcl_leave_service+243}
<ffffffffa029d1e0>{:dlm:release_lockspace+157}
<ffffffffa034a325>{:lock_dlm:release_gdlm+15}
<ffffffffa034ab3f>{:lock_dlm:lm_dlm_unmount+54}
<ffffffffa02bb383>{:lock_harness:lm_unmount+61}
<ffffffffa02dd7c2>{:gfs:gfs_lm_unmount+32}
<ffffffffa02ee543>{:gfs:gfs_put_super+787}
<ffffffff801965c7>{generic_shutdown_super+334}
<ffffffffa02eba74>{:gfs:gfs_kill_sb+41}
<ffffffff80196460>{deactivate_super+220}
<ffffffff801b6c09>{sys_umount+1822}
<ffffffff8019b9b7>{sys_newstat+17}
<ffffffff80111415>{error_exit+0}
<ffffffff80110a92>{system_call+126}

Version-Release number of selected component (if applicable):
This was on a UP kernel:
2.6.9-67.EL
GFS-kernel-2.6.9-75.9

Comment 1 Corey Marthaler 2007-11-14 15:27:18 UTC

I wonder if this is some how related to bz 290971?

Comment 2 Corey Marthaler 2007-11-14 15:31:10 UTC

Created attachment 258181 [details]
stack traces form grant-01

Comment 3 Corey Marthaler 2007-11-14 15:32:02 UTC

[root@grant-01 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    6   M   link-02
   2    1    6   M   grant-03
   3    1    6   M   grant-01
   4    1    6   M   grant-02
   5    1    6   M   link-07
   6    1    6   M   link-08
[root@grant-01 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 3 5 6 4 2]

DLM Lock Space:  "clvmd"                             3   3 run       -
[1 3 5 6 4 2]

DLM Lock Space:  "LINK_1286"                       385 150 run       S-15,200,2
[3 2]

DLM Lock Space:  "LINK_1288"                       377 152 run       -
[3 2]

DLM Lock Space:  "LINK_1283"                       422 154 run       -
[3 2]

DLM Lock Space:  "LINK_1284"                       369 156 run       -
[3]

DLM Lock Space:  "LINK_1287"                       412 158 run       -
[3 4 2]

DLM Lock Space:  "LINK_1289"                       361 160 run       -
[3 2]

DLM Lock Space:  "LINK_1282"                       353 162 run       -
[3 4 2]

DLM Lock Space:  "LINK_1285"                       432 164 run       -
[3 2]

GFS Mount Group: "LINK_1288"                       381 153 run       -
[3 2]

GFS Mount Group: "LINK_1283"                       427 155 run       -
[3 2]

GFS Mount Group: "LINK_1284"                       373 157 run       -
[3]

GFS Mount Group: "LINK_1287"                       417 159 run       -
[3 4 2]

GFS Mount Group: "LINK_1289"                       365 161 run       -
[3 2]

GFS Mount Group: "LINK_1282"                       357 163 run       -
[3 4 2]

GFS Mount Group: "LINK_1285"                       437 165 run       -
[3 2]

Comment 4 Christine Caulfield 2007-11-14 15:36:24 UTC

There's a possibility it might be related to 373671 I suppose. The sooner we get
that one acked & included the happier I'll be about some of these odd hangs.

Comment 5 David Teigland 2007-11-14 15:52:51 UTC

grant-03 would be just as interesting to inspect, can we still get data from
that node?  In addition to the bug Patrick mentioned, there are a number of
other bugs that we found and fixed while doing mount/unmount stress tests for nokia.

Comment 6 David Teigland 2007-11-14 16:27:10 UTC

[root@grant-03 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           2   2 run       -
[1 2 3 4 5 6]

DLM Lock Space:  "clvmd"                             3   3 run       -
[1 2 3 4 5 6]

DLM Lock Space:  "LINK_1289"                       361  68 run       -
[2 3]

DLM Lock Space:  "LINK_1281"                       402  70 run       -
[2]

DLM Lock Space:  "LINK_1282"                       353  72 run       -
[2 3 4]

DLM Lock Space:  "LINK_1288"                       377  74 run       -
[2 3]

DLM Lock Space:  "LINK_1283"                       422  76 run       -
[2 3]

DLM Lock Space:  "LINK_1285"                       432  78 run       -
[2 3]

DLM Lock Space:  "LINK_1280"                       439  80 run       -
[2]

DLM Lock Space:  "LINK_1287"                       412  82 run       -
[2 3 4]

DLM Lock Space:  "LINK_1286"                       385  84 run       -
[2]

GFS Mount Group: "LINK_1289"                       365  69 run       -
[2 3]

GFS Mount Group: "LINK_1281"                       407  71 run       -
[2]

GFS Mount Group: "LINK_1282"                       357  73 run       -
[2 3 4]

GFS Mount Group: "LINK_1288"                       381  75 run       -
[2 3]

GFS Mount Group: "LINK_1283"                       427  77 run       -
[2 3]

GFS Mount Group: "LINK_1285"                       437  79 run       -
[2 3]

GFS Mount Group: "LINK_1280"                       441  81 run       -
[2]

GFS Mount Group: "LINK_1287"                       417  83 run       -
[2 3 4]

GFS Mount Group: "LINK_1286"                       389  85 run       -
[2]

Comment 7 David Teigland 2007-11-14 16:44:54 UTC

Nov 13 19:53:04 grant-03 NET: /sbin/dhclient-script : updated /etc/resolv.conf
Nov 13 19:53:04 grant-03 kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
Nov 13 19:53:04 grant-03 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port
67 interval 4
Nov 13 19:53:04 grant-03 dhclient: receive_packet failed on eth0: Network is down
Nov 13 19:53:06 grant-03 kernel: CMAN: sendmsg failed: -22
Nov 13 19:53:06 grant-03 kernel: SM: send_nodeid_message error -22 to 3
Nov 13 19:53:07 grant-03 kernel: CMAN: resend failed: -22
Nov 13 19:53:07 grant-03 kernel: tg3: eth0: Link is up at 1000 Mbps, full duplex.
Nov 13 19:53:07 grant-03 kernel: tg3: eth0: Flow control is off for TX and off
for RX.
Nov 13 19:53:07 grant-03 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

Comment 8 Corey Marthaler 2007-11-14 17:38:24 UTC

Looks like the net was down during that umount attempt.

Note You need to log in before you can comment on or make changes to this bug.