Bug 682951

Summary:

GFS2: umount stuck on gfs2_gl_hash_clear

Product:

Red Hat Enterprise Linux 6

Reporter:

Nate Straz <nstraz>

Component:

kernel

Assignee:

Steve Whitehouse <swhiteho>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.1

CC:

adas, bmarzins, rpeterso, rwheeler

Target Milestone:

Keywords:

Regression, TestBlocker

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

kernel-2.6.32-125.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

803384 (view as bug list)

Environment:

Last Closed:

2011-05-23 20:43:15 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

635041

Attachments:

Description	Flags
trace logs from all buzz nodes during umount hang	none
Proposed fix (upstream)	none
Proposed fix (RHEL6)	none

Description Nate Straz 2011-03-08 05:29:26 UTC

Description of problem:

When trying to umount after running a brawl scenario (a combination of local and shared workloads), the umount hangs and the hang checker prints out this backtrace every two minutes:


INFO: task umount:11506 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount        D 000000000000000d     0 11506  11505 0x00000080
 ffff8805ec0e9d98 0000000000000082 0000000000000000 ffff8803221a77c0
 ffff8805ec0e9d08 ffffffff814d998d ffff8805ec0e9d68 0000000100480b16
 ffff88061ddfc6b8 ffff8805ec0e9fd8 000000000000f558 ffff88061ddfc6b8
Call Trace:
 [<ffffffff814d998d>] ? wait_for_completion+0x1d/0x20
 [<ffffffffa033cd6d>] gfs2_gl_hash_clear+0x7d/0xc0 [gfs2]
 [<ffffffff8108dce0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0359dfb>] gfs2_put_super+0x17b/0x220 [gfs2]
 [<ffffffff81173016>] generic_shutdown_super+0x56/0xe0
 [<ffffffff811730d1>] kill_block_super+0x31/0x50
 [<ffffffffa034b0e1>] gfs2_kill_sb+0x61/0x90 [gfs2]
 [<ffffffff81174180>] deactivate_super+0x70/0x90
 [<ffffffff8118f63f>] mntput_no_expire+0xbf/0x110
 [<ffffffff8118fa6b>] sys_umount+0x7b/0x3a0
 [<ffffffff810d1652>] ? audit_syscall_entry+0x272/0x2a0
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b

I didn't see any other information in the logs.

Version-Release number of selected component (if applicable):
2.6.32-119.el6bz635041b.x86_64

How reproducible:
I've hit this a few times on two different clusters

Steps to Reproduce:
Hit during a brawl run in the umount step
  
Actual results:


Expected results:


Additional info:

Comment 2 Steve Whitehouse 2011-03-08 12:27:16 UTC

Does this only occur with the one single kernel version mentioned above, or with other kernel versions too?

Comment 3 Steve Whitehouse 2011-03-08 12:33:40 UTC

Also, a glock dump would be helpful. The glock debugfs file should still be accessible at that point in time. The fs is bascially waiting for all replies to arrive from the dlm at that stage, and it will sit there until they have all arrived.

If it is reproduceable, then a trace from the glock tracepoints would be very helpful in tracking down the issue.

Comment 4 Nate Straz 2011-03-08 14:12:44 UTC

The glock debugfs file on buzz-01 (where the umount is stuck) is empty.  No waiters were found on any of the other nodes.  Tracing was not enabled before the test case.  I'll make sure I enable that as I retry on other kernels.

Comment 5 Nate Straz 2011-03-08 14:20:48 UTC

Just a data point, I'm doing a git bisect between -114.el6 and -119.el6 for an unrelated issue on the dash nodes.  I was able to reproduce this umount hang there with a -117.el6 based kernel.

Comment 6 Steve Whitehouse 2011-03-08 14:23:33 UTC

If the debugfs file is empty, then that is a good indication that gfs2 has sent unlock requests to all of the glocks. There is no easy way to get at the counter which is checking to ensure that all glocks are freed, unfortunately. The waiting should end when that counter hits zero.

Comment 7 Steve Whitehouse 2011-03-08 14:28:41 UTC

Actually, I have a thought... from glock.c:

106 void gfs2_glock_free(struct rcu_head *rcu)
107 {
108         struct gfs2_glock *gl = container_of(rcu, struct gfs2_glock, gl_rcu);
109         struct gfs2_sbd *sdp = gl->gl_sbd;
110
111         if (gl->gl_ops->go_flags & GLOF_ASPACE)
112                 kmem_cache_free(gfs2_glock_aspace_cachep, gl);
113         else
114                 kmem_cache_free(gfs2_glock_cachep, gl);
115
116         if (atomic_dec_and_test(&sdp->sd_glock_disposal))
117                 wake_up(&sdp->sd_glock_wait);
118 }

I wonder whether the problem is that we need to ensure that RCU flushes out its list of glocks. It might be stuck waiting for that to happen. If so, then we can either move the code which does the wake up to before the call_rcu() or try to add an rcu_synchronize() into the code somewhere....

Comment 8 Nate Straz 2011-03-08 16:02:55 UTC

Created attachment 482938 [details]
trace logs from all buzz nodes during umount hang

Comment 9 Steve Whitehouse 2011-03-09 10:22:02 UTC

Created attachment 483148 [details]
Proposed fix (upstream)

Comment 10 Steve Whitehouse 2011-03-09 10:22:32 UTC

Created attachment 483149 [details]
Proposed fix (RHEL6)

Comment 11 RHEL Program Management 2011-03-09 11:40:00 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 12 Nate Straz 2011-03-14 14:35:24 UTC

I built and tested a -119.el6 based kernel with patches from 635041 and 682951.  I ran on two clusters for about 4 days and was not able to hit either issue.  Both patches have passed QE testing.

Comment 13 Steve Whitehouse 2011-03-14 14:44:43 UTC

Thanks for the confirmation that this fixed the problem.

Comment 14 Aristeu Rozanski 2011-03-22 14:50:57 UTC

Patch(es) available on kernel-2.6.32-125.el6

Comment 17 Nate Straz 2011-03-25 22:06:26 UTC

Made it through a brawl run w/ -125.el6 kernel without hitting this on umount.

Comment 18 errata-xmlrpc 2011-05-23 20:43:15 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html