236069 – GFS2: deadlock running d_rwdirectlarge

Bug 236069 - GFS2: deadlock running d_rwdirectlarge

Summary: GFS2: deadlock running d_rwdirectlarge

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Don Zickus
QA Contact:	Dean Jansa
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-04-11 19:40 UTC by Nate Straz
Modified:	2007-11-30 22:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:	RHBA-2007-0959
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-07 19:46:32 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch that was applied to gfs2 when we ran this test. (1.90 KB, patch) 2007-04-11 21:39 UTC, Abhijith Das	no flags	Details \| Diff
A new patch to fix the bug in prepare_write's locking. (1.39 KB, patch) 2007-04-18 15:14 UTC, Steve Whitehouse	no flags	Details \| Diff
Updated patch to fix this bug (1.32 KB, patch) 2007-05-10 09:52 UTC, Steve Whitehouse	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0959	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5 Update 1	2007-11-08 00:47:37 UTC

Description Nate Straz 2007-04-11 19:40:29 UTC

Description of problem:

While running dd_io on GFS2 upstream we hit a deadlock.  After triage we
determined that there was a deadlock between a page lock and a glock.

d_doio        D E010703A     0  8244   8242          8245       (NOTLB)
       d430fc48 00000082 00000000 e010703a 00000018 00000000 e0071118 d11a6580 
       00000007 dd1335b0 075fb69c 00004ded 00000a35 dd1336bc c13f4c60 00000000 
       d139bc80 d430fc8c 00000082 de22e6f8 00000805 00000018 00000246 d430fc74 
Call Trace:
 [<e010703a>] dlm_lock+0x104/0x110 [dlm]
 [<e0071118>] gdlm_ast+0x0/0x2 [lock_dlm]
 [<e052ae03>] holder_wait+0x5/0x8 [gfs2]
 [<c040afb0>] __wait_on_bit+0x33/0x58
 [<e052adfe>] holder_wait+0x0/0x8 [gfs2]
 [<e052adfe>] holder_wait+0x0/0x8 [gfs2]
 [<c040b038>] out_of_line_wait_on_bit+0x63/0x6b
 [<c01297da>] wake_bit_function+0x0/0x3c
 [<e052adfa>] wait_on_holder+0x2f/0x33 [gfs2]
 [<e052bc08>] glock_wait_internal+0xd5/0x1ec [gfs2]
 [<e052b93c>] run_queue+0x28f/0x335 [gfs2]
 [<e052be8e>] gfs2_glock_nq+0x16f/0x1a6 [gfs2]
 [<e052cd74>] gfs2_glock_nq_atime+0xdb/0x2cf [gfs2]
 [<e0533cc8>] gfs2_prepare_write+0x50/0x237 [gfs2]
 [<c0137335>] add_to_page_cache+0x60/0x70
 [<e0533c78>] gfs2_prepare_write+0x0/0x237 [gfs2]
 [<c01384a2>] generic_file_buffered_write+0x25b/0x60f
 [<c013816c>] generic_file_direct_write+0x5c/0x137
 [<c0138c8e>] __generic_file_aio_write_nolock+0x438/0x55a
 [<c0138e05>] generic_file_aio_write+0x55/0xb3
 [<e052c0de>] gfs2_holder_uninit+0xb/0x1b [gfs2]
 [<e05350e4>] gfs2_open+0xef/0x119 [gfs2]
 [<c0151551>] do_sync_write+0xc7/0x10a
 [<c015047e>] nameidata_to_filp+0x24/0x33
 [<c01297a5>] autoremove_wake_function+0x0/0x35
 [<c015148a>] do_sync_write+0x0/0x10a
 [<c0151cb4>] vfs_write+0x8a/0x10c
 [<c0152223>] sys_write+0x41/0x67

lock_dlm1     D 00000086     0  7049      7          7050  3154 (L-TLB)
       d48d1e4c 00000046 d4ffd5b0 00000086 cf88ffa8 c10bb840 ffffffff 40000004 
       00000009 d4ffd5b0 07625434 00004ded 00001ed8 d4ffd6bc c13f4c60 00000000 
       dd4a9580 c10bb840 df27aba4 c13f5120 df27aba4 c01fe5e5 dd4ed500 c13f4c60 
Call Trace:
 [<c01fe5e5>] generic_unplug_device+0x15/0x21
 [<c040ad29>] io_schedule+0x22/0x2c
 [<c013715e>] sync_page+0x0/0x3b
 [<c0137196>] sync_page+0x38/0x3b
 [<c040aeea>] __wait_on_bit_lock+0x2a/0x52
 [<c0137150>] __lock_page+0x58/0x5e
 [<c01297da>] wake_bit_function+0x0/0x3c
 [<c013d839>] truncate_inode_pages_range+0x203/0x258
 [<c040a173>] __sched_text_start+0x14b/0x7c4
 [<c013d8a5>] truncate_inode_pages+0x17/0x1a
 [<e052c779>] inode_go_inval+0x40/0x4b [gfs2]
 [<e052b3dc>] xmote_bh+0xdb/0x250 [gfs2]
 [<e052bac5>] gfs2_glock_cb+0xae/0x11c [gfs2]
 [<c01298fb>] remove_wait_queue+0xc/0x34
 [<e007286f>] gdlm_thread+0x5af/0x5fc [lock_dlm]
 [<c0115c28>] default_wake_function+0x0/0xc
 [<e00722c0>] gdlm_thread+0x0/0x5fc [lock_dlm]
 [<c01296dc>] kthread+0xb0/0xd8
 [<c012962c>] kthread+0x0/0xd8
 [<c0103d27>] kernel_thread_helper+0x7/0x10


Version-Release number of selected component (if applicable):
2.6.21-rc6

How reproducible:
likely

Steps to Reproduce:
1. run dd_io -S REG

Comment 1 Abhijith Das 2007-04-11 21:39:06 UTC

Created attachment 152324 [details]
Patch that was applied to gfs2 when we ran this test.

Comment 2 RHEL Program Management 2007-04-13 18:25:20 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Steve Whitehouse 2007-04-18 15:14:52 UTC

Created attachment 152913 [details]
A new patch to fix the bug in prepare_write's locking.

N.B. This is for RHEL 5.1 only, there is a different fix for upstream which we
are unable to use in RHEL since its to invasive of the VFS layer. This is a
clean up of the previous patch taking into account all the comments in last
weeks meeting. Please shout if you think I've missed something.

Comment 4 Wendy Cheng 2007-04-18 15:58:34 UTC

ok, look much better than last week's hack. I'll ack.

Comment 5 Steve Whitehouse 2007-05-10 09:52:59 UTC

Created attachment 154455 [details]
Updated patch to fix this bug

This is slightly updated so it applies against current RHEL 5.1

Since Don is going on holiday, I'm intending to send this today on the basis
that we only have today & tomorrow left to get it in to RHEL 5.1 before he
leaves. If it turns out that this doesn't entirely fix the problem (very
unlikely I think since we've tested this fix at the meeting and the patch is
identical modulo the cleanup), please open another bz rather than pull this one
back from POST.

Comment 6 Don Zickus 2007-05-11 22:06:26 UTC

in 2.6.18-19.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 9 errata-xmlrpc 2007-11-07 19:46:32 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.