253768 – GFS2: deadlock on distributed mmap test case

Bug 253768 - GFS2: deadlock on distributed mmap test case

Summary: GFS2: deadlock on distributed mmap test case

Keywords:
Status:	CLOSED DUPLICATE of bug 248480
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Steve Whitehouse
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-08-21 19:25 UTC by Ben Marzinski
Modified:	2007-11-30 22:12 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-08-25 14:01:02 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Attempt to solve the bug (8.36 KB, patch) 2007-08-22 13:10 UTC, Steve Whitehouse	no flags	Details \| Diff
Revised patch, that fixes some bugs in the previous version. (7.31 KB, patch) 2007-08-22 21:27 UTC, Ben Marzinski	no flags	Details \| Diff
Show Obsolete (1) View All

Description Ben Marzinski 2007-08-21 19:25:55 UTC

Description of problem:
This is the upstream bug that I see when I try to test for bz 248480.

When I run the QA tests, the gfs2 filesystem instantly locks up. Unlike 248480,
where the nodes were livelocked because of a lock ping-ponging back and forth,
This is a hard deadlock. In my tests, There was one glock that everyone was
waiting on, and the process that had the glock seemed to be stuck in io_schedule

This is the process that is holding the glock.

id_doio        D f377dc88  2596  2914   2910
       f377dc9c 00000086 00000002 f377dc88 f377dc80 00000000 f377d000 00000001
       00000000 f3106cd0 f3106e7c c2019080 00000001 f322b200 f7fc706c c04d5cb6
       f7fc706c c04d6ae8 0002d314 c04d75d8 c043b3e0 ffffffff 00000000 00000000
Call Trace:
 [<c04d5cb6>] __generic_unplug_device+0x14/0x1f
 [<c04d6ae8>] generic_unplug_device+0x15/0x22
 [<c04d75d8>] blk_backing_dev_unplug+0x73/0x7b
 [<c043b3e0>] getnstimeofday+0x30/0xbc
 [<c061ad06>] io_schedule+0x34/0x56
 [<c0452324>] sync_page+0x0/0x3b
 [<c045235c>] sync_page+0x38/0x3b
 [<c061ae12>] __wait_on_bit_lock+0x2a/0x52
 [<c0452316>] __lock_page+0x58/0x5e
 [<c0437aa6>] wake_bit_function+0x0/0x3c
 [<c0456ad4>] write_cache_pages+0x105/0x27b
 [<c0456778>] __writepage+0x0/0x21
 [<c041d752>] check_pgt_cache+0x1b/0x1d
 [<c045cb4d>] zap_page_range+0xbe/0xc8
 [<c04e08c3>] prio_tree_next+0xe2/0x117
 [<f8c5b837>] gfs2_writepages+0x0/0x38 [gfs2]
 [<c0456c69>] generic_writepages+0x1f/0x26
 [<c0456c90>] do_writepages+0x20/0x30
 [<c0452cb8>] __filemap_fdatawrite_range+0x65/0x70
 [<c0452ee6>] filemap_fdatawrite+0x23/0x27
 [<f8c54cd7>] inode_go_sync+0x44/0xbe [gfs2]
 [<f8c53948>] gfs2_glock_xmote_th+0x2a/0x15c [gfs2]
 [<f8c54589>] gfs2_glmutex_lock+0x9c/0xa3 [gfs2]
 [<f8c53b49>] run_queue+0xcf/0x249 [gfs2]
 [<f8c54601>] gfs2_glock_dq+0x71/0x7b [gfs2]
 [<f8c54715>] gfs2_glock_dq_uninit+0x8/0x10 [gfs2]
 [<f8c60ae6>] gfs2_sharewrite_fault+0x29a/0x2a6 [gfs2]
 [<f8c60880>] gfs2_sharewrite_fault+0x34/0x2a6 [gfs2]
 [<c045bbe2>] __do_fault+0x5f/0x2ee
 [<c045a39e>] vma_prio_tree_insert+0x17/0x2a
 [<c045dc81>] handle_mm_fault+0x337/0x668
 [<c061d228>] do_page_fault+0x213/0x58b
 [<c0460c7d>] do_mmap_pgoff+0x279/0x2e1
 [<c061d015>] do_page_fault+0x0/0x58b
 [<c061bdba>] error_code+0x72/0x78
 =======================

Trying to do IO directly to the block device that GFS2 is running on also hangs
on the node with the process stuck in io_schedule.  IO to the block device works
fine from the other nodes in the cluster, who are simply waiting on the glock.

Version-Release number of selected component (if applicable):
The lastest code from  gfs2-2.6-nmw, as 2007-08-21 12:00 CDT

How reproducible:
Always, Immediately

Steps to Reproduce:
1. setup a cluster on three machines with one GFS2 filesystem
2. Create the following dd_io test file
(/usr/tests/sts-rhel5.1/gfs/lib/dd_io/248480.h2.m4):

[root@cypher-07 ~]# cat /usr/tests/sts-rhel5.1/gfs/lib/dd_io/248480.h2.m4
dnl ----------------------------------------------------------------------
dnl 248480.h2.m4
dnl --- Scenario Metadata ---
dnl SCENARIO=248480
dnl DESC=Test for 248480
dnl
dnl ----------------------------------------------------------------------
define(`RUN_TIME', `30s')
<test tag="d_mmap1">
        <cmd>d_iogen -b -S RANDSEED  -I SESSION_ID -R RESOURCE_FILE -i RUN_TIME
-m sequential -s mmread,mmwrite,readv,writev,read,write,pread,pwrite -t MINTRANS
-T MAXTRANS -F FILESIZE:mmap1 </cmd>
</test>

3. Run the QA test. Here is what I run on my setup:
# /usr/tests/sts-rhel5.1/gfs/bin/dd_io -m /mnt/test1 -R /root/hedge-123.xml -S
248480 -l /usr/tests/sts-rhel5.1/ -r /usr/tests/sts-rhel5.1
  
Actual results:
All the test processes lockup

Expected results:
The test runs to completion

Comment 1 Steve Whitehouse 2007-08-22 13:10:29 UTC

Created attachment 162049 [details]
Attempt to solve the bug

The stack trace posts what I think is a pretty clear picture of whats going on.
The run_queue() has tried to demote the lock and push out the pages, but since
its a writable mapping and a write has occurred, its got to write out the page,
so it tried to lock it, but since we are in a page fault, its already locked by
the higher layers.

My solution to this is to move the run_queue() call from gfs2_glock_dq()
directly to a workqueue. In fact its my eventual aim to move _all_ run_queue
calls to the work queue to avoid issues just like this. We have to be a bit
careful with the delay that we choose in order not to upset the very careful
balance we've previously established to fix the original bug, but again, I
think this will work well in that case.

If I'm right about the cause, then its something that will affect RHEL 5.1 as
well, so we ought to try and get it fixed now I think.

Comment 2 Ben Marzinski 2007-08-22 21:27:46 UTC

Created attachment 164161 [details]
Revised patch, that fixes some bugs in the previous version.

when the glock workqueue finishes it's work on the glock, it drops the
reference count.  However gfs2_glock_dq() didn't ever grab a reference to the
glock before it scheduled the work. This was causing the glock's reference
count to reach zero
while it was still in use, and caused panics on mount.	This version of the
patch adds the grabs a reference before it queues the work in gfs2_glock_dq()

Comment 3 Ben Marzinski 2007-08-23 00:23:40 UTC

The bug still exists with the patch.  It looks like the same run_queue issue,
but this one is in gfs2_glock_nq. Here is the call trace of the process with the
glock.

d_doio        D f7d52800  2076  2906   2903
       f52e7b14 00000082 00000000 f7d52800 00000000 f7d52800 f52e7000 ea7e195a
       0000003f f5c787c0 f5c7896c c2010080 00000000 f5c6d040 06000000 c04d5d03
       c23d406c c04d6b2c f52e7b48 0001ea25 00000000 c20fdc3c 0006101a c20fdc3c
Call Trace:
 [<c04d5d03>] __generic_unplug_device+0x1d/0x1f
 [<c04d6b2c>] generic_unplug_device+0x15/0x22
 [<c061ad46>] io_schedule+0x34/0x56
 [<c0452324>] sync_page+0x0/0x3b
 [<c045235c>] sync_page+0x38/0x3b
 [<c061ae52>] __wait_on_bit_lock+0x2a/0x52
 [<c0452316>] __lock_page+0x58/0x5e
 [<c0437aa6>] wake_bit_function+0x0/0x3c
 [<c0456ad4>] write_cache_pages+0x105/0x27b
 [<c0456778>] __writepage+0x0/0x21
 [<f8c8c923>] gfs2_writepages+0x0/0x38 [gfs2]
 [<c0456c69>] generic_writepages+0x1f/0x26
 [<c0456c90>] do_writepages+0x20/0x30
 [<c0452cb8>] __filemap_fdatawrite_range+0x65/0x70
 [<c0452ee6>] filemap_fdatawrite+0x23/0x27
 [<f8c85dfb>] inode_go_sync+0x44/0xbe [gfs2]
 [<f8c849ba>] gfs2_glock_drop_th+0x1c/0x111 [gfs2]
 [<f8c84f4a>] run_queue+0xbf/0x249 [gfs2]
 [<c0420d30>] __wake_up+0x32/0x43
 [<f8c8541f>] gfs2_glock_nq+0x154/0x19a [gfs2]
 [<c0434ce6>] insert_work+0x50/0x53
 [<f8c865b1>] gfs2_glock_nq_atime+0x106/0x2ec [gfs2]
 [<f8c8c9ab>] gfs2_prepare_write+0x50/0x23b [gfs2]
 [<c045239c>] find_lock_page+0x1a/0x7e
 [<c04533b6>] generic_file_buffered_write+0x256/0x5d5
 [<c0453bc6>] __generic_file_aio_write_nolock+0x491/0x4f0
 [<c05e342a>] tcp_recvmsg+0x8ed/0x9f9
 [<c061b11f>] __mutex_lock_slowpath+0x52/0x7a
 [<c0453c7a>] generic_file_aio_write+0x55/0xb3
 [<c046d756>] do_sync_readv_writev+0xc1/0xfe
 [<c045606e>] get_page_from_freelist+0x23c/0x2b0
 [<c0437a71>] autoremove_wake_function+0x0/0x35
 [<c04629b4>] anon_vma_prepare+0x11/0xa5
 [<c04e37e3>] copy_from_user+0x23/0x4f
 [<c046d611>] rw_copy_check_uvector+0x5c/0xb0
 [<c046de53>] do_readv_writev+0xbc/0x187
 [<c0453c25>] generic_file_aio_write+0x0/0xb3
 [<c061d2be>] do_page_fault+0x269/0x58b
 [<c044cc89>] audit_syscall_entry+0x10d/0x137
 [<c046df5b>] vfs_writev+0x3d/0x48
 [<c046e370>] sys_writev+0x41/0x67
 [<c0404e12>] syscall_call+0x7/0xb

Comment 4 Steve Whitehouse 2007-08-23 08:47:22 UTC

This is actually a different bug, although it looks similar. This can only
happen in the upstream code as its the page lock/glock bug which we fixed ages
ago in RHEL, but for which the upstream fix is in Nick Piggin's patch set. That
patch set should have been pushed to Linus at his last merge window, but its
still pending since Nick decided not to push it due to there being lots of other
VM changes at the time.

So I think we are probably safe to push the patch in its current form to
upstream now as well as RHEL.

Comment 5 Steve Whitehouse 2007-08-24 10:37:20 UTC

I guess we can close this, or mark as a dup of the other bz?

Comment 6 Steve Whitehouse 2007-08-25 14:01:02 UTC


*** This bug has been marked as a duplicate of 248480 ***

Note You need to log in before you can comment on or make changes to this bug.