Description of problem: This is the upstream bug that I see when I try to test for bz 248480. When I run the QA tests, the gfs2 filesystem instantly locks up. Unlike 248480, where the nodes were livelocked because of a lock ping-ponging back and forth, This is a hard deadlock. In my tests, There was one glock that everyone was waiting on, and the process that had the glock seemed to be stuck in io_schedule This is the process that is holding the glock. id_doio D f377dc88 2596 2914 2910 f377dc9c 00000086 00000002 f377dc88 f377dc80 00000000 f377d000 00000001 00000000 f3106cd0 f3106e7c c2019080 00000001 f322b200 f7fc706c c04d5cb6 f7fc706c c04d6ae8 0002d314 c04d75d8 c043b3e0 ffffffff 00000000 00000000 Call Trace: [<c04d5cb6>] __generic_unplug_device+0x14/0x1f [<c04d6ae8>] generic_unplug_device+0x15/0x22 [<c04d75d8>] blk_backing_dev_unplug+0x73/0x7b [<c043b3e0>] getnstimeofday+0x30/0xbc [<c061ad06>] io_schedule+0x34/0x56 [<c0452324>] sync_page+0x0/0x3b [<c045235c>] sync_page+0x38/0x3b [<c061ae12>] __wait_on_bit_lock+0x2a/0x52 [<c0452316>] __lock_page+0x58/0x5e [<c0437aa6>] wake_bit_function+0x0/0x3c [<c0456ad4>] write_cache_pages+0x105/0x27b [<c0456778>] __writepage+0x0/0x21 [<c041d752>] check_pgt_cache+0x1b/0x1d [<c045cb4d>] zap_page_range+0xbe/0xc8 [<c04e08c3>] prio_tree_next+0xe2/0x117 [<f8c5b837>] gfs2_writepages+0x0/0x38 [gfs2] [<c0456c69>] generic_writepages+0x1f/0x26 [<c0456c90>] do_writepages+0x20/0x30 [<c0452cb8>] __filemap_fdatawrite_range+0x65/0x70 [<c0452ee6>] filemap_fdatawrite+0x23/0x27 [<f8c54cd7>] inode_go_sync+0x44/0xbe [gfs2] [<f8c53948>] gfs2_glock_xmote_th+0x2a/0x15c [gfs2] [<f8c54589>] gfs2_glmutex_lock+0x9c/0xa3 [gfs2] [<f8c53b49>] run_queue+0xcf/0x249 [gfs2] [<f8c54601>] gfs2_glock_dq+0x71/0x7b [gfs2] [<f8c54715>] gfs2_glock_dq_uninit+0x8/0x10 [gfs2] [<f8c60ae6>] gfs2_sharewrite_fault+0x29a/0x2a6 [gfs2] [<f8c60880>] gfs2_sharewrite_fault+0x34/0x2a6 [gfs2] [<c045bbe2>] __do_fault+0x5f/0x2ee [<c045a39e>] vma_prio_tree_insert+0x17/0x2a [<c045dc81>] handle_mm_fault+0x337/0x668 [<c061d228>] do_page_fault+0x213/0x58b [<c0460c7d>] do_mmap_pgoff+0x279/0x2e1 [<c061d015>] do_page_fault+0x0/0x58b [<c061bdba>] error_code+0x72/0x78 ======================= Trying to do IO directly to the block device that GFS2 is running on also hangs on the node with the process stuck in io_schedule. IO to the block device works fine from the other nodes in the cluster, who are simply waiting on the glock. Version-Release number of selected component (if applicable): The lastest code from gfs2-2.6-nmw, as 2007-08-21 12:00 CDT How reproducible: Always, Immediately Steps to Reproduce: 1. setup a cluster on three machines with one GFS2 filesystem 2. Create the following dd_io test file (/usr/tests/sts-rhel5.1/gfs/lib/dd_io/248480.h2.m4): [root@cypher-07 ~]# cat /usr/tests/sts-rhel5.1/gfs/lib/dd_io/248480.h2.m4 dnl ---------------------------------------------------------------------- dnl 248480.h2.m4 dnl --- Scenario Metadata --- dnl SCENARIO=248480 dnl DESC=Test for 248480 dnl dnl ---------------------------------------------------------------------- define(`RUN_TIME', `30s') <test tag="d_mmap1"> <cmd>d_iogen -b -S RANDSEED -I SESSION_ID -R RESOURCE_FILE -i RUN_TIME -m sequential -s mmread,mmwrite,readv,writev,read,write,pread,pwrite -t MINTRANS -T MAXTRANS -F FILESIZE:mmap1 </cmd> </test> 3. Run the QA test. Here is what I run on my setup: # /usr/tests/sts-rhel5.1/gfs/bin/dd_io -m /mnt/test1 -R /root/hedge-123.xml -S 248480 -l /usr/tests/sts-rhel5.1/ -r /usr/tests/sts-rhel5.1 Actual results: All the test processes lockup Expected results: The test runs to completion
Created attachment 162049 [details] Attempt to solve the bug The stack trace posts what I think is a pretty clear picture of whats going on. The run_queue() has tried to demote the lock and push out the pages, but since its a writable mapping and a write has occurred, its got to write out the page, so it tried to lock it, but since we are in a page fault, its already locked by the higher layers. My solution to this is to move the run_queue() call from gfs2_glock_dq() directly to a workqueue. In fact its my eventual aim to move _all_ run_queue calls to the work queue to avoid issues just like this. We have to be a bit careful with the delay that we choose in order not to upset the very careful balance we've previously established to fix the original bug, but again, I think this will work well in that case. If I'm right about the cause, then its something that will affect RHEL 5.1 as well, so we ought to try and get it fixed now I think.
Created attachment 164161 [details] Revised patch, that fixes some bugs in the previous version. when the glock workqueue finishes it's work on the glock, it drops the reference count. However gfs2_glock_dq() didn't ever grab a reference to the glock before it scheduled the work. This was causing the glock's reference count to reach zero while it was still in use, and caused panics on mount. This version of the patch adds the grabs a reference before it queues the work in gfs2_glock_dq()
The bug still exists with the patch. It looks like the same run_queue issue, but this one is in gfs2_glock_nq. Here is the call trace of the process with the glock. d_doio D f7d52800 2076 2906 2903 f52e7b14 00000082 00000000 f7d52800 00000000 f7d52800 f52e7000 ea7e195a 0000003f f5c787c0 f5c7896c c2010080 00000000 f5c6d040 06000000 c04d5d03 c23d406c c04d6b2c f52e7b48 0001ea25 00000000 c20fdc3c 0006101a c20fdc3c Call Trace: [<c04d5d03>] __generic_unplug_device+0x1d/0x1f [<c04d6b2c>] generic_unplug_device+0x15/0x22 [<c061ad46>] io_schedule+0x34/0x56 [<c0452324>] sync_page+0x0/0x3b [<c045235c>] sync_page+0x38/0x3b [<c061ae52>] __wait_on_bit_lock+0x2a/0x52 [<c0452316>] __lock_page+0x58/0x5e [<c0437aa6>] wake_bit_function+0x0/0x3c [<c0456ad4>] write_cache_pages+0x105/0x27b [<c0456778>] __writepage+0x0/0x21 [<f8c8c923>] gfs2_writepages+0x0/0x38 [gfs2] [<c0456c69>] generic_writepages+0x1f/0x26 [<c0456c90>] do_writepages+0x20/0x30 [<c0452cb8>] __filemap_fdatawrite_range+0x65/0x70 [<c0452ee6>] filemap_fdatawrite+0x23/0x27 [<f8c85dfb>] inode_go_sync+0x44/0xbe [gfs2] [<f8c849ba>] gfs2_glock_drop_th+0x1c/0x111 [gfs2] [<f8c84f4a>] run_queue+0xbf/0x249 [gfs2] [<c0420d30>] __wake_up+0x32/0x43 [<f8c8541f>] gfs2_glock_nq+0x154/0x19a [gfs2] [<c0434ce6>] insert_work+0x50/0x53 [<f8c865b1>] gfs2_glock_nq_atime+0x106/0x2ec [gfs2] [<f8c8c9ab>] gfs2_prepare_write+0x50/0x23b [gfs2] [<c045239c>] find_lock_page+0x1a/0x7e [<c04533b6>] generic_file_buffered_write+0x256/0x5d5 [<c0453bc6>] __generic_file_aio_write_nolock+0x491/0x4f0 [<c05e342a>] tcp_recvmsg+0x8ed/0x9f9 [<c061b11f>] __mutex_lock_slowpath+0x52/0x7a [<c0453c7a>] generic_file_aio_write+0x55/0xb3 [<c046d756>] do_sync_readv_writev+0xc1/0xfe [<c045606e>] get_page_from_freelist+0x23c/0x2b0 [<c0437a71>] autoremove_wake_function+0x0/0x35 [<c04629b4>] anon_vma_prepare+0x11/0xa5 [<c04e37e3>] copy_from_user+0x23/0x4f [<c046d611>] rw_copy_check_uvector+0x5c/0xb0 [<c046de53>] do_readv_writev+0xbc/0x187 [<c0453c25>] generic_file_aio_write+0x0/0xb3 [<c061d2be>] do_page_fault+0x269/0x58b [<c044cc89>] audit_syscall_entry+0x10d/0x137 [<c046df5b>] vfs_writev+0x3d/0x48 [<c046e370>] sys_writev+0x41/0x67 [<c0404e12>] syscall_call+0x7/0xb
This is actually a different bug, although it looks similar. This can only happen in the upstream code as its the page lock/glock bug which we fixed ages ago in RHEL, but for which the upstream fix is in Nick Piggin's patch set. That patch set should have been pushed to Linus at his last merge window, but its still pending since Nick decided not to push it due to there being lots of other VM changes at the time. So I think we are probably safe to push the patch in its current form to upstream now as well as RHEL.
I guess we can close this, or mark as a dup of the other bz?
*** This bug has been marked as a duplicate of 248480 ***