Bug 231380
| Summary: | GFS2 will hang if you run iozone on one node and do a du -h on another node | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Josef Bacik <jbacik> | ||||
| Component: | kernel | Assignee: | Don Zickus <dzickus> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 5.1 | CC: | kanderso, lwang, rkenna, swhiteho | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | RHBA-2007-0959 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2007-11-07 19:43:09 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
On the node running iozone, it looks like its doing the "right thing" in that its obviously received a callback for the lock that its using and its trying to write out the dirty data so that it can then release the lock. It appears to be stuck obtaining the page lock for the page which is a bit strange since I can't see what else might be holding that lock. On the node running du, the situation is far from clear. That backtrace looks very odd to me as it seems to contain a jumble of information from different syscalls. It does look at though its trying to read xattrs which I presume is what is causing the request for the lock from the other node, since the node running du does appear to be awaiting a glock being released. So the problem is most likely to be on the node running iozone since thats the one which appears to not be releasing the lock in question. We need to look at what else on that node might be holding the page lock. well i thought that we may not be unlocking the page in gfs2_writepage (idk why b/c i couldn't really see a situation in block_write_full_page where the page wouldn't be unlocked) and my printk didn't get tripped so we are unlocking the page in gfs2_writepage, so the problem has to be happening somewhere after that. running a really stupid debug patch, it looks more like we are looping much like bonnie++ did with gfs2_prepare_write, the only reason I'm not getting softlockups is because the box that is looping is a multiprocessor box. I'm going to look into why this might be happening. hmm or not, I'm going to try and rework the way I'm thinking about this problem. hmm so i think i've figured out the problem, but i'm kind of unsure how to fix this properly. So we come in through generic_file_buffered_write and get our page through __grab_cache_page, which comes back with the page locked. We then come into gfs2_commit_write, which does its thing and then calls gfs2_glock_dq_m. Since we have somebody waiting for that lock we go through to flush out all dirty pages, and then hang on lock_page(), because we have the page still locked from generic_file_buffered_write. So this is what I'm thinking, around the gfs2_glock_dq_m putting a unlock_page/lock_page to keep this problem from happening. I'm going to build a test kernel and see if that resolves the problem, but I would like to know if thats the best way to be going about this problem. Created attachment 149867 [details]
patch that resolves the problem.
ok this withstands my horrible iozone/du testing and fixes the problem nicely.
I'm going to post to cluster-devel to see if anybody has a better idea on how
to fix this.
hmm i'm thinking this may be kind of a performance killer in the case that we don't actually have anybody waiting on the glock. Instead of just blindly unlocking/locking the page, instead check to see if we have waiters, and if we do unlock/lock the page, else just do what we normally do. Of course we could race and possibly get a waiter in the middle of our deque operation. I've just been looking at the OCFS2 code and I notice that they do not appear to have had this problem as their code has no obvious solution for it. I agree that the unlock and relock of the page is nasty, on the other hand I cannot see any reason why this shouldn't be correct. You can test for a waiter with test_bit(GLF_DEMOTE, &gl->gl_flags) - this assumes that you have the GLF_DEMOTE patch - but there will be a race in doing that since it can be set at any time that the gl->gl_spin lock isn't held, so I suspect that unlocking and relocking is the only option for now. Another approach would be to make the unlock operation ignore any pending waiting lock/demote requests, however there is then the question of when these requests should then run, so that seems to just introduce another problem. Eventually we will need to tackle that problem too when we come to look at preventing lock bounce, but I was hoping not to need to deal with that at this stage. Also, although you unlock the page, when the glops functions run due to the demote request we need to check that the page will be marked "not uptodate" and not just ignored since it will have an elevated ref count due to the VFS holding a ref still at this stage. This may potentially have some bearing on the other bug, bz #231910. how about this. we make a sort of "fast unlock" for these kind of situations, where we know we will likely be coming back and reclaiming the lock again anyway. In the "fast unlock" we just don't do anything with anybody who is waiting on us, and instead we have a kernel thread that sits there and every tunable amount of seconds runs through the list of glocks that have waiters and do the normal operations you do on a glock if it has waiters and then let them have the glock. This way we don't have to worry about this problem, even if we get to that area where we are between the lock and gfs2_commit_write, the page shouldn't be dirty because it hasn't yet been committed so theres no chance of a race (i think). Course this has the downside of making other nodes wait longer for a lock when an exclusive lock is held, but I'm thinking this is the only case where this would be a problem. Then again if this is the only place where this is a problem it would be kind of crappy to put a kernel thread in just to make sure this section doesn't lock up. IIRC the GLF_DEMOTE flag is cleared as soon as you do the rq_demote() stuff when you are dequeing the glock. We could instead move that into the enque code. So we grab the lock, then if we have GLF_DEMOTE set, make sure all dirty pages associated with that inode are flushed and then clear GLF_DEMOTE. But then if the glock is taken from another node who has dirty pages for that inode I'm not really sure how you go about making that happen. So all in all I have no idea how to fix this :) The "fast unlock" as you call it is something we'll need eventually anyway for a variety of reasons, but we still need to be able to release the lock from here. I'm hoping that the OCFS2 plan to have a single call in which to wrap both prepare_write and commit_write will eventually solve this for us by allowing us to get the locks in the right order, so we only really need a temporary fix to this right now. So far I've not been able to work out anything better than your original suggestion, provided we can be sure of all the pages being invalidated, which is something I'm looking at now. The "OCFS2 plan" I referred to in comment #10 is progressing in the form of patches which have just been posted for review by Nick Piggin on lkml and the subject lines are: Re: [patch 2/3] fs: introduce perform_write aop [patch 1/5] fs: add an iovec iterator [patch 2/5] fs: introduce new aops and infrastructure [patch 3/5] fs: convert some simple filesystems [patch 4/5] ext2: convert to new aops [patch 5/5] ext3: convert to new aops Re: [patch 1/5] fs: add an iovec iterator This won't be a possible solution in RHEL5 since its too invasive, but upstream should eventually benefit from it. Rob K, please add some ACKs to this one. This request was evaluated by Red Hat Kernel Team for inclusion in a Red Hat Enterprise Linux maintenance release, and has moved to bugzilla status POST. in 2.6.18-15.el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html |
I ran iozone on one node on a gfs2 filesystem and then did a du -h on another node on the same filesystem and both iozone and du hang, here is the sysrq output SysRq : Show Blocked State free sibling task PC stack pid father child younger older pdflush D F7D19DF0 2340 169 7 170 168 (L-TLB) f7d19e04 00000046 00000002 f7d19df0 f7d19dec 00000000 ea6a25f0 01000000 00000000 f3ebd458 0000000a f7d17030 c06b1440 b27a79be 00000200 000837b6 f7d17154 c9806e60 00000000 eee71900 c042ded4 c98f5e7c ffffffff 00000000 Call Trace: [<c042ded4>] getnstimeofday+0x30/0xb6 [<c060f5dd>] io_schedule+0x3a/0x5c [<c0453696>] sync_page+0x0/0x3b [<c04536ce>] sync_page+0x38/0x3b [<c060f6e9>] __wait_on_bit_lock+0x2a/0x52 [<c0453688>] __lock_page+0x58/0x5e [<c043770a>] wake_bit_function+0x0/0x3c [<c045806c>] generic_writepages+0x11a/0x2ae [<f8b0fc6d>] gfs2_writepage+0x0/0x172 [gfs2] [<f8b0fddf>] gfs2_writepages+0x0/0x38 [gfs2] [<c0458220>] do_writepages+0x20/0x30 [<c04860a2>] __writeback_single_inode+0x194/0x2c9 [<c04864b4>] sync_sb_inodes+0x169/0x209 [<c0486875>] writeback_inodes+0x6a/0xb1 [<c045866f>] wb_kupdate+0x7b/0xde [<c0458926>] pdflush+0x0/0x19f [<c0458a2f>] pdflush+0x109/0x19f [<c04585f4>] wb_kupdate+0x0/0xde [<c043760c>] kthread+0xb0/0xd8 [<c043755c>] kthread+0x0/0xd8 [<c0405a27>] kernel_thread_helper+0x7/0x10 ======================= iozone D E84B1B80 1964 4663 4318 (NOTLB) e84b1b94 00000082 00000002 e84b1b80 e84b1b7c 00000000 de8d8408 ffffffff 00000282 f3ebd458 00000007 ec404a90 c06b1440 9f38060f 000001f9 00e019ad ec404bb4 c9806e60 00000000 eced8580 c042ded4 c98f5bdc ffffffff 00000000 Call Trace: [<c042ded4>] getnstimeofday+0x30/0xb6 [<c060f5dd>] io_schedule+0x3a/0x5c [<c0453696>] sync_page+0x0/0x3b [<c04536ce>] sync_page+0x38/0x3b [<c060f6e9>] __wait_on_bit_lock+0x2a/0x52 [<c0453688>] __lock_page+0x58/0x5e [<c043770a>] wake_bit_function+0x0/0x3c [<c045806c>] generic_writepages+0x11a/0x2ae [<f8b0fc6d>] gfs2_writepage+0x0/0x172 [gfs2] [<f8b0fddf>] gfs2_writepages+0x0/0x38 [gfs2] [<c0458220>] do_writepages+0x20/0x30 [<c04542d7>] __filemap_fdatawrite_range+0x65/0x70 [<c0454505>] filemap_fdatawrite+0x23/0x27 [<f8b08bab>] inode_go_sync+0x44/0x8c [gfs2] [<f8b0844f>] gfs2_glock_xmote_th+0x23/0x14b [gfs2] [<f8b073aa>] gfs2_glmutex_lock+0x96/0x9d [gfs2] [<f8b076c2>] run_queue+0x21b/0x3a8 [gfs2] [<f8b07a10>] gfs2_glock_dq+0x6f/0x79 [gfs2] [<f8b07a2d>] gfs2_glock_dq_m+0x13/0x1e [gfs2] [<f8b102c4>] gfs2_commit_write+0x276/0x2de [gfs2] [<f8b1004e>] gfs2_commit_write+0x0/0x2de [gfs2] [<c0454b67>] generic_file_buffered_write+0x3ff/0x60f [<c042aca8>] current_fs_time+0x4f/0x58 [<c0455258>] __generic_file_aio_write_nolock+0x4e1/0x55a [<c0455326>] generic_file_aio_write+0x55/0xb3 [<c04581b8>] generic_writepages+0x266/0x2ae [<c046e68d>] do_sync_write+0xc7/0x10a [<c04591b0>] pagevec_lookup_tag+0x24/0x2b [<c04376d5>] autoremove_wake_function+0x0/0x35 [<c046e5c6>] do_sync_write+0x0/0x10a [<c046ee71>] vfs_write+0xa8/0x12a [<c046f3fe>] sys_write+0x41/0x67 [<c0404e4c>] syscall_call+0x7/0xb ======================= and then from the other node free sibling task PC stack pid father child younger older du D CA8F6C2C 2160 4281 4180 (NOTLB) ca8f6c40 00000082 00000002 ca8f6c2c ca8f6c28 00000000 c0f1e3c0 d0ba88a0 ca8f6c00 cf4e9c60 00000008 c0c94030 c06b1440 85d3986e 000000bc 00000551 c0c94154 c1204460 00000000 c0d53580 00010000 cfe09a00 ffffffff 00000000 Call Trace: [<d0ba88a0>] gdlm_bast+0x0/0x8c [lock_dlm] [<d0a9adeb>] holder_wait+0x5/0x8 [gfs2] [<c060f7af>] __wait_on_bit+0x33/0x58 [<d0a9ade6>] holder_wait+0x0/0x8 [gfs2] [<d0a9ade6>] holder_wait+0x0/0x8 [gfs2] [<c060f837>] out_of_line_wait_on_bit+0x63/0x6b [<c043770a>] wake_bit_function+0x0/0x3c [<d0a9ade2>] wait_on_holder+0x2f/0x33 [gfs2] [<d0a9bb46>] glock_wait_internal+0xdc/0x1f7 [gfs2] [<d0a9bdd3>] gfs2_glock_nq+0x172/0x1a6 [gfs2] [<c043770a>] wake_bit_function+0x0/0x3c [<d0a9a207>] gfs2_ea_get+0x58/0x8a [gfs2] [<d0a9a200>] gfs2_ea_get+0x51/0x8a [gfs2] [<d0aa6d15>] gfs2_getxattr+0x64/0x70 [gfs2] [<c04c0c4a>] inode_doinit_with_dentry+0x15d/0x547 [<d0a9b308>] gfs2_holder_uninit+0xb/0x1b [gfs2] [<d0a9da7b>] gfs2_lookupi+0x14e/0x166 [gfs2] [<c0490fff>] inotify_d_instantiate+0x41/0x67 [<c047d16a>] d_instantiate+0x5c/0x60 [<c047e15a>] d_splice_alias+0xd4/0xe3 [<c04748a4>] do_lookup+0xa3/0x140 [<c04765c4>] __link_path_walk+0x7d7/0xc2c [<c0489e3e>] sync_buffer+0x0/0x33 [<c060f837>] out_of_line_wait_on_bit+0x63/0x6b [<c0476a5d>] link_path_walk+0x44/0xb3 [<d0a9ba14>] gfs2_glock_dq+0x6f/0x79 [gfs2] [<d0a9b035>] gfs2_glock_put+0x1e/0xfe [gfs2] [<d0aa50a6>] gfs2_readdir+0x78/0x90 [gfs2] [<c0478ae4>] filldir64+0x0/0xc5 [<c0478ae4>] filldir64+0x0/0xc5 [<c0476d55>] do_path_lookup+0x172/0x1c2 [<c0475c47>] getname+0x59/0xad [<c0477514>] __user_walk_fd+0x2f/0x40 [<c04713ee>] vfs_lstat_fd+0x16/0x3d [<d0a9ba14>] gfs2_glock_dq+0x6f/0x79 [gfs2] [<d0a9b035>] gfs2_glock_put+0x1e/0xfe [gfs2] [<d0aa50a6>] gfs2_readdir+0x78/0x90 [gfs2] [<c0478ae4>] filldir64+0x0/0xc5 [<c0478ae4>] filldir64+0x0/0xc5 [<c047145a>] sys_lstat64+0xf/0x23 [<c044e5cb>] audit_syscall_entry+0x111/0x143 [<c0407975>] do_syscall_trace+0x124/0x16b [<c0404e4c>] syscall_call+0x7/0xb =======================