Bug 231380 - GFS2 will hang if you run iozone on one node and do a du -h on another node
GFS2 will hang if you run iozone on one node and do a du -h on another node
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
All Linux
medium Severity medium
: ---
: ---
Assigned To: Don Zickus
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-03-07 17:39 EST by Josef Bacik
Modified: 2007-11-30 17:07 EST (History)
4 users (show)

See Also:
Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 14:43:09 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
patch that resolves the problem. (571 bytes, patch)
2007-03-12 17:44 EDT, Josef Bacik
no flags Details | Diff

  None (edit)
Description Josef Bacik 2007-03-07 17:39:10 EST
I ran iozone on one node on a gfs2 filesystem and then did a du -h on another 
node on the same filesystem and both iozone and du hang, here is the sysrq 
output

SysRq : Show Blocked State

                         free                        sibling
  task             PC    stack   pid father child younger older
pdflush       D F7D19DF0  2340   169      7           170   168 (L-TLB)
       f7d19e04 00000046 00000002 f7d19df0 f7d19dec 00000000 ea6a25f0 01000000 
       00000000 f3ebd458 0000000a f7d17030 c06b1440 b27a79be 00000200 000837b6 
       f7d17154 c9806e60 00000000 eee71900 c042ded4 c98f5e7c ffffffff 00000000 
Call Trace:
 [<c042ded4>] getnstimeofday+0x30/0xb6
 [<c060f5dd>] io_schedule+0x3a/0x5c
 [<c0453696>] sync_page+0x0/0x3b
 [<c04536ce>] sync_page+0x38/0x3b
 [<c060f6e9>] __wait_on_bit_lock+0x2a/0x52
 [<c0453688>] __lock_page+0x58/0x5e
 [<c043770a>] wake_bit_function+0x0/0x3c
 [<c045806c>] generic_writepages+0x11a/0x2ae
 [<f8b0fc6d>] gfs2_writepage+0x0/0x172 [gfs2]
 [<f8b0fddf>] gfs2_writepages+0x0/0x38 [gfs2]
 [<c0458220>] do_writepages+0x20/0x30
 [<c04860a2>] __writeback_single_inode+0x194/0x2c9
 [<c04864b4>] sync_sb_inodes+0x169/0x209
 [<c0486875>] writeback_inodes+0x6a/0xb1
 [<c045866f>] wb_kupdate+0x7b/0xde
 [<c0458926>] pdflush+0x0/0x19f
 [<c0458a2f>] pdflush+0x109/0x19f
 [<c04585f4>] wb_kupdate+0x0/0xde
 [<c043760c>] kthread+0xb0/0xd8
 [<c043755c>] kthread+0x0/0xd8
 [<c0405a27>] kernel_thread_helper+0x7/0x10
 =======================
iozone        D E84B1B80  1964  4663   4318                     (NOTLB)
       e84b1b94 00000082 00000002 e84b1b80 e84b1b7c 00000000 de8d8408 ffffffff 
       00000282 f3ebd458 00000007 ec404a90 c06b1440 9f38060f 000001f9 00e019ad 
       ec404bb4 c9806e60 00000000 eced8580 c042ded4 c98f5bdc ffffffff 00000000 
Call Trace:
 [<c042ded4>] getnstimeofday+0x30/0xb6
 [<c060f5dd>] io_schedule+0x3a/0x5c
 [<c0453696>] sync_page+0x0/0x3b
 [<c04536ce>] sync_page+0x38/0x3b
 [<c060f6e9>] __wait_on_bit_lock+0x2a/0x52
 [<c0453688>] __lock_page+0x58/0x5e
 [<c043770a>] wake_bit_function+0x0/0x3c
 [<c045806c>] generic_writepages+0x11a/0x2ae
 [<f8b0fc6d>] gfs2_writepage+0x0/0x172 [gfs2]
 [<f8b0fddf>] gfs2_writepages+0x0/0x38 [gfs2]
 [<c0458220>] do_writepages+0x20/0x30
 [<c04542d7>] __filemap_fdatawrite_range+0x65/0x70
 [<c0454505>] filemap_fdatawrite+0x23/0x27
 [<f8b08bab>] inode_go_sync+0x44/0x8c [gfs2]
 [<f8b0844f>] gfs2_glock_xmote_th+0x23/0x14b [gfs2]
 [<f8b073aa>] gfs2_glmutex_lock+0x96/0x9d [gfs2]
 [<f8b076c2>] run_queue+0x21b/0x3a8 [gfs2]
 [<f8b07a10>] gfs2_glock_dq+0x6f/0x79 [gfs2]
 [<f8b07a2d>] gfs2_glock_dq_m+0x13/0x1e [gfs2]
 [<f8b102c4>] gfs2_commit_write+0x276/0x2de [gfs2]
 [<f8b1004e>] gfs2_commit_write+0x0/0x2de [gfs2]
 [<c0454b67>] generic_file_buffered_write+0x3ff/0x60f
 [<c042aca8>] current_fs_time+0x4f/0x58
 [<c0455258>] __generic_file_aio_write_nolock+0x4e1/0x55a
 [<c0455326>] generic_file_aio_write+0x55/0xb3
 [<c04581b8>] generic_writepages+0x266/0x2ae
 [<c046e68d>] do_sync_write+0xc7/0x10a
 [<c04591b0>] pagevec_lookup_tag+0x24/0x2b
 [<c04376d5>] autoremove_wake_function+0x0/0x35
 [<c046e5c6>] do_sync_write+0x0/0x10a
 [<c046ee71>] vfs_write+0xa8/0x12a
 [<c046f3fe>] sys_write+0x41/0x67
 [<c0404e4c>] syscall_call+0x7/0xb
 =======================


and then from the other node

                         free                        sibling
  task             PC    stack   pid father child younger older
du            D CA8F6C2C  2160  4281   4180                     (NOTLB)
       ca8f6c40 00000082 00000002 ca8f6c2c ca8f6c28 00000000 c0f1e3c0 d0ba88a0 
       ca8f6c00 cf4e9c60 00000008 c0c94030 c06b1440 85d3986e 000000bc 00000551 
       c0c94154 c1204460 00000000 c0d53580 00010000 cfe09a00 ffffffff 00000000 
Call Trace:
 [<d0ba88a0>] gdlm_bast+0x0/0x8c [lock_dlm]
 [<d0a9adeb>] holder_wait+0x5/0x8 [gfs2]
 [<c060f7af>] __wait_on_bit+0x33/0x58
 [<d0a9ade6>] holder_wait+0x0/0x8 [gfs2]
 [<d0a9ade6>] holder_wait+0x0/0x8 [gfs2]
 [<c060f837>] out_of_line_wait_on_bit+0x63/0x6b
 [<c043770a>] wake_bit_function+0x0/0x3c
 [<d0a9ade2>] wait_on_holder+0x2f/0x33 [gfs2]
 [<d0a9bb46>] glock_wait_internal+0xdc/0x1f7 [gfs2]
 [<d0a9bdd3>] gfs2_glock_nq+0x172/0x1a6 [gfs2]
 [<c043770a>] wake_bit_function+0x0/0x3c
 [<d0a9a207>] gfs2_ea_get+0x58/0x8a [gfs2]
 [<d0a9a200>] gfs2_ea_get+0x51/0x8a [gfs2]
 [<d0aa6d15>] gfs2_getxattr+0x64/0x70 [gfs2]
 [<c04c0c4a>] inode_doinit_with_dentry+0x15d/0x547
 [<d0a9b308>] gfs2_holder_uninit+0xb/0x1b [gfs2]
 [<d0a9da7b>] gfs2_lookupi+0x14e/0x166 [gfs2]
 [<c0490fff>] inotify_d_instantiate+0x41/0x67
 [<c047d16a>] d_instantiate+0x5c/0x60
 [<c047e15a>] d_splice_alias+0xd4/0xe3
 [<c04748a4>] do_lookup+0xa3/0x140
 [<c04765c4>] __link_path_walk+0x7d7/0xc2c
 [<c0489e3e>] sync_buffer+0x0/0x33
 [<c060f837>] out_of_line_wait_on_bit+0x63/0x6b
 [<c0476a5d>] link_path_walk+0x44/0xb3
 [<d0a9ba14>] gfs2_glock_dq+0x6f/0x79 [gfs2]
 [<d0a9b035>] gfs2_glock_put+0x1e/0xfe [gfs2]
 [<d0aa50a6>] gfs2_readdir+0x78/0x90 [gfs2]
 [<c0478ae4>] filldir64+0x0/0xc5
 [<c0478ae4>] filldir64+0x0/0xc5
 [<c0476d55>] do_path_lookup+0x172/0x1c2
 [<c0475c47>] getname+0x59/0xad
 [<c0477514>] __user_walk_fd+0x2f/0x40
 [<c04713ee>] vfs_lstat_fd+0x16/0x3d
 [<d0a9ba14>] gfs2_glock_dq+0x6f/0x79 [gfs2]
 [<d0a9b035>] gfs2_glock_put+0x1e/0xfe [gfs2]
 [<d0aa50a6>] gfs2_readdir+0x78/0x90 [gfs2]
 [<c0478ae4>] filldir64+0x0/0xc5
 [<c0478ae4>] filldir64+0x0/0xc5
 [<c047145a>] sys_lstat64+0xf/0x23
 [<c044e5cb>] audit_syscall_entry+0x111/0x143
 [<c0407975>] do_syscall_trace+0x124/0x16b
 [<c0404e4c>] syscall_call+0x7/0xb
 =======================
Comment 1 Steve Whitehouse 2007-03-08 05:01:23 EST
On the node running iozone, it looks like its doing the "right thing" in that
its obviously received a callback for the lock that its using and its trying to
write out the dirty data so that it can then release the lock. It appears to be
stuck obtaining the page lock for the page which is a bit strange since I can't
see what else might be holding that lock.

On the node running du, the situation is far from clear. That backtrace looks
very odd to me as it seems to contain a jumble of information from different
syscalls. It does look at though its trying to read xattrs which I presume is
what is causing the request for the lock from the other node, since the node
running du does appear to be awaiting a glock being released.

So the problem is most likely to be on the node running iozone since thats the
one which appears to not be releasing the lock in question. We need to look at
what else on that node might be holding the page lock.

Comment 2 Josef Bacik 2007-03-08 15:33:05 EST
well i thought that we may not be unlocking the page in gfs2_writepage (idk 
why b/c i couldn't really see a situation in block_write_full_page where the 
page wouldn't be unlocked) and my printk didn't get tripped so we are 
unlocking the page in gfs2_writepage, so the problem has to be happening 
somewhere after that.
Comment 3 Josef Bacik 2007-03-09 17:02:40 EST
running a really stupid debug patch, it looks more like we are looping much 
like bonnie++ did with gfs2_prepare_write, the only reason I'm not getting 
softlockups is because the box that is looping is a multiprocessor box.  I'm 
going to look into why this might be happening.
Comment 4 Josef Bacik 2007-03-09 17:43:27 EST
hmm or not, I'm going to try and rework the way I'm thinking about this 
problem.
Comment 5 Josef Bacik 2007-03-12 16:50:37 EDT
hmm so i think i've figured out the problem, but i'm kind of unsure how to fix 
this properly.  So we come in through generic_file_buffered_write and get our 
page through __grab_cache_page, which comes back with the page locked.  We 
then come into gfs2_commit_write, which does its thing and then calls 
gfs2_glock_dq_m.  Since we have somebody waiting for that lock we go through 
to flush out all dirty pages, and then hang on lock_page(), because we have 
the page still locked from generic_file_buffered_write.  So this is what I'm 
thinking, around the gfs2_glock_dq_m putting a unlock_page/lock_page to keep 
this problem from happening.  I'm going to build a test kernel and see if that 
resolves the problem, but I would like to know if thats the best way to be 
going about this problem.
Comment 6 Josef Bacik 2007-03-12 17:44:46 EDT
Created attachment 149867 [details]
patch that resolves the problem.

ok this withstands my horrible iozone/du testing and fixes the problem nicely. 
I'm going to post to cluster-devel to see if anybody has a better idea on how
to fix this.
Comment 7 Josef Bacik 2007-03-12 18:25:33 EDT
hmm i'm thinking this may be kind of a performance killer in the case that we
don't actually have anybody waiting on the glock.  Instead of just blindly
unlocking/locking the page, instead check to see if we have waiters, and if we
do unlock/lock the page, else just do what we normally do.  Of course we could
race and possibly get a waiter in the middle of our deque operation.
Comment 8 Steve Whitehouse 2007-03-13 07:22:34 EDT
I've just been looking at the OCFS2 code and I notice that they do not appear to
have had this problem as their code has no obvious solution for it. I agree that
the unlock and relock of the page is nasty, on the other hand I cannot see any
reason why this shouldn't be correct.

You can test for a waiter with test_bit(GLF_DEMOTE, &gl->gl_flags) - this
assumes that you have the GLF_DEMOTE patch - but there will be a race in doing
that since it can be set at any time that the gl->gl_spin lock isn't held, so I
suspect that unlocking and relocking is the only option for now.

Another approach would be to make the unlock operation ignore any pending
waiting lock/demote requests, however there is then the question of when these
requests should then run, so that seems to just introduce another problem.
Eventually we will need to tackle that problem too when we come to look at
preventing lock bounce, but I was hoping not to need to deal with that at this
stage.

Also, although you unlock the page, when the glops functions run due to the
demote request we need to check that the page will be marked "not uptodate" and
not just ignored since it will have an elevated ref count due to the VFS holding
a ref still at this stage. This may potentially have some bearing on the other
bug, bz #231910.
Comment 9 Josef Bacik 2007-03-13 09:00:38 EDT
how about this.  we make a sort of "fast unlock" for these kind of situations,
where we know we will likely be coming back and reclaiming the lock again
anyway.  In the "fast unlock" we just don't do anything with anybody who is
waiting on us, and instead we have a kernel thread that sits there and every
tunable amount of seconds runs through the list of glocks that have waiters and
do the normal operations you do on a glock if it has waiters and then let them
have the glock.  This way we don't have to worry about this problem, even if we
get to that area where we are between the lock and gfs2_commit_write, the page
shouldn't be dirty because it hasn't yet been committed so theres no chance of a
race (i think).  Course this has the downside of making other nodes wait longer
for a lock when an exclusive lock is held, but I'm thinking this is the only
case where this would be a problem.

Then again if this is the only place where this is a problem it would be kind of
crappy to put a kernel thread in just to make sure this section doesn't lock up.
 IIRC the GLF_DEMOTE flag is cleared as soon as you do the rq_demote() stuff
when you are dequeing the glock.  We could instead move that into the enque
code.  So we grab the lock, then if we have GLF_DEMOTE set, make sure all dirty
pages associated with that inode are flushed and then clear GLF_DEMOTE.  But
then if the glock is taken from another node who has dirty pages for that inode
I'm not really sure how you go about making that happen.

So all in all I have no idea how to fix this :)
Comment 10 Steve Whitehouse 2007-03-13 09:40:48 EDT
The "fast unlock" as you call it is something we'll need eventually anyway for a
variety of reasons, but we still need to be able to release the lock from here.
I'm hoping that the OCFS2 plan to have a single call in which to wrap both
prepare_write and commit_write will eventually solve this for us by allowing us
to get the locks in the right order, so we only really need a temporary fix to
this right now. So far I've not been able to work out anything better than your
original suggestion, provided we can be sure of all the pages being invalidated,
which is something I'm looking at now.
Comment 11 Steve Whitehouse 2007-03-14 10:33:01 EDT
The "OCFS2 plan" I referred to in comment #10 is progressing in the form of
patches which have just been posted for review by Nick Piggin on lkml and the
subject lines are:

Re: [patch 2/3] fs: introduce perform_write aop
[patch 1/5] fs: add an iovec iterator
[patch 2/5] fs: introduce new aops and infrastructure
[patch 3/5] fs: convert some simple filesystems
[patch 4/5] ext2: convert to new aops
[patch 5/5] ext3: convert to new aops
Re: [patch 1/5] fs: add an iovec iterator

This won't be a possible solution in RHEL5 since its too invasive, but upstream
should eventually benefit from it.
Comment 12 Steve Whitehouse 2007-04-05 06:13:59 EDT
Rob K, please add some ACKs to this one. 
Comment 13 RHEL Product and Program Management 2007-04-05 06:21:54 EDT
This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.
Comment 14 Don Zickus 2007-04-17 16:02:33 EDT
in 2.6.18-15.el5
Comment 18 errata-xmlrpc 2007-11-07 14:43:09 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.