Description of problem:
When I run brawl on my six-node cluster, I receive the
following error during the jdata coherency test:
GFS2: fsid=bobs_west6:brawl0.4: fatal: assertion "!buffer_busy(bh)" failed
GFS2: fsid=bobs_west6:brawl0.4: function = gfs2_ail_empty_gl, file = /home/bob/gfs2.perf.debug/glops.c, line = 69
GFS2: fsid=bobs_west6:brawl0.4: about to withdraw this file system
------------[ cut here ]------------
kernel BUG at /home/bob/gfs2.perf.debug/util.c:52!
[<ffffffffa064aa67>] inode_go_sync+0xc7/0x170 [gfs2]
[<ffffffffa0649862>] do_xmote+0x102/0x240 [gfs2]
[<ffffffffa0649a92>] run_queue+0xf2/0x170 [gfs2]
[<ffffffffa0649bf0>] glock_work_func+0x70/0x150 [gfs2]
[<ffffffffa0649b80>] ? glock_work_func+0x0/0x150 [gfs2]
[<ffffffff8108dd00>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81088530>] ? worker_thread+0x0/0x290
[<ffffffff8108d910>] ? kthread+0x0/0xa0
[<ffffffff81014180>] ? child_rip+0x0/0x20
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Run brawl
2. Wait until jdata coherency test
All tests should pass.
This behavior was noticed while working on the fix for bug #583737
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release. Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release. This request is not yet committed for
1. I now have proof that the bd is still attached to the bh via
b_private at the time of failure
2. I have proof that the buffer_head was marked dirty by unpin.
3. I also have irrefutable proof that gfs2_aspace_writepage did
not touch that bd/bh pair. So the buffer was not written.
4. The page flags are always consistently the same:
page flags: 2164 / 2164 = 0xc0000000000874
= 00001000 0111 0100
5. In order to better debug this, I created my own metadata version
of write_cache_pages with some instrumentation to tell what's
going wrong. Then I happened to notice that jdata has its own
version of write_cache_pages called gfs2_write_cache_jdata.
6. I also noticed that gfs2_write_cache_jdata has not kept up with
recent upstream bug fixes and even fixes in the rhel6 kernel.
I noticed it was missing several from Nick Piggin, including
this one in particular:
Since this only seems to recreate on jdata files, and since this
affects jdata's writing of pages, I thought that maybe some bug
in this function may have made it affect the metadata pages as
well, like maybe it was marking metadata pages dirty improperly
or something. I tried crosswriting some of these fixes into
gfs2_write_cache_jdata but unfortunately I was still able to hit
the problem. So now I'm going to try instrumenting my new metadata
version of write_cache_pages to see if I can figure out why the
buffer_head isn't being written.
So we can rule out incorrect page flags in that case.
We ought to pull in the recent upstream changes into the jdata writepages as you say, but I think thats again the wrong area for this bug since it only seems to occur on metadata.
I've been looking through the vfs/vm side of filemap_fdatawrite/writepages this morning, but so far there is nothing obviously wrong at all.
RE: comment #3 if the page is no uptodate but it is dirty that suggests that it might have been truncated while still dirty. Maybe we have not been pulling such pages out of the journal correctly.
Can you alter your debugging code to print out the page index and also the two inode sizes? That way we could prove that theory that its truncate related.
Regarding comment #5: At the time of failure, the dinode in
question is a jdata file, not a system dinode. The values of
ip->i_disksize == inode->i_size == something not small.
For example in one case it's 88006656. Not sure if that rules
out a truncate problem.
With instrumentation I was able to tell that write_cache_pages
is getting called with the correct wbc->range_start==0.
In at least one case I can tell that it decided the wbc was not
range_cyclic. It only called first pagevec_lookup_tag once in
this case, and it returned nr_pages==0, so it exited. Yet the
page flags show the page dirty and the correct metadata "mapping."
I'm still at a loss how we could have a dirty page that's not
found by pagevec_lookup_tag but it looks like there was likely
another metawrite_cache_pages that happened around the same time
with the same metamapping. From my current instrumentation I
can't tell how far apart they were.
My latest instrumentation shot down a theory I had, namely,
a bug in write_cache_pages where this might theoretically happen:
1. write_cache_pages calculates the number of max pages as:
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1;
which turns out to be 14 pages (proven by instrumentation).
2. After 14 pages, it exits the for loop, and returns to the
3. The first call to pagevec_lookup_tag sets index somewhere
well beyond the 14 pages, thus we get a gap between the
14 pages written and the new index. That's because
pagevec_lookup_tag updates index regardless of where
write_cache_pages decides to stop for its cond_resched.
It was a good theory, I thought, but it's not borne out by what
I'm seeing with my latest instrumentation.
I don't have a solution yet, but I'll need one very soon.
Setting flags for inclusion into 6.0, including blocker.
Created attachment 414966 [details]
I wasted all day trying to refine a complex patch that never
worked properly. In the end I gave up on that one and devised
this patch which is much simpler. This one has had two
successful runs of the scenario that kills it 99% of the time.
I just started it in a loop of 200 more runs and I'll see how
it goes overnight.
Created attachment 415010 [details]
Patch attempt #2
The previous patch was also bogus and I scrapped it as well.
So to state the problem again, in terms of the code:
__gfs2_log_flush calls lops_after_commit protected by the log_flush_lock
lops_after_commit calls buf_lo_after_commit and databuf_lo_after_commit
buf_lo_after_commit calls gfs2_unpin for all buffers on sd_log_le_buf
gfs2_unpin adds buffers to the gl_ail_list
But as soon as __gfs2_log_flush exits, someone else can do
another __gfs2_log_flush because it lets go of the
gfs2_log_flush_lock and thereby pin a bunch more to the gl_ail_list
while it's being synced.
Function inode_go_sync is called from glock go_xmote_th but
it's not holding gl_spin or any other kind of locking so the
gl_ail_list can (and does) change while the inode is being synced.
Therefore, the solution is either to either:
(a) add a lock to prevent adding to gl_ail_list while it's being synced
(b) flag the buffers on the gl_ail_list individually as being synced and not require it be entirely emptied
at the conclusion of ail_empty_gl
(c) move the gl_ail_list to someplace safe that is locked.
In gfs1 I introduced a gfs_bd_ail_tryremove, which is option (b).
The patch I did earlier today, the huge waste of time was (a).
The previous patch was a gfs2 attempt at (b).
This patch is an attempt at option (c) but the problem I
encounter with this patch is this:
GFS2: fsid=bobs_west6:brawl0.0: fatal: assertion "!atomic_read(&gl->gl_ail_count)" failed
GFS2: fsid=bobs_west6:brawl0.0: function = inode_go_inval, file = /home/bob/gfs2.perf.debu
GFS2: fsid=bobs_west6:brawl0.0: about to withdraw this file system
Perhaps Steve has some ideas on what to do about that.
I'm too tired to continue tonight. Pick it up in the morning.
Created attachment 415076 [details]
Try running this (or something along these lines, anyway) through your tests and see if it picks anything up. If not then maybe we'll need to instrument the page dirtying paths, but I think there is a fair chance this will pick up the offending process.
We really need to narrow down what is dirtying the buffer before we can figure out how to fix it.
Here's the trap info requested in comment #10:
------------[ cut here ]------------
kernel BUG at fs/buffer.c:1177!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
Modules linked in: sctp(U) libcrc32c(U) gfs2(U) dlm(U) configfs(U) sunrpc(U) cpu
Pid: 2322, comm: d_doio Not tainted 2.6.32 #7 PowerEdge SC1435
RIP: 0010:[<ffffffff81192740>] [<ffffffff81192740>] mark_buffer_dirty+0xb0/0xc0
RSP: 0018:ffff88021bb7da78 EFLAGS: 00010202
RAX: ffff88021a4fcaa0 RBX: ffff88021a5373a0 RCX: ffff880118aac690
RDX: 0000000000000000 RSI: ffff88021a4fcbc0 RDI: ffffea0007544ad0
RBP: ffff88021bb7da88 R08: 0000000000000000 R09: fdc7abf4fd1c6602
R10: 0000000000000000 R11: 0000000000000001 R12: ffff880218aad9d0
R13: ffff880118aac000 R14: ffff8801182647c0 R15: 0000000000000001
FS: 00007f9cb93ee700(0000) GS:ffff880028280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f9cb86fc000 CR3: 0000000217776000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process d_doio (pid: 2322, threadinfo ffff88021bb7c000, task ffff880216326100)
ffff88021bb7da88 ffff88021a5373a0 ffff88021bb7dac8 ffffffffa044c180
<0> 0000000000000000 ffff880118aac000 ffff880118aac690 ffff8801182647c0
<0> ffff880118aac668 0000000000000001 ffff88021bb7daf8 ffffffffa044c3e1
[<ffffffffa044c180>] gfs2_unpin+0x60/0x1b0 [gfs2]
[<ffffffffa044c3e1>] buf_lo_after_commit+0x61/0xb0 [gfs2]
[<ffffffffa044b251>] __gfs2_log_flush+0x301/0x690 [gfs2]
[<ffffffff8104c603>] ? mutex_spin_on_owner+0x83/0xb0
[<ffffffff8108dda0>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa0450b27>] gfs2_fsync+0x47/0xb0 [gfs2]
[<ffffffff814c8b1b>] ? mutex_lock+0x2b/0x50
[<ffffffffa045135e>] gfs2_file_aio_write+0x7e/0xb0 [gfs2]
[<ffffffff8110de08>] ? __alloc_pages_nodemask+0xf8/0x6d0
[<ffffffffa04512e0>] ? gfs2_file_aio_write+0x0/0xb0 [gfs2]
[<ffffffff8108dda0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff811ffdbb>] ? selinux_file_permission+0xfb/0x150
[<ffffffff811f33a6>] ? security_file_permission+0x16/0x20
[<ffffffff8113000e>] ? handle_mm_fault+0x1ee/0x2b0
Code: 48 8b 03 eb 8e 48 c7 c6 40 f7 75 81 48 8b 06 48 85 c0 75 0e 31 d2 e8 f0 fd
RIP [<ffffffff81192740>] mark_buffer_dirty+0xb0/0xc0
---[ end trace 52bec22d0e766c2d ]---
This makes more sense I think.... fsync doesn't take any glocks and here we have it being called after a (sync) write. That explains how it can interfere with the glock demotion state machine. I wonder if we have a race relating to GLF_LFLUSH since I notice that we reset this bit on the every block leaving the log which means that potentially we will skip the log flush required for demoting the glock if the log flush being run for the sync write has already unpinned at least one block belonging to the inode/glock.
There is a simple way to test this theory... we can just remove the optimisation in log.h and make gfs2_log_flush call __gfs2_log_flush unconditionally.
A further thought is that we might need to add a counter to the glock and keep track of the exact number of blocks in the journal for each glock. It would have to be an atomic_t since we want to access it, lock free, in the gfs2_log_flush inline.
Created attachment 415425 [details]
Proposed upstream fix
This is the simple fix. We can figure out how to optimise this later.
After discussing with Steve, here's the plan we came up with:
1. I'm submitting a slightly modified patch for bug #583737
since it has all the right flags and its previous patch was
NACKed and never made it to the rhel6 git repo.
2. Along with that, I'm rolling the patch from comment #14 into
that patch and will test and post it as a stop-gap measure.
3. This bug record will be used to create a final, more complex
version of the patch that Steve is currently working on.
Due to the plan we implemented from comment #15, a revised patch
for this bug is unnecessary for 6.0. We can still ship it in
6.0 if we craft the replacement fix in time, but at this point,
Steve's circumvention fix will work nicely. Better to take the
proper time with the "real" fix and not rush it, lest we introduce
regressions. Therefore, I'm clearing the 6.0 and blocker flags.
I'm going to tentatively set 6.1 but we can always change it back
if we need.
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
This request was erroneously denied for the current release of Red Hat
Enterprise Linux. The error has been fixed and this request has been
re-proposed for the current release.
This request was erroneously denied for the current release of
Red Hat Enterprise Linux. The error has been fixed and this
request has been re-proposed for the current release.