Bug 252191

Summary:

GFS2: More problems unstuffing journaled files

Product:

Red Hat Enterprise Linux 5

Reporter:

Robert Peterson <rpeterso>

Component:

kernel

Assignee:

Don Zickus <dzickus>

Status:

CLOSED ERRATA

QA Contact:

GFS Bugs <gfs-bugs>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.0

CC:

adas, bmarzins, djansa, lwang, nobody+wcheng, rkenna, rpeterso, swhiteho

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHBA-2007-0959

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-11-07 19:59:16 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

250772

Attachments:

Description	Flags
My current patch	none
The updated patch	none
Latest patch	none
Latest patch	none
Another updated patch	none

Description Robert Peterson 2007-08-14 17:16:18 UTC

+++ This bug was initially created as a clone of Bug #250772 +++

Description of problem:

We removed some kludges from the code and now the problem is back
where unstuffing a journaled inode is causing the kernel to panic.
Part of the removed kludging looked like:

        if (!bh){
                list_move(&bd->bd_ail_st_list, &ai->ai_ail2_list);
                continue;
        }

in functions gfs2_ail1_start_one and gfs2_ail1_empty_one.  It was
removed as part of the recent 248176.  As an experiment, I tried
adding this back in, but it only pushed the problem further out and
it failed later on.

I discussed this at length with Steve Whitehouse and he agrees that
this was a kludge and was rightfully removed: A null bh should not be
on the ail lists to begin with.

I tried some experiments at Steve's suggestion, all of which failed,
but it told us more about the problem.

After Steve left for the day, I did a bunch of experiments and
debugging to find out exactly where these null entries are coming
from.  The sequence of events seems to be this:

As part of the write operation, gfs2_invalidatepage gets called when
it's done.  gfs2_invalidatepage calls discard_buffer, which sets:

bh->b_private = NULL;

But that buffer header hasn't been flushed out to disk yet.  So when
log_flush gets called and the bh gets passed down the active items
lists, that's where we encounter a problem.

I checked the source code in git, and this call to discard_buffer and
its zeroing out of b_private have been in there since day one.
I tried making discard_buffer not do that, but of course, that also
failed, most likely because discard_buffer still does a bunch of
resetting of the bits in the bh.  It fails in: gfs2_ail2_empty_one
on the statement:

bh_ip = GFS2_I(bd->bd_bh->b_page->mapping->host);

I think that perhaps we shouldn't call discard_buffer at all until
the log/lops functions, (the gfs2_log_flush process), have finished
with the buffers/buffer descriptors.  Having the log code reference
these pages and bds, especially after discard_buffer calls
kmem_cache_free on the bd, is asking for corruption and kernel panics.

Perhaps Steve has some more thoughts on how to do this properly.


-- Additional comment from rpeterso on 2007-08-13 13:06 EST --
Created an attachment (id=161189)
Patch to fix the unstuff problem

Abhi, can you test this patch with your quota unstuff patch?
This is half-Steve and half-me to fix these problems.


-- Additional comment from adas on 2007-08-13 18:42 EST --
Created an attachment (id=161236)
Patch to forcibly unstuff the quota inode in gfs2_adjust_quota()

Bob, I tried this unstuffing patch of mine on top of your patch in the previous
comment and things seem to work fine. Steve, can you look through this patch
and tell me if it looks good?

-- Additional comment from rpeterso on 2007-08-13 18:56 EST --
I've been doing some more testing on the above patch.  I found another
test that fails, even with this patch.  This used to work a month or 
so ago.  It's test #4 listed here:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243899#c12

It basically hits the stuck_releasepage code in ops_address.c which
tells me somewhere it forgot to do a brelse on a buffer header.
Either that or someone did an extra bhget.  It gives the following values:

GFS2: fsid=bob_cluster2:test_gfs.0: stuck in gfs2_releasepage() d6c79ca0
GFS2: fsid=bob_cluster2:test_gfs.0: blkno = 147019, bh->b_count = 1
GFS2: fsid=bob_cluster2:test_gfs.0: pinned = 0
GFS2: fsid=bob_cluster2:test_gfs.0: bh->b_private = !NULL
GFS2: fsid=bob_cluster2:test_gfs.0: gl = (2, 99316)
GFS2: fsid=bob_cluster2:test_gfs.0: bd_list_tr = no, bd_le.le_list = no
GFS2: fsid=bob_cluster2:test_gfs.0: ip = 22 99316
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[0] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[1] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[2] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[3] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[4] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[5] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[6] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[7] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[8] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[9] = NULL

BTW, that block didn't look interesting, and it's not in an RG or
anything.  Just binary data in the middle (174MB) of the file,
and it's not always the same block.

I tried backing off my patch and had the same problem.
I tried some older versions of the code, but they all have various
problems doing this same test.

Since the buffer isn't pinned, the pins must match the unpins, so
that's apparently not it.  Since the i_cache's are all NULL, that
tells me that gfs2_meta_cache_flush has already run and done its
brelse's as well, so that's apparently not it either.

I did notice in gfs2_unpin that it will only do the brelse 
if (bd->bd_ail), so perhaps we have a bd_ail management issue?  
I'm still investigating.

Perhaps Steve can try this on his system, since he invented the test.


-- Additional comment from swhiteho on 2007-08-14 04:18 EST --
Abhi, your patch from comment #13 looks ok to me.


-- Additional comment from swhiteho on 2007-08-14 05:47 EST --
Bob, I tried the test with just my patch and the change to gfs2_ail2_empty_one()
and it worked ok for me. Perhaps thats a clue as to where the problem lies?

To be honest I've never really understood why it is that we wait in
gfs2_releasepage() at all. There is a comment in the VFS code which indicates
that there is a plan to make releasepage non-blocking in the future anyway. ext3
doesn't block in its release page, so if that becomes a problem, then I think we
should adopt the same solution.


-- Additional comment from rpeterso on 2007-08-14 10:38 EST --
I added a call to dump_stack() when the failure is detected and got this:

GFS2: fsid=bob_cluster2:test_gfs.0: stuck in gfs2_releasepage() c3d1eca0
 [<e02b9763>] gfs2_releasepage+0xef/0x3b1 [gfs2]
 [<e02b9674>] gfs2_releasepage+0x0/0x3b1 [gfs2]
 [<c0147e3f>] try_to_release_page+0x30/0x42
 [<c014d6ed>] shrink_inactive_list+0x4f9/0x7c7
 [<c01e49c0>] nfs_access_cache_shrinker+0x20/0x182
 [<c014da85>] shrink_zone+0xca/0xef
 [<c014ded7>] kswapd+0x288/0x405
 [<c0133e5d>] autoremove_wake_function+0x0/0x35
 [<c011dd60>] complete+0x39/0x48
 [<c014dc4f>] kswapd+0x0/0x405
 [<c0133d97>] kthread+0x38/0x5d
 [<c0133d5f>] kthread+0x0/0x5d
 [<c0105a6f>] kernel_thread_helper+0x7/0x10
 =======================
GFS2: fsid=bob_cluster2:test_gfs.0: blkno = 146511, bh->b_count = 1
GFS2: fsid=bob_cluster2:test_gfs.0: pinned = 1
GFS2: fsid=bob_cluster2:test_gfs.0: bh->b_private = !NULL
GFS2: fsid=bob_cluster2:test_gfs.0: gl = (2, 99316)
GFS2: fsid=bob_cluster2:test_gfs.0: bd_list_tr = no, bd_le.le_list = yes
GFS2: fsid=bob_cluster2:test_gfs.0: ip = 22 99316
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[0] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[1] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[2] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[3] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[4] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[5] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[6] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[7] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[8] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[9] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: stuck in gfs2_releasepage() c3d1eca0
 [<e02b9763>] gfs2_releasepage+0xef/0x3b1 [gfs2]
 [<e02b9674>] gfs2_releasepage+0x0/0x3b1 [gfs2]
 [<c0147e3f>] try_to_release_page+0x30/0x42
 [<c014d6ed>] shrink_inactive_list+0x4f9/0x7c7
 [<c041f1e2>] __sched_text_start+0x53a/0x5c9
 [<c012bb1e>] lock_timer_base+0x19/0x35
 [<c014da85>] shrink_zone+0xca/0xef
 [<c014e3f8>] try_to_free_pages+0x11d/0x1f1
 [<c014a2cf>] __alloc_pages+0x18f/0x28b
 [<c014732f>] generic_file_buffered_write+0x1b3/0x5d2
 [<c041f985>] __wait_on_bit_lock+0x4b/0x52
 [<c0147bd6>] __generic_file_aio_write_nolock+0x488/0x4e7
 [<c0147c87>] generic_file_aio_write+0x52/0xb0
 [<c01622f9>] do_sync_write+0xc7/0x10a
 [<c0128852>] tasklet_action+0x46/0x90
 [<c0133e5d>] autoremove_wake_function+0x0/0x35
 [<c012872d>] irq_exit+0x53/0x6b
 [<c011896e>] smp_apic_timer_interrupt+0x74/0x80
 [<c01058ec>] apic_timer_interrupt+0x28/0x30
 [<c0162232>] do_sync_write+0x0/0x10a
 [<c0162a72>] vfs_write+0x8a/0x10c
 [<c0162fde>] sys_write+0x41/0x67
 [<c0104e1e>] sysenter_past_esp+0x5f/0x85
 =======================
GFS2: fsid=bob_cluster2:test_gfs.0: blkno = 146535, bh->b_count = 1
GFS2: fsid=bob_cluster2:test_gfs.0: pinned = 1
GFS2: fsid=bob_cluster2:test_gfs.0: bh->b_private = !NULL
GFS2: fsid=bob_cluster2:test_gfs.0: gl = (2, 99316)
GFS2: fsid=bob_cluster2:test_gfs.0: bd_list_tr = no, bd_le.le_list = yes
GFS2: fsid=bob_cluster2:test_gfs.0: ip = 22 99316
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[0] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[1] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[2] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[3] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[4] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[5] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[6] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[7] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[8] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[9] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: stuck in gfs2_releasepage() c3d1eca0
 [<e02b9763>] gfs2_releasepage+0xef/0x3b1 [gfs2]
 [<c011dbcc>] dequeue_entity+0x75/0x92
 [<e02b9674>] gfs2_releasepage+0x0/0x3b1 [gfs2]
 [<c0147e3f>] try_to_release_page+0x30/0x42
 [<c014d6ed>] shrink_inactive_list+0x4f9/0x7c7
 [<c012bb1e>] lock_timer_base+0x19/0x35
 [<c012bb7e>] try_to_del_timer_sync+0x44/0x4a
 [<c041f902>] schedule_timeout+0x79/0x8d
 [<c012b88c>] process_timeout+0x0/0x5
 [<c041f8f4>] schedule_timeout+0x6b/0x8d
 [<c014f0cd>] congestion_wait+0x5b/0x64
 [<c0133e5d>] autoremove_wake_function+0x0/0x35
 [<c014da85>] shrink_zone+0xca/0xef
 [<c014e3f8>] try_to_free_pages+0x11d/0x1f1
 [<c014a2cf>] __alloc_pages+0x18f/0x28b
 [<c015f317>] __slab_alloc+0x177/0x4a7
 [<c015f75d>] kmem_cache_alloc+0x3c/0x7c
 [<c017c08f>] alloc_buffer_head+0x10/0x37
 [<c017c08f>] alloc_buffer_head+0x10/0x37
 [<e02b6265>] gfs2_log_fake_buf+0x56/0x10e [gfs2]
 [<e02b79ac>] databuf_lo_before_commit+0x39d/0x509 [gfs2]
 [<e02b6003>] gfs2_log_flush+0x110/0x2eb [gfs2]
 [<e02b5579>] gfs2_ail1_empty+0x2e/0x84 [gfs2]
 [<e02aa63a>] gfs2_logd+0x89/0x13b [gfs2]
 [<c011dd60>] complete+0x39/0x48
 [<e02aa5b1>] gfs2_logd+0x0/0x13b [gfs2]
 [<c0133d97>] kthread+0x38/0x5d
 [<c0133d5f>] kthread+0x0/0x5d
 [<c0105a6f>] kernel_thread_helper+0x7/0x10
 =======================
GFS2: fsid=bob_cluster2:test_gfs.0: blkno = 146543, bh->b_count = 1
GFS2: fsid=bob_cluster2:test_gfs.0: pinned = 1
GFS2: fsid=bob_cluster2:test_gfs.0: bh->b_private = !NULL
GFS2: fsid=bob_cluster2:test_gfs.0: gl = (2, 99316)
GFS2: fsid=bob_cluster2:test_gfs.0: bd_list_tr = no, bd_le.le_list = yes
GFS2: fsid=bob_cluster2:test_gfs.0: ip = 22 99316
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[0] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[1] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[2] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[3] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[4] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[5] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[6] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[7] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[8] = NULL
GFS2: fsid=bob_cluster2:test_gfs.0: ip->i_cache[9] = NULL

I'm not using NFS, so this is apparently just vfs doing its page
maintenance.

Comment 1 Robert Peterson 2007-08-14 22:49:29 UTC

Here's what I know so far:

This problem definitely recreates all the time I try to do this test
on trin-10 and it recreates regardless of whether or not I'm running with
the latest few patches.  I'm guessing Steve wasn't able to recreate it
because his machine has more memory.

For the record, to reproduce this problem I paste this into a shell:

service cman start
service clvmd start
mkfs.gfs2 -O -t bob_cluster2:test_gfs -p lock_dlm -j 3 /dev/trin_vg/hell
mount -tgfs2 /dev/trin_vg/hell /mnt/hell
cd /mnt/hell
mkdir foo
cd foo
touch foo
chattr +j foo
cat /home/devel/gfs2-2.6.git.tgz >foo
cd /home/devel/gfs2-2.6.git/
umount /mnt/hell
service clvmd stop
service cman stop
rmmod lock_dlm
rmmod gfs2

The sequence of events looks kind of like this:

1. Approximately 32600 pages are pinned by gfs2_pin().
2. Approximately 32600 pages go through gfs2_unpin()
   but there's no bd_ail list, so it doesn't do brelse() on them.
   Instead, it moves them to the jurisdiction of the glock with b_count==1.
   I did have an indication that 5 buffers had bd->bd_ail and were
   therefore brelse'd during gfs2_unpin.
3. Function log_pull_tail gets called.
4. Function gfs2_ail2_empty_one gets called, doing brelse and
   therefore setting b_count = 0.
5. It pins another 24000 pages.
6. Eventually meta_go_sync{} gets called for the exclusive glock
   but only twice, and the GLF_DIRTY flag wasn't set for the glock,
   so it didn't do anything.
7. Eventually function gfs2_ail_empty_gl gets called but only finds
   two bh's in the gl_ail_list and it brelse's them, changing b_count
   from 1 to 0.

However, when the problem occurs, gfs2_releasepage is definitely
sitting on a bh that has bh->b_count == 1 and it seems like the middle
of data, at least on disk.

It seems like these formerly pinned pages that were transferred to 
the glock should have been marked dirty so that they would be
processed and released.  It also seems like the buffers should be
released sooner than waiting for a meta_go_sync or an inode_go_sync,
but I'm not sure about that yet.

Comment 2 Steve Whitehouse 2007-08-15 10:09:10 UTC

I have an altered gfs2_releasepage() which no longer blocks. As a result I see a
bunch of gfs2_bufdata which are "stuck" after the umount. If I repeat the test,
but not using journaled data, then I don't see any "stuck" gfs2_bufdata.

So the problem appears only to relate to flushing journaled data on umount
rather than flushing during normal operation (since the amount thats left is a
lot less than the total number of blocks written), which probably means its
relatively easy to fix.

Comment 3 Steve Whitehouse 2007-08-15 12:04:01 UTC

Created attachment 161347 [details]
My current patch

This is what I'm using at the moment for testing. Does this solve the blocking
problem for you? If so do you see any of the asserts in log_shutdown() ?

Comment 4 Steve Whitehouse 2007-08-15 13:38:17 UTC

Created attachment 161356 [details]
The updated patch

Sorry, I realised that I'd attached the wrong version of this.

Comment 5 Robert Peterson 2007-08-15 14:18:10 UTC

This patch from comment #4 seems to fix the problem and I can umount
without any asserts or problems in log_shutdown.  I'll run all my
"Hell" test cases with it and see how it fares.

Comment 6 Steve Whitehouse 2007-08-15 14:29:32 UTC

Do check cat /proc/slabinfo | grep gfs2 and see if all the bufdata's drain away
after a few moments after umount. If not then we are now both seeing the same
thing now. We just need to work out what extra flush needs to occur and where.

Comment 7 Robert Peterson 2007-08-15 14:50:10 UTC

The patch passes all the "Hell" test cases.  I don't have Abhi's patch
yet for unstuffing the quota file, but that should be re-tested as well.

My kernel seems to have been built with SLUB, not SLAB, so I don't have
/proc/slabinfo.  I'll recompile my kernel with SLAB and re-do.

Comment 8 Robert Peterson 2007-08-15 15:16:01 UTC

With SLAB enabled in the kernel, I do see the numbers in /proc/slabinfo
go down to zero after everything is done.  Then I can remove the gfs2
module safely.  However, if I immediately try to do the umount (as is
the case where I paste the test into the command line) I get this error:

slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects
 [<c015e03d>] kmem_cache_destroy+0x84/0xc5
 [<e02c55cf>] exit_gfs2_fs+0x28/0x41 [gfs2]
 [<c014171c>] sys_delete_module+0x1a0/0x1c8
 [<c0153433>] remove_vma+0x36/0x3b
 [<c0104e1e>] sysenter_past_esp+0x5f/0x85
 [<c0410000>] xdr_partial_copy_from_skb+0x128/0x171
 =======================
BUG: unable to handle kernel paging request at virtual address e02c7876
 printing eip:
c021d581
*pde = 1f5cc067
*pte = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: dlm configfs qla2xxx
CPU:    0
EIP:    0060:[<c021d581>]    Not tainted VLI
EFLAGS: 00010297   (2.6.23-rc2 #2)
EIP is at strnlen+0x6/0x15
eax: e02c7876   ebx: e02c7876   ecx: e02c7876   edx: fffffffe
esi: d56f40cc   edi: d7b49ee4   ebp: 00000011   esp: d7b49e78
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process cat (pid: 3927, ti=d7b48000 task=d5f52030 task.ti=d7b48000)
Stack: c021cce0 c150bc64 00000000 c0148255 c014a0a7 00000001 00000044 00000f34 
       d56f40cc 0050bd08 d6e05588 d56f5000 ffffffff 00000010 c04dc174 d56f40cc 
       ddcf0380 d6da8b80 00000000 c0175baf d7b49ee4 d7b49ee4 00005c04 e02c7876 
Call Trace:
 [<c021cce0>] vsnprintf+0x2af/0x47c
 [<c0148255>] filemap_fault+0x217/0x373
 [<c014a0a7>] get_page_from_freelist+0x250/0x2dd
 [<c0175baf>] seq_printf+0x2e/0x4b
 [<c015edd3>] s_show+0x1b9/0x22f
 [<c017603e>] seq_read+0xe7/0x271
 [<c0175f57>] seq_read+0x0/0x271
 [<c0189d9e>] proc_reg_read+0x5c/0x6f
 [<c0189d42>] proc_reg_read+0x0/0x6f
 [<c0161744>] vfs_read+0x88/0x10a
 [<c0161b3f>] sys_read+0x41/0x67
 [<c0104e1e>] sysenter_past_esp+0x5f/0x85
 [<c0410000>] xdr_partial_copy_from_skb+0x128/0x171
 =======================
Code: c9 74 0c f2 ae 74 05 bf 01 00 00 00 4f 89 fa 5f 89 d0 c3 85 c9 57 89 c7 89
d0 74 05 f2 ae 75 01 4f 89 f8 5f c3 89 c1 89 c8 eb 06 <80> 38 00 74 07 40 4a 83
fa ff 75 f4 29 c8 c3 57 83 c9 ff 56 89 
EIP: [<c021d581>] strnlen+0x6/0x15 SS:ESP 0068:d7b49e78

Perhaps gfs2_log_shutdown should wait for the buffers?
Or a call from exit_gfs2_fs at the very least.

Comment 9 Steve Whitehouse 2007-08-15 15:36:50 UTC

Created attachment 161370 [details]
Latest patch

See if this fixes things for you? It works for me.

If this works, then we need to rerun all the tests including revolver & ddio
etc. on it. I want to be 100% sure that we've not broken anything else along
the way this time.

Comment 10 Steve Whitehouse 2007-08-15 16:40:18 UTC

Created attachment 161376 [details]
Latest patch

This fixes the ->bd_ail problem.

Comment 11 Robert Peterson 2007-08-15 19:53:22 UTC

This version of the patch from comment #10 works properly if
we add "bd &&" to the "if (bd->bd_ail)" clause.  In other words:

if (bd && bd->bd_ail)

With this change, the code passes all six "hell" tests.  I'm running
revolver on the RHEL5 version now and it's up to iteration 3.3.

Comment 12 Steve Whitehouse 2007-08-15 20:00:08 UTC

That sounds promising. I will have another change for this patch by tomorrow
with a bit of luck. We need an extra flush in inode_go_sync() for journaled
files in order to fix 252392 and I need to look at that very carefully to work
out exactly in which order that extra flush should occur.

Comment 13 Robert Peterson 2007-08-15 22:26:11 UTC

Revolver is still running on RHEL5 with this patch.  It's currently
on iteration 18.2 with no problems.

Comment 14 Steve Whitehouse 2007-08-16 09:29:14 UTC

Created attachment 161639 [details]
Another updated patch

This includes the fix for NULL bufdata's and also the fix which I expect will
solve bz #252392. I've tested the latter here (I can't reproduce the exact bug,
but I was able to reproduce a problem which goes away with this patch applied)
so I think this is the right fix.

I had actually spotted the problem earlier when looking into this bug but
hadn't been able to work out why the fix apparently made the problem worse
rather than better. Now I know the real answer to this bug, all has become
clear and I can see why this is required and why it appeared to have the wrong
effect before.

The problem was related to the ordering of flushing when dropping or otherwise
sync'ing a glock. In the journaled case we must flush the data blocks _after_
the journal flush (the data blocks will not be marked dirty until the journal
flush has taken place), in the writeback & ordered cases we must flush the data
blocks _before_ the journal flush (we want them stable on disk before we write
metadata chaneges to the journal).

Providing this version passes the tests, I'd hope that we can declare this
patch the final version.

Comment 15 Robert Peterson 2007-08-16 19:15:14 UTC

The RHEL5.1 patch was tested on the roth-0{1,2,3} cluster.  I ran all
six "hell" tests against it, plus Dean's scenario for bug #252392 and
all of them were successful.  This was done by loading the
kernel-2.6.18-40.el5, applying the patch, compiling, rebooting, and
running the tests.

Steve put the patch into the upstream git tree, and I posted the
RHEL5.1 patch to rhkernel-list.  So I'm changing status to POST and
transferring to Don Zickus.

Comment 16 Robert Peterson 2007-08-16 19:16:55 UTC

*** Bug 252392 has been marked as a duplicate of this bug. ***

Comment 17 Don Zickus 2007-08-21 18:36:01 UTC

in 2.6.18-42.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 20 errata-xmlrpc 2007-11-07 19:59:16 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html