Bug 512552 - Can't write to XFS mount during raid5 resync
Summary: Can't write to XFS mount during raid5 resync
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Eric Sandeen
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 526950
TreeView+ depends on / blocked
 
Reported: 2009-07-19 02:00 UTC by daryl herzmann
Modified: 2018-10-27 15:21 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 07:42:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
xfs module w/ more debugging (12.23 MB, application/octet-stream)
2009-07-20 16:42 UTC, Eric Sandeen
no flags Details
xfs metadump for reproducer (440.27 KB, application/octet-stream)
2009-12-01 18:40 UTC, Eric Sandeen
no flags Details
Reproducer (910 bytes, text/plain)
2009-12-01 18:45 UTC, Eric Sandeen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description daryl herzmann 2009-07-19 02:00:12 UTC
Playing with the new XFS bits in RHEL5.4beta kernel and I can't seem to write a file while the software raid5 is doing a resync.  I wrote 1.5 TB of data to the array just fine while it was in a degraded state.  I ran xfs_repair once and end up getting the same error during the next attempt at writing.

Linux mred 2.6.18-157.el5 #1 SMP Mon Jul 6 18:12:07 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
xfsprogs-2.9.4-4.el5


Jul 18 20:49:38 mred kernel: Filesystem "md0": Disabling barriers, trial barrier write failed
Jul 18 20:49:38 mred kernel: XFS mounting filesystem md0


Jul 18 20:50:10 mred kernel: Filesystem "md0": XFS internal error xfs_btree_check_sblock at line 307 of file fs/xfs/xfs_btree.c.  Caller 0xffffffff88673e87
Jul 18 20:50:10 mred kernel: 
Jul 18 20:50:10 mred kernel: Call Trace:
Jul 18 20:50:10 mred kernel:  [<ffffffff88664be2>] :xfs:xfs_btree_check_sblock+0xaf/0xbe
Jul 18 20:50:10 mred kernel:  [<ffffffff88673e87>] :xfs:xfs_inobt_increment+0x156/0x17e
Jul 18 20:50:10 mred kernel:  [<ffffffff886738f0>] :xfs:xfs_dialloc+0x4d0/0x80c
Jul 18 20:50:10 mred kernel:  [<ffffffff88679ac5>] :xfs:xfs_ialloc+0x5f/0x57f
Jul 18 20:50:10 mred kernel:  [<ffffffff8868babb>] :xfs:xfs_dir_ialloc+0x86/0x2b7
Jul 18 20:50:10 mred kernel:  [<ffffffff88680487>] :xfs:xlog_grant_log_space+0x204/0x25c
Jul 18 20:50:10 mred kernel:  [<ffffffff8868e590>] :xfs:xfs_create+0x237/0x45c
Jul 18 20:50:10 mred kernel:  [<ffffffff88653d37>] :xfs:xfs_attr_get+0x8e/0x9f
Jul 18 20:50:10 mred kernel:  [<ffffffff88697dfc>] :xfs:xfs_vn_mknod+0x144/0x215
Jul 18 20:50:10 mred kernel:  [<ffffffff8003abe3>] vfs_create+0xe6/0x158
Jul 18 20:50:10 mred kernel:  [<ffffffff8001b54e>] open_namei+0x1a1/0x6ed
Jul 18 20:50:10 mred kernel:  [<ffffffff80027978>] do_filp_open+0x1c/0x38
Jul 18 20:50:10 mred kernel:  [<ffffffff8001a35f>] do_sys_open+0x44/0xbe
Jul 18 20:50:10 mred kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0
Jul 18 20:50:10 mred kernel:

Comment 1 Eric Sandeen 2009-07-20 15:32:12 UTC
Hmm.  In this case xfs_btree_check_sblock is checking an on-disk header, for things like magic number, extent tree integrity, etc.  Unfortunately it doesn't tell us just what is wrong, but it's this test:

        sblock_ok =
                be32_to_cpu(block->bb_magic) == xfs_magics[cur->bc_btnum] &&
                be16_to_cpu(block->bb_level) == level &&
                be16_to_cpu(block->bb_numrecs) <=
                        xfs_btree_maxrecs(cur, (xfs_btree_block_t *)block) &&
                (be32_to_cpu(block->bb_leftsib) == NULLAGBLOCK ||
                 be32_to_cpu(block->bb_leftsib) < agflen) &&
                block->bb_leftsib &&
                (be32_to_cpu(block->bb_rightsib) == NULLAGBLOCK ||
                 be32_to_cpu(block->bb_rightsib) < agflen) &&
                block->bb_rightsib;

It would have been nice to use an XFS_CORRUPTION_ERROR() call here to dump out the block in hex so we can see it.

FWIW, this thread that hit the problem was going down a mknod path ...

Were there any other interesting messages prior to this?

Did xfs_repair find anything?

Thanks,
-Eric

Comment 2 daryl herzmann 2009-07-20 16:05:53 UTC
Hi Eric,

Thanks for your help!  I should mention that this RHEL5.4beta install is a VM on VMWare ESXi 4 (Dell PE610).

> Were there any other interesting messages prior to this?

Not that I noticed in /var/log/messages

> Did xfs_repair find anything?

Unfortunately, I don't recall it telling me anything meaningful.  It did fix the problem tho each time :)

Double unfortunately, I had a timeline to get this machine going, so it is running Ext3 now without issue.  I'm gonna try again on another machine this afternoon.  I'm very anxious to get back to using XFS again! 

PS.  Maybe because xfsprogs is not available from RHN, the one I got from centos was causing issues?

daryl

Comment 3 Ric Wheeler 2009-07-20 16:29:17 UTC
Daryl, could you reproduce this without VMWare  (on bare metal)?

Can you provide us with a detailed overview of the storage, how vmware was set up, etc?

Thanks!

Comment 4 Eric Sandeen 2009-07-20 16:42:32 UTC
Created attachment 354368 [details]
xfs module w/ more debugging

This xfs.ko should load cleanly and will dump the first 16 bytes of the block buffer that was found to have unexpected data, if you'd like to try it.  (It's for x86_64)

Thanks,
-Eric

Comment 5 daryl herzmann 2009-07-20 18:02:28 UTC
Hi guys,

Thanks again for the rapid help!  I tried setting up a reproducer on the VMWare box and of course, it works just fine this time.  Augh.  Until I can get you guys something useful, please feel free to close or do whatever.  Sorry about the goose chase for now.

daryl

Comment 6 Eric Sandeen 2009-08-01 21:32:39 UTC
FWIW I've sent a patch upstream to do the XFS_CORRUPTION_ERROR to at least dump out a bit more info when we hit this.  Won't make 5.4, though.

-Eric

Comment 7 daryl herzmann 2009-08-01 23:58:50 UTC
Thanks again Eric for your help.  I tried again to reproduce this and was not able to.  Please feel free to close this as invalid as I must have done something silly.  I'm so excited to see XFS in RHEL!

Comment 8 Eric Sandeen 2009-08-02 02:35:52 UTC
Ok, closing based on lack of further info; it's not clear what went wrong - all we know is that xfs detected on-disk corruption for some reason during resync.

If you see it again, please do re-open w/ any info you can provide.

Thanks,
-Eric

Comment 10 Mark Goodwin 2009-11-10 23:02:40 UTC
Comment on attachment 354368 [details]
xfs module w/ more debugging

marking this attachment obsolete - it pre-dates RHEL54 and shouldn't be used

Comment 19 Eric Sandeen 2009-11-25 18:09:14 UTC
Ok, I have a testcase for this, and it appears that this in raid5.c's make_request:

                } else {
                        /* cannot get stripe for read-ahead, just give-up */
                        clear_bit(BIO_UPTODATE, &bi->bi_flags);
                        finish_wait(&conf->wait_for_overlap, &w);
                        break;
                }

is the trigger.

When I run the testcase (basically a loopback-backed raid5 stripe, fragmented freespace, thousands of files in a dir, and simply try an "ls") I get:

Filesystem "md0": Disabling barriers, trial barrier write failed
XFS mounting filesystem md0
canceling readahead <--- my printk, my dump_stack() follows:

Call Trace:
 [<ffffffff886b7eeb>] :raid456:make_request+0x469/0x4e9
 [<ffffffff8009fc91>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8001c0af>] generic_make_request+0x211/0x228
 [<ffffffff80023112>] mempool_alloc+0x31/0xe7
 [<ffffffff80025b3e>] find_or_create_page+0x4b/0x72
 [<ffffffff80033581>] submit_bio+0xe4/0xeb
 [<ffffffff8847e363>] :xfs:_xfs_buf_ioapply+0x1ee/0x219
 [<ffffffff8847e3c7>] :xfs:xfs_buf_iorequest+0x39/0x64
 [<ffffffff8847e466>] :xfs:xfs_buf_iostart+0x74/0x8e
 [<ffffffff8847e4cf>] :xfs:xfs_buf_read_flags+0x4f/0x80
 [<ffffffff88455bc4>] :xfs:xfs_dir2_leaf_getdents+0x40c/0x5ec
 [<ffffffff8847ed90>] :xfs:xfs_hack_filldir+0x0/0x5b
 [<ffffffff8847ed90>] :xfs:xfs_hack_filldir+0x0/0x5b
 [<ffffffff88452868>] :xfs:xfs_readdir+0xa7/0xb6
 [<ffffffff8002579c>] filldir+0x0/0xb7
 [<ffffffff8847f425>] :xfs:xfs_file_readdir+0xff/0x14c
 [<ffffffff8002579c>] filldir+0x0/0xb7
 [<ffffffff8002579c>] filldir+0x0/0xb7
 [<ffffffff8003538b>] vfs_readdir+0x77/0xa9
 [<ffffffff80038c45>] sys_getdents+0x75/0xbd
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
Filesystem "md0": XFS internal error xfs_da_do_buf(2) at line 2112 of file fs/xfs/xfs_da_btree.c.  Caller 0xffffffff8845082e

Call Trace:
 [<ffffffff8845072d>] :xfs:xfs_da_do_buf+0x503/0x5b1
 [<ffffffff8845082e>] :xfs:xfs_da_read_buf+0x16/0x1b
 [<ffffffff8845082e>] :xfs:xfs_da_read_buf+0x16/0x1b
 [<ffffffff88455b0c>] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec
 [<ffffffff88455b0c>] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec
 [<ffffffff8847ed90>] :xfs:xfs_hack_filldir+0x0/0x5b
 [<ffffffff8847ed90>] :xfs:xfs_hack_filldir+0x0/0x5b
 [<ffffffff88452868>] :xfs:xfs_readdir+0xa7/0xb6
 [<ffffffff8002579c>] filldir+0x0/0xb7
 [<ffffffff8847f425>] :xfs:xfs_file_readdir+0xff/0x14c
 [<ffffffff8002579c>] filldir+0x0/0xb7
 [<ffffffff8002579c>] filldir+0x0/0xb7
 [<ffffffff8003538b>] vfs_readdir+0x77/0xa9
 [<ffffffff80038c45>] sys_getdents+0x75/0xbd
 [<ffffffff8005d229>] tracesys+0x71/0xe0
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 20 Eric Sandeen 2009-11-25 20:29:24 UTC
The logic needs to be cleaned up a bit, but something like this fixes it in my testing:

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 965df12..cbc0541 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -1142,6 +1165,8 @@ xfs_buf_bio_end_io(
 		if (unlikely(bp->b_error)) {
 			if (bp->b_flags & XBF_READ)
 				ClearPageUptodate(page);
+		} else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) {
+			ClearPageUptodate(bp);
 		} else if (blocksize >= PAGE_CACHE_SIZE) {
 			SetPageUptodate(page);
 		} else if (!PagePrivate(page) &&

Need to think about this a little more, wither we should test XBF_READ(_AHEAD) in conjunction with the uptodate ...

Mark, I'd suggested disabling readahead earlier, but that won't help, because xfs is doing readahead on dir blocks on its own ...

-Eric

Comment 21 Eric Sandeen 2009-11-25 21:30:07 UTC
Hm, although - upstream seems to do better, because on these cancelled IOs, we also have EIO set.  So maybe this is a bug outside xfs in RHEL5... still looking.

Comment 22 Eric Sandeen 2009-11-25 22:27:05 UTC
Ok.  So, maybe xfs should also be testing for this, but it looks like md raid5 is partly at fault; it should have been setting an error on the bio when it was not made uptodate on the read:

commit c2b00852fbae4f8c45c2651530ded3bd01bde814
Author: NeilBrown <neilb>
Date:   Sun Dec 10 02:20:51 2006 -0800

    [PATCH] md: return a non-zero error to bi_end_io as appropriate in raid5
    
    Currently raid5 depends on clearing the BIO_UPTODATE flag to signal an error
    to higher levels.  While this should be sufficient, it is safer to explicitly
    set the error code as well - less room for confusion.
    
    Signed-off-by: Neil Brown <neilb>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

With this patch in place on rhel5, things work.

As Neil says, xfs should probably cope too by checking uptodate itself.

But I think this patch will be sufficient for this bug...

Comment 23 Eric Sandeen 2009-11-25 23:14:02 UTC
Oh, one last note - if it reassures the user at all, this doesn't indicate on-disk corruption; we were simply handed a chunk of memory that looked like it came from disk, when actually it was uninitialized.  XFS did the right thing by noticing, and shutting down before any real damage happened.

Now, shutting down the filesystem is certainly a problem for the user, but the filesystem and the data on it prior to that error should be intact.

-Eric

Comment 24 Mark Goodwin 2009-11-26 00:57:02 UTC
(In reply to comment #23)

great work Eric :)

> Oh, one last note - if it reassures the user at all, this doesn't indicate
> on-disk corruption; we were simply handed a chunk of memory that looked like it
> came from disk, when actually it was uninitialized.  XFS did the right thing by
> noticing, and shutting down before any real damage happened.

yep

> Now, shutting down the filesystem is certainly a problem for the user, but the
> filesystem and the data on it prior to that error should be intact.

and that's exactly what has been observed - clean repair bill every time.

I'll brew a test RHEL54 kernel for the site with Neil's patch
(c2b00852fbae4f8c45c2651530ded3bd01bde814) and ask them to run their
reproducer again.

Cheers
-- Mark

Comment 26 Simon 2009-11-26 09:45:41 UTC
Is this bug likely to be the cause of an XFS shutdown I've just encountered on (software) RAID5?:

Filesystem "md0": XFS internal error xfs_btree_check_sblock at line 307 of file fs/xfs/xfs_btree.c.  Caller 0xffffffff883022a8

Call Trace:
 [<ffffffff88315be5>] :xfs:xfs_btree_check_sblock+0xaf/0xbe
 [<ffffffff883022a8>] :xfs:xfs_alloc_lookup+0x133/0x34c
 [<ffffffff88315713>] :xfs:xfs_btree_init_cursor+0x31/0x1a3
 [<ffffffff882ffced>] :xfs:xfs_free_ag_extent+0x64/0x67e
 [<ffffffff80064614>] __down_read+0x12/0x92
 [<ffffffff88301961>] :xfs:xfs_free_extent+0xa9/0xc9
 [<ffffffff8830a74d>] :xfs:xfs_bmap_finish+0xee/0x15f
 [<ffffffff8832989c>] :xfs:xfs_itruncate_finish+0x185/0x2af
 [<ffffffff8834143f>] :xfs:xfs_inactive+0x1de/0x40f
 [<ffffffff80063ad5>] mutex_lock+0xd/0x1d
 [<ffffffff8834bcc1>] :xfs:xfs_fs_clear_inode+0xa4/0xeb
 [<ffffffff80022c1c>] clear_inode+0xd2/0x123
 [<ffffffff8002f4a7>] generic_delete_inode+0xde/0x143
 [<ffffffff8000d46e>] dput+0xf6/0x114
 [<ffffffff80036c2a>] sys_renameat+0x18a/0x1eb
 [<ffffffff800b66f5>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

xfs_force_shutdown(md0,0x8) called from line 4269 of file fs/xfs/xfs_bmap.c.  Return address = 0xffffffff8830a782
Filesystem "md0": Corruption of in-memory data detected.  Shutting down filesystem: md0

The system was undergoing a resync/check as per /etc/cron.weely/99-raid-check; I've not had the problem before the 5.4 update.

Comment 27 Simon 2009-11-26 09:47:51 UTC
I should note that xfs_repair reports no errors or inconsistencies (as noted in comment #23).

Comment 28 Eric Sandeen 2009-11-26 17:54:00 UTC
Simon, yes I think it could be.

Is it repeateable?

Mark, if it's OK w/ you want to point Simon at your test build as well?

Thanks,
-Eric

Comment 29 Mark Goodwin 2009-11-26 21:54:49 UTC
Hi Simon, as Eric said, yes it looks to be the same md raid5 bug.

The patch was originally posted:
http://www.mail-archive.com/linux-raid@vger.kernel.org/msg06628.html and
is now upstream with commit c2b00852fbae4f8c45c2651530ded3bd01bde814

I have back-ported this patch to RHEL54 and built a test kernel:
http://people.redhat.com/~mgoodwin/IT363374/kernel-2.6.18-164.6.1.el5.IT363374v3.x86_64.rpm

can you give that kernel a test please?

Thanks
-- Mark

Comment 30 Jonathan Peatfield 2009-11-27 14:37:52 UTC
In the last couple of days we have seen similar looking xfs errors on two boxes which happen to have xfs on top of raid-5 mds though neither of them was in the process of a resync at the time that the errors happened.

As with the cases above the xfs_repair showed nothing unusual and re-mounting made the problem go away - for a while.

In our case the xfs filesystems were being fairly heavily used over NFS at the time of the errors, but that may not be relevant.

I'm just downloading the test kernel and will try that shortly...  btw could we have the kernel-source package too please?  (or we can't build modules against it with dkms etc)...

 -- Jon

Comment 31 Jonathan Peatfield 2009-11-27 15:02:09 UTC
Of course I meant kernel-devel now kernel-source <doh!>

Comment 32 Eric Sandeen 2009-11-28 02:49:31 UTC
I see the cc: list growing ... apparently folks are hitting this a lot.

Apologies for that; it's unique to the behavior of MD raid5, something that got missed in the testing matrix, obviously.

-Eric

Comment 34 Mark Goodwin 2009-11-29 23:12:11 UTC
(In reply to comment #31)
> Of course I meant kernel-devel now kernel-source <doh!>  

OK, I've put up the srpm, -devel and -headers packages.
http://people.redhat.com/~mgoodwin/IT363374/

Please test.

Thanks
-- Mark

Comment 37 Eric Sandeen 2009-11-30 17:09:26 UTC
I think we really need to try to push this in for RHEL5.5 at least.

-Eric

Comment 38 RHEL Program Management 2009-11-30 17:20:57 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 40 Eric Sandeen 2009-12-01 18:40:19 UTC
Created attachment 375151 [details]
xfs metadump for reproducer

Comment 41 Eric Sandeen 2009-12-01 18:45:49 UTC
Created attachment 375153 [details]
Reproducer

This bash script uses the previously attached metadata image.

It creates a 600G md0 device out of 3 sparse 300G files on loopback.

It then restores the metadata image and mounts the filesystem.

Finally it does some "ls" of the filesystem, and looks for kernelspace errors in dmesg.

It could probably be cleaned up a bit but should be a good starting point.

Thanks,
-Eric

Comment 44 Don Zickus 2009-12-11 19:27:56 UTC
in kernel-2.6.18-179.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 47 Jonathan Peatfield 2009-12-17 21:59:27 UTC
For what it is worth we have been stressing 3 machines which use raid-5 and xfs with the .IT363374 patched kernel for a while and havn't seen a single problem.

With the previous kernel it would routinely fail within a couple of hours, so the patch does seem to have cured it.

I don't suppose there is any hope of the patch for this being added to one of the kernel security/bugfix releases for el5.4 is there?

Currently we can either have the test version or a security update (without the raid5/xfs patch) or of course roll our own kernels with both (which is fine in the short term but might get a bit tedious after a while).

I've lost track of when el5.5 is due so I'm not sure whether it makes sense to aim for including it in a kernel before then.

 -- Jon

Comment 50 Chris Ward 2010-02-11 10:24:57 UTC
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~

RHEL 5.5 Beta has been released! There should be a fix present in this 
release that addresses your request. Please test and report back results 
here, by March 3rd 2010 (2010-03-03) or sooner.

Upon successful verification of this request, post your results and update 
the Verified field in Bugzilla with the appropriate value.

If you encounter any issues while testing, please describe them and set 
this bug into NEED_INFO. If you encounter new defects or have additional 
patch(es) to request for inclusion, please clone this bug per each request
and escalate through your support representative.

Comment 51 daryl herzmann 2010-02-11 15:21:29 UTC
Greetings Red Hat,

I tried again to reproduce this with rhel5.5beta kernel and was not able to.  Thank you!

Comment 54 errata-xmlrpc 2010-03-30 07:42:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html


Note You need to log in before you can comment on or make changes to this bug.