Bug 1029820

Summary: XFS Corruption detected xfs_do_force_shutdown called on heavy deletes, no corruption
Product: [Fedora] Fedora Reporter: Trevor Cordes <trevor>
Component: kernelAssignee: fedora-kernel-xfs
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 19CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone: ---Flags: jforbes: needinfo?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-03-10 14:41:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
relevant log entries around fs crash time none

Description Trevor Cordes 2013-11-13 10:07:25 UTC
Description of problem:
File server with XFS fs, 11TB usable (RAID6), 92% full, 9.2 million inodes used.
While deleting a large directory (1.5TB with about 1M files, some of them quite small), the delete ran for over 14 hours, the fs started becoming quite sluggish especially for writing (daemons and interactive) and eventually blew up:

Nov  2 02:36:53 piles kernel: [3582267.113743] XFS (md4): Corruption detected. Unmount and run xfs_repair
Nov  2 02:36:53 piles kernel: [3582267.329675] XFS (md4): xfs_do_force_shutdown(0x8) called from line 1365 of file fs/xfs/xfs_buf.c.  Return address = 0xf8ec8542
Nov  2 02:36:54 piles kernel: [3582268.457035] XFS (md4): Corruption of in-memory data detected.  Shutting down filesystem
Nov  2 02:36:54 piles kernel: [3582268.512041] XFS (md4): Please umount the filesystem and rectify the problem(s)
Nov  2 02:36:54 piles kernel: [3582268.585285] XFS (md4): xfs_difree: xfs_inobt_lookup() returned error 5.
Nov  2 02:36:54 piles kernel: [3582268.618031] XFS (md4): xfs_log_force: error 5 returned.
[...]
(See next attachment for full log including call trace.)


Was unable to umount the fs no matter what I tried.  lsof just blew up and since that fs is shared via NFS and SMB, etc, it was impossible to clean up for umount as most things just errored, froze or died.  Was able to get a pseudo-clean reboot (to not destroy the RAID6) with some alt-sysrq syncs and reboot.  (After disabling the fs from mounting on next boot.)

Ran:
xfs_repair -v -t 60 -m 1550 /dev/md4

(it wouldn't run with anything other than around -m 1550 without dying)

It ran for about 30mins and found zero errors!  I thought for sure there would be some on disk corruption that led to the fs blow up.

So I mounted it and it was all good, like nothing ever happened.

I checked my 16 hour rm results and found it had only deleted around half the files.  So I did the rm again but this time with severe ionice and nice and I also did the rm in smaller batches.  It rm'd all ok that way: quite quickly and with no blow ups.

This fs gets hammered pretty good most days for over 6 years and I've never seen anything like this.  Granted, I probably never have rm'd such a big dir in one go before.  This certainly seemed to me to be related to a massive rm occuring while normal other things run in the background, like recording mythtv and running some light rsync's, etc.

This bug may be hard to reproduce, and it may be quite nerve wracking to intentionally reproduce as restoring 10TB from backup would not be fun.

The computer is older (P-D) but has ECC and reliable-brand components and has never given any errors of any sort before.  It is not showing any bad caps on board or PS.

I should note during the entire process the RAID6 array was all OK with all U's in /proc/mdstat and no errors in the logs about dma,scsi,sda/etc,disk.


Version-Release number of selected component (if applicable):
kernel-PAE-3.10.11-200.fc19.i686
(Probably, it could have been 3.10.9 or 3.10.10, the computer hadn't been rebooted in a while and the boot-time logs spooled off, from what I can see.)

How reproducible:
once, so far
perhaps hard to reproduce

Steps to Reproduce:
1. Run a massive rm -rf on a dir on a massive XFS fs under decent load
2. Wait a day

Actual results:
fs gets very slow.  XFS crashes and disables the fs.  System still runs ok (no panic), you just can't access or umount the crashed fs.

Expected results:
Should delete the directory and do its normal processing happily

Additional info:

Comment 1 Trevor Cordes 2013-11-13 10:14:12 UTC
Created attachment 823324 [details]
relevant log entries around fs crash time

Comment 2 Trevor Cordes 2013-11-13 10:19:45 UTC
After compiling/redacting the log I realized more things:

The kernel it was running at the time was 3.10.10-200.fc19.i686.PAE.  (My grep I was using in comment #1 was wrong.)

In attachment line 14 it says I am not tainted, but on line 80 a few mins later I am suddenly tainted?  Huh?  I have no idea how that can happen.  I wasn't loading/unloading any modules and I'm pretty sure I have no taint on that system (the only thing "weird" in it is a Hauppauge 250 card.)

Also, the second call trace I did not see earlier.  It looks like something else went wonky during my attempts at shutdown.  It may not be relevant as it may have been during my frantic alt-sysrq work, which did take a while to do.  If the md1 refers to md raid device md1, that's my / partition on an Intel SSD disk separate from the RAID6 array.

Comment 3 Justin M. Forbes 2014-01-03 22:09:08 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.12.6-200.fc19.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20.

If you experience different issues, please open a new bug report for those.

Comment 4 Justin M. Forbes 2014-03-10 14:41:06 UTC
*********** MASS BUG UPDATE **************

This bug has been in a needinfo state for more than 1 month and is being closed with insufficient data due to inactivity. If this is still an issue with Fedora 19, please feel free to reopen the bug and provide the additional information requested.