Red Hat Bugzilla – Bug 1029820
XFS Corruption detected xfs_do_force_shutdown called on heavy deletes, no corruption
Last modified: 2014-03-10 10:41:06 EDT
Description of problem:
File server with XFS fs, 11TB usable (RAID6), 92% full, 9.2 million inodes used.
While deleting a large directory (1.5TB with about 1M files, some of them quite small), the delete ran for over 14 hours, the fs started becoming quite sluggish especially for writing (daemons and interactive) and eventually blew up:
Nov 2 02:36:53 piles kernel: [3582267.113743] XFS (md4): Corruption detected. Unmount and run xfs_repair
Nov 2 02:36:53 piles kernel: [3582267.329675] XFS (md4): xfs_do_force_shutdown(0x8) called from line 1365 of file fs/xfs/xfs_buf.c. Return address = 0xf8ec8542
Nov 2 02:36:54 piles kernel: [3582268.457035] XFS (md4): Corruption of in-memory data detected. Shutting down filesystem
Nov 2 02:36:54 piles kernel: [3582268.512041] XFS (md4): Please umount the filesystem and rectify the problem(s)
Nov 2 02:36:54 piles kernel: [3582268.585285] XFS (md4): xfs_difree: xfs_inobt_lookup() returned error 5.
Nov 2 02:36:54 piles kernel: [3582268.618031] XFS (md4): xfs_log_force: error 5 returned.
(See next attachment for full log including call trace.)
Was unable to umount the fs no matter what I tried. lsof just blew up and since that fs is shared via NFS and SMB, etc, it was impossible to clean up for umount as most things just errored, froze or died. Was able to get a pseudo-clean reboot (to not destroy the RAID6) with some alt-sysrq syncs and reboot. (After disabling the fs from mounting on next boot.)
xfs_repair -v -t 60 -m 1550 /dev/md4
(it wouldn't run with anything other than around -m 1550 without dying)
It ran for about 30mins and found zero errors! I thought for sure there would be some on disk corruption that led to the fs blow up.
So I mounted it and it was all good, like nothing ever happened.
I checked my 16 hour rm results and found it had only deleted around half the files. So I did the rm again but this time with severe ionice and nice and I also did the rm in smaller batches. It rm'd all ok that way: quite quickly and with no blow ups.
This fs gets hammered pretty good most days for over 6 years and I've never seen anything like this. Granted, I probably never have rm'd such a big dir in one go before. This certainly seemed to me to be related to a massive rm occuring while normal other things run in the background, like recording mythtv and running some light rsync's, etc.
This bug may be hard to reproduce, and it may be quite nerve wracking to intentionally reproduce as restoring 10TB from backup would not be fun.
The computer is older (P-D) but has ECC and reliable-brand components and has never given any errors of any sort before. It is not showing any bad caps on board or PS.
I should note during the entire process the RAID6 array was all OK with all U's in /proc/mdstat and no errors in the logs about dma,scsi,sda/etc,disk.
Version-Release number of selected component (if applicable):
(Probably, it could have been 3.10.9 or 3.10.10, the computer hadn't been rebooted in a while and the boot-time logs spooled off, from what I can see.)
once, so far
perhaps hard to reproduce
Steps to Reproduce:
1. Run a massive rm -rf on a dir on a massive XFS fs under decent load
2. Wait a day
fs gets very slow. XFS crashes and disables the fs. System still runs ok (no panic), you just can't access or umount the crashed fs.
Should delete the directory and do its normal processing happily
Created attachment 823324 [details]
relevant log entries around fs crash time
After compiling/redacting the log I realized more things:
The kernel it was running at the time was 3.10.10-200.fc19.i686.PAE. (My grep I was using in comment #1 was wrong.)
In attachment line 14 it says I am not tainted, but on line 80 a few mins later I am suddenly tainted? Huh? I have no idea how that can happen. I wasn't loading/unloading any modules and I'm pretty sure I have no taint on that system (the only thing "weird" in it is a Hauppauge 250 card.)
Also, the second call trace I did not see earlier. It looks like something else went wonky during my attempts at shutdown. It may not be relevant as it may have been during my frantic alt-sysrq work, which did take a while to do. If the md1 refers to md raid device md1, that's my / partition on an Intel SSD disk separate from the RAID6 array.
*********** MASS BUG UPDATE **************
We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.
Fedora 19 has now been rebased to 3.12.6-200.fc19. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20.
If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE **************
This bug has been in a needinfo state for more than 1 month and is being closed with insufficient data due to inactivity. If this is still an issue with Fedora 19, please feel free to reopen the bug and provide the additional information requested.