Bug 910635 - XFS filesystem + md RAID hang on reboot
Summary: XFS filesystem + md RAID hang on reboot
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.9
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Jes Sorensen
QA Contact: XiaoNi
URL:
Whiteboard:
: 950460 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-02-13 04:00 UTC by storm9c1
Modified: 2014-06-16 18:24 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-06-03 11:44:39 UTC
Target Upstream Version:
Embargoed:
storm9c1: needinfo-


Attachments (Terms of Use)
Traceback after 120 seconds of hang (3.28 KB, application/octet-stream)
2013-02-13 04:00 UTC, storm9c1
no flags Details
sysrq-t at time of hang (35.28 KB, application/octet-stream)
2013-02-13 04:04 UTC, storm9c1
no flags Details
Patch proposed for 5.10 (2.74 KB, patch)
2013-05-03 08:52 UTC, Jes Sorensen
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
CentOS 0006217 0 None None None Never

Description storm9c1 2013-02-13 04:00:11 UTC
Created attachment 696702 [details]
Traceback after 120 seconds of hang

Description of problem:
As of RHEL 5.9, kernel 2.6.18-348.el5 (x86_64 and i386), a reboot of a system using XFS results in a hang immediately after printing:

Unmounting file systems:
Please stand by while rebooting the system...
md: stopping all md devices.
md: md2 switched to read-only mode.
md: md1 switched to read-only mode.
(hang)

After 120 seconds, a traceback is produced.  See attached document for traceback.

It's important to note that:
* Same hardware, kickstarted with same configuration --RHEL 5.8 does not hang on reboot.
* Same hardware, kickstarted with same configuration -- RHEL 5.9 and downgraded kernel to 2.6.18-308.el5 does NOT hang on reboot.
* Same hardware, kickstarted with same configuration -- RHEL 5.[678] and upgraded kernel to 2.6.18-348.el5 hangs on reboot.
* Same hardware, kickstarted with same configuration -- RHEL 5.9 using ext3 for root fs does not hang on reboot.
* Same hardware, kickstarted with same configuration -- RHEL 5.9 using ext3 for root fs, and a XFS data partition (md and non-md) does not hang on reboot.
* Same hardware, kickstarted with same configuration -- RHEL 5.9 and NO md raid1, does NOT hang on reboot.

I suspect something changed between kernel level 308 and 348 to interfere with md raid1 and the root filesystem.

This bug is being filed not to receive support for using XFS as the root filesystem (since it is clear that this is an unsupported configuration), but because I suspect that XFS as a root filesystem is EXPOSING a more nefarious bug that could potentially affect other supported filesystem types.

Discussions with folks on the XFS forum have produced this explanation:
"A fsync on the superblock when the filesystem is mounted read only should not be writing anything.

And, indeed, there's the problem - the fsync_bdev() path in your
original stack trace (i.e. in the partition invalidation code) does
not do a read-only check before telling the filesystem to write it's
superblock. RHEL6 and upstream both have checks in this path to
prevent this from happening."


Version-Release number of selected component (if applicable):
kernel 2.6.18-348.el5
kernel 2.6.18-348.1.1.el5

How reproducible:
Always


Steps to Reproduce:
1. Create a md raid1 device.
2. For the sake of testing, put the root filesystem on XFS (either through a migration process or using a patched Anaconda).
3. Reboot.
  
Actual results:
System hang


Expected results:
Normal system reboot should happen.

Additional info:
See attachment(s) for tracebacks

Comment 1 storm9c1 2013-02-13 04:04:47 UTC
Created attachment 696703 [details]
sysrq-t at time of hang

Comment 2 Jes Sorensen 2013-02-13 11:57:29 UTC
Which disk controller do you use? Does this happen to be an MPT SAS?

Jes

Comment 3 storm9c1 2013-02-13 17:08:20 UTC
(In reply to comment #2)
> Which disk controller do you use? Does this happen to be an MPT SAS?
> 
> Jes

No, it's the sata_sil driver.  And 2 identical Samsung SATA drives.  But this doesn't matter, I was able to reproduce it on other generic harware.  No hardware RAID (nor any fancy RAID controller) is being used.

All other kernels work fine in the 5.X series until 348.

During the discussion on the XFS forum, this changeset was also brought up:

11ff4073: [md] Fix reboot stall with raid on megaraid_sas controller

Is that related to the MPT SAS question you are asking?

Comment 4 Eric Sandeen 2013-02-13 20:49:23 UTC
Jes, FWIW, we don't support xfs on root.  Still, might bear investigation.  I agree that 11ff4073 seems to be the relevant change.

Comment 5 storm9c1 2013-02-13 22:07:30 UTC
Eric, I agree.  For the record, in the bug request, I wasn't asking for XFS on root support.  I am interested in having the possible fsync_bdev() on a read-only device bug investigated.  It appears that using XFS root is an easy way to expose this bug (though there might be other mechanisms to expose it as well).  I expect this bug to be dangerous for data integrity on other filesystem types.

Comment 6 Jes Sorensen 2013-05-03 08:45:15 UTC
I have a patch for problems caused by 11ff4073 which should make it into 5.10.

I am optimistic it will solve this problem as well.

Jes

Comment 7 Jes Sorensen 2013-05-03 08:49:20 UTC
*** Bug 950460 has been marked as a duplicate of this bug. ***

Comment 8 Jes Sorensen 2013-05-03 08:52:08 UTC
Created attachment 743120 [details]
Patch proposed for 5.10

Comment 9 Bernhard Erdmann 2013-05-03 11:19:47 UTC
How can I apply and test this patch?

Comment 10 storm9c1 2013-05-04 22:44:13 UTC
Patch applied to the md.c driver.  RHEL 5.9 (348) Kernel RPM rebuilt.  New kernel package applied to fresh system.  Does not appear to fix the problem by itself:

[root@test9][/root]# uname -a
Linux test9.skymagik.net 2.6.18-348.el5.999 #1 SMP Sat May 4 15:28:50 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

(I used kernel buildid 999 for testing this patch)


Despite the patch, something is still causing md to go read-only.  This is printed at the end of the reboot/halt process just before the hang:

md: stopping all md devices.
md: md2 switched to read-only mode.
md: md1 switched to read-only mode.
md: md3: sync done.
md: checkpointing recovery of md3.

Comment 11 Bernhard Erdmann 2013-05-08 10:40:42 UTC
I downloaded the appropriate kernel SRPM, installed it, integrated your patch into the spec file (changed version 2.6.18-348.4.1.el5 to 2.6.18-348.4.2.el5) and built a new kernel RPM (rpmbuild -bb kernel.rpm). I installed this kernel RPM (--nogpgcheck) and rebooted to start the new kernel. When shutting down this new kernel, I do not see any difference in shutdown procedure's output and the machine stops as well blocking the reboot.

[...]
Unmounting pipe file systems:
Unmounting file systems:
Please stand by while rebooting the system...
md: stopping all md devices.
md: md0 switched to read-only mode.
md: md1 switched to read-only mode.
md: md4 switched to read-only mode.
md: md5 switched to read-only mode.
md: md2 switched to read-only mode.
[...stop...]

Comment 12 Bernhard Erdmann 2013-05-08 10:41:54 UTC
correct command was "rpmbuild -bb kernel.spec"

Comment 13 Bernhard Erdmann 2013-05-08 10:58:58 UTC
Here is how I patched the kernel spec file:

--- kernel.spec~        2013-04-16 21:20:51.000000000 +0200
+++ kernel.spec 2013-05-04 16:50:26.502136932 +0200
@@ -75,7 +75,7 @@
 #
 %define sublevel 18
 %define stablerev 4
-%define revision 348.4.1
+%define revision 348.4.2
 %define kversion 2.6.%{sublevel}.%{stablerev}
 %define rpmversion 2.6.%{sublevel}
 %define release %{revision}%{dist}%{?buildid}
@@ -424,6 +424,7 @@
 Patch1: kernel-2.6.18-redhat.patch
 Patch2: xen-config-2.6.18-redhat.patch
 Patch3: xen-2.6.18-redhat.patch
+Patch4: md.patch

 # empty final patch file to facilitate testing of kernel patches
 Patch99999: linux-kernel-test.patch

Comment 14 Jes Sorensen 2013-05-08 11:00:14 UTC
Note it is normal that the root device isn't stopped but just switched to
read-only mode. Hence, I wonder if there problems you are seeing are due
to XFS.

Jes

Comment 15 Bernhard Erdmann 2013-05-08 11:10:53 UTC
The root fs on this machine is on /dev/md3 - and md3 is not mentioned in the output until the stop occurs.

# df /
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md3               7813312   3285312   4528000  43% /

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
      104320 blocks [2/2] [UU]

md1 : active raid1 sdb2[1] sda2[0]
      104320 blocks [2/2] [UU]

md4 : active raid1 sdb6[1] sda6[0]
      7823552 blocks [2/2] [UU]

md5 : active raid1 sdb7[1] sda7[0]
      609273024 blocks [2/2] [UU]

md2 : active raid1 sdd1[1] sdc1[0]
      1465135936 blocks [2/2] [UU]

md3 : active raid1 sdb5[1] sda5[0]
      7823552 blocks [2/2] [UU]

unused devices: <none>


[...]
Unmounting pipe file systems:
Unmounting file systems:
Please stand by while rebooting the system...
md: stopping all md devices.
md: md0 switched to read-only mode.
md: md1 switched to read-only mode.
md: md4 switched to read-only mode.
md: md5 switched to read-only mode.
md: md2 switched to read-only mode.
[...stop...]

Comment 16 storm9c1 2013-05-08 15:26:25 UTC
Yep, sounds like we both tested the patch the same way and ran into the same outcome (more work is needed before this patch can be fully utilized).

I'm not going to discount that it's XFS due to this being mentioned on the XFS forum when this bug first showed up:

"A fsync on the superblock when the filesystem is mounted read only should not be writing anything.
And, indeed, there's the problem - the fsync_bdev() path in your
original stack trace (i.e. in the partition invalidation code) does
not do a read-only check before telling the filesystem to write it's
superblock. RHEL6 and upstream both have checks in this path to
prevent this from happening."

While XFS on the root filesystem exposes this bug, I think all filesystem types are also affected, but ignore this condition.  I imagine this is also a problem because the other filesystem types could be silently corrupted in this case.  Trying to do a fsync_bdev() (or any writes for that matter) on a read-only device should not be allowed and should be handled before the underlying device becomes read-only.  I think the patch is a step in the right direction, but needs to be fully implemented (or implemented differently) on the root filesystem as well to resolve this problem.

Comment 17 Jes Sorensen 2013-05-08 15:34:58 UTC
What you are saying is that the patch I proposed is causing this fsync_bdev()?

The reason for the patch was reports that a write of metadata would result in 
an OOPS. I have a couple of reports on that, and all have reported the patch
solves the problem. The XFS case is somewhat different but given it was
related to the original patch pushed into 5.9 I figured it was worth a try.

Given we don't support XFS on root in 5.x, I haven't been looking closely
at this particular case, but if you have details please share.

Jes

Comment 18 storm9c1 2013-05-08 19:54:57 UTC
I am saying that the original traceback attached to this bug is showing the fsync_bdev() happens after the device goes read-only, and this patch does not correct it.  It has the potential to correct it by looking at the patch.  But something else is putting the root md device into read-only mode.  For this patch to work, shouldn't we see the mode switched to safemode instead of readonly?

Here is a portion of the original traceback.  Appears to be hanging at md_write_start() (and reading further down), this is triggered by fsync_bdev().  These are all happening on a md device that has already been switched to read-only, so this is the root cause of the bug, it really has nothing to do with XFS.  But XFS is paranoid enough to hang and wait for the I/O request to complete (which it never will).  Other filesystems do not, and they should (which could be considered a bug as well).  Can you tell us why these fsync calls are happening on a read-only device (regardless of filesystem type)?  If we can answer this question, this bug will make much more sense.

Call Trace:
 [<ffffffff8002e4bc>] __wake_up+0x38/0x4f
 [<ffffffff80223bce>] md_write_start+0xf2/0x108
 [<ffffffff800a3bc2>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8000ab62>] get_page_from_freelist+0x380/0x442
 [<ffffffff880b102c>] :raid1:make_request+0x38/0x5d8
 [<ffffffff8001c839>] generic_make_request+0x211/0x228
 [<ffffffff8002389f>] mempool_alloc+0x31/0xe7
 [<ffffffff8001a98f>] vsnprintf+0x5d7/0xb54
 [<ffffffff80033695>] submit_bio+0xe6/0xed
 [<ffffffff8807f801>] :xfs:_xfs_buf_ioapply+0x1f2/0x254
 [<ffffffff8807f89c>] :xfs:xfs_buf_iorequest+0x39/0x64
 [<ffffffff8808386c>] :xfs:xfs_bdstrat_cb+0x36/0x3a
 [<ffffffff8807c0a8>] :xfs:xfs_bwrite+0x5e/0xba
 [<ffffffff88077669>] :xfs:xfs_syncsub+0x119/0x226
 [<ffffffff88084ce2>] :xfs:xfs_fs_sync_super+0x33/0xdd
 [<ffffffff8010aa44>] quota_sync_sb+0x2e/0xf0
 [<ffffffff800e55bd>] __fsync_super+0x1b/0x9e
 [<ffffffff800e578a>] fsync_super+0x9/0x16
 [<ffffffff800e57c1>] fsync_bdev+0x2a/0x3b
 [<ffffffff8014ea59>] invalidate_partition+0x28/0x40
 [<ffffffff802225a8>] do_md_stop+0xa0/0x2ec
 [<ffffffff80224d41>] md_notify_reboot+0x5f/0x120
 [<ffffffff80067565>] notifier_call_chain+0x20/0x32
 [<ffffffff8009de98>] blocking_notifier_call_chain+0x22/0x36
 [<ffffffff8009e220>] kernel_restart_prepare+0x18/0x29

Comment 19 justin.vegso 2013-06-05 18:06:05 UTC
I use XFS extensively for its data reliability and checksum/b+ tree commit features.

I'm a paying RH customer (additionally, I have specifically paid for XFS support) and quite frankly, I'm more than a little concerned that my XFS data LUNS of Enterprise-critical data might biff it and go corrupt on a reboot due to improper shutdown semantics if I were to upgrade to a 5.9 kernel.  This is holding me back for a few cases where I would like to use XFS on production machine data LUNs with RHEL 5.9

Justin

Comment 20 Eric Sandeen 2013-06-05 20:48:39 UTC
Justin, as far as I know this issue is affecting xfs on the root filesystem, something which is not supported.  Non-root data luns should be unaffected by this behavior, as far as I understand it.

However, if you have further concerns I'd encourage you to talk to your support channels so that things are properly tracked and addressed.

Thanks,
-Eric

Comment 21 Eric Sandeen 2013-06-05 21:45:26 UTC
From the upstream thread on this, Dave found the same commit that I had identified:

(Subject: XFS appears to cause strange hang with md raid1 on reboot)

---
I can tell you the exact commit in the RHEL 5.9 tree that
caused this regression:

11ff4073: [md] Fix reboot stall with raid on megaraid_sas controller

The result is that the final shutdown of md devices now uses a
"force readonly" method, which means it ignores the fact that a
filesystem may still be active on top of it and rips the device out
from under the filesystem. This really only affects root devices,
and given that XFs is not supported as a root device on RHEL, it
isn't in the QE test matrix and so the problem was never noticed.
---

-eric

Comment 23 Jes Sorensen 2013-10-11 08:03:30 UTC
Could you please test this against 5.10 and let us know if you still see the
problem?

Thanks,
Jes

Comment 24 storm9c1 2013-10-17 02:32:15 UTC
Thanks for keeping me posted.  But unfortunately, using kernel-2.6.18-371...  No, the system still hangs on reboot as before.  Identical traceback as well.
Disappointing.  Let me know if you want me to try any other kernel versions.

Comment 25 RHEL Program Management 2014-03-07 12:11:30 UTC
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in the  last planned RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX. To request that Red Hat re-consider this request, please re-open the bugzilla via  appropriate support channels and provide additional business and/or technical details about its importance to you.

Comment 26 RHEL Program Management 2014-06-03 11:44:39 UTC
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).


Note You need to log in before you can comment on or make changes to this bug.