Bug 429054
| Summary: | soft lockup while unmounting a read-only filesystem with errors | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Tim Mooney <mooney> |
| Component: | kernel | Assignee: | Eric Sandeen <esandeen> |
| Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> |
| Severity: | high | Docs Contact: | |
| Priority: | low | ||
| Version: | 5.1 | CC: | esandeen, mgahagan, r.mcmurdo, tumeya |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | rhts | ||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2009-01-20 20:22:22 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Tim Mooney
2008-01-17 00:15:06 UTC
Just to provide some more info: In every case where we've experienced this error, it's been with a filesystem (ext3) that's on top of a software RAID1. The RAID1 is composed of two volumes coming off of two separate (and geographically distinct) SAN arrays. We're using QLogic fibre-channel cards, in some cases QLE2342s, in other cases QLA2200s. Note also that we don't believe this would be impacting us so much if it weren't for what we believe is another bug: when half of a RAID1 mirror fails, that failure should not be visible at the ext3 filesystem level, yet it is. We've reported that as a separate problem, bug # 430984 The two problems together make for a nasty one-two punch: if we lose communication with one of our geographically separate mirror halves, that error isn't confined to the MD level, and eventually becomes a filesystem error. Then, when we try to umount that filesystem and fix it, we get a system lockup and are forced to do a hard reboot. Ok, this is pretty easy to recreate I'm afraid! set up quotas on a filesystem do a bit of IO mount -o remount,abort umount blam. BUG: soft lockup - CPU#1 stuck for 10s! [umount:28931] ... Pid: 28931, comm: umount Tainted: G 2.6.18-88.el5 #1 RIP: 0010:[<ffffffff88056b17>] [<ffffffff88056b17>] :ext3:ext3_journal_start_sb+0x3d/0x46 RSP: 0018:ffff81011feeddb8 EFLAGS: 00000202 ... Call Trace: [<ffffffff88058f55>] :ext3:ext3_release_dquot+0x42/0x75 [<ffffffff800f68d7>] dqput+0x15d/0x19f [<ffffffff800f755d>] vfs_quota_off+0xf6/0x3c7 [<ffffffff800db1c8>] deactivate_super+0x5b/0x82 [<ffffffff800e3fc4>] sys_umount+0x245/0x27b [<ffffffff800b3f7a>] audit_syscall_entry+0x16e/0x1a1 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9c3013e9b91ad23ecae88e45405e98208cce455d is the likely fix, and simple enough. I'll test it. -Eric This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. FWIW, tested with something like this, substitute your favorite IO generator (or maybe even none is needed...) mkfs.ext3 /dev/sdb5 mount -o quota /dev/sdb5 /mnt/test quotacheck /mnt/test setquota -u root 10000 10000 10000 10000 /mnt/test quotaon /mnt/test cd /mnt/test cp ../linux-2.6.25.1.tar . tar xvf linux-2.6.25.1.tar sync cd mount -o remount,abort /mnt/test sleep 10 umount /mnt/test -Eric in kernel-2.6.18-98.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 confirmed the bug on the 92.1.13 kernel Confirmed fix is working with the -116 kernel An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html Hi, I feel that this has regressed as I have been able to re-produce this on RHEL5.4. Kernel Version: 2.6.18-164.6.1.el5 x86_64 Steps to re-produce: mkfs.ext3 /dev/hdb1 tune2fs -e remount-ro /dev/hdb1 mount -o quota /dev/hdb1 /mnt quotacheck /mnt setquota -u root 10000 10000 10000 10000 /mnt quotaon /mnt dd if=/dev/zero of=/mnt/dump # Cause corruption dd if=/dev/zero of=/dev/hdb1 # Cause filesystem to remount ro file /mnt/dump # Cause Soft lockup umount /mnt I believe the initial patch didn't fix the problem. So this is not a regression but it wasn't fixed until now. Looked around for another patch upstream and this should fix the problem: b48d380541f634663b71766005838edbb7261685 Built on 5.4 and the issue is gone. I've already started to take action against incorporating this to the tree. Tim, looks like my handy testcase was not representative of your bug. Thanks for the testcase in comment #14, that makes it clearer. Please open another new bug for this, with that testcase - I don't think we can recycle this one, sorry! -Eric Hi Eric, I thought that would be the case, so I already opened a new bug some time ago: https://bugzilla.redhat.com/show_bug.cgi?id=546060 Ah so you did, thanks. (and sorry for calling you Tim) :) |