Running Red Hat 5.1 (32 bit) with all updates applied on various Intel x86 systems. Kernel is 2.6.18-53.1.4.el5. 1) Set the "error behavior" for a filesystem to "remount read-only", via tune2fs: tune2fs -e remount-ro /dev/sanvg1/home 2) Cause an error on that filesystem, perhaps by removing the disk that the filesystem resides on. The kernel will log errors and then switch the filesystem so that it's mounted read-only: Jan 16 06:51:58 host kernel: Remounting filesystem read-only 3) stop/kill anything that may be accessing that filesystem, and verify that nothing is open on it using lsof. 4) attempt to unmount the filesystem. The system will log a lockup message, and then hang. We've seen several different variations of the stack trace in the past week. Here are a couple: Jan 16 08:07:34 imap4 kernel: BUG: soft lockup detected on CPU#0! Jan 16 08:07:34 imap4 kernel: [<c044d1ec>] softlockup_tick+0x96/0xa4 Jan 16 08:07:34 imap4 kernel: [<c042ddb0>] update_process_times+0x39/0x5c Jan 16 08:07:34 imap4 kernel: [<c04196fb>] smp_apic_timer_interrupt+0x5b/0x6c Jan 16 08:07:34 imap4 kernel: [<c04059bf>] apic_timer_interrupt+0x1f/0x24 Jan 16 08:07:34 imap4 kernel: [<c0498195>] dqput+0x15b/0x15d Jan 16 08:07:34 imap4 kernel: [<c049959a>] vfs_quota_sync+0x9b/0x131 Jan 16 08:07:34 imap4 kernel: [<c049b321>] quota_sync_sb+0x11/0xcc Jan 16 08:07:34 imap4 kernel: [<c0472072>] __fsync_super+0x14/0x83 Jan 16 08:07:34 imap4 kernel: [<c047214d>] fsync_super+0x8/0x14 Jan 16 08:07:34 imap4 kernel: [<c04750a8>] do_remount_sb+0x47/0x121 Jan 16 08:07:34 imap4 kernel: [<c0487ffb>] do_mount+0x1a3/0x646 Jan 16 08:07:34 imap4 kernel: [<c04876d6>] mntput_no_expire+0x11/0x6a Jan 16 08:07:34 imap4 kernel: [<c047d3c7>] link_path_walk+0xb3/0xbd Jan 16 08:07:34 imap4 kernel: [<c045fd28>] __handle_mm_fault+0x353/0x87b Jan 16 08:07:34 imap4 kernel: [<c0460233>] __handle_mm_fault+0x85e/0x87b Jan 16 08:07:34 imap4 kernel: [<c0458a8a>] get_page_from_freelist+0x96/0x310 Jan 16 08:07:34 imap4 kernel: [<c0606965>] do_page_fault+0x274/0x4b8 Jan 16 08:07:34 imap4 kernel: [<c048733f>] copy_mount_options+0x26/0x109 Jan 16 08:07:34 imap4 kernel: [<c048850b>] sys_mount+0x6d/0xa5 Jan 16 08:07:34 imap4 kernel: [<c0404eff>] syscall_call+0x7/0xb Jan 16 08:07:34 imap4 kernel: ======================= Jan 12 16:17:42 imap3 kernel: BUG: soft lockup detected on CPU#0! Jan 12 16:17:42 imap3 kernel: [<c044d1ec>] softlockup_tick+0x96/0xa4 Jan 12 16:17:42 imap3 kernel: [<c042ddb0>] update_process_times+0x39/0x5c Jan 12 16:17:42 imap3 kernel: [<c04196fb>] smp_apic_timer_interrupt+0x5b/0x6c Jan 12 16:17:42 imap3 kernel: [<c04059bf>] apic_timer_interrupt+0x1f/0x24 Jan 12 16:17:42 imap3 kernel: [<c06057ec>] _spin_lock+0x3/0xf Jan 12 16:17:42 imap3 kernel: [<c04980a5>] dqput+0x6b/0x15d Jan 12 16:17:42 imap3 kernel: [<c0498c90>] vfs_quota_off+0xd6/0x355 Jan 12 16:17:42 imap3 kernel: [<c047574e>] deactivate_super+0x45/0x65 Jan 12 16:17:42 imap3 kernel: [<c0488762>] sys_umount+0x1f0/0x218 Jan 12 16:17:42 imap3 kernel: [<c0477bcc>] sys_stat64+0xf/0x23 Jan 12 16:17:42 imap3 kernel: [<c044abaa>] audit_syscall_entry+0x11c/0x14e Jan 12 16:17:42 imap3 kernel: [<c0488795>] sys_oldumount+0xb/0xe Jan 12 16:17:42 imap3 kernel: [<c0404eff>] syscall_call+0x7/0xb Jan 12 16:17:42 imap3 kernel: ======================= Note: quota-related calls appear in every one of the stack traces we've seen, so it may also be that you need to set up some quotas and enable quotas on the filesystem you test on.
Just to provide some more info: In every case where we've experienced this error, it's been with a filesystem (ext3) that's on top of a software RAID1. The RAID1 is composed of two volumes coming off of two separate (and geographically distinct) SAN arrays. We're using QLogic fibre-channel cards, in some cases QLE2342s, in other cases QLA2200s.
Note also that we don't believe this would be impacting us so much if it weren't for what we believe is another bug: when half of a RAID1 mirror fails, that failure should not be visible at the ext3 filesystem level, yet it is. We've reported that as a separate problem, bug # 430984 The two problems together make for a nasty one-two punch: if we lose communication with one of our geographically separate mirror halves, that error isn't confined to the MD level, and eventually becomes a filesystem error. Then, when we try to umount that filesystem and fix it, we get a system lockup and are forced to do a hard reboot.
Ok, this is pretty easy to recreate I'm afraid! set up quotas on a filesystem do a bit of IO mount -o remount,abort umount blam. BUG: soft lockup - CPU#1 stuck for 10s! [umount:28931] ... Pid: 28931, comm: umount Tainted: G 2.6.18-88.el5 #1 RIP: 0010:[<ffffffff88056b17>] [<ffffffff88056b17>] :ext3:ext3_journal_start_sb+0x3d/0x46 RSP: 0018:ffff81011feeddb8 EFLAGS: 00000202 ... Call Trace: [<ffffffff88058f55>] :ext3:ext3_release_dquot+0x42/0x75 [<ffffffff800f68d7>] dqput+0x15d/0x19f [<ffffffff800f755d>] vfs_quota_off+0xf6/0x3c7 [<ffffffff800db1c8>] deactivate_super+0x5b/0x82 [<ffffffff800e3fc4>] sys_umount+0x245/0x27b [<ffffffff800b3f7a>] audit_syscall_entry+0x16e/0x1a1 [<ffffffff8005d28d>] tracesys+0xd5/0xe0
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9c3013e9b91ad23ecae88e45405e98208cce455d is the likely fix, and simple enough. I'll test it. -Eric
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
FWIW, tested with something like this, substitute your favorite IO generator (or maybe even none is needed...) mkfs.ext3 /dev/sdb5 mount -o quota /dev/sdb5 /mnt/test quotacheck /mnt/test setquota -u root 10000 10000 10000 10000 /mnt/test quotaon /mnt/test cd /mnt/test cp ../linux-2.6.25.1.tar . tar xvf linux-2.6.25.1.tar sync cd mount -o remount,abort /mnt/test sleep 10 umount /mnt/test -Eric
in kernel-2.6.18-98.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
confirmed the bug on the 92.1.13 kernel Confirmed fix is working with the -116 kernel
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html
Hi, I feel that this has regressed as I have been able to re-produce this on RHEL5.4. Kernel Version: 2.6.18-164.6.1.el5 x86_64 Steps to re-produce: mkfs.ext3 /dev/hdb1 tune2fs -e remount-ro /dev/hdb1 mount -o quota /dev/hdb1 /mnt quotacheck /mnt setquota -u root 10000 10000 10000 10000 /mnt quotaon /mnt dd if=/dev/zero of=/mnt/dump # Cause corruption dd if=/dev/zero of=/dev/hdb1 # Cause filesystem to remount ro file /mnt/dump # Cause Soft lockup umount /mnt
I believe the initial patch didn't fix the problem. So this is not a regression but it wasn't fixed until now. Looked around for another patch upstream and this should fix the problem: b48d380541f634663b71766005838edbb7261685 Built on 5.4 and the issue is gone. I've already started to take action against incorporating this to the tree.
Tim, looks like my handy testcase was not representative of your bug. Thanks for the testcase in comment #14, that makes it clearer. Please open another new bug for this, with that testcase - I don't think we can recycle this one, sorry! -Eric
Hi Eric, I thought that would be the case, so I already opened a new bug some time ago: https://bugzilla.redhat.com/show_bug.cgi?id=546060
Ah so you did, thanks. (and sorry for calling you Tim) :)