Description of Problem:
When attempting to umount an ext3 filesystem, the process spins in 'R' state,
and cannot be killed (even with -9). The entire system becomes extremely
sluggish (keypresses take seconds, running most commands take minutes) except
for low-level network activity (i.e. pings). Other umounts also freeze,
requiring that the system be rebooted hard to fix it.
The remaining filesystems still function (even those on other partitions of the
same disk), although I/O seems to use up large amounts of system CPU time:
[mjeffery@hera mjeffery]$ time echo OK >file
[mjeffery@hera mjeffery]$ time cat file
[mjeffery@hera mjeffery]$ time cat file
[mjeffery@hera mjeffery]$ time rm file
Version-Release number of selected component (if applicable):
The problem has happened twice on very recent attempts to umount partitions.
They were different partitions, mounting with different journalling levels
(data=journal and data=writeback). However, I have shut the system down cleanly
at least once since the last kernel upgrade, and have mounted and umounted the
filesystems in question repeatedly without problems after rebooting.
I do not have a procedure that can reproduce the problem, even inconsistently.
4 CPU P4 Xeon system (Dell PowerEdge 6650)
Dell PERC 3/QC
All disks are SCSI, on the PERC RAID.
(Hangs were both on RAID-1, but on different logical and physical disks.)
The machine is not yet deployed into production (as a primary Oracle DB server).
Problem was first triggered by a nightly backup process of the root filesystem
to a second disk, which is mounted, rsynced, and umounted. Second time was a
manual attempt to umount a filesystem to switch it to ext2.
I've not seen this reported before, and from the description I can't immediately
tell whether the problem is in a device driver, the ext3 filesystem, or the VFS
or VM layers.
Next time this happens, could you please trap some log output from the kernel to
help us debug this? If you can trap the log to a serial console, that would be
ideal, otherwise you should be able to pick it up from /var/log/messages. The
key things we need are output from
alt+sysrq+p: prints register dump showing where the CPU is stuck
alt+sysrq+t: prints backtrace of each process in the system
The sysrq-p output only shows the output on the current CPU, so please repeat
that several times so that we can capture each CPU in your system.
Still waiting for another lock-up...
Got another freeze-up from the system, and I have some sysreq info which I will
attach shortly. Most was recorded by syslog, but I also have an additional bit
transcribed from a screen-dump. Alt-SysReq-P produced several hex-for-hex
identical dumps in syslog. I have included only the first. There are two
Alt-SysReq-T dumps in syslog; I left both in since they seem at least slightly
Please note that this particular crash did not conform to the same pattern as
the prior two. This one was not triggered by a umount, and the network
interfaces shut off or failed. The fact that user processes were still able to
write to disk is similar, and I the machine's user-level appears to have been
quite sluggish (sysreq response was instantaneous). I will continue trying to
produce a crash more closely matching the original.
The timestamps taken from /var/log/messages are wildly inaccurate---the sysreqs
were typed shortly before 7:00 PM. This may or may not mean anything. In any
case, the clock seemed fine after the reboot.
Created attachment 84746 [details]
syslog output from /var/log/messages
Created attachment 84747 [details]
tail end of sysreq-t output (hand transcribed)
Created attachment 84748 [details]
sysreq-p output (hand transcribed)
Looks like a driver problem. The only sign of a suspicious running ext3 task is:
kjournald R current 2380 225 1 226 224 (L-TLB)
Call Trace: [<c010a831>] do_IRQ [kernel] 0xc1 (0xf48e9e1c))
[<c010d088>] call_do_IRQ [kernel] 0x5 (0xf48e9e40))
[<f883c0cb>] journal_commit_transaction [jbd] 0xc9b (0xf48e9e6c))
[<c018358c>] batch_entropy_process [kernel] 0xac (0xf48e9ea8))
[<c0126575>] update_process_times [kernel] 0x25 (0xf48e9ef0))
[<c01169f9>] smp_apic_timer_interrupt [kernel] 0xa9 (0xf48e9f14))
[<c010d9dd>] call_apic_timer_interrupt [kernel] 0x5 (0xf48e9f38))
[<c011a362>] schedule [kernel] 0x362 (0xf48e9f64))
[<f883e556>] kjournald [jbd] 0x146 (0xf48e9fc0))
[<f883e3f0>] commit_timeout [jbd] 0x0 (0xf48e9fd8))
[<c0107286>] kernel_thread [kernel] 0x26 (0xf48e9ff0))
[<f883e410>] kjournald [jbd] 0x0 (0xf48e9ff8))
and that's simply servicing an interrupt at this time. However, there's also:
MegaServ R F4E65880 4352 790 1 887 740 (NOTLB)
Call Trace: [<f8835f84>] mimd_ioctl_sem [megaraid] 0x0 (0xf5da3bac))
[<f8833370>] megadev_ioctl_done [megaraid] 0x0 (0xf5da3bb4))
[<c0107838>] __down [kernel] 0x68 (0xf5da3bbc))
[<f8835f90>] mimd_ioctl_sem [megaraid] 0xc (0xf5da3bc8))
[<f8835f90>] mimd_ioctl_sem [megaraid] 0xc (0xf5da3bcc))
[<c01079e4>] __down_failed [kernel] 0x8 (0xf5da3bd8))
[<f8835f84>] mimd_ioctl_sem [megaraid] 0x0 (0xf5da3bdc))
[<f88338db>] .text.lock.megaraid [megaraid] 0x84 (0xf5da3be8))
[<f8835f98>] mimd_entry_mtx [megaraid] 0x0 (0xf5da3f64))
[<f8832900>] megadev_ioctl_entry [megaraid] 0x30 (0xf5da3f74))
[<c0153437>] sys_ioctl [kernel] 0x257 (0xf5da3f94))
[<c0108c6b>] system_call [kernel] 0x33 (0xf5da3fc0))
in the trace, so it looks like we have the megaraid driver spinning in a loop
here. It's possible that the unmount is simply getting stalled behind this
What PERC firmware revision are you using?
I believe the "Version" below (1.72:3.27) is the firmware revision. Please
correct me if I'm wrong.
# cat /proc/megaraid/0/config
Controller Type: 438/466/467/471/493
Controller Supports 40 Logical Drives
Controller / Driver uses 64 bit memory addressing
Base = f8837000, Irq = 21, Logical Drives = 5, Channels = 4
Version =1.72:3.27, DRAM = 128Mb
Controller Queue Depth = 254, Driver Queue Depth = 126
# rpm -qa | grep -i mega
We're getting problems with umount on a machine running 2.4.18-17.7. The stuck
umount process looks like:
Nov 20 11:47:31 xpc3 kernel: umount R current 72 19630 19616
Nov 20 11:47:31 xpc3 kernel: Call Trace: [<c013a7e3>] invalidate_bdev [kernel]
Nov 20 11:47:31 xpc3 kernel: [<c013e81d>] kill_bdev [kernel] 0xd (0xe4e1bf20))
Nov 20 11:47:32 xpc3 kernel: [<c013f3d0>] blkdev_put [kernel] 0x50 (0xe4e1bf30))
Nov 20 11:47:32 xpc3 kernel: [<c013da72>] remove_super [kernel] 0x62 (0xe4e1bf44))
Nov 20 11:47:32 xpc3 kernel: [<c013e53a>] kill_super [kernel] 0xca (0xe4e1bf54))
Nov 20 11:47:32 xpc3 kernel: [<c0141ff8>] path_release [kernel] 0x28 (0xe4e1bf6c))
Nov 20 11:47:32 xpc3 kernel: [<c014e0d8>] sys_umount [kernel] 0x78 (0xe4e1bf88))
Nov 20 11:47:32 xpc3 kernel: [<c0115788>] schedule [kernel] 0x218 (0xe4e1bf98))
Nov 20 11:47:32 xpc3 kernel: [<c014e0ec>] sys_oldumount [kernel] 0xc (0xe4e1bfb4))
Nov 20 11:47:32 xpc3 kernel: [<c0108933>] system_call [kernel] 0x33 (0xe4e1bfc0))
(This was umounting an ext3 partition on a standard IDE drive). The umount
process is stuck in "R", at 100% CPU, and isn't listening to any kill signals. I
can't get the journald trace as the machine has too many processes, so it
doesn't make it to the log file.
We also have had umount stick on more than one machine when upgrading the kernel
with rpm. RPM mounts a loopback initrd image, writes it, then umounts it. The
umount can get stuck there with the same symptoms as above (stuck in
invalidate_bdev, see bug 77851)
jss: your problem looks quite different. In the sysreq call trace from the
previous report, the megaraid driver is stuck waiting for a driver semaphore.
Your trace, showing a spin in invalidate_bdev, looks like something I've seen
before --- it's a bad interaction between the low-latency patches and certain
umount conditions, and should be easy to fix.
Actually, from the sounds of it the problem Jeremy encountered sounds very
similar to mine. Keep in mind that the call trace I attached did not match the
circumstances of my original report. I included the information on the theory
that it might be related to the original problem, and since I have not yet
produced trace info for the umount hang, which has steadfastly refused to
reproduce itself since I submitted this bug.
Since it sounds like you have an explanation for the umount hang, do you have an
ETA for a fix?
I've upgraded some of the software/firmware on the machine in question. In
particular, the megaraid firmware is now at 1.74, and the kernel is
I've split off the non-umount-related stuff into its own bug, and have added a
new sysreq trace to it. (Yes it hung again today.) See bug 78312.
This bug was inappropriately marked MODIFIED. (The MODIFIED state refers to
changes in the code itself, and not just the bug report).
Changing to ASSIGNED.
The kernel-2.4.18-19.7.x packages should fix the core umount problem. Please
reopen the bug if it does not. (Megaraid-specific updates should be directed to