Red Hat Bugzilla – Bug 137874
Process stuck in "D" when reading disk
Last modified: 2007-11-30 17:07:04 EST
One of my RHEL 3 ES systems is regularly getting a process stuck in an
uninterruptable wait, state "D".
This system is a network backup server, running Arkeia. Arkeia
creates a tree with a directory per system backed up and a
subdirectory tree below that that matches the backed up systems; i.e.
if I back up directory / on server ant, I get .../ant/etc,
.../ant/bin, etc.). At the end of each night's backup, I have a
script that runs to clean and back up the tree. It does a find across
the whole tree to clean out old stuff, and then runs tar to save it.
At least once a week lately either the find or the tar process gets
stuck. For example, right now I have:
# ps wwwl -p 5471
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
0 0 5471 2206 15 0 3964 704 wait_o D ? 0:54 tar
--exclude=*.lck -jcf /var/knox/sav/o3dbtree.tar.bz2-new o3dbtree
The only way to clear the process is to reboot the system.
This is with all RHEL 3 ES updates applied, running
kernel-smp-2.4.21-20.EL.athlon.rpm. This is a dual Opteron system (we
weren't ready to jump to the 64 bit OS) with an LSI MegaRAID adapter
for the storage (hardware RAID 1+0). I'm running LVM, and the
filesystem that contains the Arkeia database is ext3 mounted with
"noatime". Since Arkeia creates lots of small directories and files,
I made the filesystem with lots of inodes (df reports 11902144 1k
blocks and 3080192 inodes with 6057432 and 1695502 used respectively).
Please let me know if there is any additional data or information I
Chris, this problem is likely fixed by our latest U4 kernel respin,
which is kernel version 2.4.21-23.EL. It will appear in the RHN beta
channel within a few days, but will not be the exact kernel that is
released with U4. (One more respin is expected.)
If you can reproduce the tar hang in a non-production environment,
would you be willing to test the -23.EL beta kernel to confirm
whether it resolves the problem?
This looks like something for which you might be best off to go through support,
either by calling 1-800-REDHAT1, or by going to http://www.redhat.com/support
Chris, like Ernie said, a problem similar to this was fixed in the
RHEL3-U4 kernel so its worth upgrading and testing the newer kernel.
In any case, please try to get me an AltSysrq-T output when this
happens so I can see the kernel stack trace-back of the process stuck
in D state. This will let me determine exactly what that process is
I will watch for the updated kernel in RHN and give it a try. I don't
have a test system that I've been able to reproduce this on, so I
guess I will try it "live".
Is there a way to trigger the magic-sysrq codes remotely? The system
is in an unmanned NOC (although I can probably get some hands on it,
it would be easier if I could do something over an SSH connection to
Created attachment 106122 [details]
Output from sysrq-T
Okay, I found /proc/sysrq-trigger. Here is the output. There are two
processes stuck in "D": tar (pid 5471) from yesterday, and o3flow (pid 1203)
from today's backup. I did not reboot yesterday to clear the stuck tar
process, so it may be that the stuck o3flow is because of that.
Chris, I think the problem that you are running into was fixed in the
latest RHEL3-U4 release candidate kernel. It appears that tar process
that is stuck in D state downed an inode semaphore then did a
wait_on_inode() which raced with locking and unlocking of the inode
without a wakeup in prune_icache called from kswapd. I fixed this
race by waking up any inode waiters after unlocking the inode in
prune_icache. The o3flow stuch in D state just happened to try to
down the same inode semaphore that tar has already downed to it as
also stuck behind the same inode. Can you please verify that the
latest kernel actually fixes the problem you are seeing. The lates
kernel is located here:
Thanks for your help, Larry Woodman
I've got the kernel downloaded and will reboot into it in a little
bit. I'll just have to watch and see; this would sometimes happen in
a day or two and sometimes not for a week. I'll let you know.
Thanks, Chris. Based on Larry's analysis, I'm changing the state of this
bug to MODIFIED. Larry's fix was committed to kernel version 2.4.21-23.EL,
which is a precursor to what will be released as the RHEL3 U4 kernel.
The system has been up with no problem for a week now, so I think it
is fixed. Thanks.
I was going to close this, but I wasn't sure which state it should go to.
Thanks for your verification, Chris. There's no further action required
on your part. This bug should remain in MODIFIED state for now. It will
automatically be transitioned to CLOSED/ERRATA by our Errata System when
RHEL3 U4 is officially released on RHN.
An errata has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.