Bug 137874

Summary: Process stuck in "D" when reading disk
Product: Red Hat Enterprise Linux 3 Reporter: Chris Adams <linux>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: petrides, riel, shillman
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-12-20 20:56:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Output from sysrq-T none

Description Chris Adams 2004-11-02 14:12:05 UTC
One of my RHEL 3 ES systems is regularly getting a process stuck in an
uninterruptable wait, state "D".

This system is a network backup server, running Arkeia.  Arkeia
creates a tree with a directory per system backed up and a
subdirectory tree below that that matches the backed up systems; i.e.
if I back up directory / on server ant, I get .../ant/etc,
.../ant/bin, etc.).  At the end of each night's backup, I have a
script that runs to clean and back up the tree.  It does a find across
the whole tree to clean out old stuff, and then runs tar to save it. 
At least once a week lately either the find or the tar process gets
stuck.  For example, right now I have:

# ps wwwl -p 5471
F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN  STAT TTY        TIME COMMAND
0     0  5471  2206  15   0  3964  704 wait_o D    ?          0:54 tar
--exclude=*.lck -jcf /var/knox/sav/o3dbtree.tar.bz2-new o3dbtree

The only way to clear the process is to reboot the system.

This is with all RHEL 3 ES updates applied, running
kernel-smp-2.4.21-20.EL.athlon.rpm.  This is a dual Opteron system (we
weren't ready to jump to the 64 bit OS) with an LSI MegaRAID adapter
for the storage (hardware RAID 1+0).  I'm running LVM, and the
filesystem that contains the Arkeia database is ext3 mounted with
"noatime".  Since Arkeia creates lots of small directories and files,
I made the filesystem with lots of inodes (df reports 11902144 1k
blocks and 3080192 inodes with 6057432 and 1695502 used respectively).

Please let me know if there is any additional data or information I
can gather.

Comment 1 Ernie Petrides 2004-11-02 21:01:59 UTC
Chris, this problem is likely fixed by our latest U4 kernel respin,
which is kernel version 2.4.21-23.EL.  It will appear in the RHN beta
channel within a few days, but will not be the exact kernel that is
released with U4.  (One more respin is expected.)

If you can reproduce the tar hang in a non-production environment,
would you be willing to test the -23.EL beta kernel to confirm
whether it resolves the problem?


Comment 2 Suzanne Hillman 2004-11-02 21:03:25 UTC
This looks like something for which you might be best off to go through support,
either by calling 1-800-REDHAT1, or by going to http://www.redhat.com/support

Comment 3 Larry Woodman 2004-11-02 21:14:06 UTC
Chris, like Ernie said, a problem similar to this was fixed in the
RHEL3-U4 kernel so its worth upgrading and testing the newer kernel. 
In any case, please try to get me an AltSysrq-T output when this
happens so I can see the kernel stack trace-back of the process stuck
in D state.  This will let me determine exactly what that process is
waiting on.

Larry Woodman


Comment 4 Chris Adams 2004-11-03 14:38:03 UTC
I will watch for the updated kernel in RHN and give it a try.  I don't
have a test system that I've been able to reproduce this on, so I
guess I will try it "live".

Is there a way to trigger the magic-sysrq codes remotely?  The system
is in an unmanned NOC (although I can probably get some hands on it,
it would be easier if I could do something over an SSH connection to
trigger it).


Comment 5 Chris Adams 2004-11-03 14:42:08 UTC
Created attachment 106122 [details]
Output from sysrq-T

Okay, I found /proc/sysrq-trigger.  Here is the output.  There are two
processes stuck in "D": tar (pid 5471) from yesterday, and o3flow (pid 1203)
from today's backup.  I did not reboot yesterday to clear the stuck tar
process, so it may be that the stuck o3flow is because of that.

Comment 6 Larry Woodman 2004-11-03 20:44:12 UTC
Chris, I think the problem that you are running into was fixed in the
latest RHEL3-U4 release candidate kernel.  It appears that tar process
that is stuck in D state downed an inode semaphore then did a
wait_on_inode() which raced with locking and unlocking of the inode
without a wakeup in prune_icache called from kswapd.   I fixed this
race by waking up any inode waiters after unlocking the inode in
prune_icache.  The o3flow stuch in D state just happened to try to
down the same inode semaphore that tar has already downed to it as
also stuck behind the same inode.  Can you please verify that the
latest kernel actually fixes the problem you are seeing.  The lates
kernel is located here:

>>>http://people.redhat.com/~lwoodman/RHEL3/


Thanks for your help, Larry Woodman


Comment 7 Chris Adams 2004-11-03 20:47:10 UTC
I've got the kernel downloaded and will reboot into it in a little
bit.  I'll just have to watch and see; this would sometimes happen in
a day or two and sometimes not for a week.  I'll let you know.


Comment 8 Ernie Petrides 2004-11-03 21:49:27 UTC
Thanks, Chris.  Based on Larry's analysis, I'm changing the state of this
bug to MODIFIED.  Larry's fix was committed to kernel version 2.4.21-23.EL,
which is a precursor to what will be released as the RHEL3 U4 kernel.


Comment 9 Chris Adams 2004-11-11 00:29:30 UTC
The system has been up with no problem for a week now, so I think it
is fixed.  Thanks.

I was going to close this, but I wasn't sure which state it should go to.


Comment 10 Ernie Petrides 2004-11-11 02:10:36 UTC
Thanks for your verification, Chris.  There's no further action required
on your part.  This bug should remain in MODIFIED state for now.  It will
automatically be transitioned to CLOSED/ERRATA by our Errata System when
RHEL3 U4 is officially released on RHN.


Comment 11 John Flanagan 2004-12-20 20:56:52 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-550.html