One of my RHEL 3 ES systems is regularly getting a process stuck in an uninterruptable wait, state "D". This system is a network backup server, running Arkeia. Arkeia creates a tree with a directory per system backed up and a subdirectory tree below that that matches the backed up systems; i.e. if I back up directory / on server ant, I get .../ant/etc, .../ant/bin, etc.). At the end of each night's backup, I have a script that runs to clean and back up the tree. It does a find across the whole tree to clean out old stuff, and then runs tar to save it. At least once a week lately either the find or the tar process gets stuck. For example, right now I have: # ps wwwl -p 5471 F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 0 0 5471 2206 15 0 3964 704 wait_o D ? 0:54 tar --exclude=*.lck -jcf /var/knox/sav/o3dbtree.tar.bz2-new o3dbtree The only way to clear the process is to reboot the system. This is with all RHEL 3 ES updates applied, running kernel-smp-2.4.21-20.EL.athlon.rpm. This is a dual Opteron system (we weren't ready to jump to the 64 bit OS) with an LSI MegaRAID adapter for the storage (hardware RAID 1+0). I'm running LVM, and the filesystem that contains the Arkeia database is ext3 mounted with "noatime". Since Arkeia creates lots of small directories and files, I made the filesystem with lots of inodes (df reports 11902144 1k blocks and 3080192 inodes with 6057432 and 1695502 used respectively). Please let me know if there is any additional data or information I can gather.
Chris, this problem is likely fixed by our latest U4 kernel respin, which is kernel version 2.4.21-23.EL. It will appear in the RHN beta channel within a few days, but will not be the exact kernel that is released with U4. (One more respin is expected.) If you can reproduce the tar hang in a non-production environment, would you be willing to test the -23.EL beta kernel to confirm whether it resolves the problem?
This looks like something for which you might be best off to go through support, either by calling 1-800-REDHAT1, or by going to http://www.redhat.com/support
Chris, like Ernie said, a problem similar to this was fixed in the RHEL3-U4 kernel so its worth upgrading and testing the newer kernel. In any case, please try to get me an AltSysrq-T output when this happens so I can see the kernel stack trace-back of the process stuck in D state. This will let me determine exactly what that process is waiting on. Larry Woodman
I will watch for the updated kernel in RHN and give it a try. I don't have a test system that I've been able to reproduce this on, so I guess I will try it "live". Is there a way to trigger the magic-sysrq codes remotely? The system is in an unmanned NOC (although I can probably get some hands on it, it would be easier if I could do something over an SSH connection to trigger it).
Created attachment 106122 [details] Output from sysrq-T Okay, I found /proc/sysrq-trigger. Here is the output. There are two processes stuck in "D": tar (pid 5471) from yesterday, and o3flow (pid 1203) from today's backup. I did not reboot yesterday to clear the stuck tar process, so it may be that the stuck o3flow is because of that.
Chris, I think the problem that you are running into was fixed in the latest RHEL3-U4 release candidate kernel. It appears that tar process that is stuck in D state downed an inode semaphore then did a wait_on_inode() which raced with locking and unlocking of the inode without a wakeup in prune_icache called from kswapd. I fixed this race by waking up any inode waiters after unlocking the inode in prune_icache. The o3flow stuch in D state just happened to try to down the same inode semaphore that tar has already downed to it as also stuck behind the same inode. Can you please verify that the latest kernel actually fixes the problem you are seeing. The lates kernel is located here: >>>http://people.redhat.com/~lwoodman/RHEL3/ Thanks for your help, Larry Woodman
I've got the kernel downloaded and will reboot into it in a little bit. I'll just have to watch and see; this would sometimes happen in a day or two and sometimes not for a week. I'll let you know.
Thanks, Chris. Based on Larry's analysis, I'm changing the state of this bug to MODIFIED. Larry's fix was committed to kernel version 2.4.21-23.EL, which is a precursor to what will be released as the RHEL3 U4 kernel.
The system has been up with no problem for a week now, so I think it is fixed. Thanks. I was going to close this, but I wasn't sure which state it should go to.
Thanks for your verification, Chris. There's no further action required on your part. This bug should remain in MODIFIED state for now. It will automatically be transitioned to CLOSED/ERRATA by our Errata System when RHEL3 U4 is officially released on RHN.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html