Bug 137874
Summary: | Process stuck in "D" when reading disk | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Chris Adams <linux> | ||||
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | ||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.0 | CC: | petrides, riel, shillman | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-12-20 20:56:52 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Chris Adams
2004-11-02 14:12:05 UTC
Chris, this problem is likely fixed by our latest U4 kernel respin, which is kernel version 2.4.21-23.EL. It will appear in the RHN beta channel within a few days, but will not be the exact kernel that is released with U4. (One more respin is expected.) If you can reproduce the tar hang in a non-production environment, would you be willing to test the -23.EL beta kernel to confirm whether it resolves the problem? This looks like something for which you might be best off to go through support, either by calling 1-800-REDHAT1, or by going to http://www.redhat.com/support Chris, like Ernie said, a problem similar to this was fixed in the RHEL3-U4 kernel so its worth upgrading and testing the newer kernel. In any case, please try to get me an AltSysrq-T output when this happens so I can see the kernel stack trace-back of the process stuck in D state. This will let me determine exactly what that process is waiting on. Larry Woodman I will watch for the updated kernel in RHN and give it a try. I don't have a test system that I've been able to reproduce this on, so I guess I will try it "live". Is there a way to trigger the magic-sysrq codes remotely? The system is in an unmanned NOC (although I can probably get some hands on it, it would be easier if I could do something over an SSH connection to trigger it). Created attachment 106122 [details]
Output from sysrq-T
Okay, I found /proc/sysrq-trigger. Here is the output. There are two
processes stuck in "D": tar (pid 5471) from yesterday, and o3flow (pid 1203)
from today's backup. I did not reboot yesterday to clear the stuck tar
process, so it may be that the stuck o3flow is because of that.
Chris, I think the problem that you are running into was fixed in the
latest RHEL3-U4 release candidate kernel. It appears that tar process
that is stuck in D state downed an inode semaphore then did a
wait_on_inode() which raced with locking and unlocking of the inode
without a wakeup in prune_icache called from kswapd. I fixed this
race by waking up any inode waiters after unlocking the inode in
prune_icache. The o3flow stuch in D state just happened to try to
down the same inode semaphore that tar has already downed to it as
also stuck behind the same inode. Can you please verify that the
latest kernel actually fixes the problem you are seeing. The lates
kernel is located here:
>>>http://people.redhat.com/~lwoodman/RHEL3/
Thanks for your help, Larry Woodman
I've got the kernel downloaded and will reboot into it in a little bit. I'll just have to watch and see; this would sometimes happen in a day or two and sometimes not for a week. I'll let you know. Thanks, Chris. Based on Larry's analysis, I'm changing the state of this bug to MODIFIED. Larry's fix was committed to kernel version 2.4.21-23.EL, which is a precursor to what will be released as the RHEL3 U4 kernel. The system has been up with no problem for a week now, so I think it is fixed. Thanks. I was going to close this, but I wasn't sure which state it should go to. Thanks for your verification, Chris. There's no further action required on your part. This bug should remain in MODIFIED state for now. It will automatically be transitioned to CLOSED/ERRATA by our Errata System when RHEL3 U4 is officially released on RHN. An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html |