Bug 78312 - Kernel hang (possibly from disk or network load)
Kernel hang (possibly from disk or network load)
Status: CLOSED ERRATA
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.3
i686 Linux
high Severity high
: ---
: ---
Assigned To: Arjan van de Ven
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-11-20 22:33 EST by Need Real Name
Modified: 2005-10-31 17:00 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-02-28 22:05:40 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Syslog output containing sysreq-T/P from 1st hang (duplicated from bug 77508) (168.92 KB, text/plain)
2002-11-20 22:35 EST, Need Real Name
no flags Details
Serial console log of sysreq-T/P from 2nd hang (43.26 KB, text/plain)
2002-11-20 22:37 EST, Need Real Name
no flags Details
Diff between sysreq-T/P attempts for 2nd hang (2.69 KB, text/plain)
2002-11-20 22:38 EST, Need Real Name
no flags Details

  None (edit)
Description Need Real Name 2002-11-20 22:33:33 EST
Description of Problem:

System hangs unexpectedly, apparently after significant disk activity.  Network
access fails, alt-sysreq works (but alt-sysreq-b completely jams system).

This has happened twice with two separate kernels, separated by about a week.


Version-Release number of selected component (if applicable):

kernel-bigmem-2.4.18-17.7.x.i686    (1st)
kernel-bigmem-2.4.18-18.7.x.i686    (2nd)


How Reproducible:

Unclear.  Loading the machine seems to help.

First hang happened while lightly loading Oracle (including DB changes) and
filesystem.

Second happened while importing large files (Oracle backup files) over network
using ssh.


Additional Information:

4 CPU P4 Xeon system (Dell PowerEdge 6650)
Dell PERC 3/QC
All disks are SCSI, on the PERC RAID.

See bug 77508.  This hang occurred while trying to diagnose another problem.  I
now think this one is unrelated.  This is all the same machine.

This machine did do similar tasks using an earlier kernal (2.4.18-3 or
2.4.18-10---I forget which) without the problems described here.

I have two sets of sysreq-T/Ps to attach from the second hang.  They are
separated by about 20 minutes and a sysreq-S/U.  Actually, they are so similar I
will attach the first and a diff.
Comment 1 Need Real Name 2002-11-20 22:35:50 EST
Created attachment 85824 [details]
Syslog output containing sysreq-T/P from 1st hang (duplicated from bug 77508)
Comment 2 Need Real Name 2002-11-20 22:37:08 EST
Created attachment 85825 [details]
Serial console log of sysreq-T/P from 2nd hang
Comment 3 Need Real Name 2002-11-20 22:38:19 EST
Created attachment 85826 [details]
Diff between sysreq-T/P attempts for 2nd hang
Comment 4 Stephen Tweedie 2002-11-21 07:14:14 EST
In the second case, there is nothing suspicious at all in the logs.  There are
just a few processes waiting for disk IO.  There is absolutely no deadlocked
processes.

So, all we know is that a disk IO went missing --- we scheduled it to disk, but
it never completed, and one by one a bunch of kernel processes started waiting
uninterruptibly for that lost IO.

In conjunction with the first hang's trace, which indicates being locked in the
MegaServ process, this looks even more like a driver-level problem --- either a
driver bug, a firmware problem or a hardware problem causing lost IOs.  Are
there any other IO messages in the logs?
Comment 5 Need Real Name 2002-11-21 16:23:49 EST
There are absolutely no other messages that seem connected to the crash---the
only thing close in time are a few securelog messages recording my use of ssh
and su.  I see no evidence that a single sector was written to the disk once it
hung (for the 2nd hang), and the last thing logged to the console before the
sysreqs are boot messages.

I've upped sys.printk to 15.  Is there anything else I can do to get more
debugging info out of this system?
Comment 6 Need Real Name 2003-02-28 22:05:40 EST
Upgrading to 2.4.18-19.7.xbigmem fixed this.
We've gotten over 45 days of uptime and no freezes since.

Note You need to log in before you can comment on or make changes to this bug.