This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 101650 - after heavy load - kernel stuck - networking still active but userland dead
after heavy load - kernel stuck - networking still active but userland dead
Status: CLOSED NOTABUG
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
8.0
i686 Linux
high Severity high
: ---
: ---
Assigned To: Arjan van de Ven
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2003-08-05 03:16 EDT by yuval yeret
Modified: 2005-10-31 17:00 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2003-09-29 16:52:22 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
full magic keys output after the hangup occurs (106.20 KB, text/plain)
2003-08-06 02:54 EDT, yuval yeret
no flags Details
sorted ksyms for the relevant kernel (41.76 KB, text/plain)
2003-08-06 02:56 EDT, yuval yeret
no flags Details
magic keys output without IANS loaded (134.50 KB, text/plain)
2003-08-06 05:11 EDT, yuval yeret
no flags Details
ksyms without iANS loaded (41.13 KB, text/plain)
2003-08-06 05:13 EDT, yuval yeret
no flags Details
magic keys without IANS with symbols looked up (218.81 KB, text/plain)
2003-08-06 05:37 EDT, yuval yeret
no flags Details

  None (edit)
Description yuval yeret 2003-08-05 03:16:58 EDT
Description of problem:
We are running 2.4.18-24 on SMP machines with 2CPUs and hyperthreading 
(SuperMicro Xeon servers) and doing heavy IO to disk and networking. (Qlogic 
/Emulex HBAs and Intel e1000 NICs are used) 

At some point the machine oopses /hangs  

Version-Release number of selected component (if applicable):
2.4.18-24 SMP kernel

How reproducible:
not reproduceable easily

Steps to Reproduce:
run heavy I/O load on the system (we have a proprietary application but it 
mainly writes a lot of files into ext3 partitions and does a lot of networking 
to clients and other machines in the cluster)

a variant is to somehow reduce the I/O capabilities of the system for a little 
while when it is under heavy load so there are pending IO requests. Once they 
can be serviced, the potential for the problem is higher. 
    
Actual results:
System is stuck in a zombie-like state where pings are answered and kernel 
seems to be functioning in a limited way, while some userland processes cannot 
work (e.g. ssh cannot work) 

The oops doesn't appear in logs or on the console, but I've been able to use 
the diagnostic keys to get the following information: 


right ALT+Scroll lock 
CPU 0 : swapper 
[<c0106f32>] (0xc0323fc4)) default_idle 
[<c0105000>] (0xc0323fd4)) empty_zero_page 
CPU 1 : swapper 
(<c0106f32>) (0xc 6597fb0) - default_idle 
(<c011d29b>) (0xc 6597fd0) - out_of_line_bug 
(<c011d449>) (8xc 659ffc) - printk 
CPU 3 : swapper 
[<c0106f32>] (0xc8257fb0)) default_idle 
[<c011d449>][0xc8257fd0)) printk 


Expected results:
system continues to work without oops...

Additional info:
Comment 1 Arjan van de Ven 2003-08-05 04:33:59 EDT
You're using a way old kernel, and binary only kernel modules... not a lot to do
here.
Can you reproduce this with the current erratum kernel and without bin only
kernel modules?
Comment 2 yuval yeret 2003-08-06 02:52:46 EDT
We are currently not utilizing any binary-only kernel modules :

this is our lsmod:
ians                  110128   2       ====> source taken from Intel
qla2300               231120   4       ====> source taken from Qlogic
sg                     31184   0  (unused)
e1000_5.0.43           69152   4       ====> source taken from Intel
exastore_mod             912   0  (unused) =====> internal patch to allow panic 
from userspace via the procfs
e100                   52144   2  (autoclean)
md                     62944   0  (unused)


We will try to reproduce with the newest errata kernel from redhat, but can you 
point to fixes made between 24 and the latest that could have addressed this ? 

I'm also attaching a more detailed trace we got together with our ksyms dump



all drivers are built from source:
Comment 3 yuval yeret 2003-08-06 02:54:51 EDT
Created attachment 93422 [details]
full magic keys output after the hangup occurs
Comment 4 yuval yeret 2003-08-06 02:56:14 EDT
Created attachment 93423 [details]
sorted ksyms for the relevant kernel
Comment 5 Arjan van de Ven 2003-08-06 03:48:54 EDT
ians is very much a binary only module (with a .c glue layer)

*** This bug has been marked as a duplicate of 78616 ***
Comment 6 yuval yeret 2003-08-06 05:04:49 EDT
well this happened even without iANS loaded. 

(P.S. I don't understand why iANS is considered binary module. 
from looking at the sources I can't find reference to any firmware or binary 
there. 
)



Comment 7 yuval yeret 2003-08-06 05:11:07 EDT
Created attachment 93426 [details]
magic keys output without IANS loaded
Comment 8 yuval yeret 2003-08-06 05:13:41 EDT
Created attachment 93427 [details]
ksyms without iANS loaded
Comment 9 yuval yeret 2003-08-06 05:37:41 EDT
Created attachment 93428 [details]
magic keys without IANS with symbols looked up
Comment 10 Mike A. Harris 2003-09-29 16:52:22 EDT
We are currently not utilizing any binary-only kernel modules :

ians                  110128   2       ====> source taken from Intel
                                             ^^^^^^^^^^^^^^^^^^^^^^^
Unsupported.  Red Hat does not support 3rd party kernel modules in any way,
wether they are proprietary/binary only, open source, GPL, or otherwise.  We
also do not support user compiled kernels or kernel modules.  We only support
the binary kernel we ship, with the binary modules we ship.


qla2300               231120   4       ====> source taken from Qlogic
                                             ^^^^^^^^^^^^^^^^^^^^^^^^
Unsupported.  Red Hat does not support 3rd party kernel modules in any way,
wether they are proprietary/binary only, open source, GPL, or otherwise.  We
also do not support user compiled kernels or kernel modules.  We only support
the binary kernel we ship, with the binary modules we ship.

e1000_5.0.43           69152   4       ====> source taken from Intel
                                             ^^^^^^^^^^^^^^^^^^^^^^^

Unsupported.  Red Hat does not support 3rd party kernel modules in any way,
wether they are proprietary/binary only, open source, GPL, or otherwise.  We
also do not support user compiled kernels or kernel modules.  We only support
the binary kernel we ship, with the binary modules we ship.

exastore_mod             912   0  (unused) =====> internal patch to allow panic 
from userspace via the procfs

Unsupported.  Red Hat does not support 3rd party kernel modules in any way,
wether they are proprietary/binary only, open source, GPL, or otherwise.  We
also do not support user compiled kernels or kernel modules.  We only support
the binary kernel we ship, with the binary modules we ship.  This includes
patched sources as well as unmodified sources.

>We will try to reproduce with the newest errata kernel from redhat, but can you 
>point to fixes made between 24 and the latest that could have addressed this ? 

You'll have to reproduce it with our supplied binary kernel, not recompiled,
and not using any 3rd party modules or recompiled sources.

>all drivers are built from source:

Which is never supported.

Note You need to log in before you can comment on or make changes to this bug.