82416 – OOPS - Frequently system lockup/crash under some load

Bug 82416 - OOPS - Frequently system lockup/crash under some load

Summary: OOPS - Frequently system lockup/crash under some load

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Raw Hide
Classification:	Retired
Component:	kernel
Sub Component:
Version:	1.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Nalin Dahyabhai
QA Contact:	Jay Turner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-01-21 22:59 UTC by Daniel Khan
Modified:	2015-01-08 00:03 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-10-30 04:01:29 UTC
Embargoed:

Attachments	(Terms of Use)

Description Daniel Khan 2003-01-21 22:59:36 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

Description of problem:
Hello,

since months I am trying to debug a problem on my RedHat 7.3 machine using 
nss_ldap.
The following post and some other bugreports finally brought me to the 
conclusion that nss_ldap could be the cause:
http://www.netsys.com/nssldap/2002/09/msg00014.html

I have exactly the same problem the user describes.
After some uptime and under some load the system locks up totally.
All ports and interfaces remain open, pings are possible but no kind of login 
or connect is possible.
I don't run nscd which seems to caused some bugs lately.

Nearly the same issue was addressed some times at redhat's and padl's bugzilla 
but this problems are all marked as solved and shouldn't occur in nss_ldap-189-
4.

I allready exchanged my hardware. I have 2 similar set up machines running in 
a cluster. The error occurs on both.

Maybe an issue:
I had no crashes for some weeks as I had removed the LDAP maps in my postfix 
setup. Now these lookups are active and since then the problem seems to occur 
again. Bevore the only app which looked up ldap was pam.

Please help

Thanks in advance

Daniel Khan


Version-Release number of selected component (if applicable):


How reproducible:
Couldn't Reproduce


Additional info:

Comment 1 Daniel Khan 2003-01-24 11:48:17 UTC

I was able to reproduce the crash doing rsyncs over the 1000mbit nic.
I have got oops now and I have posted it to the kernel list also.
I tried the rawhide kernel 2.4.20 and the problem is the same.

Scenario:
2.4.20-2.25smp from RawHide

Doing a rsync from the crashing host _to_ another host over a 1000 Mbit 3com
(TG3).
The rsynced files include bigger files with about 1.5 gigs.
Heartbeat runs.

Below are the OOPS.

<------------------------CUT---------------------------->
NMI Watchdog detected LOCKUP on CPU0, eip c02499ac, registers:
via686a eeprom lm80 i2c-proc i2c-isa i2c-viapro i2c-core tg3 eepro100 mii
ipt_LOG ipt_limit ipt_state ipt_REJECT iptable_nat ip_cona
CPU:    0
EIP:    0060:[<c02499ac>]    Not tainted
EFLAGS: 00000086

EIP is at .text.lock.tcp_ipv4 [kernel] 0x182 (2.4.20-2.25smp)
eax: 00000001   ebx: d400010a   ecx: 00000000   edx: f78837d8
esi: f6f22ae0   edi: c3d3ad40   ebp: f74939f4   esp: f1335d8c
ds: 0068   es: 0068   ss: 0068
Process rsync (pid: 3151, stackpage=f1335000)
Stack: c3d3ad40 f3121f38 00000001 f1335e28 00000000 03ff0202 00000004
000003ff
       00000000 00000006 c3d3ad40 f74939e0 c022d67e c3d3ad40 f1335e28
c3d5a000
       00000000 00000006 00000000 00000001 00000000 c022d530 c021ce67
c3d3ad40
Call Trace:   [<c022d67e>] ip_local_deliver_finish [kernel] 0x14e
(0xf1335dbc))
[<c022d530>] ip_local_deliver_finish [kernel] 0x0 (0xf1335de0))
[<c021ce67>] nf_hook_slow [kernel] 0x107 (0xf1335de4))
[<c022d530>] ip_local_deliver_finish [kernel] 0x0 (0xf1335e00))
[<c022d2b3>] ip_local_deliver [kernel] 0x53 (0xf1335e1c))
[<c022d530>] ip_local_deliver_finish [kernel] 0x0 (0xf1335e34))
[<c022d8b9>] ip_rcv_finish [kernel] 0x219 (0xf1335e38))
[<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335e5c))
[<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335e6c))
[<c021ce67>] nf_hook_slow [kernel] 0x107 (0xf1335e70))
[<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335e8c))
[<c022d480>] ip_rcv [kernel] 0x1a0 (0xf1335ea8))
[<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335ec0))
[<c021566e>] netif_receive_skb [kernel] 0x14e (0xf1335ed8))
[<f89d2c7c>] tg3_rx [tg3] 0x27c (0xf1335ef8))
[<f89d2e71>] tg3_poll [tg3] 0x81 (0xf1335f38))
[<c0215917>] net_rx_action [kernel] 0xa7 (0xf1335f58))
[<c01289f9>] do_softirq [kernel] 0xd9 (0xf1335f80))
[<c010b81b>] do_IRQ [kernel] 0xfb (0xf1335f9c))
[<c010e7c8>] call_do_IRQ [kernel] 0x5 (0xf1335fc0))


Code: 7e f8 e9 68 e5 ff ff e8 2c ed eb ff e9 c3 ee ff ff e8 22 ed
console shuts up ...
 NMMI Watchdog detected LOCKUP on CPU1, eip f89d9f3b, registers:
<------------------------CUT---------------------------->

Comment 2 Daniel Khan 2003-01-27 06:14:10 UTC

I now got rid of the failures by exchanging the tg3 driver with the latest 
bcm5700 driver from 3com.
It seems as there is a bug in tg3 with the BCM5701 Gigabit Ethernet card.

regards
Daniel Khan

Note You need to log in before you can comment on or make changes to this bug.