Bug 82416

Summary:	OOPS - Frequently system lockup/crash under some load
Product:	[Retired] Red Hat Raw Hide	Reporter:	Daniel Khan <dk>
Component:	kernel	Assignee:	Nalin Dahyabhai <nalin>
Status:	CLOSED WONTFIX	QA Contact:	Jay Turner <jturner>
Severity:	high	Docs Contact:
Priority:	medium
Version:	1.0	CC:	davej, dk, srevivo
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-10-30 04:01:29 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Daniel Khan 2003-01-21 22:59:36 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

Description of problem:
Hello,

since months I am trying to debug a problem on my RedHat 7.3 machine using 
nss_ldap.
The following post and some other bugreports finally brought me to the 
conclusion that nss_ldap could be the cause:
http://www.netsys.com/nssldap/2002/09/msg00014.html

I have exactly the same problem the user describes.
After some uptime and under some load the system locks up totally.
All ports and interfaces remain open, pings are possible but no kind of login 
or connect is possible.
I don't run nscd which seems to caused some bugs lately.

Nearly the same issue was addressed some times at redhat's and padl's bugzilla 
but this problems are all marked as solved and shouldn't occur in nss_ldap-189-
4.

I allready exchanged my hardware. I have 2 similar set up machines running in 
a cluster. The error occurs on both.

Maybe an issue:
I had no crashes for some weeks as I had removed the LDAP maps in my postfix 
setup. Now these lookups are active and since then the problem seems to occur 
again. Bevore the only app which looked up ldap was pam.

Please help

Thanks in advance

Daniel Khan


Version-Release number of selected component (if applicable):


How reproducible:
Couldn't Reproduce


Additional info:

Comment 1 Daniel Khan 2003-01-24 11:48:17 UTC

I was able to reproduce the crash doing rsyncs over the 1000mbit nic.
I have got oops now and I have posted it to the kernel list also.
I tried the rawhide kernel 2.4.20 and the problem is the same.

Scenario:
2.4.20-2.25smp from RawHide

Doing a rsync from the crashing host _to_ another host over a 1000 Mbit 3com
(TG3).
The rsynced files include bigger files with about 1.5 gigs.
Heartbeat runs.

Below are the OOPS.

<------------------------CUT---------------------------->
NMI Watchdog detected LOCKUP on CPU0, eip c02499ac, registers:
via686a eeprom lm80 i2c-proc i2c-isa i2c-viapro i2c-core tg3 eepro100 mii
ipt_LOG ipt_limit ipt_state ipt_REJECT iptable_nat ip_cona
CPU:    0
EIP:    0060:[<c02499ac>]    Not tainted
EFLAGS: 00000086

EIP is at .text.lock.tcp_ipv4 [kernel] 0x182 (2.4.20-2.25smp)
eax: 00000001   ebx: d400010a   ecx: 00000000   edx: f78837d8
esi: f6f22ae0   edi: c3d3ad40   ebp: f74939f4   esp: f1335d8c
ds: 0068   es: 0068   ss: 0068
Process rsync (pid: 3151, stackpage=f1335000)
Stack: c3d3ad40 f3121f38 00000001 f1335e28 00000000 03ff0202 00000004
000003ff
       00000000 00000006 c3d3ad40 f74939e0 c022d67e c3d3ad40 f1335e28
c3d5a000
       00000000 00000006 00000000 00000001 00000000 c022d530 c021ce67
c3d3ad40
Call Trace:   [<c022d67e>] ip_local_deliver_finish [kernel] 0x14e
(0xf1335dbc))
[<c022d530>] ip_local_deliver_finish [kernel] 0x0 (0xf1335de0))
[<c021ce67>] nf_hook_slow [kernel] 0x107 (0xf1335de4))
[<c022d530>] ip_local_deliver_finish [kernel] 0x0 (0xf1335e00))
[<c022d2b3>] ip_local_deliver [kernel] 0x53 (0xf1335e1c))
[<c022d530>] ip_local_deliver_finish [kernel] 0x0 (0xf1335e34))
[<c022d8b9>] ip_rcv_finish [kernel] 0x219 (0xf1335e38))
[<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335e5c))
[<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335e6c))
[<c021ce67>] nf_hook_slow [kernel] 0x107 (0xf1335e70))
[<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335e8c))
[<c022d480>] ip_rcv [kernel] 0x1a0 (0xf1335ea8))
[<c022d6a0>] ip_rcv_finish [kernel] 0x0 (0xf1335ec0))
[<c021566e>] netif_receive_skb [kernel] 0x14e (0xf1335ed8))
[<f89d2c7c>] tg3_rx [tg3] 0x27c (0xf1335ef8))
[<f89d2e71>] tg3_poll [tg3] 0x81 (0xf1335f38))
[<c0215917>] net_rx_action [kernel] 0xa7 (0xf1335f58))
[<c01289f9>] do_softirq [kernel] 0xd9 (0xf1335f80))
[<c010b81b>] do_IRQ [kernel] 0xfb (0xf1335f9c))
[<c010e7c8>] call_do_IRQ [kernel] 0x5 (0xf1335fc0))


Code: 7e f8 e9 68 e5 ff ff e8 2c ed eb ff e9 c3 ee ff ff e8 22 ed
console shuts up ...
 NMMI Watchdog detected LOCKUP on CPU1, eip f89d9f3b, registers:
<------------------------CUT---------------------------->

Comment 2 Daniel Khan 2003-01-27 06:14:10 UTC

I now got rid of the failures by exchanging the tg3 driver with the latest 
bcm5700 driver from 3com.
It seems as there is a bug in tg3 with the BCM5701 Gigabit Ethernet card.

regards
Daniel Khan