Bug 151054

Summary: kernel panic when bringing up and down multiple interfaces simultaneously
Product: Red Hat Enterprise Linux 3 Reporter: David Knierim <new_galoot>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: davem, petrides, riel, trevor
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2005-663 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-28 14:51:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 156320    
Attachments:
Description Flags
Script to reproduce failure.
none
companion script to "master"
none
Oops output
none
panic with untainted kernel
none
panic with HT/SMP (I think) and SMC NICs (I think)
none
another panic HT/SMP (I think) and Realtek cards (I think)
none
panic with noapic boot (maybe SMP off?)
none
panic on 766 UP (others were 770), my workstation
none
more oops output
none
Oops with latest kernel (2.4.21-32.3.EL.jwltest.22smp)
none
oops with an even newer kernel (2.4.21-32.3.EL.jwltest.24smp)
none
jwltest-init_module-vfree.patch none

Description David Knierim 2005-03-14 14:11:27 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5)
Gecko/20041111 Firefox/1.0

Description of problem:
While attempting to come up with a simple test script to demonstrate
the issue found in Bugzilla bug id 150130, I wrote a simple pair of
scripts to bring all interfaces up and down.   To my surprise, the box
panic'ed less than a minute after starting the test script.   I have
reproduced the problem a number of times (I tried 5 times and the box
panic'ed 5 times).

Version-Release number of selected component (if applicable):
kernel-smp-2.4.21-27.0.2.EL

How reproducible:
Always

Steps to Reproduce:
1. Configure a box with a lot of ethernet interfaces (I use 11 e1000
interfaces)
2. Attach interfaces in question to network with dhcp server.
3. Configure all interfaces to start up using dhcp.
4. Run the attached script.
    

Actual Results:  The box will panic.

Expected Results:  The box should never panic.

Additional info:

Comment 1 David Knierim 2005-03-14 14:13:23 UTC
Created attachment 111979 [details]
Script to reproduce failure.

Modify the "master" script to match your configuration.  Place both scripts in
the same directory.   Run master.

Comment 2 David Knierim 2005-03-14 14:14:49 UTC
Created attachment 111980 [details]
companion script to "master"

This is the other half of the script pair to reproduce this problem.

Comment 3 David Knierim 2005-03-14 14:17:25 UTC
Created attachment 111981 [details]
Oops output

This is the oops output for the 5 oops'es that I have recorded.

Comment 4 Ernie Petrides 2005-03-14 21:16:57 UTC
David, can this problem be reproduced with an untainted kernel?

Comment 6 David Knierim 2005-03-14 23:14:19 UTC
Created attachment 112001 [details]
panic with untainted kernel

Yes, the problem occurs with an untainted kernel.

Comment 7 David Miller 2005-03-14 23:17:43 UTC
All of the EIP values in the OOPS traces look wrong.
They all point into areas outside of the kernel image
or modules area, and thus the symbols it matches up to
are garbage as well.

Can some x86 guru interpret this or suggest a way to
get more reasonable dump output?


Comment 8 Trevor Cordes 2005-03-21 11:46:25 UTC
I think I just hit this same bug.  I set up a new firewall box the way I have
dozens others but this new one (the first running FC3 770) has panic'd a lot
during my testing.  I narrowed it down (I think) to the ifdown/ifup my watchdog
scripts do on the interfaces when they are down/unpingable/unplugged.

As a definitive test, I was reliably able to make the kernel panic by typing
repeatedly:   ifdown eth2; ifup eth2

After about 5-15 times as fast as I can hit up-arrow/return, the machine panics.
 It did this with cheap SMC NIC's (Linksys-chip driver I think), and my
tried-and-true Realtek 8139 and the onboard r8169.  The only NIC that was
constant in the testing was the onboard r8169, for obvious reasons.  I've never
had problems with either the SMC or 8139's in the past, and I run them on many
machines.  The 8169 is a new one for me, so I can't vouch for it.

The interfaces I am resetting to cause the panic are NOT running dhclient
(dhcp), they have static addresses.  So in that sense what I've hit appears
different than David's.  Perhaps, David, you can try setting those interfaces
static for a brief test and see if yours still panics?  I bet it's not related
to DHCP vs. static.

The machine appears 100% stable if not many ifdown/ifups are done.  It's been up
2 days with no panic.  If I ifdown/ifup it right now I guarantee it will panic.
 Also, the ifdown/ifup count seems to be cumulative over a long period of time.
 If my scripts do it once every 10 mins then it will crash the box after 30-60 mins.

The new system I tested is also the first P4 HT box I've done, so I tried
turning off HT, booting with noapic, booting with noacpi, BIOS set to MPS 1.1
and 1.4, running UP instead of SMP, but nothing affected the panic.

I have screenshots of many of the panics I can attach if it looks like this is
the same bug -- otherwise I'll open a new bug.  Right now I live in fear of
network problems that will cause my scripts to ifdown/ifup and hang the (remote)
box!


Comment 9 Trevor Cordes 2005-03-22 03:35:23 UTC
This is definite bug in the kernel.  I have reproduced this on 3 different
firewall boxes running FC3 766 and 770, using at least 5 different brand/models
of NIC.  I think it's a timing issue.  I do not think it is hardware dependent
(panic'd on both i865 and i7205).

On my own main workstation (4 eth interfaces), I can crash it in 20 seconds by
running: ifdown eth1; ifup eth1 (or eth2) repeatedly.  Always crashes after 2-12
iterations.  The interfaces I tested were NOT running dhclient -- they were
static interfaces.

However, if you "sync; ifdown eth1" -- pause -- "sync; ifup eth1", the system
DOES NOT seem to crash.

It appears the ifdown has not completed its entire process before the ifup
starts its thing.  The stack trace is interesting.

I do not _think_ that this bug has been in the kernel for long because I'm sure
that my watchdog scripts would have crashed machines before now.  It's probably
safe to say that it was not in 2.4, but I can't be sure.

A friend tested this on a box with only 1 NIC (static) and it did NOT crash for
him (FC3 770).  It must be dependent on multiple NICs, or something else weird I
am doing like custom iptables scripts, or named/dhcpd/smbd/etc listening on the
interface.

I will attach images of the panic screenshots I took with my dig cam.


Comment 10 Trevor Cordes 2005-03-22 04:33:01 UTC
Created attachment 112205 [details]
panic with HT/SMP (I think) and SMC NICs (I think)

Comment 11 Trevor Cordes 2005-03-22 04:34:07 UTC
Created attachment 112206 [details]
another panic HT/SMP (I think) and Realtek cards (I think)

Comment 12 Trevor Cordes 2005-03-22 04:35:09 UTC
Created attachment 112207 [details]
panic with noapic boot (maybe SMP off?)

Comment 13 Trevor Cordes 2005-03-22 04:36:07 UTC
Created attachment 112208 [details]
panic on 766 UP (others were 770), my workstation

Comment 15 John W. Linville 2005-03-22 16:26:52 UTC
I have done some research, and I believe that comment 8 through comment 13
relate to a different problem.  In fact, I believe I have the patch to fix that
problem.

Trevor, please open a bug against Fedora Core 3 to cover the issue you are
seeing.  Assign it to me if you can, and please post the bug number here for
reference.

This bug will remain open to deal with the (strikingly similar, but different)
problem observed on RHEL3.

Thanks!

Comment 17 David Knierim 2005-03-22 22:01:21 UTC
Created attachment 112232 [details]
more oops output

This file has the oops data from three failures with untainted kernel.	
Hopefully, these will work better.

Comment 18 David Knierim 2005-03-22 22:04:49 UTC
Oh yeah.   The interfaces are configured with static IP addresses now, so it's
not dhcp.   The box now has 15 interfaces that I am bringing up and down.

Comment 19 Trevor Cordes 2005-03-23 05:11:25 UTC
For comment 8 through comment 13, see new bug 151874

Comment 20 John W. Linville 2005-03-23 15:03:30 UTC
Hmmm...well, I don't doubt that there is a problem...but the oopses from comment
17, while consistent, don't seem to narrow down the problem.  In fact, they just
don't make sense... :-(

I speculate that there is a connection between this and bug 150130, and probably
ug 145959 as well...I'm just not sure what it is yet...

Comment 21 John W. Linville 2005-03-23 15:04:39 UTC
Hmmm...that should be "bug 145959 as well..."

Comment 22 David Knierim 2005-03-24 23:11:41 UTC
I have no idea if this is related, but I have seen this same configuration hang
hard a number of times (3+), too.  Magic sysrq doesn't work.   Box doesn't
repond to pings, etc...

Comment 23 John W. Linville 2005-05-03 20:57:56 UTC
Please see bug 150130 comment 9...thanks! 

Comment 24 David Knierim 2005-05-06 22:06:29 UTC
Created attachment 114108 [details]
Oops with latest kernel (2.4.21-32.3.EL.jwltest.22smp)

This took 384 seconds to happen on  box with 4 interfaces.

Comment 25 David Knierim 2005-05-11 16:15:20 UTC
Created attachment 114251 [details]
oops with an even newer kernel (2.4.21-32.3.EL.jwltest.24smp)

Ran for 1456 seconds before failing.

Comment 26 John W. Linville 2005-05-13 14:13:35 UTC
Could you find these lines in /etc/sysconfing/network-scripts/ifup? 
 
# Is there a firewall running, and does it look like one we configured? 
FWACTIVE= 
if iptables -L -n 2>/dev/null | LC_ALL=C grep -q RH-Lokkit-0-50-INPUT ; then 
    FWACTIVE=1 
else 
    modprobe -r iptable_filter >/dev/null 2>&1 
fi 
 
Once you find them, comment them out (i.e. put a "#" at the beginning of each 
of those lines).  Then please attempt your test again, and post the results 
here. 
 
If the problem persists, please attach a copy of your 
modified /etc/sysconfig/network-scripts/ifup to ensure that I told you to do 
the right thing... :-) 

Comment 27 David Knierim 2005-05-16 13:02:13 UTC
I commented out the requested lines and my test script is still running (after 2
days, 19 hours and over 41,000 iterations).  Looks like a clue :^)

Comment 28 John W. Linville 2005-06-07 19:23:28 UTC
*** Bug 150130 has been marked as a duplicate of this bug. ***

Comment 29 John W. Linville 2005-06-15 16:33:58 UTC
Looks like doing a loop which inserts and removes iptable_filter repeatedly 
will trigger the same problem. 
 
iptable_filter depends on ip_tables...doing the loop w/ ip_tables causes the 
same problem as well...getting closer? 

Comment 30 John W. Linville 2005-06-15 19:27:25 UTC
Looks like most any module will do...loop does it as well... 

Comment 31 John W. Linville 2005-06-15 20:20:08 UTC
I've posted some test kernels here: 
 
   http://people.redhat.com/linville/kernels/rhel3/ 
 
I no longer seem to be able to recreate the insmod failure when using these 
kernels.  Would you mind giving them a try and posting the results?  Thanks! 

Comment 32 John W. Linville 2005-06-15 20:21:43 UTC
Created attachment 115505 [details]
jwltest-init_module-vfree.patch

Comment 33 David Knierim 2005-06-24 17:03:20 UTC
I have retested with kernel version 2.4.21-32.8.EL.jwltest.32smp on i686.   It
is working great.

Comment 40 Ernie Petrides 2005-07-12 01:07:29 UTC
A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.10.EL).


Comment 41 Ernie Petrides 2005-07-12 01:11:46 UTC
Removing dependency of bug 145959 on this one, since the former is against Fedora.

Comment 42 Ernie Petrides 2005-07-22 00:05:26 UTC
*** Bug 150130 has been marked as a duplicate of this bug. ***

Comment 45 Red Hat Bugzilla 2005-09-28 14:51:17 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-663.html