Bug 151054
Description
David Knierim
2005-03-14 14:11:27 UTC
Created attachment 111979 [details]
Script to reproduce failure.
Modify the "master" script to match your configuration. Place both scripts in
the same directory. Run master.
Created attachment 111980 [details]
companion script to "master"
This is the other half of the script pair to reproduce this problem.
Created attachment 111981 [details]
Oops output
This is the oops output for the 5 oops'es that I have recorded.
David, can this problem be reproduced with an untainted kernel? Created attachment 112001 [details]
panic with untainted kernel
Yes, the problem occurs with an untainted kernel.
All of the EIP values in the OOPS traces look wrong. They all point into areas outside of the kernel image or modules area, and thus the symbols it matches up to are garbage as well. Can some x86 guru interpret this or suggest a way to get more reasonable dump output? I think I just hit this same bug. I set up a new firewall box the way I have dozens others but this new one (the first running FC3 770) has panic'd a lot during my testing. I narrowed it down (I think) to the ifdown/ifup my watchdog scripts do on the interfaces when they are down/unpingable/unplugged. As a definitive test, I was reliably able to make the kernel panic by typing repeatedly: ifdown eth2; ifup eth2 After about 5-15 times as fast as I can hit up-arrow/return, the machine panics. It did this with cheap SMC NIC's (Linksys-chip driver I think), and my tried-and-true Realtek 8139 and the onboard r8169. The only NIC that was constant in the testing was the onboard r8169, for obvious reasons. I've never had problems with either the SMC or 8139's in the past, and I run them on many machines. The 8169 is a new one for me, so I can't vouch for it. The interfaces I am resetting to cause the panic are NOT running dhclient (dhcp), they have static addresses. So in that sense what I've hit appears different than David's. Perhaps, David, you can try setting those interfaces static for a brief test and see if yours still panics? I bet it's not related to DHCP vs. static. The machine appears 100% stable if not many ifdown/ifups are done. It's been up 2 days with no panic. If I ifdown/ifup it right now I guarantee it will panic. Also, the ifdown/ifup count seems to be cumulative over a long period of time. If my scripts do it once every 10 mins then it will crash the box after 30-60 mins. The new system I tested is also the first P4 HT box I've done, so I tried turning off HT, booting with noapic, booting with noacpi, BIOS set to MPS 1.1 and 1.4, running UP instead of SMP, but nothing affected the panic. I have screenshots of many of the panics I can attach if it looks like this is the same bug -- otherwise I'll open a new bug. Right now I live in fear of network problems that will cause my scripts to ifdown/ifup and hang the (remote) box! This is definite bug in the kernel. I have reproduced this on 3 different firewall boxes running FC3 766 and 770, using at least 5 different brand/models of NIC. I think it's a timing issue. I do not think it is hardware dependent (panic'd on both i865 and i7205). On my own main workstation (4 eth interfaces), I can crash it in 20 seconds by running: ifdown eth1; ifup eth1 (or eth2) repeatedly. Always crashes after 2-12 iterations. The interfaces I tested were NOT running dhclient -- they were static interfaces. However, if you "sync; ifdown eth1" -- pause -- "sync; ifup eth1", the system DOES NOT seem to crash. It appears the ifdown has not completed its entire process before the ifup starts its thing. The stack trace is interesting. I do not _think_ that this bug has been in the kernel for long because I'm sure that my watchdog scripts would have crashed machines before now. It's probably safe to say that it was not in 2.4, but I can't be sure. A friend tested this on a box with only 1 NIC (static) and it did NOT crash for him (FC3 770). It must be dependent on multiple NICs, or something else weird I am doing like custom iptables scripts, or named/dhcpd/smbd/etc listening on the interface. I will attach images of the panic screenshots I took with my dig cam. Created attachment 112205 [details]
panic with HT/SMP (I think) and SMC NICs (I think)
Created attachment 112206 [details]
another panic HT/SMP (I think) and Realtek cards (I think)
Created attachment 112207 [details]
panic with noapic boot (maybe SMP off?)
Created attachment 112208 [details]
panic on 766 UP (others were 770), my workstation
I have done some research, and I believe that comment 8 through comment 13 relate to a different problem. In fact, I believe I have the patch to fix that problem. Trevor, please open a bug against Fedora Core 3 to cover the issue you are seeing. Assign it to me if you can, and please post the bug number here for reference. This bug will remain open to deal with the (strikingly similar, but different) problem observed on RHEL3. Thanks! Created attachment 112232 [details]
more oops output
This file has the oops data from three failures with untainted kernel.
Hopefully, these will work better.
Oh yeah. The interfaces are configured with static IP addresses now, so it's not dhcp. The box now has 15 interfaces that I am bringing up and down. For comment 8 through comment 13, see new bug 151874 Hmmm...well, I don't doubt that there is a problem...but the oopses from comment 17, while consistent, don't seem to narrow down the problem. In fact, they just don't make sense... :-( I speculate that there is a connection between this and bug 150130, and probably ug 145959 as well...I'm just not sure what it is yet... Hmmm...that should be "bug 145959 as well..." I have no idea if this is related, but I have seen this same configuration hang hard a number of times (3+), too. Magic sysrq doesn't work. Box doesn't repond to pings, etc... Please see bug 150130 comment 9...thanks! Created attachment 114108 [details]
Oops with latest kernel (2.4.21-32.3.EL.jwltest.22smp)
This took 384 seconds to happen on box with 4 interfaces.
Created attachment 114251 [details]
oops with an even newer kernel (2.4.21-32.3.EL.jwltest.24smp)
Ran for 1456 seconds before failing.
Could you find these lines in /etc/sysconfing/network-scripts/ifup? # Is there a firewall running, and does it look like one we configured? FWACTIVE= if iptables -L -n 2>/dev/null | LC_ALL=C grep -q RH-Lokkit-0-50-INPUT ; then FWACTIVE=1 else modprobe -r iptable_filter >/dev/null 2>&1 fi Once you find them, comment them out (i.e. put a "#" at the beginning of each of those lines). Then please attempt your test again, and post the results here. If the problem persists, please attach a copy of your modified /etc/sysconfig/network-scripts/ifup to ensure that I told you to do the right thing... :-) I commented out the requested lines and my test script is still running (after 2 days, 19 hours and over 41,000 iterations). Looks like a clue :^) *** Bug 150130 has been marked as a duplicate of this bug. *** Looks like doing a loop which inserts and removes iptable_filter repeatedly will trigger the same problem. iptable_filter depends on ip_tables...doing the loop w/ ip_tables causes the same problem as well...getting closer? Looks like most any module will do...loop does it as well... I've posted some test kernels here: http://people.redhat.com/linville/kernels/rhel3/ I no longer seem to be able to recreate the insmod failure when using these kernels. Would you mind giving them a try and posting the results? Thanks! Created attachment 115505 [details]
jwltest-init_module-vfree.patch
I have retested with kernel version 2.4.21-32.8.EL.jwltest.32smp on i686. It is working great. A fix for this problem has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.10.EL). Removing dependency of bug 145959 on this one, since the former is against Fedora. *** Bug 150130 has been marked as a duplicate of this bug. *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html |