Red Hat Bugzilla – Bug 157209
Server reboots on initial bringup of e1000 after power up occationally
Last modified: 2007-11-30 17:07:07 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050416 Fedora/1.0.3-1.3.1 Firefox/1.0.3
Description of problem:
I have a number of servers with multiple e1000 interfaces (4+). The servers are dual processor 3.2Ghz XEON boxes with 2 - 8GB of dram. Occationally, when the box is booting up after being powered off, the server will reboot when it tries to bring up an ethernet interface. When it reboots, the server never has any trouble with any interface that has been brought up or the interface that caused the server to reboot, but it may reboot again on other interfaces that have not been accessed previously. I have seen a server 8 interfaces reboot twice due to this issue.
Once all interfaces have been brought up, this problem does not occur again until the box is power cycled (warm reboots do not cause any issues).
Yes, this is really strange, but I have observed it many times on many boxes.
In an effort to help solve this issue, I installed Intel's driver from their web site (version 5.7.6) and, if anything, the problem is more severe.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Configure dual 3.2 Ghz xeon box with lots of e1000 interfaces
2. Set all interfaces to get IP addresses (I have been using DHCP)
3. Power box off and on again.
4. Observe console to see if box reboots during boot up.
5. Repeat steps 3 and 4.
Actual Results: Sometimes, the server will reboot bringing up ethernet interfaces.
I have test kernels w/ a later e1000 driver available here:
Please give them a try to see if the problem has already been corrected.
Please post the results here. Thanks!
Closed due to lack of response...please reopen if/when requested information
I finally got a chance to retest this. It failed booting up after power cycle
running Red Hat test kernel (kernel-smp-2.4.21-32.9.EL.jwltest.35.i686.rpm):
Bringing up interface eth8: C<P0>UK 1e:rn Melac phainnei c:Ch Uencka bElex cteo
p ticoonnt: i0nu00e0
00In0 0i00d0le0 00t4as
-CP nUo t0 :s yMancchinigne
It failed the second time I booted the box up (with a power cycle before each
boot). The box in question has 16 e1000 interfaces.
Hmmm, not a lot to go on...just curious, is this the same box as in bug
Please attach the output of running "sysreport" on the box...thanks!
The problem described in the link from comment 5 sounds to me like it _might_
be related. I have hacked-up a patch that I think _might_ help...wanna try
Test kernels available at the same location as in comment 1. Please give them
a try to see if you can reproduce the issue with them, and post the results
Created attachment 116720 [details]
The box in question is from the same family as the box I reported bug
154680 against. It's physically a different box, but it has exactly the same
hardware configuration except it only has 2GB of DRAM.
I just tried to boot is up running the kernel:
I have seen several issues when booting this kernel. Several times the box
hung when starting up the first interface. Several times it hung running
kudzu. One time it spewed forth loads of shell code on the screen and failed to
configure the interface. After doing this, the server rebooted a minute or so
later. Sorry, but I don't have a capture of the output from this failure.
When I went back to the latest errata kernel, the box booted just fine and all
interfaces came up.
Hmmm...well, that sucks...don't use those kernels... :-)
Thanks for the testing, anyway! I'll have to get back to you...
Created attachment 116799 [details]
Output of sysreport
sysreport output from server in question.
Do the "acpi=off" or "acpi=noirq" kernel command line parameters have any
effect on this issue?
I just did some testing in this area and here are my results:
Initially I configured a box with 16 e1000 ports. One was configured for DHCP
and the rest were static. The static ports did not have a network attached.
Test pass fail
acpi=off 5 0
acpi=noirq 3 1
I then decided to make the test a bit more realistic. I configured all ports
for DHCP and attached 15 of them to a hub with a DHCP server and the remaining
port went to our main network, which has a different DHCP server.
Test pass fail
acpi=off 10 0
acpi=noirq 7 3
std acpi 10 1
"std acpi" means there is no entry for acpi on the kernel boot line.
I will be doing more testing with acpi=off to see if it actually clears up the
I ran an automated test overnight with acpi=off. Out of 111 attempts, 91
passed. I will run additional testing today.
Running with no settings for acpi on the kernel line gave the following results:
33 attempts, 27 passes. I will be out of the office next week. I will not be
doing further testing until I return.
Is there an oops or panic before the box reboots? Can you do anything to
capture that oops (perhaps using netconsole)?
I set up netconsole and attempted to capture anything further. Netconsole is
working, but it does not capture anything with this failure.
I just tried this using the latest driver from Intel (6.2.15). The problem
exists there, too.
Determining IP information for eth1...C<P0U> 0K:e rnMaelch ipanne iCch:e Uckn
aEbxlec eptoti coon:nt i0n00ue0
000I0n 00i0dl00e 04t 0
sk C-PU no1:t Msaynchciinnge
I have observed something that might be useful. I have been testing using a
box with 16 interfaces. 1 goes to our production network and the rest all go
to a single 10/100 hub with a DHCP server attached. No matter which order the
interfaces are brought up, the failure always happens when bringing up the
interface attached to the production net. Up until now, all interfaces have
been configured to use DHCP. I will set the production interface to be static
and see what happens.
Did that configuration change make any difference? Are you still confident
that it is the "production" interface that always causes the problem?
The problem only happens on the interface attached to the "production" network.
I have tried both DHCP and static configured IP address. Both fail. In a few
cases, the interface came up with a static address, but I then got the failure
when I tried to ssh out the interface to a second server.
I have also moved the network cable from the production network to different
ports, but that did not make any difference.
Since I noticed this behavior, I have never seen any of the 15 interfaces
attached to the test network fail (The production interface is running 100Mbps,
full duplex, while the test network is 100Mbps, half duplex).
When I get a chance to test this more, I can try to put some traffic on the test
network to see if that makes any difference. Let me know if you have any other
It certainly would not be the first time I've seen drivers have problems with
incoming traffic during initialization...I'll have to get back to you...
Big e1000 update available in the test kernels here:
Please give those a try and post the results here...thanks!
Belay that...RHEL3 kernels not ready yet...
Ok, now they are here:
Please give those a try and post the results...thanks!
Sorry it has taken so long to retest this. Getting the test setup back together
was a PITA.
The problem is still happening. The failure does not seem to be happening as
often as before. I will retest to see if this observation is accurate.
I am currently running with one interface attached to a production network. I
am testing with a different server and different production network, now that I
think about it, so the failure rate does not necessarily correlate.
In any case, I have a script that does the following:
- boot up the server
- run and log online test on 1 interface
- run and log offline test on 1 interface
- log before bringing up interface
- bring up interface
- log after bringing up interface
- log and run ssh <remote host> /bin/true
- log and shut down interface
- log and bring up interface again (a leftover from when the script did more
- log and ping remote host
- ssh to remote host and initiate script to cause power cycle (with logging)
- shut down server (the previous script will fire it back up)
After running this for about 24 hours, the logs have 275 entries for each of the
steps before bringing up the interface. There were 233 entries for the
remaining steps. This indicates that out of 275 attempts, 233 succeeded and 42
I am retesting with an older kernel to see what rate the error is happening with
this server and network.
I retested the same configuration, but only changed the kernel to the kernel we
normally use (2.4.21-32.0.1.ELsmp). The scripts attempted to bring up the
ethernet interface 29 times. The server hung twice and rebooted a third time
without bringing up the interface. The test kernel never hung when attempting
to bring up the interface. However, the failure rate does not appear to be
improved by the new test kernel.
RHEL3 is now late enough in its life cycle that I don't think this will ever
be fixed. I'm sorry.