Bug 157209 - Server reboots on initial bringup of e1000 after power up occationally
Server reboots on initial bringup of e1000 after power up occationally
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: John W. Linville
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-05-09 09:45 EDT by David Knierim
Modified: 2007-11-30 17:07 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-05-17 10:12:59 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
jwltest-e1000-alloc_rx_buf.patch (1.49 KB, patch)
2005-07-13 16:20 EDT, John W. Linville
no flags Details | Diff
Output of sysreport (286.13 KB, application/x-bzip2)
2005-07-15 10:08 EDT, David Knierim
no flags Details

  None (edit)
Description David Knierim 2005-05-09 09:45:33 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050416 Fedora/1.0.3-1.3.1 Firefox/1.0.3

Description of problem:
I have a number of servers with multiple e1000 interfaces (4+).  The servers are dual processor 3.2Ghz XEON boxes with 2 - 8GB of dram.  Occationally, when the box is booting up after being powered off, the server will reboot when it tries to bring up an ethernet interface.   When it reboots, the server never has any trouble with any interface that has been brought up or the interface that caused the server to reboot, but it may reboot again on other interfaces that have not been accessed previously.   I have seen a server 8 interfaces reboot twice due to this issue.

Once all interfaces have been brought up, this problem does not occur again until the box is power cycled (warm reboots do not cause any issues).

Yes, this is really strange, but I have observed it many times on many boxes.  

In an effort to help solve this issue, I installed Intel's driver from their web site (version 5.7.6) and, if anything, the problem is more severe.

David



Version-Release number of selected component (if applicable):
2.4.21-27.0.2.ELsmp

How reproducible:
Sometimes

Steps to Reproduce:
1. Configure dual 3.2 Ghz xeon box with lots of e1000 interfaces
2. Set all interfaces to get IP addresses (I have been using DHCP)
3. Power box off and on again.
4. Observe console to see if box reboots during boot up.
5. Repeat steps 3 and 4.
  

Actual Results:  Sometimes, the server will reboot bringing up ethernet interfaces.

Additional info:
Comment 1 John W. Linville 2005-05-10 13:55:59 EDT
I have test kernels w/ a later e1000 driver available here: 
 
   http://people.redhat.com/linville/kernels/rhel3/ 
 
Please give them a try to see if the problem has already been corrected.  
Please post the results here.  Thanks! 
Comment 2 John W. Linville 2005-06-27 10:34:09 EDT
Closed due to lack of response...please reopen if/when requested information 
becomes available... 
Comment 3 David Knierim 2005-07-07 10:18:28 EDT
I finally got a chance to retest this.  It failed booting up after power cycle
running Red Hat test kernel (kernel-smp-2.4.21-32.9.EL.jwltest.35.i686.rpm):

Bringing up interface eth8:  C<P0>UK 1e:rn Melac phainnei c:Ch Uencka bElex cteo
p ticoonnt: i0nu00e0
00In0 0i00d0le0 00t4as
 -CP nUo t0 :s yMancchinigne
 Ch e

It failed the second time I booted the box up (with a power cycle before each
boot).  The box in question has 16 e1000 interfaces.  
Comment 4 John W. Linville 2005-07-13 11:11:03 EDT
Hmmm, not a lot to go on...just curious, is this the same box as in bug 
154680? 
 
Please attach the output of running "sysreport" on the box...thanks! 
Comment 6 John W. Linville 2005-07-13 16:18:22 EDT
The problem described in the link from comment 5 sounds to me like it _might_ 
be related.  I have hacked-up a patch that I think _might_ help...wanna try 
it? 
 
Test kernels available at the same location as in comment 1.  Please give them 
a try to see if you can reproduce the issue with them, and post the results 
here...thanks! 
Comment 7 John W. Linville 2005-07-13 16:20:19 EDT
Created attachment 116720 [details]
jwltest-e1000-alloc_rx_buf.patch
Comment 8 David Knierim 2005-07-15 09:57:46 EDT
The box in question is from the same family as the box I reported bug 
154680 against.  It's physically a different box, but it has exactly the same
hardware configuration except it only has 2GB of DRAM.

I just tried to boot is up running the kernel:
kernel-smp-2.4.21-32.10.EL.jwltest.37.i686.rpm

I have seen several issues when booting this kernel.   Several times the box
hung when starting up the first interface.   Several times it hung running
kudzu.  One time it spewed forth loads of shell code on the screen and failed to
configure the interface.  After doing this, the server rebooted a minute or so
later.  Sorry, but I don't have a capture of the output from this failure.

When I went back to the latest errata kernel, the box booted just fine and all
interfaces came up.
Comment 9 John W. Linville 2005-07-15 10:02:55 EDT
Hmmm...well, that sucks...don't use those kernels... :-) 
 
Thanks for the testing, anyway!  I'll have to get back to you... 
Comment 10 David Knierim 2005-07-15 10:08:34 EDT
Created attachment 116799 [details]
Output of sysreport

sysreport output from server in question.
Comment 11 John W. Linville 2005-08-24 10:32:46 EDT
Do the "acpi=off" or "acpi=noirq" kernel command line parameters have any 
effect on this issue? 
Comment 12 David Knierim 2005-09-08 17:58:57 EDT
I just did some testing in this area and here are my results:
Initially I configured a box with 16 e1000 ports.  One was configured for DHCP
and the rest were static.   The static ports did not have a network attached.

Test        pass  fail
acpi=off    5     0
acpi=noirq  3     1

I then decided to make the test a bit more realistic.  I configured all ports
for DHCP and attached 15 of them to a hub with a DHCP server and the remaining
port went to our main network, which has a different DHCP server.

Test        pass  fail
acpi=off    10    0
acpi=noirq  7     3
std acpi    10    1

"std acpi" means there is no entry for acpi on the kernel boot line.

I will be doing more testing with acpi=off to see if it actually clears up the
problem.

Comment 13 David Knierim 2005-09-09 10:39:08 EDT
I ran an automated test overnight with acpi=off.  Out of 111 attempts, 91
passed.  I will run additional testing today.   
Comment 14 David Knierim 2005-09-09 15:11:02 EDT
Running with no settings for acpi on the kernel line gave the following results:
 33 attempts, 27 passes.  I will be out of the office next week. I will not be
doing further testing until I return.
Comment 15 John W. Linville 2005-09-23 10:57:25 EDT
Is there an oops or panic before the box reboots?  Can you do anything to 
capture that oops (perhaps using netconsole)? 
Comment 16 David Knierim 2005-10-07 09:05:09 EDT
I set up netconsole and attempted to capture anything further.  Netconsole is
working, but it does not capture anything with this failure.
Comment 17 David Knierim 2005-11-02 14:20:23 EST
I just tried this using the latest driver from Intel (6.2.15).  The problem
exists there, too.

Determining IP information for eth1...C<P0U> 0K:e rnMaelch ipanne iCch:e Uckn
aEbxlec eptoti coon:nt i0n00ue0
000I0n 00i0dl00e 04t          0
sk C-PU  no1:t  Msaynchciinnge
Ch ec      
Comment 18 David Knierim 2005-11-17 11:41:42 EST
I have observed something that might be useful.   I have been testing using a
box with 16 interfaces.   1 goes to our production network and the rest all go
to a single 10/100 hub with a DHCP server attached.  No matter which order the
interfaces are brought up, the failure always happens when bringing up the
interface attached to the production net.   Up until now, all interfaces have
been configured to use DHCP.   I will set the production interface to be static
and see what happens.
Comment 19 John W. Linville 2005-12-12 09:57:55 EST
Did that configuration change make any difference?  Are you still confident 
that it is the "production" interface that always causes the problem? 
Comment 20 David Knierim 2005-12-16 16:53:31 EST
The problem only happens on the interface attached to the "production" network.
 I have tried both DHCP and static configured IP address.  Both fail.  In a few
cases, the interface came up with a static address, but I then got the failure
when I tried to ssh out the interface to a second server.   

I have also moved the network cable from the production network to different
ports, but that did not make any difference.

Since I noticed this behavior, I have never seen any of the 15 interfaces
attached to the test network fail (The production interface is running 100Mbps,
full duplex, while the test network is 100Mbps, half duplex).

When I get a chance to test this more, I can try to put some traffic on the test
network to see if that makes any difference.   Let me know if you have any other
ideas.
Comment 21 John W. Linville 2005-12-16 17:07:30 EST
It certainly would not be the first time I've seen drivers have problems with 
incoming traffic during initialization...I'll have to get back to you... 
Comment 23 John W. Linville 2006-02-08 15:58:26 EST
Big e1000 update available in the test kernels here: 
 
   http://people.redhat.com/linville/kernels/rhel4/ 
 
Please give those a try and post the results here...thanks! 
Comment 24 John W. Linville 2006-02-08 15:59:10 EST
Belay that...RHEL3 kernels not ready yet... 
Comment 25 John W. Linville 2006-02-09 09:58:17 EST
Ok, now they are here:  
  
   http://people.redhat.com/linville/kernels/rhel3/ 
 
Please give those a try and post the results...thanks! 
Comment 26 David Knierim 2006-03-07 15:52:41 EST
Sorry it has taken so long to retest this.  Getting the test setup back together
was a PITA.

The problem is still happening.  The failure does not seem to be happening as
often as before.  I will retest to see if this observation is accurate.

I am currently running with one interface attached to a production network.  I
am testing with a different server and different production network, now that I
think about it, so the failure rate does not necessarily correlate.

In any case, I have a script that does the following:
- boot up the server
- run and log online test on 1 interface
- run and log offline test on 1 interface
- log before bringing up interface
- bring up interface
- log after bringing up interface
- log and run ssh <remote host> /bin/true
- log and shut down interface 
- log and bring up interface again (a leftover from when the script did more
interfaces)
- log and ping remote host
- ssh to remote host and initiate script to cause power cycle (with logging)
- shut down server (the previous script will fire it back up)

After running this for about 24 hours, the logs have 275 entries for each of the
steps before bringing up the interface.  There were 233 entries for the
remaining steps.   This indicates that out of 275 attempts, 233 succeeded and 42
failed.

I am retesting with an older kernel to see what rate the error is happening with
this server and network.


Comment 27 David Knierim 2006-03-08 08:18:10 EST
I retested the same configuration, but only changed the kernel to the kernel we
normally use (2.4.21-32.0.1.ELsmp).  The scripts attempted to bring up the
ethernet interface 29 times.   The server hung twice and rebooted a third time
without bringing up the interface.  The test kernel never hung when attempting
to bring up the interface.  However, the failure rate does not appear to be
improved by the new test kernel.
Comment 28 John W. Linville 2006-05-17 10:12:59 EDT
RHEL3 is now late enough in its life cycle that I don't think this will ever 
be fixed.  I'm sorry. 

Note You need to log in before you can comment on or make changes to this bug.