Bug 442829

Summary: On-board NICs re-ordered wrt. NICs in PCI-e slot on HP DL-360-G5
Product: Red Hat Enterprise Linux 5 Reporter: Vinod Kutty <vendor-redhat>
Component: distributionAssignee: RHEL Program Management <pm-rhel>
Status: CLOSED NOTABUG QA Contact: Release Test Team <release-test-team-automation>
Severity: high Docs Contact:
Priority: high    
Version: 5.3CC: akarlsso, andy.coull, apevec, bernd.bartmann, dag, danzani, ddomingo, duck, dzickus, emcnabb, fleite, hpetty, jfeeney, jlaska, k.georgiou, liko, linux-bugs, matt_domsch, notting, peterm, pjones, rlerch, robert_hentosh, syeghiay, tao, travellig, wwlinuxengineering
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-02-10 18:54:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 391221, 448732    

Description Vinod Kutty 2008-04-17 03:45:36 UTC
Description of problem:

Testing with 5.2 beta, we've discovered that any HP DL-360-G5s that we either
upgrade from 5.1 or perform a fresh install on (PXE + kickstart) become
immediately unusable in our environment after rebooting with the new kernel,
because the on-board NICs are re-ordered wrt the add-on NICs.

We've also had some issues in the past with NIC ordering in 5.1 that we thought
pci=bfsort as a kernel arg may have helped with, but we'd have to go back and
re-check, because we don't have many 5.1 systems.

Version-Release number of selected component (if applicable):
Kernel: 2.6.18-84.el5 x86_64
Hardware: HP DL-360-G5 with Intel Pro/1000 PT dual-port PCI-e card installed in
low-profile PCI-e slot

Also tested with Intel Pro/1000 PT quad-port PCI-e card with similar results.


How reproducible:
Always (on same hardware config)

Steps to Reproduce:
1. 
a) Either:
Install RH 5.2 beta on vanilla HP DL-360-G5 with above NIC config and allow
anaconda to reboot after install

or:

b) Patch existing 5.1 system in same hardware config to 5.2  

2. Reboot with new kernel ( 2.6.18-84.el5 )

  
Actual results:
- eth3 and eth4 map to on-board NICs (bnx2)
- eth0 and eth1 map to Intel NICs in PCI-e low-profile slot


Expected results:
- eth0 and eth1 should correspond to on-board NICs

Additional info:

On a test server, lspci shows these on-board NICs:

03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit
Ethernet (rev 12)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit
Ethernet (rev 12)


and these Intel NICs in the low-profile PCI slot:

0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (rev 06)
0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (rev 06)


After a fresh install, /etc/modprobe.conf looks like this:
-------------------------------------
alias eth0 bnx2
alias eth1 bnx2
alias eth2 e1000e
alias eth3 e1000e
alias scsi_hostadapter cciss
alias scsi_hostadapter1 ata_piix
-------------------------------------

And 'ip addr' shows after anaconda reboot:
-------------------------------------
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 00:15:17:63:61:f6 brd ff:ff:ff:ff:ff:ff
3: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 00:15:17:63:61:f7 brd ff:ff:ff:ff:ff:ff
4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:1e:0b:e9:d1:94 brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.252/24 brd 10.1.1.255 scope global eth0
    inet6 fe80::21e:bff:fee9:d194/64 scope link
       valid_lft forever preferred_lft forever
5: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 00:1e:0b:e9:f1:a6 brd ff:ff:ff:ff:ff:ff
6: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0
-----------------------------------------------

At this point eth0/eth1 map to on-board NICs.

Also note: dmesg after shows:

-------------------------------------------------
e1000e: Intel(R) PRO/1000 Network Driver - 0.2.0
e1000e: Copyright (c) 1999-2007 Intel Corporation.
ACPI: PCI Interrupt 0000:0b:00.0[A] -> GSI 16 (level, low) -> IRQ 169
PCI: Setting latency timer of device 0000:0b:00.0 to 64
input: PC Speaker as /class/input/input2
0000:0a:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:15:17:63:61:f6
0000:0a:00.0: eth0: Intel(R) PRO/1000 Network Connection
0000:0a:00.0: eth0: MAC: 0, PHY: 4, PBA No: d50868-003
ACPI: PCI Interrupt 0000:0b:00.1[B] -> GSI 17 (level, low) -> IRQ 177
PCI: Setting latency timer of device 0000:0b:00.1 to 64
Floppy drive(s): fd0 is 1.44M
EDAC MC: Ver: 2.0.1 Feb 29 2008
intel_rng: FWH not detected
0000:0a:00.0: eth1: (PCI Express:2.5GB/s:Width x4) 00:15:17:63:61:f7
0000:0a:00.0: eth1: Intel(R) PRO/1000 Network Connection
0000:0a:00.0: eth1: MAC: 0, PHY: 4, PBA No: d50868-003
Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.6.9 (December 8, 2007)
ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 18 (level, low) -> IRQ 185
eth0: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found
d at mem f8000000, IRQ 185, node addr 001e0be9d194
ACPI: PCI Interrupt 0000:05:00.0[A] -> GSI 19 (level, low) -> IRQ 82
eth1: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found
d at mem fa000000, IRQ 82, node addr 001e0be9f1a6

-------------------------------------------------

Rebooting after this install-initiated reboot results in:
-------------------------------------------------
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 00:15:17:63:61:f6 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 00:15:17:63:61:f7 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 00:1e:0b:e9:d1:94 brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop qlen 1000
    link/ether 00:1e:0b:e9:f1:a6 brd ff:ff:ff:ff:ff:ff
6: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0
-------------------------------------------------

dmesg shows:
-------------------------------------------------
0000:0a:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:15:17:63:61:f6
0000:0a:00.0: eth0: Intel(R) PRO/1000 Network Connection
0000:0a:00.0: eth0: MAC: 0, PHY: 4, PBA No: d50868-003
ACPI: PCI Interrupt 0000:0b:00.1[B] -> GSI 17 (level, low) -> IRQ 177
PCI: Setting latency timer of device 0000:0b:00.1 to 64
0000:0a:00.0: eth1: (PCI Express:2.5GB/s:Width x4) 00:15:17:63:61:f7
0000:0a:00.0: eth1: Intel(R) PRO/1000 Network Connection
0000:0a:00.0: eth1: MAC: 0, PHY: 4, PBA No: d50868-003
Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.6.9 (December 8, 2007)
shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 18 (level, low) -> IRQ 185
EDAC MC0: Giving out device to i5000_edac.c I5000: DEV 0000:00:10.0
eth2: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found
d at mem f8000000, IRQ 185, node addr 001e0be9d194
ACPI: PCI Interrupt 0000:05:00.0[A] -> GSI 19 (level, low) -> IRQ 82
eth3: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found
d at mem fa000000, IRQ 82, node addr 001e0be9f1a6
-------------------------------------------------

At this point eth0/eth1 and eth2/3 are swapped and the system is unreachable
over the network.


This is a recurring theme on x86 hardware and one that needs to be addressed
once and for all. There should be a predictable ordering of PCI-e devices,
without hacks such as hard-coding MAC addresses (unacceptable, not robust under
several scenarios).

It just happens to have regressed in 5.2 beta on this hardware.

Comment 2 Andy Gospodarek 2008-04-17 17:53:35 UTC
Vinod, I would like to clarify a few things so I can try and address this.  

1.  Are you stating there is a difference in device ordering between the system
that is running anaconda (did you hit CTRL-ALT-F2 to get that info?) and the
system after it is installed?

2.  Are you stating that simply booting at 5.2 kernel on a 5.1 system (which is
fine with me) causes the eth0/1 to switch with eth2/3?

Thanks!



Comment 3 Vinod Kutty 2008-04-19 14:39:25 UTC
Andy,

I've been traveling and wanted to verify the your 2nd question before responding.

RE: 1. Yes, I believe so based on the fact that it PXE boots then installs and
boots with the correct ordering just once.  One of our feature requests for many
years has been to make the whole install process more friendly for "headless"
servers. In this case, I cannot get CTRL-ALT-F2 info via the serial console ...
unless there's something new I should be aware of? For now a log file is
probably the easiest way to collect this info if you feel it's worthwhile.

RE: 2. Yes. I installed just kernel alone (prior test case used a yum update
from 5.1 -> 5.2 which pulled other RPMs). Note that installing a new kernel
leaves the 'e1000' driver in place in /etc/modprobe.conf, but I see that
'e1000e' is actually loaded.


Comment 5 Andy Gospodarek 2008-04-21 17:11:44 UTC
Ok, let's address #2 first.  

Did your ifcfg-ethX files have entries for HWADDR=<mac address> for all 4
interfaces?  I know that users often delete these entries, so I want to check.

Comment 6 Vinod Kutty 2008-04-21 23:01:45 UTC
No.

We don't put MAC addresses in there as our FRU is an entire server. We pull
drives from a failed chassis into a replacement and so it's important that no
state is preserved on the drives that would cause probs in a new chassis with
new NICs, etc.

Comment 7 Vinod Kutty 2008-04-21 23:06:45 UTC
Also another observation: building with pci=bfsort seems to be a workaround that
works in a simple test, but I haven't tested more thoroughly. We've used it
before in RH5.1 and it seemed to be inconsistent across different hardware
platforms.

Is the DL-360-G5 part of the white list for this? (i.e. so we don't have to
explicitly specify it)

Comment 8 Andy Gospodarek 2008-04-22 00:14:00 UTC
No it is not currently in the bfsort-whitelist.

Comment 9 Bill Nottingham 2008-04-22 00:50:48 UTC
HWADDR (or similar mappings) is pretty much required in any case with multiple
types of ethernet drivers. You will not get consistent results otherwise.

Comment 10 Tim Burke 2008-04-22 02:00:41 UTC
Comment #9 suggests this is a user error.


Comment 11 Andy Gospodarek 2008-04-22 02:36:03 UTC
I agreed with comment #9 and comment #10.

Comment 12 Vinod Kutty 2008-04-22 02:50:49 UTC
HWADDR is a really poor workaround for a bigger problem. It's fine in a static,
small environment, but unacceptable in a large enterprise with lots of hardware
change. How does a crash dump kernel figure out what NIC to use? What about
rescue images? Or PXE boot through OS install lifetime consistency? Or the
chassis swap capability mentioned earlier? How do you reduce human error from
replacing failed NICs?

We've been discussing this with HP/Dell ... they're working on longer term
solutions that are enterprise-ready. This is a serious issue that has languished
for years without resolution, that will make Linux on x86 more enterprise friendly.

In the mean time pci=bfsort seems to have arisen from this need for something
better than hard coding MAC addresses.

The short term solution seems to be to add this to the bfsort whitelist until a
real solution can be found.

Comment 13 Bill Nottingham 2008-04-22 03:05:51 UTC
The modules are initialized in parallel - other than changing the timing, bfsort
will not help significantly.

Comment 14 Vinod Kutty 2008-04-22 04:04:20 UTC
These are all workarounds for a prob that has tended to bite us between major
releases, but bites us now between 5.1 and 5.2beta. From our limited testing,
bfsort is essentially the least evil workaround. We are concerned about timing
issues, but would appreciate some more details. We do want the *long term*
solution to be better.

If the list of models listed here under bfsort support:
http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.2/html/Release_Notes/RELEASE-NOTES-U2-x86_64-en.html

is expanded to include one more model, IMHO it will get us to a happier place in
the short term until a better solution is found. 

Please check with HP on this one as well, to confirm if they see any issue with
this. So far my conversations with them suggest it should apply to this model of
hardware and the limited tests we've done so far indicate this is true.

Comment 15 Andy Gospodarek 2008-04-22 13:55:10 UTC
I've been following some of what Dell wants to do for some of this and I find it
interesting.  I also toyed with the idea of something that will create some udev
rules during install that could then take effect when the system boots for the
first time after installation.  This might be a nice way to use pci device
ordering as a clear way to order devices after an installation.  It may not
solve the issue of device ordering when swapping hard drives if the udev rules
contain mac addresses, but that could be considered when trying to design
something that will create rules.

Comment 16 Kevin Krafthefer 2008-04-22 14:12:06 UTC
The appropriate design is still being debated; given the 5.2 release schedule,
we're going to have to address this in an EUS or ASYNC.

Comment 17 Kevin Krafthefer 2008-04-22 14:51:54 UTC
We need to release note this for 5.2

Comment 18 Bill Nottingham 2008-04-22 15:28:15 UTC
When I say timing issues:

- udev, on start, emits uevents for all the hardware in the system, in pci bus
order (IIRC)
- these events are handled in parallel
- Hence, the modules that are loaded to handle these events can race against
each other on initialization

For many modules, they do not have significant delays on startup, so it seems
'more or less' consistent. However, if the modules do have some sort of delay in
their initialization (for example, if they load firmware), then the order can
change from boot to boot.


Comment 19 Bill Nottingham 2008-04-22 15:29:25 UTC
To continue, this is why we always set up networking configurations with HWADDR,
so that even if the devices show up 'out of order', they can be renamed so that
you have a consistent policy.  If you remove HWADDR, it is really best to have
*some* other mechanism of determining this (iftab, udev rules, etc.)

Comment 20 Bernd Bartmann 2008-06-26 08:12:12 UTC
We have the same problem here, but even if we set HWADDR the
inferfaces get mixed up at each reboot.
In our case we have the via-rhine module for eth0 and the sundance
module for a 4-port network card, i.e. eth1 - eth4.

Also, the problem seems to be somehow related to the init scripts and
not to the kernel as the same problem now occurs if we go back to the
last 5.1 kernel.

Comment 21 Ryan Lerch 2008-08-12 04:30:39 UTC
Tracking this bug for the Red Hat Enterprise Linux 5.3 Release Notes.

Comment 23 travellig 2008-09-23 10:25:27 UTC
Hi,

Just thought of updating the Red Hat Engineering guys on this issue.

The nic enumaration problem is seen and confirmed on HP BL680C RHEL5u1 and in agospoda lastest test kernel-2.6.18-115.el5.gtest.56.x86_64.rpm
 
The system rebooted 182 times in a period of 14 hours (slot for testing I had), two paterns were seen:

NIC Specs on this box (I could provide a sosreport if needed too, just let me know and ready to help/test on this hardware):

1 x Mezz BCM5708s (QUAD onboard so kernel sees 4 cards)
2 x PCIs BCM5715s

Parterns seen:

a) 178 times

eth0-eth3 (BCM5708s)
eth4-eth5 (BCM5715s)

b) 4 times

eth4-eth5 (BCM5715s) 178
eth0-eth3 (BCM5708s)

As I said above, I am ready to test anything agospoda has and ready to provide quick feedback. I have some slots available for testing but I do not own the box so the quicker the better...

travellig

Comment 24 Dario Anzani 2008-10-17 11:05:05 UTC
Hi

I think the same problem with the same hardware configuration happens also in
Fedora, there are people working on it there.

https://bugzilla.redhat.com/show_bug.cgi?id=408891

The guy from Broadcom pointed out a link, I am not sure if it is relevant, but
maybe it can help.

Thanks to all for working on the problem

Dario

Comment 25 Matt Domsch 2008-10-17 12:18:51 UTC
#408891 is completely different, but thanks for looking.

Comment 30 RHEL Program Management 2008-11-18 13:11:11 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 36 Bill Nottingham 2009-02-10 18:54:38 UTC
Just to clarify:

- This is not a regression in any new update release - all RHEL 5 releases behave the same way
- In the default way that we configure devices (with HWADDR in the ifcfg file), customers will see consistent device names across reboots. If they don't like which devices get which names, they can edit their configuration accordingly.
- If they remove the HWADDR line from the ifcfg file, they are likely to see inconsistent device names across reboots

In the next major release, udev persistent names will be used so that even without HWADDR there will be consistent names across reboots. However, that change requires changes to udev, anaconda, kudzu, and initscripts at a minimum, and is not really feasible to backport to RHEL 5 at this time.

Comment 37 Issue Tracker 2009-02-17 20:02:21 UTC
Hi,

I did some tests with udev and it worked nicely here. 
I have three network cards, so I wrote these udev rules below:
# cat /etc/udev/rules.d/99-ethernet.rules 
KERNEL=="eth*", ID=="0000:05:05.0", NAME="eth0"
KERNEL=="eth*", ID=="0000:0b:00.0", NAME="eth1"
KERNEL=="eth*", ID=="0000:0c:02.0", NAME="eth2"

That renames the interface based on PCI slot, so if you replace the
NIC board with another one, the interface name will remain the same.

It cames back correctly after few reboots.
It also worked swapping slots of two boards.
No HWADDR in configs.

I think they have all systems of the same models, so the PCI ID 
would be the same then they can work around writing those rules 
in %post section of a generic kickstart config.

Is this acceptable?

thanks,
Flavio

Internal Status set to 'Waiting on Support'

This event sent from IssueTracker by fleitner 
 issue 224660

Comment 38 Bill Nottingham 2009-02-17 20:22:41 UTC
As a local workaround, sure. Obviously we can't genericize that across hardware and releases.

Comment 39 Andy Coull 2009-02-26 15:27:29 UTC
Hi,

  We have installed 5u1 on a BL460c. This has an internal NIC (BCM5708S) and a mezzanine quad port card (BL5715S). The MAC addresses are all set in the ipcfg-ethx files via the HWADDR field.
 
  We get changes to the NIC numbering (eth0 - eth5) on reboot which is not consistant irrespective of the existance of HWADDR.

  As suggested (#37) I created a udev script which appears to resolve this.

Comment 45 Matt Domsch 2011-02-08 13:35:32 UTC
As many are aware by now, we're solving this in Fedora 15 using biosdevname (latest source at 
http://linux.dell.com/cgi-bin/gitweb/gitweb.cgi?p=biosdevname.git;a=summary
).  This is way too intrusive to add to RHEL5, but there's an older copy in epel5, and I'll get a build out hopefully late this week that includes most recent code.

Thanks,
Matt

Comment 46 Vinod Kutty 2011-03-15 01:28:47 UTC
Matt,

Thanks again for your work on this, as well as others involved. This is an important part of making our enterprise Linux environment more robust.

--
Vinod