Bug 320841
| Summary: | [Broadcom 5.1 Regression] tg3 3.77 to 3.79 polarity-bit-revert patch | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Andrius Benokraitis <andriusb> | ||||
| Component: | kernel | Assignee: | Andy Gospodarek <agospoda> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Martin Jenner <mjenner> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 5.1 | CC: | amit_bhutani, anaconda-maint-list, andriusb, bill.hayes, bugproxy, dcantrell, dchapman, ddomingo, eriley, jbaker, jbaron, jeanne.colon-bonet, jfeeney, jtorrice, konradr, mcarlson, mchan, peterm, pgraner, poelstra, rick.bieber, rick.hester, rpacheco, tao, wwlinuxengineering | ||||
| Target Milestone: | rc | Keywords: | OtherQA, Regression | ||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2007-10-10 19:42:25 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 225466, 239782 | ||||||
| Bug Blocks: | 221461, 222082 | ||||||
| Attachments: |
|
||||||
|
Description
Andrius Benokraitis
2007-10-05 20:55:24 UTC
Created attachment 218021 [details] Patch to fix regression in tg3 v. 3.79 Patch moved from bug 225466. I discovered that I could use ddiskit to create tg3 drivers that I use to test
in the installer environment. I had built different tg3 driver update disk
images and was happy testing to isolate the problem patch when the problem
seemed to change. I had moved from trying to isolate the problem on snapshot 8
to rc1.
So at this point I was a little confused about what I really knew and did not
knew. So I went back and retested getting the following:
rhel5.1s1 - worked
rhel5.1s2 - worked
rhel5.1s3 - failed (and retry OK many times and it does not come up)
kernel: 2.6.18-41.el
tg3 patches added in this release
Fix msi issue with kexec/kdump.
Update version to 3.78.
Add missing NVRAM strapping.
Enable auto MDI.
Fix the polarity bit.
Fix irq_sync race condition.
rhel5.1s4
rhel5.1s5
rhel5.1s6 - failed (and retry many times and it does not come up) 2.6.18-45.el5
rhel5.1s7
rhel5.1s8 - failed (and retry many times and it does not come up) 2.6.18-48.el5
tg3 patches added in this release: None!
rhel5.1rc1 - works after hitting "OK" twice on the "Configure TCP/IP" screen
kernel 2.6.18-52.el5
So something is rc1 caused the original problem to change, but I do not see any
changes to the tg3 driver specifically. There was one bonding patch removed and
3 other non-tg3 networking driver changes were made.
So there is still a startup problem that you have to hit "OK" twice on the
"Configure TCP/IP" screen to get the tg3 ports to come up:
Welcome to Red Hat Enterprise Linux Server
+----------------+ Configure TCP/IP +----------------+
| |
| [*] Enable IPv4 support |
| (*) Dynamic IP configuration (DHCP) |
| ( ) Manual configuration |
| |
| [*] Enable IPv6 support |
| (*) Automatic neighbor discovery (RFC 2461) |
| ( ) Dynamic IP configuration (DHCP) |
| ( ) Manual configuration |
| |
| +----+ +------+ |
| | OK | | Back | |
| +----+ +------+ |
| |
| |
+----------------------------------------------------+
<Tab>/<Alt-Tab> between elements | <Space> selects | <F12> next screen
What happened in rc1 to change this problem?
What is likely causing the tg3 startup problem that causes it not to start up
the first time?
I will try more ifup/idfown/rmmod/modprobe testing with rc1 to see if I can
learn anything more.
IIRC, there were some ananconda issues related to network configuration that were supposed to be addressed in RC1 -- I wonder if these are more at fault than the tg3 polarity fix? Looks like those were already fixed (though it doesn't mean this isn't a regression). Bill, Thanks for doing all that testing! Were you testing only the kernels or the full distro? I'll add some anaconda people to this list to see if they can address the "Double <ENTER>" aspect of this issue. Please test against RC3. bug #230525 caused many types of network installs to fail, exhibiting various bizarre behavior. I tested with the 3.71b, 3.81c and the rhel5.1-rd1 (3.80) using driver update disks and they all behave the same. When you first hit OK on the "Configure TCP/IP" screen in Anaconda, it does the following: (a) put up this message "Sending request for IP information for eth12", (b) there is not DHCP traffic from the system going to the DHCP server, and (c) after between 1 and 1.5 seconds the anaconda comes back again to the "Configure TCP/IP" screen. The second hitting of OK does cause DHCP traffic and then Anaconda continues to ask for the Install server and directory. Sounds like there is some confusion on the RH side here based on a conversation with Ron P. Bill tested this on "RC3" bits sent out last week. He tried both with and without a driver disk, both gave the same problem with anaconda asking for TCP/IP configuration twice. I am trying to reproduce this in the Red Hat lab now. Anaconda waits for the link to be up before we send out the DHCP request. We check the ethtool way and the MII registers. If either indicate the link is ready, we proceed with the DHCP request. We wait up to 5 seconds, checking the link status each second and continue when either method indicates the link is ready. This hasn't changed in RHEL-5 during the 5.1 development cycle, so my guess is that it's a driver problem. When you get to the Configure TCP/IP screen, what happens if you wait 10 or 15 seconds and then select ok? Does it work after just one attempt or do you have to perform two attempts before it succeeds? I ran the test that David suggested and waited 30 seconds after the Configure TCP/IP screen came up and then hit F12. Same behavior the link does NOT come up. Hitting F12 again the link comes up. The system I have been seeing this problem in has 12 Intel and 4 Broadcom Ethernet ports. I just tried this on a different system that only has 4 Broadcom Ethernet ports, and I did not see the problem (single OK). Both system are connected to the pass-thru interconnect modules. So this suggests that maybe this related to the number of Ethernet ports? I ran another test on the system with 16 ethernet ports, and it will install without the double OKs if you use the old 'ethtool="speed=1000, duplex=full, autoneg=off"' trick on the kernel command line. What is the max # of devices that you *can* see before you run into this bug? To date, we have another partner who has successfully tested to 15 NICS, hence the request in comment #13. I tried a rx4640 with 6 LAN cards that had 13 ports and they all worked the
first time. NO hit OK the second time here.
eth0 - S2io Inc. Xframe II 10Gbps Ethernet - OK first time
eth1 - S2io Inc. Xframe 10 Gigabit Ethernet PCI-X - OK first time
eth2 - Intel Corporation 82546GB Gigabit Ethernet Controller - OK first time
eth3 - Intel Corporation 82546GB Gigabit Ethernet Controller - OK first time
eth4 - Intel Corporation 82546GB Gigabit Ethernet Controller - OK first time
eth5 - Intel Corporation 82546GB Gigabit Ethernet Controller - OK first time
eth6 - Intel Corporation 82546GB Gigabit Ethernet Controller - OK first time
eth7 - Intel Corporation 82546GB Gigabit Ethernet Controller - OK first time
eth8 - Broadcom Corporation NetXtreme BCM5701 Gigabit Ethernet - OK first time
eth9 - Digital Equipment Corporation DECchip 21142/43 - OK first time
eth10 - Digital Equipment Corporation DECchip 21142/43 - OK first time
eth11 - Digital Equipment Corporation DECchip 21142/43 - OK first time
eth12 - Digital Equipment Corporation DECchip 21142/43 - OK first time
I will try now try removing Intel mezz cards from the failing system to see
where the limit it there.
Eliminating one of the Intel mezz NIC cards (4 ports) allowed with 1 or 2 of Broadcom port to work on the first OK on the Configure TCP/IP screen. This configuration has 12 ports are in the system, 8 e1000 and 4 tg3 ports. Eliminating an additional Intel mezz NIC cards (4 ports) enabled all the ports to come up the first time with a single OK. This configuration has 8 ports are in the system, 4 e1000 and 4 tg3 ports. With the 12 port configuration I either got 1 or 2 Broadcom ports to work the first time depending on the order that I tried to bring up the ports! See the details below if this this not clear. 12 ports - 8 intel / 4 broadcom Cycle thru Anaconda Screens as follows Networking Device (selecting a different device) -> Configure TCP/IP -> HTTP Setup -> Back (in order tried) eth11 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK twice eth10 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK twice eth9 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK twice eth8 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time eth7 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth6 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth5 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth4 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth3 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth2 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth1 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth0 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time Reboot the box Cycle thru Anaconda Screens as follows Networking Device (selecting a different device)-> Configure TCP/IP -> HTTP Setup -> Back (in order tried) eth0 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth1 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth2 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth3 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth4 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth5 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth6 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth7 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth8 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time eth9 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time eth10 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK twice eth11 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK twice Remove another Mother Neff Cycle thru Anaconda Screens as follows Networking Device (selecting a different device) -> Configure TCP/IP -> HTTP Setup -> Back (in order tried) eth0 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth1 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth2 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth3 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth4 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time eth5 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time eth6 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time eth7 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time I ran a test on a rx2660 server with 14 Ethernet ports and both of the Broadcom ports required hitting OK twice on the "TCP/IP Configure" screen. This is a rack server and the other failures were seen on blade servers, so this is a hardware independent problem. But the problem has only been seen on Broadcom ports, but that might just have to do with the tg3 driver being loaded last/etc. Cycle thru Anaconda Screens as follows Networking Device (selecting a different device) -> Configure TCP/IP -> HTTP Setup -> Back (in order tried) eth0 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth1 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth2 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth3 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth4 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth5 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth6 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth7 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth8 - Digital Equipment Corporation DECchip 21142/43 - OK first time eth9 - Digital Equipment Corporation DECchip 21142/43 - OK first time eth10 - Digital Equipment Corporation DECchip 21142/43 - OK first time eth11 - Digital Equipment Corporation DECchip 21142/43 - OK first time eth12 - Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet - OK twice eth13 - Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet - OK twice I also ran some tests today with kickstart and this will affect those users trying to automate installation. This also directly impacts our internal HP ability to do preload of RHEL 5.1 onto servers because of this problem. When kickstart fails it just sits on the "TCP/IP Configure" screen. I have seen the kickstart failure on a rx2660 with 14 Ethernet ports and a bl860c with 16 Ethernet ports. revising release note for Release Notes Updates:
<quote>
Installing Red Hat Enterprise Linux 5 on HP BL860c blade systems may hang during
the IP information request stage. This issue manifests when you have to select
OK twice on the Configure TCP/IP screen.
If this occurs, reboot and perform the installation with Ethernet
autonegotiation disabled. To do this, use the parameter ethtool="autoneg=off"
when booting from the installation media. Doing so does not affect the final
installed system.
</quote>
please advise if any further revisions are required. thanks!
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Red Hat Management has concluded that this be deferred to RHEL 5.2, with a planned fix to be ready in the 3.8x version of the tg3 driver. A workaround will be stated in the online release notes for RHEL 5.1 (see Comment #21). *** This bug has been marked as a duplicate of 253344 *** Does someone need to create an Issue Tracker corresponding to this Bugzilla? Bill, I don't think an IT ticket be necessary for this specific bug since we will address this problem in 5.2, but if you would like to open one and attach it to bug 253344 that seems reasonable. since fixed, replacing note with the following (under "Resolved Issues"): <quote> Previous versions of Red Hat Enterprise Linux 5 on HP BL860c blade systems could hang during the IP information request stage of installation. When this occurred, you were required to reboot and perform the installation with Ethernet autonegotiation disabled. This issue is now fixed in this update. </quote> please advise if any further revisions are required. This issue ended up being an Anaconda problem and has been fixed in RHEL 5.2. The Anaconda work was done under Bug 429968: Anaconda stage 1 installer does NOT work with network installs on ports about eth10. *** This bug has been marked as a duplicate of 429968 *** |