-- Additional comment from bill.hayes on 2007-10-03 14:21 EST -- There seems to be a regression in the either the tg3 driver or anaconda, because we can't install from any of the Broadcom links in our BL860c blades with RHEL 5.1 Snapshot 8 with the pass-thru modules. This is something that used to work, see comment #28. When we try to install from the Broadcom links you get a message "Sending request for IP information for eth15" that is up for just a split second before it comes back to the same "Configure TCP/IP" screen. It seems that the Broadcom link are failing to autonegotiate or is taking too long from anaconda's perspective. If I use the kernel command line option: ' ethtool="speed=1000, duplex=full, autoneg=off"' then I can install from the Broadcom ports. The Broadcom ports seem to work fine once the Linux is installed so this maybe this is indeed an anaconda issue. -- Additional comment from andriusb on 2007-10-03 14:37 EST -- Bill @ HP: I think a new bugzilla is needed, esp. since it may be anaconda related, and not kernel driver related which this bugzilla tracked to the original public Beta. -- Additional comment from mchan on 2007-10-03 15:00 EST -- Bill, is the serdes linking up within 5 seconds after Linux is installed? Does anaconda also wait 5 seconds for link up? -- Additional comment from agospoda on 2007-10-03 15:07 EST -- We should open a new bug for this -- it sounds like the driver is working fine. As as side-note: I recently found a problem on rhel4 that appears when using the latest tg3 driver and an older ethtool, so if 5.1 snap 8 doesn't have a new enough ethtool we could have issues. The basic problem is that running: # ethtool -s eth0 autoneg on on an old ethtool and a new tg3 driver can result in the port being disabled for 1G speed. A tg3 driver update now strictly follows the settings used by ethtool. Since ethtool-5.1 (the version we are shipping with rhel 5.1) enables 1000Mbit we should be OK for 5.1, but if someone is running an older version of ethtool with the newer kernel they will need to update ethtool. -- Additional comment from dcantrell on 2007-10-03 15:34 EST -- anaconda waits for up to 5 seconds for link up. We wait for 1 second and check the status. If the status is not up, we continue. The loop runs for up to 5 times before we fail and say there is no network available. So you get a 5 second wait at max, but we check it each second. -- Additional comment from bill.hayes on 2007-10-03 17:54 EST -- I did go back and check to see when it started failing. Snapshot 2 and earlier work fine installing over the BCM5704S and the Pass-Thru module. Snapshot 3 and later fail. I did check the kernels to see if any tg3 related changes were introduced between these releases and 2 new tg3 patches were added: +Patch21712: linux-2.6-net-tg3-small-update-for-kdump-fix.patch +Patch21733: linux-2.6-net-tg3-pci-ids-missed-during-backport.patch The linux-2.6-net-tg3-pci-ids-missed-during-backport.patch is very small and could not be the source of the problem. On the other hand, linux-2.6-net-tg3-small-update-for-kdump-fix.patch is 507 lines and could be the source of the problem. Here is the description of that change. From: Andy Gospodarek <gospo> Subject: [RHEL5.1 PATCH] tg3: small update for kdump fix Date: Fri, 20 Jul 2007 14:14:38 -0400 Bugzilla: 239782 Message-Id: <20070720181437.GL28248.redhat.com> Changelog: [net] tg3: small update for kdump fix This pulls in 6 changes since my most recent tg3 post. Most importantly this fixes a kdump issue found by a partner a while ago. Michael Chan contacted me this week and asked me to include this before I even found out about the partner problem, so this worked out nicely. Here are the upstream commits included: commit ee6a99b539a50b4e9398938a0a6d37f8bf911550 Author: Michael Chan <mchan> [TG3]: Fix msi issue with kexec/kdump. commit 15028aad00ddf241581fbe74a02ec89cbb28d35d Author: Michael Chan <mchan> [TG3]: Update version to 3.78. commit 70b65a2d628d2e66bbf044bb764be64949f3580c Author: Matt Carlson <mcarlson> [TG3]: Add missing NVRAM strapping. commit 9ef8ca99749784644602535691f8cf201ee2a225 Author: Matt Carlson <mcarlson> [TG3]: Enable auto MDI. commit e8f3f6cad7e423253090887bc4afe7bc844162da Author: Matt Carlson <mcarlson> [TG3]: Fix the polarity bit. commit 469665459d26da8d0b46c70d070da1e192e48e46 Author: Michael Chan <mchan> [TG3]: Fix irq_sync race condition. This resolves BZ 239782 (and satisfies Michael's request in 225466 for the later version). -- Additional comment from agospoda on 2007-10-03 18:46 EST -- Hmmmm, this *could* be the one: commit 9ef8ca99749784644602535691f8cf201ee2a225 Author: Matt Carlson <mcarlson> [TG3]: Enable auto MDI. Bill is there a 5906 in that box? -- Additional comment from agospoda on 2007-10-03 18:48 EST -- Geeee, Andy, read much? :-) It seems my nemesis, the 5704S, is the one failing.... -- Additional comment from mchan on 2007-10-03 19:09 EST -- The polarity fix sounds more like it since it is 5704S. Matt, what do you think? Bill, does it consistently link up within 5 seconds after linux in installed? -- Additional comment from bill.hayes on 2007-10-03 19:18 EST -- Michael, Yes the link comes up consistently within 5 seconds (since its RHEL with DHCP...). I did several iterations of '/etc/init.t/network restart' earlier today and the link came up 97% of the time. That 3% failures can be attributed to a hardware problem on the pass-thru's that I am using. -- Additional comment from mcarlson on 2007-10-03 19:41 EST -- Bill, I agree with Michael. It sounds like the polarity bit change is the most likely culprit. Can we try reverting that change and see if it fixes the problem? -- Additional comment from andriusb on 2007-10-03 21:30 EST -- It looks like the bug that bumped this to 3.79 in bug 239782 is the main culprit. -- Additional comment from mchan on 2007-10-03 21:56 EST -- Matt and I went over the "polarity" patch in great detail. The net effect of that patch on 5704S is that a redundant IO to set the polarity bit on the mac_mode register has been eliminated. Because it was redundant, it theoretically should have no effect. The fact that it works 97% after install also proves that. It's possible that the patch may have a very subtle effect that we don't understand yet. So the only way to know for sure is to revert that patch and see what happens. We don't have that machine with the pass-through PHY so we are unable to do the experiment here. -- Additional comment from agospoda on 2007-10-04 08:41 EST -- Created an attachment (id=215741) tg3-polarity-bit-fix.patch Bill, I've attached the patch in question -- can you try building a kernel with just this patch reverted? -- Additional comment from agospoda on 2007-10-05 10:48 EST -- Created an attachment (id=217581) tg3-polarity-bit-fix2.patch Wow that patch was worthless. I'm also going to integrate this into my latest rhel5 test kernel. -- Additional comment from agospoda on 2007-10-05 10:54 EST -- Created an attachment (id=217591) tg3-polarity-bit-revert.patch This one doesn't have to be applied with a '-R' so it can just be put in linux-kernel-test.patch in the srpm. -- Additional comment from bill.hayes on 2007-10-05 12:15 EST -- Sorry for the delay, I am building the kernel now and then I will test it. Since the problem I was seeing occurs at install problem, I am not sure what you expect me to see with a patched kernel? I suspect we will only see the problem with an install image with the patched kernel. I will see if I can see any difference between the patched kernel and the original kernel. Any things that you want me to specifically test? -- Additional comment from mchan on 2007-10-05 13:21 EST -- So what's the difference between install time and run-time? Perhaps during install time, the driver is always loaded right after PCI reset or after PXE driver. Bill, instead of doing network restart, can you do repeated reboots to see if there's any issue? -- Additional comment from agospoda on 2007-10-05 13:25 EST -- Bill, You've got at least 2 options to test this (or any other kernel) with the installer: 1. You can copy the vmlinuz that is normally installed in /boot to the appropriate place and put that in the appropriate spot so you use that with PXE boot and start your install. 2. Combine the vmlinuz that is created and the installer initrd.img to /boot on one of your running systems and make the necessary entry in grub.conf so you can boot it. This might be best since you wouldn't have to even install -- just try to start the network install and see if you get link-up as expected. -- Additional comment from poelstra on 2007-10-05 13:27 EST -- Given all the discussion above it is far from clear whether or not this bug is considered fixed for 5.1 Changing status to ASSIGNED. -- Additional comment from mchan on 2007-10-05 13:48 EST -- Bill, I suggest that you try Andy's suggestions using the unpatched kernel first to see if you can reproduce the problem. Once reproduced, then try switching to the patched kernel with the polarity patch reverted. -- Additional comment from agospoda on 2007-10-05 15:32 EST -- I updated my test kernels with a revert of the tg3 polarity patch. As always you can get them here: http://people.redhat.com/agospoda/#rhel5 -- Additional comment from andriusb on 2007-10-05 16:48 EST -- This bug is getting far too complicated. I'm going to clone a BZ based on the fact that it is a regression, and it is not based on the 3.77 version of the tg3 driver (rather 3.79).
Created attachment 218021 [details] Patch to fix regression in tg3 v. 3.79 Patch moved from bug 225466.
I discovered that I could use ddiskit to create tg3 drivers that I use to test in the installer environment. I had built different tg3 driver update disk images and was happy testing to isolate the problem patch when the problem seemed to change. I had moved from trying to isolate the problem on snapshot 8 to rc1. So at this point I was a little confused about what I really knew and did not knew. So I went back and retested getting the following: rhel5.1s1 - worked rhel5.1s2 - worked rhel5.1s3 - failed (and retry OK many times and it does not come up) kernel: 2.6.18-41.el tg3 patches added in this release Fix msi issue with kexec/kdump. Update version to 3.78. Add missing NVRAM strapping. Enable auto MDI. Fix the polarity bit. Fix irq_sync race condition. rhel5.1s4 rhel5.1s5 rhel5.1s6 - failed (and retry many times and it does not come up) 2.6.18-45.el5 rhel5.1s7 rhel5.1s8 - failed (and retry many times and it does not come up) 2.6.18-48.el5 tg3 patches added in this release: None! rhel5.1rc1 - works after hitting "OK" twice on the "Configure TCP/IP" screen kernel 2.6.18-52.el5 So something is rc1 caused the original problem to change, but I do not see any changes to the tg3 driver specifically. There was one bonding patch removed and 3 other non-tg3 networking driver changes were made. So there is still a startup problem that you have to hit "OK" twice on the "Configure TCP/IP" screen to get the tg3 ports to come up: Welcome to Red Hat Enterprise Linux Server +----------------+ Configure TCP/IP +----------------+ | | | [*] Enable IPv4 support | | (*) Dynamic IP configuration (DHCP) | | ( ) Manual configuration | | | | [*] Enable IPv6 support | | (*) Automatic neighbor discovery (RFC 2461) | | ( ) Dynamic IP configuration (DHCP) | | ( ) Manual configuration | | | | +----+ +------+ | | | OK | | Back | | | +----+ +------+ | | | | | +----------------------------------------------------+ <Tab>/<Alt-Tab> between elements | <Space> selects | <F12> next screen What happened in rc1 to change this problem? What is likely causing the tg3 startup problem that causes it not to start up the first time? I will try more ifup/idfown/rmmod/modprobe testing with rc1 to see if I can learn anything more.
IIRC, there were some ananconda issues related to network configuration that were supposed to be addressed in RC1 -- I wonder if these are more at fault than the tg3 polarity fix?
Looks like those were already fixed (though it doesn't mean this isn't a regression). Bill, Thanks for doing all that testing! Were you testing only the kernels or the full distro? I'll add some anaconda people to this list to see if they can address the "Double <ENTER>" aspect of this issue.
Please test against RC3. bug #230525 caused many types of network installs to fail, exhibiting various bizarre behavior.
I tested with the 3.71b, 3.81c and the rhel5.1-rd1 (3.80) using driver update disks and they all behave the same. When you first hit OK on the "Configure TCP/IP" screen in Anaconda, it does the following: (a) put up this message "Sending request for IP information for eth12", (b) there is not DHCP traffic from the system going to the DHCP server, and (c) after between 1 and 1.5 seconds the anaconda comes back again to the "Configure TCP/IP" screen. The second hitting of OK does cause DHCP traffic and then Anaconda continues to ask for the Install server and directory.
Sounds like there is some confusion on the RH side here based on a conversation with Ron P. Bill tested this on "RC3" bits sent out last week. He tried both with and without a driver disk, both gave the same problem with anaconda asking for TCP/IP configuration twice. I am trying to reproduce this in the Red Hat lab now.
Anaconda waits for the link to be up before we send out the DHCP request. We check the ethtool way and the MII registers. If either indicate the link is ready, we proceed with the DHCP request. We wait up to 5 seconds, checking the link status each second and continue when either method indicates the link is ready. This hasn't changed in RHEL-5 during the 5.1 development cycle, so my guess is that it's a driver problem. When you get to the Configure TCP/IP screen, what happens if you wait 10 or 15 seconds and then select ok? Does it work after just one attempt or do you have to perform two attempts before it succeeds?
I ran the test that David suggested and waited 30 seconds after the Configure TCP/IP screen came up and then hit F12. Same behavior the link does NOT come up. Hitting F12 again the link comes up.
The system I have been seeing this problem in has 12 Intel and 4 Broadcom Ethernet ports. I just tried this on a different system that only has 4 Broadcom Ethernet ports, and I did not see the problem (single OK). Both system are connected to the pass-thru interconnect modules. So this suggests that maybe this related to the number of Ethernet ports? I ran another test on the system with 16 ethernet ports, and it will install without the double OKs if you use the old 'ethtool="speed=1000, duplex=full, autoneg=off"' trick on the kernel command line.
What is the max # of devices that you *can* see before you run into this bug?
To date, we have another partner who has successfully tested to 15 NICS, hence the request in comment #13.
I tried a rx4640 with 6 LAN cards that had 13 ports and they all worked the first time. NO hit OK the second time here. eth0 - S2io Inc. Xframe II 10Gbps Ethernet - OK first time eth1 - S2io Inc. Xframe 10 Gigabit Ethernet PCI-X - OK first time eth2 - Intel Corporation 82546GB Gigabit Ethernet Controller - OK first time eth3 - Intel Corporation 82546GB Gigabit Ethernet Controller - OK first time eth4 - Intel Corporation 82546GB Gigabit Ethernet Controller - OK first time eth5 - Intel Corporation 82546GB Gigabit Ethernet Controller - OK first time eth6 - Intel Corporation 82546GB Gigabit Ethernet Controller - OK first time eth7 - Intel Corporation 82546GB Gigabit Ethernet Controller - OK first time eth8 - Broadcom Corporation NetXtreme BCM5701 Gigabit Ethernet - OK first time eth9 - Digital Equipment Corporation DECchip 21142/43 - OK first time eth10 - Digital Equipment Corporation DECchip 21142/43 - OK first time eth11 - Digital Equipment Corporation DECchip 21142/43 - OK first time eth12 - Digital Equipment Corporation DECchip 21142/43 - OK first time I will try now try removing Intel mezz cards from the failing system to see where the limit it there.
Eliminating one of the Intel mezz NIC cards (4 ports) allowed with 1 or 2 of Broadcom port to work on the first OK on the Configure TCP/IP screen. This configuration has 12 ports are in the system, 8 e1000 and 4 tg3 ports. Eliminating an additional Intel mezz NIC cards (4 ports) enabled all the ports to come up the first time with a single OK. This configuration has 8 ports are in the system, 4 e1000 and 4 tg3 ports. With the 12 port configuration I either got 1 or 2 Broadcom ports to work the first time depending on the order that I tried to bring up the ports! See the details below if this this not clear. 12 ports - 8 intel / 4 broadcom Cycle thru Anaconda Screens as follows Networking Device (selecting a different device) -> Configure TCP/IP -> HTTP Setup -> Back (in order tried) eth11 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK twice eth10 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK twice eth9 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK twice eth8 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time eth7 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth6 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth5 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth4 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth3 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth2 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth1 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth0 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time Reboot the box Cycle thru Anaconda Screens as follows Networking Device (selecting a different device)-> Configure TCP/IP -> HTTP Setup -> Back (in order tried) eth0 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth1 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth2 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth3 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth4 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth5 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth6 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth7 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth8 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time eth9 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time eth10 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK twice eth11 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK twice Remove another Mother Neff Cycle thru Anaconda Screens as follows Networking Device (selecting a different device) -> Configure TCP/IP -> HTTP Setup -> Back (in order tried) eth0 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth1 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth2 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth3 - Intel Corporation 82571EB Quad Port Gigabit Mezzanine Adapter - OK first time eth4 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time eth5 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time eth6 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time eth7 - Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet - OK first time
I ran a test on a rx2660 server with 14 Ethernet ports and both of the Broadcom ports required hitting OK twice on the "TCP/IP Configure" screen. This is a rack server and the other failures were seen on blade servers, so this is a hardware independent problem. But the problem has only been seen on Broadcom ports, but that might just have to do with the tg3 driver being loaded last/etc. Cycle thru Anaconda Screens as follows Networking Device (selecting a different device) -> Configure TCP/IP -> HTTP Setup -> Back (in order tried) eth0 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth1 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth2 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth3 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth4 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth5 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth6 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth7 - Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) - OK first time eth8 - Digital Equipment Corporation DECchip 21142/43 - OK first time eth9 - Digital Equipment Corporation DECchip 21142/43 - OK first time eth10 - Digital Equipment Corporation DECchip 21142/43 - OK first time eth11 - Digital Equipment Corporation DECchip 21142/43 - OK first time eth12 - Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet - OK twice eth13 - Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet - OK twice
I also ran some tests today with kickstart and this will affect those users trying to automate installation. This also directly impacts our internal HP ability to do preload of RHEL 5.1 onto servers because of this problem. When kickstart fails it just sits on the "TCP/IP Configure" screen. I have seen the kickstart failure on a rx2660 with 14 Ethernet ports and a bl860c with 16 Ethernet ports.
revising release note for Release Notes Updates: <quote> Installing Red Hat Enterprise Linux 5 on HP BL860c blade systems may hang during the IP information request stage. This issue manifests when you have to select OK twice on the Configure TCP/IP screen. If this occurs, reboot and perform the installation with Ethernet autonegotiation disabled. To do this, use the parameter ethtool="autoneg=off" when booting from the installation media. Doing so does not affect the final installed system. </quote> please advise if any further revisions are required. thanks!
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Red Hat Management has concluded that this be deferred to RHEL 5.2, with a planned fix to be ready in the 3.8x version of the tg3 driver. A workaround will be stated in the online release notes for RHEL 5.1 (see Comment #21). *** This bug has been marked as a duplicate of 253344 ***
Does someone need to create an Issue Tracker corresponding to this Bugzilla?
Bill, I don't think an IT ticket be necessary for this specific bug since we will address this problem in 5.2, but if you would like to open one and attach it to bug 253344 that seems reasonable.
since fixed, replacing note with the following (under "Resolved Issues"): <quote> Previous versions of Red Hat Enterprise Linux 5 on HP BL860c blade systems could hang during the IP information request stage of installation. When this occurred, you were required to reboot and perform the installation with Ethernet autonegotiation disabled. This issue is now fixed in this update. </quote> please advise if any further revisions are required.
This issue ended up being an Anaconda problem and has been fixed in RHEL 5.2. The Anaconda work was done under Bug 429968: Anaconda stage 1 installer does NOT work with network installs on ports about eth10.
*** This bug has been marked as a duplicate of 429968 ***