Red Hat Bugzilla – Bug 466983
[RHEL5.3] sky2, Can not get ip address while installing from PXE
Last modified: 2009-04-09 13:32:57 EDT
Description of problem:
Installing the latest RHEL5.3-Server-20081011.nightly x86_64 the system can not get an ip address.
While running through the anaconda install the system fails to get an ip address.
Ethernet controller: Marvell Technology Group Ltd. 88E8052 PCI-E ASF Gigabit Ethernet Controller (rev 20)
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Use a system with sky2 driver and Marvell Technology Group Ltd. 88E8052 PCI-E ASF Gigabit Ethernet Controller (rev 20)
2. Try to PXE install
The system will never get an ip address at the anaconda screen when selecting IPV4 and IPV6
This should work
I can install the same release from a DVD after the reboot the network card works fine.
I can install the i386 version of the distro and PXE works fine.
I am not 100% sure that this is a kernel issue it maybe in anaconda.
03:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8052 PCI-E ASF Gigabit Ethernet Controller (rev 20)
Subsystem: Marvell Technology Group Ltd. Marvell RDK-8052
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size 10
Interrupt: pin A routed to IRQ 217
Region 0: Memory at febfc000 (64-bit, non-prefetchable) [size=16K]
Region 2: I/O ports at e800 [size=256]
Expansion ROM at febc0000 [disabled] [size=128K]
Capabilities:  Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities:  Vital Product Data
Capabilities: [5c] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable+
Address: 00000000fee01000 Data: 40d9
Capabilities: [e0] Express Legacy Endpoint IRQ 0
Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s unlimited, L1 unlimited
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
Device: MaxPayload 128 bytes, MaxReadReq 2048 bytes
Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 2
Link: Latency L0s <256ns, L1 unlimited
Link: ASPM Disabled RCB 128 bytes CommClk+ ExtSynch-
Link: Speed 2.5Gb/s, Width x1
Capabilities:  Advanced Error Reporting
Jeff, can you provide the console output while the sky2 card tries to obtain an address? I'd like to see if anaconda reports any further errors while its trying to dhcp. A tcpdump of the process from the server or an observing machine if possible (lets not worry about that just yet though).
Also, are you familiar with the nicdelay parameter to anaconda? It might be good to set it to 120 on the kernel command line, just to force a 2 minute pause after link up, just to make sure we're not encountering some wierd initalization delay in the card on 64 bit systems.
Here is the console output while doing the install:
<6>sky2: eth0: enabling interface
<6>sky2: eth0: ram buffer 48K
<6>ADDRCONF(NETDEV_UP): eth0 link is not ready
<6>sky2: eth0: enabling interface
<6>sky2: eth0: ram buffer 48K
<6>ADDRCONF(NETDEV_UP): eth0 link is not ready
<6>sky2: eth0: Link is up at 1000Mbps, full duplex, flow control rx
<6>ADDRCONF(NETDEV_UP): eth0 link becomes ready
<6>sky2: eth0: no ipv6 routers present
Adding the nicdelay=120 option to the kernel command line worked. I was able to proceed with the install.
Running the DHCP loop in anaconda indefinitely isn't a great idea to me (or really, running any loop in anaconda indefinitely isn't a great idea). We should be able to determine that the sky2 driver is still initializing the device and we should wait on that.
Is there something else we can check via an ioctl() that will tell us the driver is still initializing? That can be added to the get_link_status() code and we can wait on that to settle before checking for link up/down.
I noticed this BZ is still in needinfo
james.antill: needinfo? (anaconda-maint). Did Comment #15 answer your question? Or were you looking for additional information?
Reply to Comment#19
1) configure a sufficiently long nicdelay timeout in your kickstart file
Despite what I said in Comment #4. I have gone back to retest this and it is _NOT_ working. I tried 10 - 240, it never gets an address.
2) manually retry the operation
I can't get this installed unless I put in a second NIC card to do the install. Once the installation is finished and the system reboots the NIC card works.
3) default the installer to retry indefinitely, adding the ability to specify a
timeout or retry count, and the ability to manually break out of the operation.
If that was the case now it still would not work. I think we may have a bigger issue with this system.
1) Ok, well if the nicdelay parameter doesn't work that changes everything. If its not about just waiting long enough for the hardware to init, then this is a driver bug, or a hardware problem. Do you know if its reproducible on another, identical sky2 card (if one is available)?
2) I assume this is an RHTS system? For the purposes of getting this system up, can't you do a serial console install?
3) I agree, Its either a hardware bug, or something is up with the driver. Has this system been installable in the past like this? i.e. if you try an older build, did it work?
I'll start looking through rhts for a simmilar card to try this on
1.) We have two system that are identical. Both of them have the issue.
2.) These are not in RHTS. They are in the 3rd floor minilab area. I can get you access.
3.) I am not sure what I did when it actually worked :/ I am thinking I had the second NIC in the system and did not realize it. I also went back to RHEL5.2 and that also failed.
Jeff, I'm trying to tinker with this system this morning, but the pxe code keeps failing. Its getting an address, but then it tries to load a file that the pxe server seems to say is unavailable. The flicker from the kvm is preventing me from getting into the bios, so I seem unable to reconfigure it. Any advice you can offer?
As for the actuall problem, As you noted it works fine after a local install. My best guess is that the pxe boot process is leaving it in a state where the driver can't init it properly. If you can help me get pxe working again I can go look for signs of that. Thanks!
Ok, Jeff and I just went over all this, and we've discovered that in addition to pxe being broken, dhcp doesn't work on this card if you ifdown/ifup it until you remove and re-install the module. I'm sure these are related issues. I'm guessing that we're doing something in the close routine that prevent a re-open on the driver from working properly that is simmilar to what the pxe firmware does
I'll try fix the driver ifdown/up issue and see if we can get the pxe boot to work from there
Created attachment 330504 [details]
patch to enforce read/write order on pci bus
So I was doing some reading and came accross this:
I was wondering if this had anything to do with our problem. If somehow our the use of pxe, or by shutting down and restarting our NIC had caused to get out of order reads or writes on the pci bus. So I wrote this patch to try to enforce order, and it works for the case of ifdown/ifuping the interface. I tried it several times over and each time it could re-dhcp for an address. Jeff, can you build a kernel with this patch, and try pxe booting with it to see if it fixes the problem there too.
Please bear in mind, this is a complete throw away patch, I just want to make sure if fixes both problems before we address a more correct solution. The easiest thing to do might be to just take the upstream driver into RHEL5 (which may fix this differently, the changelog is vague on it). Lets figure out if this is on the right track, then we'll tackle fixing it properly. Thanks!
OK finally got to do the test. Here is what I did:
I built a test kernel with Neil's patch.
I grabbed the pxe versions of vmlinuz and initrd.img from the 2.6.18-128.el5 distro. Unpacked the initrd.img, unpacked the modules.cgz. Removed the sky2.ko.
Then copied the sky2.ko from the tree where I built from. Stripped it to remove the key. Packed the modules.cgz up. Packed the initrd.img up.
Bill Peck then created a pxe entry on the lab server that pointed to the vmlinuz and initrd.img I had given him.
It had the same issues as the driver that is in 2.6.18-128.el5 so doesn't look like the patch helps this issue.
I have 2 machines with the Marvell EN NIC (on loan) in my office to address this problem.
Jeff, do me a quick favor, try to pxe boot this system with 5.1 and 5.2. I'd like to make sure this isn't a regression. Thanks!
I have found the same issue during installing RHEL5 U2 on Sun X4100 (NIC e1000) and I get the same dhcp problem. I do not have this problem when kickstarting from the RHEL5 U1 distro ..
BTW the NIC available for PXE boot is Intel Gigabit Card 82571EB .. I have not tried the nicdelay option yet .. but very soon I will let you know how it goes ..
Jeff, As part of bz 484712, I had to rev the sky2 driver. Its probably worth making sure that this problem still exists when that patch gets into the build
Reply to Comment #36 From Need Real Name (email@example.com)
The e1000 issue you are seeing may be a side effect of adding the e1000e driver to Red Hat Enterprise Linux Server 5.2 It would be good to know if the problem exists in the current release 5.3. If it does then please open a new BZ to track the issue for e1000 or e1000e. This current BZ is only for sky2 driver issue.
Updating PM score.
k, is there a way I can try pxe booting this machine myself with the initramfs that you set up so that I can tinker a bit?
Absolutely, contact me on irc and I will give you the information on how to access this machine.
Ok, I've tinkered for a while, and unfortunately, I can't get any visibility into whats going on here.
So here's my plan, I'm going to setup a local RHEL5 U3 system with the -128 kernel, and I'm going to build a custom initrd using kdump, so that I have busybox access in the initramfs. I'll load the appropriate sky2 driver from the 133 kernel there, and that will give me some visibility into this problem, since I'll be able to get shell access.
Jeff, If I give you an initramfs can you get an entry setup on the pxe server so that I can boot to it with an appropriate kernel?
Ok, bpeck set me up with my custom busybox initramfs, and the results are interesting.
Most interesting (or perhaps frustrating), is that, if I boot with my initramfs, and do nothing else, and then insmod the sky2 module, and manually set the ip address, the interface works fine.
I need to rebuild the initramfs, as I forgot the dhcp client app, so that I can verify that dhcp works with it, but I expect it will.
This leads me to two conclusions:
1) Something about the previous initrd that we built to test the new driver with anaconda was broken
2) Something about how anaconda (or perhaps NetowrkManager is dealing with this interface) is broken.
After I test with my new initramfs and confirm that dhcp works, we should probably have an anaconda or NetworkManager person look at this bug again.
Ok, so I've confirmed it, with my busybox initramfs, I'm able to load the sky2 module from the -133 kernel, bring it up, obtain a dhcp address and ping the router and other hosts on the network. Something about anaconda or its NetworkManager component (if we use that in RHEL5) is causing this problem. At the very least we need to get someone familiar with anaconda to look at this again so that we can get some visibility into the problem here.
Jeff, any suggestions who that might be, I don't recall who owns anaconda at the moment.
David Cantrell is cc'd on this bz, he should be able to answer any of your anaconda questions.
(In reply to comment #46)
> Ok, so I've confirmed it, with my busybox initramfs, I'm able to load the sky2
> module from the -133 kernel, bring it up, obtain a dhcp address and ping the
> router and other hosts on the network. Something about anaconda or its
> NetworkManager component (if we use that in RHEL5) is causing this problem. At
> the very least we need to get someone familiar with anaconda to look at this
> again so that we can get some visibility into the problem here.
> Jeff, any suggestions who that might be, I don't recall who owns anaconda at
> the moment.
RHEL5 doesn't use NetworkManager during installation, so it'll be one of the following:
- We're missing the required kernel module in the initrd.img
- loader is not communicating with the device correctly
I'd be happy to take a look at the issue, but I need a sky2 NIC. Anyone know where I can get one?
almost certain its not #1, I see messages indicating that sky2 is getting loaded from modules.cgz when booting in this environment, and subsequent messages about eth0, so the interface was at least created I think.
That would leave the loader/driver miscommunication, although I'm not sure what it would be doing to cause this.
Jeff, can you forward minilab access info to David, and set up an appropriate pxe boot entry for him to reproduce the problem? (the one you were using previously was co-opted by bpeck to setup the busybox test that I requested).
I am not sure what the magic is/was. But currently with the RHEL5.4-Server-20090408.nightly I can install both systems that I could not install before. If I go back and try to install RHEL5.3 it fails.
So I am going to assume that your driver update fixed it and I did not do the test correctly :/
The driver update may have had some impact on it, although I'm not sure what. I suppose we can close this as currentrelase if your testing shows consitent failure with 5.3 and consistent success with 5.4. Let me know. Thanks!
The behavior is consistent. I went back several times just to make sure I was not loosing my mind. Thanks for all your help and assistance with this issue. I think you can close this "currentrelease"
No, problem. Thanks!