Bug 466983 - [RHEL5.3] sky2, Can not get ip address while installing from PXE
[RHEL5.3] sky2, Can not get ip address while installing from PXE
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
All Linux
high Severity high
: rc
: ---
Assigned To: Neil Horman
Red Hat Kernel QE team
:
Depends On:
Blocks: 483701
  Show dependency treegraph
 
Reported: 2008-10-14 17:27 EDT by Jeff Burke
Modified: 2009-04-09 13:32 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-04-09 13:32:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch to enforce read/write order on pci bus (2.88 KB, patch)
2009-01-30 15:17 EST, Neil Horman
no flags Details | Diff

  None (edit)
Description Jeff Burke 2008-10-14 17:27:37 EDT
Description of problem:
 Installing the latest RHEL5.3-Server-20081011.nightly x86_64 the system can not get an ip address.

 While running through the anaconda install the system fails to get an ip address.
Ethernet controller: Marvell Technology Group Ltd. 88E8052 PCI-E ASF Gigabit Ethernet Controller (rev 20)

Version-Release number of selected component (if applicable):
2.6.18.118.el5

How reproducible:
Always

Steps to Reproduce:
1. Use a system with sky2 driver and Marvell Technology Group Ltd. 88E8052 PCI-E ASF Gigabit Ethernet Controller (rev 20)
2. Try to PXE install
  
Actual results:
 The system will never get an ip address at the anaconda screen when selecting IPV4 and IPV6

Expected results:
This should work

Additional info:
I can install the same release from a DVD after the reboot the network card works fine.
I can install the i386 version of the distro and PXE works fine.

I am not 100% sure that this is a kernel issue it maybe in anaconda.
Comment 1 Jeff Burke 2008-10-14 17:29:03 EDT
03:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8052 PCI-E ASF Gigabit Ethernet Controller (rev 20)
	Subsystem: Marvell Technology Group Ltd. Marvell RDK-8052
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B-
	Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 0, Cache Line Size 10
	Interrupt: pin A routed to IRQ 217
	Region 0: Memory at febfc000 (64-bit, non-prefetchable) [size=16K]
	Region 2: I/O ports at e800 [size=256]
	Expansion ROM at febc0000 [disabled] [size=128K]
	Capabilities: [48] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
	Capabilities: [5c] Message Signalled Interrupts: 64bit+ Queue=0/1 Enable+
		Address: 00000000fee01000  Data: 40d9
	Capabilities: [e0] Express Legacy Endpoint IRQ 0
		Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag-
		Device: Latency L0s unlimited, L1 unlimited
		Device: AtnBtn- AtnInd- PwrInd-
		Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
		Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
		Device: MaxPayload 128 bytes, MaxReadReq 2048 bytes
		Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 2
		Link: Latency L0s <256ns, L1 unlimited
		Link: ASPM Disabled RCB 128 bytes CommClk+ ExtSynch-
		Link: Speed 2.5Gb/s, Width x1
	Capabilities: [100] Advanced Error Reporting
Comment 2 Neil Horman 2008-10-14 20:06:09 EDT
Jeff, can you provide the console output while the sky2 card tries to obtain an address?  I'd like to see if anaconda reports any further errors while its trying to dhcp.  A tcpdump of the process from the server or an observing machine if possible (lets not worry about that just yet though).

Also, are you familiar with the nicdelay parameter to anaconda? It might be good to set it to 120 on the kernel command line, just to force a 2 minute pause after link up, just to make sure we're not encountering some wierd initalization delay in the card on 64 bit systems.

Thanks!
Comment 3 Jeff Burke 2008-10-15 09:05:50 EDT
Here is the console output while doing the install:

<6>sky2: eth0: enabling interface
<6>sky2: eth0: ram buffer 48K
<6>ADDRCONF(NETDEV_UP): eth0 link is not ready
<6>sky2: eth0: enabling interface
<6>sky2: eth0: ram buffer 48K
<6>ADDRCONF(NETDEV_UP): eth0 link is not ready
<6>sky2: eth0: Link is up at 1000Mbps, full duplex, flow control rx
<6>ADDRCONF(NETDEV_UP): eth0 link becomes ready
<6>sky2: eth0: no ipv6 routers present
Comment 4 Jeff Burke 2008-10-15 09:06:47 EDT
Adding the nicdelay=120 option to the kernel command line worked. I was able to proceed with the install.
Comment 15 David Cantrell 2008-11-06 15:19:41 EST
Running the DHCP loop in anaconda indefinitely isn't a great idea to me (or really, running any loop in anaconda indefinitely isn't a great idea).  We should be able to determine that the sky2 driver is still initializing the device and we should wait on that.

Is there something else we can check via an ioctl() that will tell us the driver is still initializing?  That can be added to the get_link_status() code and we can wait on that to settle before checking for link up/down.
Comment 16 Jeff Burke 2008-12-10 22:28:25 EST
James,
   I noticed this BZ is still in needinfo
james.antill: needinfo? (anaconda-maint). Did Comment #15 answer your question? Or were you looking for additional information?
Comment 20 Jeff Burke 2009-01-27 19:29:06 EST
Reply to Comment#19
1) configure a sufficiently long nicdelay timeout in your kickstart file
 Despite what I said in Comment #4. I have gone back to retest this and it is _NOT_ working. I tried 10 - 240, it never gets an address.

2) manually retry the operation
 I can't get this installed unless I put in a second NIC card to do the install. Once the installation is finished and the system reboots the NIC card works.

3) default the installer to retry indefinitely, adding the ability to specify a
timeout or retry count, and the ability to manually break out of the operation.
 If that was the case now it still would not work. I think we may have a bigger issue with this system.
Comment 21 Neil Horman 2009-01-28 06:59:44 EST
1) Ok, well if the nicdelay parameter doesn't work that changes everything.  If its not about just waiting long enough for the hardware to init, then this is a driver bug, or a hardware problem.  Do you know if its reproducible on another, identical sky2 card (if one is available)?

2) I assume this is an RHTS system?  For the purposes of getting this system up, can't you do a serial console install?

3) I agree, Its either a hardware bug, or something is up with the driver.  Has this system been installable in the past like this?  i.e. if you try an older build, did it work?

I'll start looking through rhts for a simmilar card to try this on
Comment 22 Jeff Burke 2009-01-28 08:25:05 EST
1.) We have two system that are identical. Both of them have the issue.

2.) These are not in RHTS. They are in the 3rd floor minilab area. I can get you access.

3.) I am not sure what I did when it actually worked :/ I am thinking I had the second NIC in the system and did not realize it. I also went back to RHEL5.2 and that also failed. 

Thanks,
Jeff
Comment 24 Neil Horman 2009-01-30 10:09:35 EST
Jeff, I'm trying to tinker with this system this morning, but the pxe code keeps failing.  Its getting an address, but then it tries to load a file that the pxe server seems to say is unavailable.  The flicker from the kvm is preventing me from getting into the bios, so I seem unable to reconfigure it.  Any advice you can offer?

As for the actuall problem, As you noted it works fine after a local install.  My best guess is that the pxe boot process is leaving it in a state where the driver can't init it properly.  If you can help me get pxe working again I can go look for signs of that.  Thanks!
Comment 25 Neil Horman 2009-01-30 11:17:13 EST
Ok, Jeff and I just went over all this, and we've discovered that in addition to pxe being broken, dhcp doesn't work on this card if you ifdown/ifup it until you remove and re-install the module.  I'm sure these  are related issues.  I'm guessing that we're doing something in the close routine that prevent a re-open on the driver from working properly that is simmilar to what the pxe firmware does
I'll try fix the driver ifdown/up issue and see if we can get the pxe boot to work from there
Comment 27 Neil Horman 2009-01-30 15:17:56 EST
Created attachment 330504 [details]
patch to enforce read/write order on pci bus

So I was doing some reading and came accross this:
http://lkml.indiana.edu/hypermail/linux/kernel/0704.2/1766.html

I was wondering if this had anything to do with our problem.   If somehow our the use of pxe, or by shutting down and restarting our NIC had caused to get out of order reads or writes on the pci bus.  So I wrote this patch to try to enforce order, and it works for the case of ifdown/ifuping the interface.  I tried it several times over and each time it could re-dhcp for an address.  Jeff, can you build a kernel with this patch, and try pxe booting with it to see if it fixes the problem there too.

Please bear in mind, this is a complete throw away patch, I just want to make sure if fixes both problems before we address a more  correct solution.  The easiest thing to do might be to just take the upstream driver into RHEL5 (which may fix this differently, the changelog is vague on it).  Lets figure out if this is on the right track, then we'll tackle fixing it properly.  Thanks!
Comment 28 Jeff Burke 2009-02-03 17:31:54 EST
OK finally got to do the test. Here is what I did:
I built a test kernel with Neil's patch.
I grabbed the pxe versions of vmlinuz and initrd.img from the 2.6.18-128.el5 distro. Unpacked the initrd.img, unpacked the modules.cgz. Removed the sky2.ko.
Then copied the sky2.ko from the tree where I built from. Stripped it to remove the key. Packed the modules.cgz up. Packed the initrd.img up.

Bill Peck then created a pxe entry on the lab server that pointed to the vmlinuz and initrd.img I had given him.

It had the same issues as the driver that is in 2.6.18-128.el5 so doesn't look like the patch helps this issue.
Comment 29 Larry Troan 2009-02-05 20:00:57 EST
I have 2 machines with the Marvell EN NIC (on loan) in my office to address this problem.
Comment 31 Neil Horman 2009-02-06 08:36:38 EST
Jeff, do me a quick favor, try to pxe boot this system with 5.1 and 5.2.  I'd like to make sure this isn't a regression.  Thanks!
Comment 35 Ahmed Taha 2009-02-12 15:01:30 EST
I have found the same issue during installing RHEL5 U2 on Sun X4100 (NIC e1000) and I get the same dhcp problem. I do not have this problem when kickstarting from the RHEL5 U1 distro .. 

Thanks,
Comment 36 Ahmed Taha 2009-02-12 15:07:39 EST
BTW the NIC available for PXE boot is Intel Gigabit Card 82571EB .. I have not tried the nicdelay option yet .. but very soon I will let you know how it goes ..

Thanks,
Comment 37 Neil Horman 2009-02-12 15:16:32 EST
Jeff, As part of bz 484712, I had to rev the sky2 driver.  Its probably worth making sure that this problem still exists when that patch gets into the build
Comment 38 Jeff Burke 2009-02-12 16:27:52 EST
Reply to Comment #36 From Need Real Name (consult.itlinux@gmail.com)

The e1000 issue you are seeing may be a side effect of adding the e1000e driver to Red Hat Enterprise Linux Server 5.2 It would be good to know if the problem exists in the current release 5.3. If it does then please open a new BZ to track the issue for e1000 or e1000e. This current BZ is only for sky2 driver issue.

http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5.2/html/Release_Notes/x86/ar01s04.html

Thanks,
Jeff
Comment 40 RHEL Product and Program Management 2009-02-16 10:25:55 EST
Updating PM score.
Comment 42 Neil Horman 2009-03-06 06:50:12 EST
k, is there a way I can try pxe booting this machine myself with the initramfs that you set up so that I can tinker a bit?
Comment 43 Jeff Burke 2009-03-06 08:45:53 EST
Neil,
   Absolutely, contact me on irc and I will give you the information on how to access this machine.
Comment 44 Neil Horman 2009-03-09 16:05:38 EDT
Ok, I've tinkered for a while, and unfortunately, I can't get any visibility into whats going on here.

So here's my plan, I'm going to setup a local RHEL5 U3 system with the -128 kernel, and I'm going to build a custom initrd  using kdump, so that I have busybox access in the initramfs.  I'll load the appropriate sky2 driver from the 133 kernel there, and that will give me some visibility into this problem, since I'll be able to get shell access.

Jeff, If I give you an initramfs can you get an entry setup on the pxe server so that I can boot to it with an appropriate kernel?
Comment 45 Neil Horman 2009-03-10 15:44:44 EDT
Ok, bpeck set me up with my custom busybox initramfs, and the results are interesting.

Most interesting (or perhaps frustrating), is that, if I boot with my initramfs, and do nothing else, and then insmod the sky2 module, and manually set the ip address, the interface works fine.

I need to rebuild the initramfs, as I forgot the dhcp client app, so that I can verify that dhcp works with it, but I expect it will.

This leads me to two conclusions:
1) Something about the previous initrd that we built to test the new driver with anaconda was broken

2) Something about how anaconda (or perhaps NetowrkManager is dealing with this interface) is broken.

After I test with my new initramfs and confirm that dhcp works, we should probably have an anaconda or NetworkManager person look at this bug again.
Comment 46 Neil Horman 2009-03-11 07:17:07 EDT
Ok, so I've confirmed it, with my busybox initramfs, I'm able to load the sky2 module from the -133 kernel, bring it up, obtain a dhcp address and ping the router and other hosts on the network.  Something about anaconda or its NetworkManager component (if we use that in RHEL5) is causing this problem.  At the very least we need to get someone familiar with anaconda to look at this again so that we can get some visibility into the problem here.

Jeff, any suggestions who that might be, I don't recall who owns anaconda at the moment.
Comment 47 Don Zickus 2009-03-11 09:49:00 EDT
Neil,

David Cantrell is cc'd on this bz, he should be able to answer any of your anaconda questions.  

-Don
Comment 48 David Cantrell 2009-03-11 15:03:17 EDT
(In reply to comment #46)
> Ok, so I've confirmed it, with my busybox initramfs, I'm able to load the sky2
> module from the -133 kernel, bring it up, obtain a dhcp address and ping the
> router and other hosts on the network.  Something about anaconda or its
> NetworkManager component (if we use that in RHEL5) is causing this problem.  At
> the very least we need to get someone familiar with anaconda to look at this
> again so that we can get some visibility into the problem here.
> 
> Jeff, any suggestions who that might be, I don't recall who owns anaconda at
> the moment.  

RHEL5 doesn't use NetworkManager during installation, so it'll be one of the following:

- We're missing the required kernel module in the initrd.img
- loader is not communicating with the device correctly

I'd be happy to take a look at the issue, but I need a sky2 NIC.  Anyone know where I can get one?
Comment 49 Neil Horman 2009-03-11 15:22:10 EDT
David-

almost certain its not #1, I see messages indicating that sky2 is getting loaded from modules.cgz when booting in this environment, and subsequent messages about eth0, so the interface was at least created I think.

That would leave the loader/driver miscommunication, although I'm not sure what it would be doing to cause this.

Jeff, can you forward minilab access info to David, and set up an appropriate pxe boot entry for him to reproduce the problem? (the one you were using previously was co-opted by bpeck to setup the busybox test that I requested).

Thanks guys!
Comment 51 Jeff Burke 2009-04-09 08:45:25 EDT
Neil,
  I am not sure what the magic is/was. But currently with the RHEL5.4-Server-20090408.nightly I can install both systems that I could not install before. If I go back and try to install RHEL5.3 it fails.

  So I am going to assume that your driver update fixed it and I did not do the test correctly :/

Thanks,
Jeff
Comment 52 Neil Horman 2009-04-09 09:30:38 EDT
The driver update may have had some impact on it, although I'm not sure what.  I suppose we can close this as currentrelase if your testing shows consitent failure with 5.3 and consistent success with 5.4.  Let me know.  Thanks!
Comment 53 Jeff Burke 2009-04-09 09:34:34 EDT
Niel,
  The behavior is consistent. I went back several times just to make sure I was not loosing my mind. Thanks for all your help and assistance with this issue. I think you can close this "currentrelease"

Thanks,
Jeff
Comment 54 Neil Horman 2009-04-09 13:32:57 EDT
No, problem.  Thanks!

Note You need to log in before you can comment on or make changes to this bug.