Yesterday morning, my first cold boot after a yum update my network connection didn't work. After may cold and warm boots I've come to the following conlusions: My network card did not work 4 out of 5 cold boot attempts with 2.6.14-1.1729_FC5.x86_64. It did work 1 out of 1 cold boot attempts with 2.6.14-1.1663_FC5.x86_64, but that may be just dumb luck. If it has failed no ammount of warm booting 2.6.14-1.1644_FC5 or any newer kernel will get it back to live. Booting 2.6.13-1.1594_FC5.x86_64 will bring it back. Once it is alive any kernel warm-boot will work fine. Not working / not alive in this case means that the card is detected fine. Also, according to the ifconfig statistics frames get send fine, but nothing is received. I found a similar bug, but with totally different hardware when looking for duplicates: bug 174022 My card is a Compex ReadyLink 2000 (rev 0a). lspci -nv for my card gives: 00:0a.0 Class 0200: 11f6:1401 (rev 0a) Flags: medium devsel, IRQ 177 I/O ports at b000 [size=32] [virtual] Expansion ROM at 30000000 [disabled] [size=32K] This is with the network working. Note this might not be network related at all, but instead a PCI IRQ routing problem, although concidering the behaviour I find this unlikely. One last note I'm using the BNC (coax)/ 10-base-2 connector of my card not RJ45/UTP.
I've upgraded to the latest rawhide kernel, but thats no help. I've now also experienced the problem with a warmboot.
There was a somewhat recent change thatcould be touching this. I've got test kernels that back-out that change available here: http://people.redhat.com/linville/kernels/fc5/ Please give them a try and post the results here...thanks!
I'll give them a spin, it could be a couple of days before you hear from me again, because if they seem to fix the problem I want to make sure the problem is really fixed, which means gathering statistics (aka cold boots).
I'm afraid that the problem is not solved by kernel-2.6.14-1.1739.2.2_FC5.jwltest.6.x86_64
Well, thanks for the info. There isn't much else that changed in that driver between the kernels you cite as working and the ones that don't. Have you tried booting w/ "acpi=off" or "acpi=noirq" as kernel command-line options?
I've tried: -cold booting 2.6.14-1.1739_FC5 with acpi=noirq, no luck -warm boot 2.6.14-1.1739_FC5 after acpi=noirq attempt, with acpi=off, this seems todo the trick.
And cold boots w/ acpi=off?
Also works
I just (accidently) did a cold boot of the latest Rawhide kernel 2.6.14-1.1743_FC5 (without acpi=off) and it worked. This might just be a lucky shot, but I'll try it as my default kernel for the next couple of days, maybe an upstream acpi change has fixed things.
Do you have the latest BIOS available for your motherboard? If not, please upgrade it. Hopefully that will stabilize things for you. Barring that, it is often difficult to tell if something like this is a BIOS problem or a Linux ACPI problem. I'm copying Len Brown (the Linux ACPI "dude") to see if he has any insight.
2.6.14-1.1743_FC5, seems to be working a lot better, sofar only one failed coldboot. My BIOS is indeed a bit ancient, I'll try upgrading it and keep you posted.
I've upgraded my BIOS but that didn't help. But after some fiddling I have found the real reason, this might not be a kernel bug att all but just a timing race condition, which might be caused from userspace. What I've done is dump my entire dmesg of a successfull boot and a failed boot of the same kernel to 2 files and did a diff on them, which reveals the real problem: I've got a via network interface integrated on my motherboard, but since I have an old coax network here at home I still use my old trusty ne2k-pci with the bnc connector. The problem is that it doesnot always get assigned the same interface, on some boots its eth0 on others eth1 (and vica versa for the via interface). Now that the problem is clear, why does this happen and what can I do to try and fix it?
you should be able to bind an ethX name to a specific interface by using HWADDR=00:11:22:33:44:55 in the /etc/sysconfig/networking/devices/ifcfg-ethX files That should stop it jumping around.
So I sould replace/remove the DEVICE= line then? Also this will work for me but this stil is a bug, modifying the scripts is not a solution for technical savy users.
DEVICE= stays. The HWADDR= line is additional. Different kernel versions, BIOS updates, kernel command-line options, and probably other things can account for the order being different at different times. Can you provide a definite, reproducible means of reproducing the detection one way or the other? If so, we might be able to narrow it down to a real problem.
system-config-network also allows a 'bind to hardware' option that sets this.
Actually the problem is that with kernels > 2.6.13-1.1594_FC5.x86_64 I can't get the detection order stable in anyway. Sometimes the ne2k-pci card becomes eth0, other times eth1 . Whos / whats task is it to load the modules? I've the feeling that the modules are loaded in paralel and that this is timing related.
Some modules will be loaded by rc.sysinit, others will be loaded on-demand as the hardware is accessed. Did you try adding the HWADDR= lines as mentioned in comment 13?
Yes I did add the HWADDR= line as suggested and that fixes my problem. So,this bug might be closed I say might because IMHO this behaviour should never happen, the HWADDR= line is a hack in this case, its not like I'm swapping cards from one slot to another, I'm only turning of the PC and turning it back on again, and with recent kernels this causes inconsistent probing order which IMHO is _bad_ .
I'm sorry that you don't like the HWADDR= option, but it is the best we have to offer. I'm going to close this as WORKSFORME since you have a working solution. Thanks!
I won't reopen this since it indeed works for me, but can you please explain how booting the same kernel twice, with nothing changed between the boots and still getting a different probeorder is not a _bug_.
There is no defined order for detection in the first place, so the fact that it changes is really just an annoyance. The HWADDR= fixes that annoyance. The NIC drivers are modular. Since you have NICs that are not covered by the same driver, it is the order in which the driver modules load AND _initialize_ that will determine which gets which name by default. Any number of factors might influence the order of initialization even if they are always loaded in the same order. Such factors include locking issues, event delays, and other code details stemming from differences between the two drivers. Hth...thanks!