174922 – ne2k-pci cold boot problem with 2.6.14-1.1644_FC5 and newer

Bug 174922 - ne2k-pci cold boot problem with 2.6.14-1.1644_FC5 and newer

Summary: ne2k-pci cold boot problem with 2.6.14-1.1644_FC5 and newer

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	John W. Linville
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-12-04 08:17 UTC by Hans de Goede
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-01-04 14:32:32 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Hans de Goede 2005-12-04 08:17:33 UTC

Yesterday morning, my first cold boot after a yum update my network connection
didn't work. After may cold and warm boots I've come to the following conlusions:

My network card did not work 4 out of 5 cold boot attempts with
2.6.14-1.1729_FC5.x86_64.

It did work 1 out of 1 cold boot attempts with 2.6.14-1.1663_FC5.x86_64, but
that may be just dumb luck.

If it has failed no ammount of warm booting 2.6.14-1.1644_FC5 or any newer
kernel will get it back to live. Booting 2.6.13-1.1594_FC5.x86_64 will bring it
back. Once it is alive any kernel warm-boot will work fine.

Not working / not alive in this case means that the card is detected fine. Also,
according to the ifconfig statistics frames get send fine, but nothing is
received. I found a similar bug, but with totally different hardware when
looking for duplicates: bug 174022

My card is a Compex ReadyLink 2000 (rev 0a). lspci -nv for my card gives:
00:0a.0 Class 0200: 11f6:1401 (rev 0a)
        Flags: medium devsel, IRQ 177
        I/O ports at b000 [size=32]
        [virtual] Expansion ROM at 30000000 [disabled] [size=32K]

This is with the network working. Note this might not be network related at all,
but instead a PCI IRQ routing problem, although concidering the behaviour I find
this unlikely. 

One last note I'm using the BNC (coax)/ 10-base-2 connector of my card not RJ45/UTP.

Comment 1 Hans de Goede 2005-12-05 17:34:15 UTC

I've upgraded to the latest rawhide kernel, but thats no help. I've now also
experienced the problem with a warmboot.

Comment 2 John W. Linville 2005-12-05 21:51:48 UTC

There was a somewhat recent change thatcould be touching this.  I've got test  
kernels that back-out that change available here:  
  
   http://people.redhat.com/linville/kernels/fc5/  
  
Please give them a try and post the results here...thanks!

Comment 3 Hans de Goede 2005-12-06 19:37:27 UTC

I'll give them a spin, it could be a couple of days before you hear from me
again, because if they seem to fix the problem I want to make sure the problem
is really fixed, which means gathering statistics (aka cold boots).

Comment 4 Hans de Goede 2005-12-06 20:27:35 UTC

I'm afraid that the problem is not solved by
kernel-2.6.14-1.1739.2.2_FC5.jwltest.6.x86_64

Comment 5 John W. Linville 2005-12-06 20:47:47 UTC

Well, thanks for the info.  There isn't much else that changed in that driver 
between the kernels you cite as working and the ones that don't. 
 
Have you tried booting w/ "acpi=off" or "acpi=noirq" as kernel command-line 
options?

Comment 6 Hans de Goede 2005-12-07 20:57:17 UTC

I've tried:
-cold booting 2.6.14-1.1739_FC5 with acpi=noirq, no luck
-warm boot 2.6.14-1.1739_FC5 after acpi=noirq attempt, with acpi=off, this
 seems todo the trick.

Comment 7 John W. Linville 2005-12-07 21:03:54 UTC

And cold boots w/ acpi=off?

Comment 8 Hans de Goede 2005-12-07 21:09:14 UTC

Also works

Comment 9 Hans de Goede 2005-12-09 18:23:45 UTC

I just (accidently) did a cold boot of the latest Rawhide kernel
2.6.14-1.1743_FC5 (without acpi=off) and it worked. This might just be a lucky
shot, but I'll try it as my default kernel for the next couple of days, maybe an
upstream acpi change has fixed things.

Comment 10 John W. Linville 2005-12-09 19:01:25 UTC

Do you have the latest BIOS available for your motherboard?  If not, please 
upgrade it.  Hopefully that will stabilize things for you. 
 
Barring that, it is often difficult to tell if something like this is a BIOS 
problem or a Linux ACPI problem.  I'm copying Len Brown (the Linux ACPI 
"dude") to see if he has any insight.

Comment 11 Hans de Goede 2005-12-12 12:01:24 UTC

2.6.14-1.1743_FC5, seems to be working a lot better, sofar only one failed
coldboot. My BIOS is indeed a bit ancient, I'll try upgrading it and keep you
posted.

Comment 12 Hans de Goede 2005-12-15 23:33:04 UTC

I've upgraded my BIOS but that didn't help. But after some fiddling I have found
the real reason, this might not be a kernel bug att all but just a timing race
condition, which might be caused from userspace.

What I've done is dump my entire dmesg of a successfull boot and a failed boot
of the same kernel to 2 files and did a diff on them, which reveals the real
problem: I've got a via network interface integrated on my motherboard, but
since I have an old coax network here at home I still use my old trusty ne2k-pci
with the bnc connector. The problem is that it doesnot always get assigned the
same interface, on some boots its eth0 on others eth1 (and vica versa for the
via interface).

Now that the problem is clear, why does this happen and what can I do to try and
fix it?

Comment 13 Dave Jones 2005-12-16 01:18:57 UTC

you should be able to bind an ethX name to a specific interface by using
HWADDR=00:11:22:33:44:55 in the /etc/sysconfig/networking/devices/ifcfg-ethX files

That should stop it jumping around.

Comment 14 Hans de Goede 2005-12-16 08:36:53 UTC

So I sould replace/remove the DEVICE= line then?

Also this will work for me but this stil is a bug, modifying the scripts is not
a solution for technical savy users.

Comment 15 John W. Linville 2005-12-16 20:05:47 UTC

DEVICE= stays.  The HWADDR= line is additional. 
 
Different kernel versions, BIOS updates, kernel command-line options, and 
probably other things can account for the order being different at different 
times. 
 
Can you provide a definite, reproducible means of reproducing the detection 
one way or the other?  If so, we might be able to narrow it down to a real 
problem.

Comment 16 Dave Jones 2005-12-16 20:49:25 UTC

system-config-network also allows a 'bind to hardware' option that sets this.

Comment 17 Hans de Goede 2005-12-17 08:36:41 UTC

Actually the problem is that with kernels > 2.6.13-1.1594_FC5.x86_64 I can't get
the detection order stable in anyway. Sometimes the ne2k-pci card becomes eth0,
other times eth1 .

Whos / whats task is it to load the modules? I've the feeling that the modules
are loaded in paralel and that this is timing related.

Comment 18 John W. Linville 2006-01-03 19:04:29 UTC

Some modules will be loaded by rc.sysinit, others will be loaded on-demand as 
the hardware is accessed. 
 
Did you try adding the HWADDR= lines as mentioned in comment 13?

Comment 19 Hans de Goede 2006-01-03 21:43:31 UTC

Yes I did add the HWADDR= line as suggested and that fixes my problem. So,this
bug might be closed I say might because IMHO this behaviour should never happen,
the HWADDR= line is a hack in this case, its not like I'm swapping cards from
one slot to another, I'm only turning of the PC and turning it back on again,
and with recent kernels this causes inconsistent probing order which IMHO is _bad_ .

Comment 20 John W. Linville 2006-01-04 14:32:32 UTC

I'm sorry that you don't like the HWADDR= option, but it is the best we have  
to offer.  I'm going to close this as WORKSFORME since you have a working 
solution.  Thanks!

Comment 21 Hans de Goede 2006-01-04 14:43:07 UTC

I won't reopen this since it indeed works for me, but can you please explain how
booting the same kernel twice, with nothing changed between the boots and still
getting a different probeorder is not a _bug_.

Comment 22 John W. Linville 2006-01-04 15:10:49 UTC

There is no defined order for detection in the first place, so the fact that 
it changes is really just an annoyance.  The HWADDR= fixes that annoyance. 
 
The NIC drivers are modular.  Since you have NICs that are not covered by the 
same driver, it is the order in which the driver modules load AND _initialize_ 
that will determine which gets which name by default.  Any number of factors 
might influence the order of initialization even if they are always loaded in 
the same order.  Such factors include locking issues, event delays, and other 
code details stemming from differences between the two drivers. 
 
Hth...thanks!

Note You need to log in before you can comment on or make changes to this bug.