Description of problem: We are current deploying a large number of IBM x3850 servers (w/ 4 x Xeon MP 7120). Each machine is equipped with one or two Intel GbE NICs. (Either two dual port 82572EI/Fiber or a single quad port 82571EB/Copper) No matter which type of NIC we connect to the machine and/or in which combination (slot/copper-fiber/etc), we get multiple "ADDRCONF(NETDEV_UP): ethX: link is not ready" message and the NICs refuse to work even though ethtool reports the link status as Link up/1000/Full duplex. Ping doesn't work, including broadcast - neither is out own in-kernel raw packet generator. (dev_queue_xmit) Just to test new(er) kernels we attempted to install Fedora 10, and the network devices worked out of the box. We didn't attempt to use the latest e1000e drivers for Intel.com (0.8.6.1) - we assume that these driver should work more-or-less out of the box. While we are not using the latest IBM firmware (mostly because the machines are not ours to modify) - by looking at the changelog we didn't see anything concerning RHEL5.2 support and/or PCI-E bug-fixes. Version-Release number of selected component (if applicable): $ uname -a Linux IP-Probe-Hadar 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux $ cat /etc/issue Red Hat Enterprise Linux Server release 5.2 (Tikanga) Kernel \r on an \m How reproducible: Always. (On a number of machines) Additional info: $ dmesg -c ... $ ifconfig eth0 down $ ifconfig eth1 down $ rmmod e1000e $ ifup eth0 $ ifup eth1 $ sleep 30s $ dmesg -c ACPI: PCI interrupt for device 0000:1a:00.0 disabled ACPI: PCI interrupt for device 0000:0b:00.0 disabled e1000e: Intel(R) PRO/1000 Network Driver - 0.2.0 e1000e: Copyright (c) 1999-2007 Intel Corporation. PCI: Enabling device 0000:0b:00.0 (0040 -> 0043) ACPI: PCI Interrupt 0000:0b:00.0[A] -> GSI 52 (level, low) -> IRQ 217 PCI: Setting latency timer of device 0000:0b:00.0 to 64 0000:0a:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:15:17:7c:e0:2a 0000:0a:00.0: eth0: Intel(R) PRO/1000 Network Connection 0000:0a:00.0: eth0: MAC: 1, PHY: 1, PBA No: d76489-001 PCI: Enabling device 0000:1a:00.0 (0040 -> 0043) ACPI: PCI Interrupt 0000:1a:00.0[A] -> GSI 55 (level, low) -> IRQ 225 PCI: Setting latency timer of device 0000:1a:00.0 to 64 0000:19:00.0: eth1: (PCI Express:2.5GB/s:Width x4) 00:15:17:23:51:7c 0000:19:00.0: eth1: Intel(R) PRO/1000 Network Connection 0000:19:00.0: eth1: MAC: 1, PHY: 1, PBA No: d76489-001 ADDRCONF(NETDEV_UP): eth0: link is not ready ADDRCONF(NETDEV_UP): eth1: link is not ready $
$ ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:15:17:7C:E0:2A inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Memory:ea620000-ea640000 $ ifconfig eth1 eth1 Link encap:Ethernet HWaddr 00:15:17:23:51:7C inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Memory:ea720000-ea740000
$ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 99967994 0 0 0 0 0 0 0 local-APIC-edge timer 1: 2384 0 0 0 738 0 0 0 IO-APIC-edge i8042 8: 1 0 0 0 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi 12: 4 0 0 0 0 0 0 0 IO-APIC-edge i8042 14: 883627 0 0 0 7038 0 0 0 IO-APIC-edge ide0 50: 3 0 0 0 0 0 0 0 IO-APIC-level eth3 66: 0 0 0 0 0 0 0 0 PCI-MSI eth0 74: 0 0 0 0 0 0 0 0 PCI-MSI eth1 201: 20095 0 0 0 2514 0 0 0 IO-APIC-level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3 209: 108957 0 91641 0 2056 0 0 0 IO-APIC-level aacraid 233: 29 0 0 762632 0 0 0 1584 IO-APIC-level eth2 NMI: 1885 1819 1331 1270 963 902 1351 1290 LOC: 99954204 99954131 99954062 99953994 99953911 99953839 99953772 99953709 ERR: 0 MIS: 0 $ ethtool eth0 Settings for eth0: Supported ports: [ FIBRE ] Supported link modes: 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: FIBRE PHYAD: 0 Transceiver: external Auto-negotiation: on Supports Wake-on: pumbag Wake-on: d Current message level: 0x00000001 (1) Link detected: yes $ ethtool eth1 Settings for eth1: Supported ports: [ FIBRE ] Supported link modes: 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: FIBRE PHYAD: 0 Transceiver: external Auto-negotiation: on Supports Wake-on: pumbag Wake-on: d Current message level: 0x00000001 (1) Link detected: yes
Created attachment 327757 [details] lspci -vv output
Older quad-port PCI-X/Copper and dual port PCI-X/Fiber NICs (both using e1000) work out of the box. - Gilboa
You are probably hitting the problem that this upstream patch fixes: commit f8d59f7826aa73c5e7682fbed6db38020635d466 Author: Bruce Allan <bruce.w.allan> Date: Fri Aug 8 18:36:11 2008 -0700 e1000e: test for unusable MSI support Some systems do not like 82571/2 use of 16-bit MSI messages and some other systems claim to support MSI, but neither really works. Setup a test MSI handler to detect whether or not MSI is working properly, and if not, fallback to legacy INTx interrupts. Signed-off-by: Bruce Allan <bruce.w.allan> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher> Signed-off-by: Jeff Garzik <jgarzik> This patch has been added to the e1000e driver for our RHEL5.3 update that will be shipping quite soon. You can confirm the fix by trying kernels from here: http://people.redhat.com/dzickus/el5/128.el5/ Please reopen this bug if testing of the kernels mentioned above to not yield positive results. Thanks! *** This bug has been marked as a duplicate of bug 436045 ***
Thanks for the prompt response. We'll test the pre-5.3 kernel and see how it goes. - Gilboa
No go. Seeing the same (ethx: link is not ready) problem after the kernel upgrade. - Gilboa
That's interesting. I'm a bit surprised that didn't work around it. Can you send me the output from the dmesg command that shows this doesn't work. I'd like all of it too so I can see the complete initialization of the hardware.
I would also be curious if your machine still booted and if the NIC in question works when adding pci=nomsi to the kernel command line.
OK. Latest kernel seems to be more information. As expected, MSI is acting up. $ ifup eth0 && ifup eth1 $ dmesg e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k4 e1000e: Copyright (c) 1999-2008 Intel Corporation. PCI: Enabling device 0000:0b:00.0 (0440 -> 0443) ACPI: PCI Interrupt 0000:0b:00.0[A] -> GSI 52 (level, low) -> IRQ 74 PCI: Setting latency timer of device 0000:0b:00.0 to 64 eth0: (PCI Express:2.5GB/s:Width x4) 00:15:17:7c:e0:2a eth0: Intel(R) PRO/1000 Network Connection eth0: MAC: 1, PHY: 1, PBA No: d76489-001 PCI: Enabling device 0000:1a:00.0 (0440 -> 0443) ACPI: PCI Interrupt 0000:1a:00.0[A] -> GSI 55 (level, low) -> IRQ 98 PCI: Setting latency timer of device 0000:1a:00.0 to 64 eth1: (PCI Express:2.5GB/s:Width x4) 00:15:17:23:51:7c eth1: Intel(R) PRO/1000 Network Connection eth1: MAC: 1, PHY: 1, PBA No: d76489-001 eth0: MSI interrupt test failed, using legacy interrupt. ADDRCONF(NETDEV_UP): eth0: link is not ready eth1: MSI interrupt test failed, using legacy interrupt. ADDRCONF(NETDEV_UP): eth1: link is not ready
Gilboa, this is definitely an interesting problem. There are a few changes between the current version of e1000e (from F10) and what is in the e1000e driver in RHEL5.3. Nothing immediately stands out as the fix for this. Can I ask why you chose this particular NIC to be added to the system rather than using the on-board bnx2-based interfaces?
Anday, The NICs are being used to passively monitor GbE links using dev_add_pack in promisc mode. Non-Intel cards yield far better results when monitoring multiple GbE ports. In general, the ideal solution would have been to use the PCI-X version of this card instead, but for some weird reason IBM only certifies the PCI-E card... (Go figure) - Gilboa
Please ignore my previous comments. I had a typo in the kernel command line (pci=nnomsi :() Device seems to work just fine with MSI disabled. We'll conduct some low-level testing and report back. Thanks! - Gilboa
Glad to hear the system is working with pci=nomsi on the command line. At least we know something works. :-) I'm a bit surprised that my test kernels that disable msi on the NIC don't work. All cases I've run into with the 8257x and friends that don't work with MSI are resolved by the patch that tests and then switches to INTx. Could you send me the output from lspci -t?
lspci -t -+-[0000:19]---00.0-[0000:1a-1d]----00.0 +-[0000:14]---00.0-[0000:15-18]-- +-[0000:0f]---00.0-[0000:10-13]-- +-[0000:0a]---00.0-[0000:0b-0e]----00.0 +-[0000:06]-+-00.0 | \-01.0-[0000:07]--+-04.0 | +-04.1 | +-06.0 | \-06.1 +-[0000:02]---00.0 +-[0000:01]-+-00.0 | +-01.0 | +-01.1 | \-02.0 \-[0000:00]-+-00.0 +-01.0 +-03.0 +-03.1 +-03.2 +-0f.0 +-0f.1 \-0f.3
This should be fixed with the latest kernel available from RHN. Can you verify that the latest kernels on RHN or my test kernels at http://people.redhat.com/agospoda#rhel5 are fully functional. Thanks!
Should we enable msi? - Gilboa
Yes, I would try it with msi enabled. The changes that were made in the e1000e driver should now allow the card to work correctly on a system without pci=nomsi on the command-line.
No go. Once we remove pci=nomsi, the e1000e devices no long work. - Gilboa
That is not at all what I expected. I will do some more looking and post a patch or comments.
I can confirm this too. I have also an x3850 and run into the same problem. e1000e was not working at all. The link was there, but I was not able to talk to other devices on the local lan. I also tried your latest test kernel http://people.redhat.com/agospoda/rhel5/kernel-2.6.18-140.el5.gtest.70.x86_64.rpm and it is still not working without the pci=nomsi option. With that option it is working fine. Andy, you mentioned in one of your posts the onboard Broadcom card and the bnx2 module. The onboard card in the x3850 is a NetXtreme, so the tg3 module is used. The reason why we added an additional card is that the application owner wanted a etherchannel with more then 2 GBit. Anyway, this server where I had the problems is still running for 2 weeks in a test mode. If you want I still can test something. But after 2 weeks, I need to put that box into production mode. Marco
Hitting the same problem in kernel-2.6.18-164.11.1
As I am investigating a related issue with the e1000e driver, I have noticed the following: on a system where MSI interrupts don't work, and the e1000e driver attempts to fallback to legacy interrupts, the driver won't work. However, if I force legacy interrupts myself by passing the e1000e driver the IntMode=0,0,0,0 option _after a reboot_, then it works fine. So I am under the impression that the e1000e driver's code that falls back to legacy interrupts doesn't really work and leaves the card in an incorrect state, while legacy interrupt support itself does work. I am curious if others would be able to reproduce this.
(In reply to comment #24) > So I am under the impression that the e1000e driver's code that falls back to > legacy interrupts doesn't really work and leaves the card in an incorrect > state, while legacy interrupt support itself does work. > > I am curious if others would be able to reproduce this. I was able to reproduce the problem reported by this BZ (and I believe mentioned by you in your comment) on an internal system (nec-em17), for which 'lspci -tv' produced: > -[0000:00]-+-00.0 Intel Corporation 5000V Chipset Memory Controller Hub > +-02.0-[0000:01-08]--+-00.0-[0000:02-07]--+-00.0-[0000:03]-- > | | \-02.0-[0000:07]--+-00.0 Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) > | | \-00.1 Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) > | \-00.3-[0000:08]-- > +-03.0-[0000:0c-0e]--+-00.0-[0000:0d]----0e.0 Promise Technology, Inc. 80333 [SuperTrak EX8350/EX16350], 80331 [SuperTrak EX8300/EX16300] > | \-00.2-[0000:0e]-- > +-10.0 Intel Corporation 5000 Series Chipset FSB Registers > +-10.1 Intel Corporation 5000 Series Chipset FSB Registers > +-10.2 Intel Corporation 5000 Series Chipset FSB Registers > +-11.0 Intel Corporation 5000 Series Chipset Reserved Registers > +-13.0 Intel Corporation 5000 Series Chipset Reserved Registers > +-15.0 Intel Corporation 5000 Series Chipset FBD Registers > +-16.0 Intel Corporation 5000 Series Chipset FBD Registers > +-1c.0-[0000:12]-- > +-1c.1-[0000:18]----00.0 Matrox Graphics, Inc. MGA G200e [Pilot] ServerEngines (SEP1) > +-1d.0 Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #1 > +-1d.1 Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #2 > +-1d.2 Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #3 > +-1d.3 Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #4 > +-1d.7 Intel Corporation 631xESB/632xESB/3100 Chipset EHCI USB2 Controller > +-1e.0-[0000:19]-- > +-1f.0 Intel Corporation 631xESB/632xESB/3100 Chipset LPC Interface Controller > +-1f.1 Intel Corporation 631xESB/632xESB IDE Controller > \-1f.3 Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller For debugging purposes, I added two printk()s to e1000e_set_interrupt_capability(). One marked 'A' was added just before the switch statement and the other marked 'B' was added just before the last return statement. I also added a debug function to msi_capability_init() that dumped out the devices as it followed bus->parent. The bus number appears after "msi_capability_init:". The following is a small portion pulled from dmesg: > e1000e_set_interrupt_capability:A: int_mode=0x2 > > msi_capability_init:7: pci_dev=0xffff88012b08d000 subord=0x(null) hdr_type=0x0 vendor=0x8086 device=0x1096 msi_enabled=1 > msi_capability_init:7: pci_bus=0xffff88012bbe1c00 parent=0xffff88012bbe1400 primary=2 secondary=7 PCI_BUS_FLAGS_NO_MSI=0x0 > > msi_capability_init:2: pci_dev=0xffff88012b08b000 subord=0xffff88012bbe1c00 hdr_type=0x1 vendor=0x8086 device=0x3518 msi_enabled=1 > msi_capability_init:2: pci_bus=0xffff88012bbe1400 parent=0xffff88012bbe1000 primary=1 secondary=2 PCI_BUS_FLAGS_NO_MSI=0x0 > > msi_capability_init:1: pci_dev=0xffff88012b088000 subord=0xffff88012bbe1400 hdr_type=0x1 vendor=0x8086 device=0x3500 msi_enabled=0 > msi_capability_init:1: pci_bus=0xffff88012bbe1000 parent=0xffff88012bbe0c00 primary=0 secondary=1 PCI_BUS_FLAGS_NO_MSI=0x0 > > msi_capability_init:0: pci_dev=0xffff88012b009000 subord=0xffff88012bbe1000 hdr_type=0x1 vendor=0x8086 device=0x25f7 msi_enabled=1 > msi_capability_init:0: pci_bus=0xffff88012bbe0c00 parent=0x(null) primary=0 secondary=0 PCI_BUS_FLAGS_NO_MSI=0x0 > > e1000e_set_interrupt_capability:B: int_mode=0x1 When the e1000_test_msi_interrupt()/e1000_test_msi() tries to switch back to legacy interrupts, it only deals with the e1000e device and none of the bridges. So the only device in the above list that gets msi_enabled flag set to 0 is the e1000e. (Note that the bridge on bus 1 had msi_enabled set to 0 all along.) I've verified that this is a problem on RHEL5 and RHEL6, but I haven't verified whether it fails on RHEL4.9, though from just looking at the code it should. I've also not verified whether upstream has this issue. But the code is pretty much the same as RHEL6, so I suspect upstream hasn't addressed this problem either. I'm in the process of looking for a solution, but haven't been actively pursuing one for the last few months. This area of the kernel/hardware is new to me, so there are a number of areas I need to become familar with, like how interrupts are propagated through bridges, and the impact of switching a bridge from MSI to legacy on any of the other devices that may be on its bus.
Dean, I'm not sure if you're on the right track. Even in the case where legacy interrupts work for me (loading e1000e with IntMode=0,0,0,0 after a reboot), the bridge settings don't change, MSI support is still enabled on the bridge. I don't think MSI settings need to be the same on the e1000e device and the bridge. But then again I am no expert in the area.
(In reply to comment #27) > Dean, I'm not sure if you're on the right track. Even in the case where legacy > interrupts work for me (loading e1000e with IntMode=0,0,0,0 after a reboot), > the bridge settings don't change, MSI support is still enabled on the bridge. I > don't think MSI settings need to be the same on the e1000e device and the > bridge. > > But then again I am no expert in the area. Jean, at the moment, I'm inclined to agree with... Back in mid-February I'd run on two systems provisioned with RHEL6, nec-em17 (mentioned above in comment #25) and hp-z200-02. The latter system looked as follows (notice that it lacks the multiple bridges of nec-em17): > [root@hp-z200-02 ~]# lspci -tv > -[0000:00]-+-00.0 Intel Corporation Core Processor DRAM Controller > +-02.0 Intel Corporation Core Processor Integrated Graphics Controller > +-04.0 Intel Corporation Core Processor Thermal Management Controller > +-19.0 Intel Corporation 82578DM Gigabit Network Connection > +-1a.0 Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller > +-1a.1 Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller > +-1a.2 Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller > +-1a.7 Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller > +-1b.0 Intel Corporation 5 Series/3400 Series Chipset High Definition Audio > +-1c.0-[18]-- > +-1c.4-[24]-- > +-1d.0 Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller > +-1d.1 Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller > +-1d.2 Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller > +-1d.3 Intel Corporation 5 Series/3400 Series Chipset USB Universal Host Controller > +-1d.7 Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller > +-1e.0-[10]-- > +-1f.0 Intel Corporation 5 Series/3400 Series Chipset LPC Interface Controller > \-1f.2 Intel Corporation 82801 SATA RAID Controller > [root@hp-z200-02 ~]# It should be noted that in my analysis, I had forced the e1000_msi_test_interrupt() to fail by ensuring that FLAG_MSI_TEST_FAILED remained set in the adapter->flags. This caused e1000_msi_test() to switch from MSI to legacy interrupts. The following is from the dmesg output of one of my runs on hp-z200-02: > eth0: MSI interrupt test failed! > eth0: MSI interrupt test failed, using legacy interrupt. > ADDRCONF(NETDEV_UP): eth0: link is not ready > e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None > ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > eth0: no IPv6 routers present Because I could ssh into this system after it had booted, and the ssh session seemed quite normal and responsive. And because, things weren't so good for the nec-em17, I was led to consider that something with the bridges was at issue. But this morning I tried doing the same experiment on hp-z200-02, and discovered I couldn't even ping the system. From the serial console I saw that the ouput of dmesg had: > 0000:00:19.0: eth0: MSI interrupt test failed! > 0000:00:19.0: eth0: MSI interrupt test failed, using legacy interrupt. > ADDRCONF(NETDEV_UP): eth0: link is not ready > ADDRCONF(NETDEV_UP): eth0: link is not ready And ifconfig yielded: > [root@hp-z200-02 test]# ifconfig > eth0 Link encap:Ethernet HWaddr 00:1A:4B:0C:29:05 > UP BROADCAST MULTICAST MTU:1500 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > Memory:e0400000-e0420000 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:7 errors:0 dropped:0 overruns:0 frame:0 > TX packets:7 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:576 (576.0 b) TX bytes:576 (576.0 b) > [root@hp-z200-02 test]# This morning I also tried a few other systems, all with the same result. So, something has definitely changed from mid-February to now. And like you, I'm not thinking I was on the right track. I need to do some more investigation. Did I mention that all of this is new to me? :)
(In reply to comment #28) > (In reply to comment #27) > > Dean, I'm not sure if you're on the right track. Even in the case where legacy > > interrupts work for me (loading e1000e with IntMode=0,0,0,0 after a reboot), > > the bridge settings don't change, MSI support is still enabled on the bridge. I > > don't think MSI settings need to be the same on the e1000e device and the > > bridge. > > > > But then again I am no expert in the area. > > Jean, at the moment, I'm inclined to agree with... BTW, the "inclined to agree with..." didn't come out quite right. It was suppose to be "inclined to agree with you..." and was not meant in reference to your last sentence, but rather to your first sentence. Sorry for the confusion or insult this may have led to.
No problem, I didn't feel offended at all :D Me too, hacked e1000_msi_test_interrupt() to simulate an MSI interrupt failure and force the fallback to legacy interrupts on my laptop. Note that network failure doesn't necessarily trigger immediately for me. The only immediate sign that something is wrong, is a ping much higher and less stable than usual (in the 1-16 ms range, sawtooth style, when it is normally in the 0-2 ms range.) The actual network failure can happen after a few minutes, or one hour.
After some investigation, I believe I've found the problem reported in this BZ, at least it's the one I've been seeing running RHEL5.6, RHEL6.0 and net-next-2.6 on a number of different systems. A description of the problem can be seen in the following patch I've posted to the netdev mailing list. http://patchwork.ozlabs.org/patch/56224/ I've also updated my test kernel rpms to include this patch, and they can be found under RHEL5 Test Packages at: http://people.redhat.com/dnelson/#rhel5 If you've been seeing this problem, please install and test one of the rpms. And if you do, please report back whether the problem has been resolved or not. Thanks, Dean
Wow, excellent job, Dean. I've been looking at this code for hours and couldn't see the bug. Of course, now that you explained it, it is pretty clear... As I had 2 machines I was able to reproduce this bug on, I'll try your patch immediately and report the results.
Patch tested successfully on both machines I had that were exhibiting the problem. Great job, thanks again!
Both bug 496127 and this bug are reporting the same problem. Since that one has ACKs, blocker bugs and Issue-Tracker attachments, and this bug has none of those things, I'll mark this bug a duplicate of bug 496127. *** This bug has been marked as a duplicate of bug 496127 ***
Try again to close this bug as a duplicate of bug 496127. Seems there was a timing issue with the adding of comment #36 which collided with the adding of comment #35 and all that each of those included. *** This bug has been marked as a duplicate of bug 496127 ***