Bug 242572
Description
Randy
2007-06-04 20:45:27 UTC
Same issue as me: http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=242301 *** Bug 242357 has been marked as a duplicate of this bug. *** *** Bug 242301 has been marked as a duplicate of this bug. *** So, John can we get the priority moved to high/urgent as this issue renders my laptop totally unusable unless I'm sitting next to a router/switch?? So didi somebody just change this back to med priority ?? I also had a comment with a log dump of the error Oooops, I can also tell you the .3212 fc8 kernel displays the same issues... Randy, Can you please change the priority on this to Urgent/High?? This bug is causing serious issues especially stemming from remote management of servers using RTL8169 chipsets or Laptops with RTL8169 chipsets that are used to manage these remote devices. Here is a dump of the error. Sorry for being pushy but it just blew our testbed and guess who they're pointing fingers at?? Here is a log dump of the error: Jun 2 20:13:52 localhost kernel: r8169: eth0: link up Jun 2 20:14:10 localhost kernel: r8169: eth0: link up Jun 2 20:14:10 localhost kernel: BUG: soft lockup detected on CPU#0! Jun 2 20:14:10 localhost kernel: [<c0451f3e>] softlockup_tick+0xa5/0xb4 Jun 2 20:14:10 localhost kernel: [<c042e930>] update_process_times+0x3b/0x5e Jun 2 20:14:10 localhost kernel: [<c043d2bd>] tick_sched_timer+0x78/0xbb Jun 2 20:14:10 localhost kernel: [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 Jun 2 20:14:10 localhost kernel: [<c043d245>] tick_sched_timer+0x0/0xbb Jun 2 20:14:10 localhost kernel: [<c0408534>] timer_interrupt+0x2c/0x32 Jun 2 20:14:10 localhost kernel: [<c04521aa>] handle_IRQ_event+0x1a/0x3f Jun 2 20:14:10 localhost kernel: [<c04535ea>] handle_level_irq+0x81/0xc7 Jun 2 20:14:10 localhost kernel: [<c04072c7>] do_IRQ+0xb8/0xd1 Jun 2 20:14:10 localhost kernel: [<c04058ff>] common_interrupt+0x23/0x28 Jun 2 20:14:10 localhost kernel: [<c04058ff>] common_interrupt+0x23/0x28 Jun 2 20:14:10 localhost kernel: [<c0561704>] yenta_interrupt+0x13/0xb4 Jun 2 20:14:10 localhost kernel: [<c04521aa>] handle_IRQ_event+0x1a/0x3f Jun 2 20:14:10 localhost kernel: [<c04535ea>] handle_level_irq+0x81/0xc7 Jun 2 20:14:10 localhost kernel: [<c0453569>] handle_level_irq+0x0/0xc7 Jun 2 20:14:10 localhost kernel: [<c04072bb>] do_IRQ+0xac/0xd1 Jun 2 20:14:10 localhost kernel: [<c04058ff>] common_interrupt+0x23/0x28 Jun 2 20:14:10 localhost kernel: [<c042b2dc>] __do_softirq+0x54/0xba Jun 2 20:14:10 localhost kernel: [<c04071b7>] do_softirq+0x59/0xb1 Jun 2 20:14:10 localhost kernel: [<c0453569>] handle_level_irq+0x0/0xc7 Jun 2 20:14:10 localhost kernel: [<c042b194>] irq_exit+0x38/0x6b Jun 2 20:14:10 localhost kernel: [<c04072cc>] do_IRQ+0xbd/0xd1 Jun 2 20:14:10 localhost kernel: [<c04058ff>] common_interrupt+0x23/0x28 Jun 2 20:14:10 localhost kernel: [<f8b0007b>] rtl8169_init_one+0x5c7/0x9d7 [r8169] Jun 2 20:14:10 localhost kernel: [<c060171d>] _spin_unlock_irqrestore+0x8/0x9 Jun 2 20:14:10 localhost kernel: [<f8aff1f7>] rtl8169_open+0x139/0x194 [r8169] Jun 2 20:14:10 localhost kernel: [<c05a2f8d>] dev_open+0x2b/0x62 Jun 2 20:14:10 localhost kernel: [<c05a19e1>] dev_change_flags+0x47/0xe4 Jun 2 20:14:10 localhost kernel: [<c05a977b>] rtnl_setlink+0x264/0x365 Jun 2 20:14:10 localhost kernel: [<c05a9517>] rtnl_setlink+0x0/0x365 Jun 2 20:14:10 localhost kernel: [<c05a8dad>] rtnetlink_rcv_msg+0x1c1/0x1e6 Jun 2 20:14:10 localhost kernel: [<c05b4e19>] netlink_run_queue+0x50/0xbe Jun 2 20:14:10 localhost kernel: [<c05a8bec>] rtnetlink_rcv_msg+0x0/0x1e6 Jun 2 20:14:10 localhost kernel: [<c05a8bab>] rtnetlink_rcv+0x25/0x3d Jun 2 20:14:10 localhost kernel: [<c05b51b6>] netlink_data_ready+0x12/0x4c Jun 2 20:14:10 localhost kernel: [<c05b426a>] netlink_sendskb+0x19/0x30 Jun 2 20:14:10 localhost kernel: [<c05b5198>] netlink_sendmsg+0x277/0x283 Jun 2 20:14:10 localhost kernel: [<c0599180>] sock_sendmsg+0xd0/0xeb Jun 2 20:14:10 localhost kernel: [<c0436e71>] autoremove_wake_function+0x0/0x35 Jun 2 20:14:10 localhost kernel: [<c0436e71>] autoremove_wake_function+0x0/0x35 Jun 2 20:14:10 localhost kernel: [<c04e7100>] copy_from_user+0x3a/0x66 Jun 2 20:14:10 localhost kernel: [<c059932d>] sys_sendmsg+0x192/0x1f7 Jun 2 20:14:10 localhost kernel: [<c0599e0d>] sys_recvmsg+0x1b9/0x1cd Jun 2 20:14:10 localhost kernel: [<c04e7350>] copy_to_user+0x3c/0x50 Jun 2 20:14:10 localhost kernel: [<c0599c3c>] move_addr_to_user+0x50/0x68 Jun 2 20:14:13 localhost kernel: [<c059a0d6>] sys_getsockname+0x9f/0xb0 Jun 2 20:14:13 localhost kernel: [<c06016f4>] _spin_lock_bh+0x8/0x18 Jun 2 20:14:13 localhost kernel: [<c059adb6>] release_sock+0x12/0x9d Jun 2 20:14:13 localhost kernel: [<c059a4fc>] sys_socketcall+0x240/0x261 Jun 2 20:14:13 localhost kernel: [<c0404f70>] syscall_call+0x7/0xb Jun 2 20:14:13 localhost kernel: ======================= Jun 2 20:14:13 localhost kernel: r8169: eth0: link down Andy, let me know if you need help locating suitable hardware... What device do you have? I've got one of these: 05:04.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) Subsystem: Netgear Unknown device 311a Flags: bus master, 66MHz, medium devsel, latency 66, IRQ 19 I/O ports at 2000 [size=256] Memory at f2004800 (32-bit, non-prefetchable) [size=256] [virtual] Expansion ROM at 88000000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 05:04.0 0200: 10ec:8169 (rev 10) Subsystem: 1385:311a Flags: bus master, 66MHz, medium devsel, latency 66, IRQ 19 I/O ports at 2000 [size=256] Memory at f2004800 (32-bit, non-prefetchable) [size=256] [virtual] Expansion ROM at 88000000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 and it seems to boot both with and without the cable connected. Can you give me some more specific information about the Dell System you are using as well. I'm also attaching a test patch, to see if that might help out. I'm doubtful that it will, but it might be worth trying. Created attachment 156603 [details]
r8169-irq-reorder.patch
test patch -- untested as I cannot reproduce the problem on my system
Here's my device, Sager laptop. There never has been any issue with Fedora 4, 5, 6 except with 7. So how might the patch be implemented?. There is another bug report "eth0 boot fail" with the same issues and the RTL8169 mentioned 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) Subsystem: CLEVO/KAPOK Computer Unknown device 0470 Flags: bus master, 66MHz, medium devsel, latency 128, IRQ 10 I/O ports at a200 [size=256] Memory at d0008000 (32-bit, non-prefetchable) [size=256] [virtual] Expansion ROM at 88000000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 OK, I just found a little bit more. When booting Fedora 7, If I select "I" for interactive mode, when it gets to the line: Start Service Network Y/N (C)ontinue I select Yes and error happens FATAL: Module not found (Next Line) Bringing up loopback interface (OK) FATAL: Module not found But the computer boots without plugging the ethernet caable in... But again starting with the cable plugged, I go to "I"nteractive mode, and it get to the line: Start NetworkManagerDispatcher and it hangs until I disconnect the cable, it will probably get past this part if I shut off the Service for Dispatcher I found a driver for the 8169 chipset on the Realtek website that was released May 23, but when I try to compile it, I get a few errors. You can find the driver here, http://www.realtek.com.tw/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=4&Level=5&Conn=4&DownTypeID=3&GetDown=false&Downloads=true#5,7,8,10,982 Can someone try to install it? I'm currently installing a machine that exhibits the problem. I didn't know about that it made a difference if you (re)plug the cable, but found that after loading "fail-safe" i.e. very conservative BIOS settings, there's a 50/50 chance that it works instead of hanging. Once it's installed, I can try the patch above and see if it makes a difference. BTW, my card is in "lspci" (transcribed from the installer shell): 00:09.0 Ethernet controller: D-Link System Inc DGE-528T Gigabit Ethernet Adapter (rev 10) and "lspci -n": 00:09.0 0200: 1186:4300 (rev 10) I would like to try the patch above, but do not know how to install it. Can someone instruct me on how to do so? (In reply to comment #13) > But again starting with the cable plugged, I go to "I"nteractive mode, and it > get to the line: > > Start NetworkManagerDispatcher and it hangs until I disconnect the cable, it > will probably get past this part if I shut off the Service for Dispatcher Interesting. NetworkManager should poll to determine if there is link on the system so if checking for link never returns then this might explain the hang. I'd be curious to know if the system boots ok without NetoworkManager enabled and then what happens when calling `ethtool eth0` and `mii-tool eth0` to see if that call ever returns. (In reply to comment #15) > I'm currently installing a machine that exhibits the problem. I didn't know > about that it made a difference if you (re)plug the cable, but found that after > loading "fail-safe" i.e. very conservative BIOS settings, there's a 50/50 chance > that it works instead of hanging. Once it's installed, I can try the patch above > and see if it makes a difference. Great info, Nils! I still cannot reproduce this, but it seems the only r8169-based card I have isn't the same as everyone else's. (In reply to comment #17) > I would like to try the patch above, but do not know how to install it. Can > someone instruct me on how to do so? You will need to download and install the necessary SRPM (usually 'src' in in place of 'i386' or 'x86_64' as the arch) and build a new rpm using the rpmbuild command. A guide to using rpm/rpmbuild can be found here: http://www.redhat.com/magazine/002dec04/features/betterliving-part2/ But what you basically want to do is: 1. install the rpm 2. copy the attachment from comment #10 to the file called linux-kernel-test.patch (should be in /usr/src/redhat/SOURCES) 3. modify the kernel-2.6.spec file (in /usr/src/redhat/SPECS) and remove the '#' from the line that says #%define buildid .local 4. type `rpmbuild -ba kernel-2.6.spec` 5. wait for a while 6. go get your new rpms from the /usr/src/redhat/RPMS directory and install them. I would *strongly* recommend reading on up rpms first, but those commands should get you going. (In reply to comment #18) > (In reply to comment #13) > > But again starting with the cable plugged, I go to "I"nteractive mode, and it > > get to the line: > > > > Start NetworkManagerDispatcher and it hangs until I disconnect the cable, it > > will probably get past this part if I shut off the Service for Dispatcher > Interesting. NetworkManager should poll to determine if there is link on the > system so if checking for link never returns then this might explain the hang. > I'd be curious to know if the system boots ok without NetoworkManager enabled > and then what happens when calling `ethtool eth0` and `mii-tool eth0` to see if > that call ever returns. I can confirm, that when NetworkManager and NetworkManagerDispatcher services are disabled, the computer will boot fine with the cable plugged in or whether the cable is disconnected it still boots. With the services disable, I can open terminal window and type: "service NetworkManager start" and it will verify it with the "OK" and the system will hang which is cleared by plugging the cable in or disconnecting the cable as the case may be.... (In reply to comment #21) > > I can confirm, that when NetworkManager and NetworkManagerDispatcher services > are disabled, the computer will boot fine with the cable plugged in or whether > the cable is disconnected it still boots. > > With the services disable, I can open terminal window and type: > "service NetworkManager start" > and it will verify it with the "OK" > and the system will hang which is cleared by plugging the cable in or > disconnecting the cable as the case may be.... > Thanks for the feedback! With NetworkManager disabled, can you also try to run the these to commands: # ethtool eth0 # mii-tool eth0 And let me know if the box hangs? I'll start looking at the NetworkManager sources and see what ioctl's/sysfs entries might be getting called. Hmm, with the test patch networking doesn't work at all, not even with replugging the cable. Well that's not good....sounds like we might have this narrowed down to something that NetworkManager anyway, so it probably isn't an initialization problem. Created attachment 156759 [details]
NetworkManager-0.6.5-5.fc7.linktest.src.rpm
I'm not sure this will make a difference, but I did pull 2 patches from
NetworkManager to see if they might help.
I'd be pretty surprised if this made a difference, but it might be worth a try
to build it and try it.
(In reply to comment #22) > (In reply to comment #21) > > > > I can confirm, that when NetworkManager and NetworkManagerDispatcher services > > are disabled, the computer will boot fine with the cable plugged in or whether > > the cable is disconnected it still boots. > > > > With the services disable, I can open terminal window and type: > > "service NetworkManager start" > > and it will verify it with the "OK" > > and the system will hang which is cleared by plugging the cable in or > > disconnecting the cable as the case may be.... > > > > Thanks for the feedback! > > With NetworkManager disabled, can you also try to run the these to commands: > > # ethtool eth0 > # mii-tool eth0 > > And let me know if the box hangs? > > I'll start looking at the NetworkManager sources and see what ioctl's/sysfs > entries might be getting called. OK, with the NM services disabled, and the cable disconnected running the ethtool and mii-tool I get this without any hangs (ethtool) Settings for eth0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Half Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: on Supports Wake-on: pumbg Wake-on: g Current message level: 0x00000033 (51) Link detected: yes (mii-tool) SIOCGMIIPHY on 'eth0' failed: No such device Now with NM services enabled with the cable connected: (ethtool) Settings for eth0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: on Supports Wake-on: pumbg Wake-on: g Current message level: 0x00000033 (51) Link detected: yes (mii-tool) eth0: negotiated 100baseTx-FD flow-control, link ok One thing I did notice was ethtool is displaying "Link Detected: yes" when the cable is disconnected. (In reply to comment #25) > Created an attachment (id=156759) [edit] > NetworkManager-0.6.5-5.fc7.linktest.src.rpm > > I'm not sure this will make a difference, but I did pull 2 patches from > NetworkManager to see if they might help. > > I'd be pretty surprised if this made a difference, but it might be worth a try > to build it and try it. I'm getting an error "cannot install source packages" re comment #24: I don't think it's something in NetworkManager, as I don't use it on the machine. In fact, I saw it first while installing the machine, both when trying to use DHCP and static IP. Now the machine is setup for DHCP on bootup. Sorry for not saying this in so many words earlier... I'm leaning towards something other than NetworkManager also. I just turned off the NM services, changed the network device configuration to "activate on boot" shutdown the computer, plugged in the cable and turned the computer on. The computer booted fine and had network connectivity. Tried it with the same settings BUT with the cable disconnected and it hung on "starting network" and wouldn't continue until I plugged the cable in. Tried the same scenario but this time using the wireless and settings of "activate on boot" It hesitated on "starting network" and jumped to the line by line activation sequence, BUT it didn't hang the computer, it just said FAILED and continued on to login. I am using static ip addressing and haven't tried with DHCP. (In reply to comment #27) > > I'm getting an error "cannot install source packages" > Check to make sure you have the 'rpmbuild' rpm installed and that the directory /usr/src/redhat exists. Thanks for the feedback Nils (comment #28) and Mike (comment #29). Did either of you notice that this was a hard lockup? Could you switch to another vty with ctrl+alt+F2/F3/etc? Also, does anyone happen to remember what the latest working fc6 kernel might have been on their system? If you can verify with some reliability that kernel-2.6.20-1.2952.fc6 worked well on their system then it will help me narrow down the differences between that last working kernel and the current one. It worked fine with realease, "kernel-2.6.20-1.2952.fc6" I was updated all the way until Fedora 7 final. Well there appear to be 5 patches that directly effect r8169 that have been pushed upstream between kernel-2.6.20-1.2952.fc6 and kernel-2.6.21-1.3194.fc7. Those changes are: commit 1371fa6db0bbb8e23f988a641f5ae7361bc629dd Author: Francois Romieu <romieu.com> Date: Mon Apr 2 23:01:11 2007 +0200 r8169: fix suspend/resume for down interface The PM hooks are no-op if the r8169 interface is down (i.e. !IFF_UP). However, as the chipset is enabled, the device will not work after a suspend/resume cycle. The patch always issue the required PCI suspend sequence and removes the module unload/reload workaround. Signed-off-by: Arnaud Patard <apatard> Signed-off-by: Francois Romieu <romieu.com> Signed-off-by: Jeff Garzik <jeff> commit 99f252b097a3bd6280047ba2175b605671da4a23 Author: Francois Romieu <romieu.com> Date: Mon Apr 2 22:59:59 2007 +0200 r8169: issue request_irq after the private data are completely initialized The irq handler schedules a NAPI poll request unconditionally as soon as the status register is not clean. It has been there - and wrong - for ages but a recent timing change made it apparently easier to trigger. Signed-off-by: Francois Romieu <romieu.com> Cc: Jay Cliburn <jacliburn> Signed-off-by: Jeff Garzik <jeff> commit 2efa53f373ed811d4860904f5205b8a3b376e253 Author: Francois Romieu <romieu.com> Date: Fri Mar 9 00:00:05 2007 +0100 r8169: fix a race between PCI probe and dev_open Initialize the timer with the rest of the private-struct. Signed-off-by: Jeff Garzik <jeff> Signed-off-by: Francois Romieu <romieu.com> Signed-off-by: Jeff Garzik <jeff> commit 9e0db8ef4a8c8fd6f3a506259975d7f8db962421 Author: Francois Romieu <romieu.com> Date: Thu Mar 8 23:59:54 2007 +0100 r8169: revert bogus BMCR reset Added during bf793295e1090af84972750898bf8470df5e5419 The current code requests a reset but prohibits autoneg, 1000 Mb/s, 100 Mb/s and full duplex. The 8168 does not like it at all. Signed-off-by: Francois Romieu <romieu.com> Signed-off-by: Jeff Garzik <jeff> commit eb2a021c4710b98081daa797d5a729ac23c240cd Author: Francois Romieu <romieu.com> Date: Thu Feb 15 23:37:21 2007 +0100 r8169: RTNL and flush_scheduled_work deadlock flush_scheduled_work() in net_device->close has a slight tendency to deadlock with tasks on the workqueue that hold RTNL. rtl8169_close/down simply need the recovery tasks to not meddle with the hardware while the device is going down. Signed-off-by: Francois Romieu <romieu.com> Signed-off-by: Jeff Garzik <jeff> commit dcb92f8804717b845db70939b523c5d152a2e0ea Author: Al Viro <viro.org.uk> Date: Fri Feb 9 16:39:00 2007 +0000 [PATCH] uintptr_t is unsigned long, not u32 Signed-off-by: Al Viro <viro.org.uk> Signed-off-by: Linus Torvalds <torvalds> None of them jump out to me as the obvious cause of this regression, but it seems unlikely that the problem comes from: commit 1371fa6db0bbb8e23f988a641f5ae7361bc629dd Author: Francois Romieu <romieu.com> Date: Mon Apr 2 23:01:11 2007 +0200 r8169: fix suspend/resume for down interface (since we aren't doing suspend/resume in these cases) commit 2efa53f373ed811d4860904f5205b8a3b376e253 Author: Francois Romieu <romieu.com> Date: Fri Mar 9 00:00:05 2007 +0100 r8169: fix a race between PCI probe and dev_open (this fix prevents a panic -- it should not cause a hang) commit 9e0db8ef4a8c8fd6f3a506259975d7f8db962421 Author: Francois Romieu <romieu.com> Date: Thu Mar 8 23:59:54 2007 +0100 r8169: revert bogus BMCR reset (hardware specific call that was seen as an error previously) commit dcb92f8804717b845db70939b523c5d152a2e0ea Author: Al Viro <viro.org.uk> Date: Fri Feb 9 16:39:00 2007 +0000 [PATCH] uintptr_t is unsigned long, not u32 (this is a simple fix that should not cause our problem) IF I ruled these out correctly then we are only left with these 2 patches as the problems: commit 99f252b097a3bd6280047ba2175b605671da4a23 Author: Francois Romieu <romieu.com> Date: Mon Apr 2 22:59:59 2007 +0200 r8169: issue request_irq after the private data are completely initialized commit eb2a021c4710b98081daa797d5a729ac23c240cd Author: Francois Romieu <romieu.com> Date: Thu Feb 15 23:37:21 2007 +0100 r8169: RTNL and flush_scheduled_work deadlock I'm leaning towards the first one being the issue since that patch reordered when we first expect to start getting interrupts, but the second one would be an issues as well if there manages to be some bad interaction with the linkwatch code. Is any way that someone with the hardware could do a few builds and try to back-out these 2 patches (maybe one at a time) and see if it makes a difference? Created attachment 156815 [details]
/tmp/eb2a021c4710b98081daa797d5a729ac23c240cd.patch
Created attachment 156816 [details]
/tmp/99f252b097a3bd6280047ba2175b605671da4a23.patch
Created attachment 156817 [details]
f6-f7.patch
differences between fc6 and f7 for r8169 driver
You can forget Rolf's problem: it is a different one. -- Ueimor Ok Andy I just tried the patches from Comment 35, 36, 37, Long story short, no change in the hang up. 1st. Installed comment 35 patch, and reboted, no change 2nd. Removed comment 35 patch and installed comment 36 patch and rebooted, no change 3rd. Installed comment 35 and 36 patches and rebooted, no change 4th. Removed comment 35 and 36 patches and installed comment 37 patch and rebooted, no change.... Just wanted to confirm that the latest FC6 kernel works here as well. hmmm I see lots of network manager here and in my case it's not NM that hangs the deal. The system will freeze if the eth0 interface is brought up during BOOT process here. Regardless of whether I'm using NM or not. The only remedy I found to be able to boot into X, was to go and untick the "load at system startup" check box on the network settings for eth0. Once I do this, the interface is ignored and the system boots normaly. *Now* once I boot and I'm into X, I can bring up a terminal, type "ifup eth0" and everything works like there was no problem EVER. Note1: The cable is plugged through the whole process. No tampering with it. Note2: I, too, verify that everything works perfectly (with this NIC) in Fedora 6 Note3: For the record, a cheap lame NIC will work perfectly on boot or any other instance. I hope it helps. I found this bug after just doing a fresh install on a machine. In reading through, it seemed there was some confusion as to whether it would work on boot with the cable plugged in or unplugged, or whether it could bring up the ethernet card at boot at all. I started testing, and I found some interesting results: 1.) Started PC with network cable unplugged. The system froze at Starting Networking. I plugged in the network cable. The computer resumed the boot process. ifconfig revealed that eth0 had come up and successfully grabbed an IP address from my dhcp server. 2.) Started PC with network cable plugged in. The system froze at Starting Networking. I removed the network cable. The computer resumed the boot process. Logged in and issued "service network restart" and eth0 came up and successfully grabbed an IP address from my dhcp server. 3.) Started the PC with the network cable plugged in. The system froze at Starting Networking. I unplugged the network cable. As soon as the connectivity light switched off, I put the cable back in as quickly as possible. The computer resumed the boot process. ifconfig revealed that eth0 had come up and successfully grabbed an IP address from my dhcp server. So it seems that the card is waiting for some kind of status, and it can't get it until there is a change in connectivity. It seems like it doesn't matter whether the cable is already plugged in or removed - it's the act of plugging it in or removing it. As soon as it gets going, it doesn't matter if it's plugged in or not. Since it takes a while for the network to timeout, there is plenty of time to plug the cable back in and get an IP address after removing it. I've now performed the actions outlined in step 3 at least 6 times successfully. Any update on this bug? I still have two machines doing this? It seems this issue has fallen by the wayside.... Sorry this issue hasn't been worked for the last few days. I was out of town and unable to look at it futher until today (I JUST got back). I also have been unable to locate hardware that can actually reproduce this issue so I'd rather not just toss out any old idea to solve the problem and make everyone else test this for me. If kernel with the patch from comment #36 backed out (applied with -R) still showed the same problem, I certainly question how this could be driver related since the code produced would be exactly the same as what was shipped with the latest FC6 kernel. Has anyone tried a forced-install of the FC6 kernel on their F7 box and indicated whether or not they see the same failure? Also, when the system locks up, is the console still responsive (like can you type <ENTER> and see that the screen scrolls)? Hi Andy, I have forced on the last FC6 kernel and the machine fires up perfectly, no more pulling ethernet cables! When it locks on on the F7 kernel its locked up COMPLETELY, I mean even the number lock and caps lock are frozen! If you see my bug # 242181 which was around longer than this bug I already tried the FC6 kernel on the 18th of this month. Cheers, David Hi Andy, I concur with David, force the FC6 kernel seems to alleviate the issues with lockups, the kernel I have know installed is/; kernel-2.6.20-1.2948.fc6.i686.rpm Mike.... Hi Mike, I tried the latest FC6 kernel 2952 and it was fine. Hi Andy, There are a number of bugs with the F7 kernel. I have a bug open that it also won't scan with my scanner bug # 243953, as well as the last F7 kernel won't start xwindows on another machine as its messing up the PCI cards badly bug # 242391 Hello Andy! Since I am fairly new to linux (and FC) I don't know how to force install FC6 kernel (I tried installing FC6 altogether and it works fine) but I can tell you that the system hangs and freezes during startup so basically there's still no console yet. If I boot with interactive boot and I don't turn up eth0, everything boots fine and then when I'm logged on, I can just turn go to a console and "ifup eth0" and everything will work perfectly without messing with the -already connected- cable. Hope it helps Thanks for the feedback, Andreas. I'm still trying to narrow what specifically might be causing this problem, so I hope to provide some F7 test kernels sometime soon so we can work through this on F7 rathat than FC6. Look for those soon! Any progress being made on this??? The new .3255 kernel in the "updates-testing" repo doesn't correct the issues with the realtek 8169 Created attachment 158726 [details]
startup script for realtek NIC so system doesn't freeze
I would like to add that I have been seeing this issue with F7 as well. My
temporary solution has been to disabled the "activate on boot" option for this
card. I wrote a startup script to separately activate the Realtek 8169 NIC
after the normal network startup. I am including this script as an attachment.
Note that if I change the "# chkconfig: 2345 75 90" line to make the boot
priority closer to 10 (which is /etc/init.d/network's default boot priority),
the freeze will still occur. I haven't experimented enough to see just how
much I can change the boot priority, but 75 works for me.
The only problem with this approach is that named and dhcpd (this computer is a
router for my network) have to be restarted after I log in. At least the
system doesn't freeze, though.
Yesterday I upgraded from Fedora Core 6 to Fedora 7 and I'm experiencing the same problem. I'll append some information about my system, maybe someone will finally be able to see a pattern. I still have the latest FC6 kernel (kernel-2.6.20-1.2962.fc6) installed and the nic works perfectly with it. When I boot the F7 kernel (kernel-2.6.21-1.3228.fc7) the system hangs when the initscripts try to bring up the interface. It's configured to use a static IP. Boot continues when I unplug and replug the network cable. Mainboard: Asus P4B266 NIC: Eusso UEC2300-32R /var/log/messages: Jul 9 10:44:45 linux kernel: r8169: eth0: link down Jul 9 10:44:45 linux kernel: BUG: soft lockup detected on CPU#0! Jul 9 10:44:45 linux kernel: [<c0451ea2>] softlockup_tick+0xa5/0xb4 Jul 9 10:44:45 linux kernel: [<c042e930>] update_process_times+0x3b/0x5e Jul 9 10:44:45 linux kernel: [<c043d298>] tick_sched_timer+0x57/0x9a Jul 9 10:44:45 linux kernel: [<c0439df5>] hrtimer_interrupt+0x12b/0x1b6 Jul 9 10:44:45 linux kernel: [<c043d241>] tick_sched_timer+0x0/0x9a Jul 9 10:44:45 linux kernel: [<c0408534>] timer_interrupt+0x2c/0x32 Jul 9 10:44:45 linux kernel: [<c045210e>] handle_IRQ_event+0x1a/0x3f Jul 9 10:44:45 linux kernel: [<c045354e>] handle_level_irq+0x81/0xc7 Jul 9 10:44:45 linux kernel: [<c04534cd>] handle_level_irq+0x0/0xc7 Jul 9 10:44:45 linux kernel: [<c04072bb>] do_IRQ+0xac/0xd1 Jul 9 10:44:45 linux kernel: [<c04058ff>] common_interrupt+0x23/0x28 Jul 9 10:44:45 linux kernel: [<c042b2dc>] __do_softirq+0x54/0xba Jul 9 10:44:45 linux kernel: [<c04071b7>] do_softirq+0x59/0xb1 Jul 9 10:44:45 linux kernel: [<c04534cd>] handle_level_irq+0x0/0xc7 Jul 9 10:44:45 linux kernel: [<c042b194>] irq_exit+0x38/0x6b Jul 9 10:44:45 linux kernel: [<c04072cc>] do_IRQ+0xbd/0xd1 Jul 9 10:44:45 linux kernel: [<c04058ff>] common_interrupt+0x23/0x28 Jul 9 10:44:45 linux kernel: [<c04200d8>] find_busiest_group+0x264/0x4c5 Jul 9 10:44:45 linux kernel: [<c0601895>] _spin_unlock_irqrestore+0x8/0x9 Jul 9 10:44:45 linux kernel: [<c042e863>] __mod_timer+0xa1/0xab Jul 9 10:44:45 linux kernel: [<f8a4e1ec>] rtl8169_open+0x12e/0x194 [r8169] Jul 9 10:44:45 linux kernel: [<c05a3054>] dev_open+0x2b/0x62 Jul 9 10:44:45 linux kernel: [<c05a1aa1>] dev_change_flags+0x47/0xe4 Jul 9 10:44:45 linux kernel: [<c05de45c>] devinet_ioctl+0x250/0x56a Jul 9 10:44:45 linux kernel: [<c04e72c0>] copy_to_user+0x3c/0x50 Jul 9 10:44:45 linux kernel: [<c0598b47>] sock_ioctl+0x19f/0x1be Jul 9 10:44:45 linux kernel: [<c05989a8>] sock_ioctl+0x0/0x1be Jul 9 10:44:45 linux kernel: [<c047f713>] do_ioctl+0x1f/0x62 Jul 9 10:44:45 linux kernel: [<c047f99a>] vfs_ioctl+0x244/0x256 Jul 9 10:44:45 linux kernel: [<c047f9f8>] sys_ioctl+0x4c/0x64 Jul 9 10:44:45 linux kernel: [<c0404f70>] syscall_call+0x7/0xb Jul 9 10:44:45 linux kernel: ======================= Jul 9 10:44:45 linux kernel: r8169: eth0: link up lspci -v: 02:0a.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) Subsystem: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 5 I/O ports at b800 [size=256] Memory at f2800000 (32-bit, non-prefetchable) [size=256] [virtual] Expansion ROM at f3e00000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 lspci -vn: 02:0a.0 0200: 10ec:8169 (rev 10) Subsystem: 10ec:8169 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 5 I/O ports at b800 [size=256] Memory at f2800000 (32-bit, non-prefetchable) [size=256] [virtual] Expansion ROM at f3e00000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 ethtool eth0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: on Supports Wake-on: pumbg Wake-on: g Current message level: 0x00000033 (51) Link detected: yes mii-tool eth0: eth0: negotiated 100baseTx-FD flow-control, link ok Please tell me if you need any additional information. When will this ever get fixed - much later rather than sooner :( I just installed kernel 3255 from updates-testing It fixed my sane and cannon scanner (another bug), but the damn bug is eth hangup is STILL there :( Has this bug been ignored?? The bug is still present in the 2.6.22 kernel that is currently in "updates-testing" I certainly hope it hasn't been ignored. I just had to deal with bringing my servers back up after a power outage. Of course, I ran into issues from this card again. webmaster: > When will this ever get fixed - much later rather than sooner :( > > I just installed kernel 3255 from updates-testing It fixed my sane and cannon > scanner (another bug), but the damn bug is eth hangup is STILL there :( Is the bug still there if you add an 'clocksource=pit' option in the kernel boot command line ? -- Ueimor Yes, it still hangs with 'clocksource=pit' Are all of you that are noticing problems using SMP or UP systems? I'm starting to wonder if this could be an issue with uni-processor kernels/systems. I am using an uniprocessor system. Success! I found a 32-bit UP system and managed to reproduce this with an F7 LiveCD. The system hung when trying to start NetworkManager and when I connected the cable on the r8169 mentioned in comment #9 it continued to boot. I'll do an install and see what I can figure out! Reproduced this on an installed system. Looks like the trick might be a 32-bit UP system with an smp kernel (the default f7 kernel is smp). A UP kernel might die too, but we don't build one of those anymore, so I can't say if that makes a difference. I did notice that it doesn't *always* hang while booting, but it seems to happen most of the time. Seeing soft-lockups like that makes me wonder if some code was added recently that works well on true SMP systems but not on UP ones.... Uniprocessor system here also...... Yes I agree both my machines that do this are single threaded Pentium 4 2.6 and 2.8 processor machines. But with this FC6 kernel it works fine..... 2.6.20-1.2952.fc6 #1 SMP Agree there is something not quite right with F7 kernels as regards this. I also have another 2 Pentium 4 (one 2.6 and other 2.8) single threaded machines with same exact ethernet card and these have NEVER had the bug. Its got to be some timing issue that depending on your motherboard BIOS and PCI address space. I say this as I have a bug 242391 and 247913 that are to do with PCI grpahics card detection problems on one machine with F7 but never with a FC6 kernel either. Maybe they are somehow related? No one mentioned it before, so I just want to add, that this problem is not limited to fedora kernels. I have the same problem with vanilla kernels. Good to know, Thomas. Thomas Müller (thomas):
[...]
> I have the same problem with vanilla kernels.
Thomas, can you bissect the problem or do all kernels fail ?
--
Ueimor
(In reply to comment #69) > Thomas, can you bissect the problem or do all kernels fail ? vanilla 2.6.20 works fine, vanilla 2.6.21 hangs. I'll try to bisect it Do we really think this is a kernal issue? Once you're booted up, everything runs fine. You can ifup/ifdown all you want, and 'service network restart' works fine as well. It would seem to suggest that both drivers and kernel are fine. It seems more likely that we have a problem with how something in the boot process is interacting with the kernel. Also, has anyone done a search to see if this is affecting any other distribution between these two kernels? How can it not be a kernel issue? Force on any fc6 kernel and constantly boot the machine over and over all day long and it works perfectly. If I only change the kernel and nothing else, then it works explain how its not the kernel? Also I forgot to mention, if you have got the server on a remote location, how do you start and stop the network when its frozen during bootup requiring the ethernet cable to be physically removed and reinserted. This has happened twice so I left on a fc6 kernel, but this is no long term solution - fix the kernel is the only solution. (In reply to comment #71) *shrug* 2.6.20 works, 2.6.21 doesn't. Exactly the same userspace tools. Even if Fedora is the only distribution unlucky enough to trigger this, I think it's something that has to be fixed within the kernel. (In reply to comment #70) > (In reply to comment #69) > > Thomas, can you bissect the problem or do all kernels fail ? > > vanilla 2.6.20 works fine, vanilla 2.6.21 hangs. > I'll try to bisect it I just did a 2.6.21 build from Linus' tree and it works fine for me.... (In reply to comment #75) > I just did a 2.6.21 build from Linus' tree and it works fine for me.... That's very strange... I tested 2.6.21.6, 2.6.22.1 and 2.6.22-git8 and every one failed. Did you use the default configuration shipped with the vanilla kernel or did you use the fedora one? I started bisecting and the first version git suggested (somewhere between 2.6.20 and 2.6.21) also failed. I'm not at home for the weekend, but I will retest everything when I get home on sunday and continue bisecting if no one has a new idea by then. :) I used the fedora one and then when doing a 'make oldconfig' just hold down <ENTER> (I know, I know, brilliant). The stable trees failing is interesting since you can't ever get exactly those fixes out of Linus's trees. My last build had some flavor or 2.6.22-rc2 working, so I wonder if the change got introduced after that and fed into 2.6.21-stable sometime before 2.6.21.6 (but that seems unlikely). I'll check out my config a little more and see what might be different. I have the same problem. My hardware is 00:0a.3 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) Subsystem: Melco Inc Unknown device 0237 Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 11 I/O ports at 8800 [size=256] Memory at c5800000 (32-bit, non-prefetchable) [size=256] [virtual] Expansion ROM at 20000000 [disabled] [size=128K] Capabilities: <access denied> It works fine with kernel-2.6.20-1.2962.fc6. It doesn't work 2.6.21-1.3228.fc7 nor 2.6.22.1-27.fc7 nor 2.6.23-0.41.rc0.git14.fc8. The rawhide kernel sometimes works but sometimes doesn't. I also tested with starting runlevel 1 and run "/etc/init.d/network start" in addition to normal sratup process. I tried the drive mentioned in comment #14 and it seems to work for now. The working environment is kernel-2.6.22.1-27.fc7 + r8169-6.002.00 driver. This driver needs some small modification to compile on kernel 2.6.22. Created attachment 159748 [details]
configuration used to bisect
Comment on attachment 159748 [details] configuration used to bisect I'm back home, retested everything and finished bisecting the kernel. I started with the configuration shipped with the fedora kernel 2.6.21-1.3228.fc7 but deactivated some drivers to speed things up. According to git-bisect the first bad commit is a304e1b82808904c561b7b149b467e338c53fcce http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h =a304e1b82808904c561b7b149b467e338c53fcce Before this commit the kernel works fine and never hang during multiple reboots. With this commit it hangs almost every time I reboot. The commit adds a new configuration option "Debug shared IRQ handlers" (CONFIG_DEBUG_SHIRQ) that is set to 'y' in the Fedora configuration. If I set this to 'n', the kernel also works fine. Andy: Did you build the vanilla 2.6.21 with a 2.6.20 fedora configuration? If yes, maybe CONFIG_DEBUG_SHIRQ was deactivated by default, so you didn't trigger this output of git-bisect: a304e1b82808904c561b7b149b467e338c53fcce is first bad commit commit a304e1b82808904c561b7b149b467e338c53fcce Author: David Woodhouse <dwmw2> Date: Mon Feb 12 00:52:00 2007 -0800 [PATCH] Debug shared irqs Drivers registering IRQ handlers with SA_SHIRQ really ought to be able to handle an interrupt happening before request_irq() returns. They also ought to be able to handle an interrupt happening during the start of their call to free_irq(). Let's test that hypothesis.... [bunk: Kconfig fixes] Signed-off-by: David Woodhouse <dwmw2> Cc: Arjan van de Ven <arjan> Signed-off-by: Jesper Juhl <jesper.juhl> Signed-off-by: Ingo Molnar <mingo> Signed-off-by: Adrian Bunk <bunk> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> :040000 040000 21b0f48e3c9aa1d1880b8cfcc2509fa37ab16ff0 57bfbda5aa3dc95f31216ce99e2222c23e4adced M kernel :040000 040000 35b70986b0b39798d573fec0b73b9035d0e85739 1839fb518a55d408690014fcd43c6629bc2b9a50 M lib Nice find, Thomas! That patch would certainly explain the softirq lockup messages that appear in comment #7 where there seem to be constant IRQs. I'll take a look at my config when I get to work today and let you know, but I bet I didn't add CONFIG_DEBUG_SHIRQ=y. I'll also take a look at this patch and see what we can do to resolve the issue. CONFIG_DEBUG_SHIRQ was not set in my kernel builds.... (In reply to comment #80) I also confirm that kernel-2.6.22.1-27.fc7 + CONFIG_DEBUG_SHIRQ=n works fine. I checked 10 times both cold boot (power on) and warm boot (shutdown -r). I also checked kernel-2.6.22.1-27.fc7 + revert commit 99f252b097a3bd6280047ba2175b605671da4a23 Author: Francois Romieu <romieu.com> Date: Mon Apr 2 22:59:59 2007 +0200 r8169: issue request_irq after the private data are completely initialized , but that doesn't help. Sometimes it boots fine but sometimes doesn't. I would happily test any patches. My hardware is UP system. By the way, the patch which mentioned in comment #80 seems to be strange. The CONFIG_DEBUG_SHIRQ=y effects nothing in function "free_irq". Is this an intensional behavior ? I think the all "return;" in "for(;;)" loop should be "break;". Should I make a patch and send upstream ? (In reply to comment #83) > > > By the way, the patch which mentioned in comment #80 seems to be strange. > The CONFIG_DEBUG_SHIRQ=y effects nothing in function "free_irq". > Is this an intensional behavior ? > I think the all "return;" in "for(;;)" loop should be "break;". > > Should I make a patch and send upstream ? actually it does have an effect on free_irq: @@ -403,6 +406,17 @@ void free_irq(unsigned int irq, void *dev_id) spin_unlock_irqrestore(&desc->lock, flags); return; } +#ifdef CONFIG_DEBUG_SHIRQ + if (handler) { + /* + * It's a shared IRQ -- the driver ought to be prepared for it + * to happen even now it's being freed, so let's make sure.... + * We do this after actually deregistering it, to make sure that + * a 'real' IRQ doesn't run in parallel with our fake + */ + handler(irq, dev_id); + } +#endif } EXPORT_SYMBOL(free_irq); (In reply to comment #84) > actually it does have an effect on free_irq: What? http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=kernel/irq/manage.c;h=203a518b6f1437d134c115ce7d43e1e04e1d5c25;hb=HEAD 403 void free_irq(unsigned int irq, void *dev_id) 404 { 405 struct irq_desc *desc; 406 struct irqaction **p; 407 unsigned long flags; 408 irqreturn_t (*handler)(int, void *) = NULL; 409 410 WARN_ON(in_interrupt()); 411 if (irq >= NR_IRQS) 412 return; 413 414 desc = irq_desc + irq; 415 spin_lock_irqsave(&desc->lock, flags); 416 p = &desc->action; 417 for (;;) { ... 454 spin_unlock_irqrestore(&desc->lock, flags); 455 return; 456 } 457 #ifdef CONFIG_DEBUG_SHIRQ 458 if (handler) { 459 /* 460 * It's a shared IRQ -- the driver ought to be prepared for it 461 * to happen even now it's being freed, so let's make sure.... 462 * We do this after actually deregistering it, to make sure that 463 * a 'real' IRQ doesn't run in parallel with our fake 464 */ 465 handler(irq, dev_id); 466 } 467 #endif 468 } How can I break this "for(;;)" loop? Yes, you are correct. Sorry I misunderstood your first statement. A patch like this should be probably be submitted: diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 203a518..03edf45 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -448,11 +448,11 @@ void free_irq(unsigned int irq, void *dev_id) if (action->flags & IRQF_SHARED) handler = action->handler; kfree(action); - return; + break; } printk(KERN_ERR "Trying to free already-free IRQ %d\n", irq); spin_unlock_irqrestore(&desc->lock, flags); - return; + break; } #ifdef CONFIG_DEBUG_SHIRQ if (handler) { (In reply to comment #82) > CONFIG_DEBUG_SHIRQ was not set in my kernel builds.... Ok, now everything seems to make sense. It looks like the r8169 driver does have a problem with interrupts happening before request_irq() returns and that's exactly what the patch tries to test. (In reply to comment #86) > Yes, you are correct. Sorry I misunderstood your first statement. Thanks. I'm sure my English is bad. > A patch like this should be probably be submitted: Yes. But this patch introduces totally untested code path and it will break many drivers like this bug. So I think it's dangerous to send to upstream. (Or it may be a good time to send it because the merge window for 2.6.23 is just closed.) (In reply to comment #88) > Yes. But this patch introduces totally untested code path and > it will break many drivers like this bug. > So I think it's dangerous to send to upstream. Nevertheless, I think you should send this to LKML or directly to David Woodhouse <dwmw2> as he seems to be the original author of this patch. I'm pretty sure he never intended to just unconditionally skip his own code. :) Even if they don't want to change it now, they should know about this. After looking at this, it seemed there was a chance that rtl8169_interrupt might not like getting run if there weren't any interrupts to service. Causing a link change is just the thing that makes an interrupt pop. Whenever CONFIG_DEBUG_SHIRQ=y the 'handler' (in this case rtl8169_interrupt) is always called once before an actual interrupt is available. It seems our system hangs until one is ready, so I decided to add the following hack to see how successfully the system could be booted if this call was basically ignored. diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c index bb6896a..a864194 100644 --- a/drivers/net/r8169.c +++ b/drivers/net/r8169.c @@ -2723,6 +2723,9 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance) int status; int handled = 0; + if (!in_atomic()) + return IRQ_RETVAL(1); + do { status = RTL_R16(IntrStatus); It seemed to perform well, so now it seems one of 2 things needs to happen: 1) rtl8169_interrupt (and other driver interrupt handlers) need(s) to be modified more cleanly to not block if there are no interrupts to handle. 2) the 'test' call to the handler in request_irq needs to be removed. (In reply to comment #89) I'll. But it's midnight and I can't access my mail server right now. I can do it 7 hours later, so I'm happy if someone kindly tell David Woodhouse about this. If no one do it I'll do it. Masaki, I'll talk to him about it. (In reply to comment #90) > It seemed to perform well, so now it seems one of 2 things needs to happen: > > 1) rtl8169_interrupt (and other driver interrupt handlers) need(s) to be > modified more cleanly to not block if there are no interrupts to handle. > > 2) the 'test' call to the handler in request_irq needs to be removed. > I think we might want to notify LKML about this. Problems with this patch seem to be rare, but we triggered one and maybe it's only a matter of time until someony else has the same problem with a totally different driver. Until this is finally solved, Fedora could deactivate CONFIG_DEBUG_SHIRQ in it's default configuration. It shouldn't break anything and I would finally be able to reboot without having to unplug and replug cables ;) I've talked with the f7 kernel maintainer and I think we can unset CONFIG_DEBUG_SHIRQ for the next f7 build. agospoda:
[...]
> It seemed to perform well, so now it seems one of 2 things needs to happen:
> 1) rtl8169_interrupt (and other driver interrupt handlers) need(s) to be
> modified more cleanly to not block if there are no interrupts to handle.
1. So far I fail to see where rtl8169_interrupt could block. I'll try harder.
2. Without CONFIG_DEBUG_SHIRQ, IRQ handlers are run with irq enabled and are
non-reentrant. With CONFIG_DEBUG_SHIRQ, irq are disabled and the usual
protection against reentrant handler is not there. Imho it is anything
but safe.
--
Ueimor
(In reply to comment #62) > Reproduced this on an installed system. > > Looks like the trick might be a 32-bit UP system with an smp kernel (the default > f7 kernel is smp). A UP kernel might die too, but we don't build one of those > anymore, so I can't say if that makes a difference. > > I did notice that it doesn't *always* hang while booting, but it seems to happen > most of the time. Seeing soft-lockups like that makes me wonder if some code > was added recently that works well on true SMP systems but not on UP ones.... > > (In reply to comment #62) > Reproduced this on an installed system. > > Looks like the trick might be a 32-bit UP system with an smp kernel (the default > f7 kernel is smp). A UP kernel might die too, but we don't build one of those > anymore, so I can't say if that makes a difference. > > I did notice that it doesn't *always* hang while booting, but it seems to happen > most of the time. Seeing soft-lockups like that makes me wonder if some code > was added recently that works well on true SMP systems but not on UP ones.... > > Hi, I've been pondering this freeze since F7 first came out and I upgraded to it. I had disabled hyperthreading on my laptop for some reason or another, and upon reading this and UP systems, I decided I'd give it a shot and try re-enabling hyperthreading. No more lock ups! This might be a little-too-late response...but its feedback none the less :-) (In reply to comment #92) > Masaki, I'll talk to him about it. Thanks Andy. (In reply to comment #90) > It seems our system hangs > until one is ready, so I decided to add the following hack to see how > successfully the system could be booted if this call was basically ignored. I tested this patch (2.6.22.1-27.fc7 + the patch + CONFIG_DEBUG_SHIRQ=y) and it works fine. This should be fixed with 2.6.22.1-33. I notice that Ingo has been tracking IRQ problem with ne2k NIC on LKML. See the followning thread. Re: 2.6.20->2.6.21 - networking dies after random time http://www.ussg.iu.edu/hypermail/linux/kernel/0707.3/0057.html I wonder this may be some hints for our bug. (In reply to comment #95) > 2. Without CONFIG_DEBUG_SHIRQ, IRQ handlers are run with irq enabled and are > non-reentrant. With CONFIG_DEBUG_SHIRQ, irq are disabled and the usual > protection against reentrant handler is not there. Imho it is anything > but safe. Even with CONFIG_DEBUG_SHIRQ, I think that irq are not disabled. In "request_irq()", 506 #ifdef CONFIG_LOCKDEP 507 /* 508 * Lockdep wants atomic interrupt handlers: 509 */ 510 irqflags |= IRQF_DISABLED; 511 #endif but CONFIG_LOCKDEP isn't defined anywhere. We have only CONFIG_LOCKDEP_SUPPORT. I think all CONFIG_LOCKDEP in kernel tree are wrong. Or am I some missing ? Oh, I'm wrong. Please ignore. Sorry. How about this workaround? This patch clears IntrStatus and discard any pending interrupt. It cames from a LKML post. http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.3/0635.html +++ src/r8169.c 2007-07-25 22:08:48.000000000 +0900 @@ -1737,6 +1737,7 @@ { struct rtl8169_private *tp = netdev_priv(dev); struct pci_dev *pdev = tp->pci_dev; + void __iomem *ioaddr = tp->mmio_addr; int retval = -ENOMEM; @@ -1764,6 +1765,8 @@ smp_mb(); + RTL_W16(IntrStatus, 0xffff); + retval = request_irq(dev->irq, rtl8169_interrupt, IRQF_SHARED, dev->name, dev); if (retval < 0) I think rtl8169_interrupt() should not be called before rtl8169_open() is finished. Masaki Chikama:
> How about this workaround ?
[...]
Useless. See rtl8169_irq_mask_and_ack() in rtl8169_init_one.
#53 contains a nice trace. The answer is there. It is not exactly fun to
read due to the gazillion of debug options which changes the code like
hell. :o/
--
Ueimor
This should be fixed with 2.6.22.1-33. Andy I just check this build in updates-testing and can confirm that it works, 10 reboots both with cable plugged in and cable unplugged. Thanks to everyone that helped on this.... Good news, Mike. I still plan to keep this open because I'd like to enable CONFIG_DEBUG_SHIRQ when the r8169 problem is fixed. 2.6.22.1-33.fc7 introduces this (or a similar) problem for me, no problems noticed with any earlier F7 kernels, I've ran them all including some testing updates. The lockup is not hard, I can reboot with Ctrl+Alt+Del from the state where it gets stuck (determining IP info for eth0). Booting with cable unplugged goes through, but ifup eth0 doesn't work after booting either, hangs while getting IP address from DHCP. DHCPOFFER is found in /var/log/messages, but after that nothing happens, nothing logged. This is the same card as Nils reported in comment 16, i686, NM not involved, very minimal setup. Happens on every boot with 2.6.22.1-33.fc7, reverting to 2.6.22.1-27.fc7 fixes it. Hm, 33.fc7 introduces the same problem also for a different, older, 8139too driven card, so maybe what I'm seeing is something else. Ville Skyttä :
> Hm, 33.fc7 introduces the same problem also for a different, older, 8139too
> driven card, so maybe what I'm seeing is something else.
Can you trigger the bug after boot-up with the r8169 and, while the network
is locked, post the result of:
$ echo d > /proc/sysrq-trigger
$ echo t > /proc/sysrq-trigger
$ echo q > /proc/sysrq-trigger
$ echo w > /proc/sysrq-trigger
(of course it will be nicer if there are not too many processes)
--
Ueimor
Created attachment 160162 [details]
Requested info
Created attachment 160173 [details]
debug helper
Could someone who experiences the bug on a UP system (no HT please) try
the attached patch with an UP built kernel and send the resulting dmesg ?
Thanks in advance.
--
Ueimor
Created attachment 160185 [details]
dmesg output of patched r8169 driver
I applied the patch to a kernel freshly pulled from the offical git repository.
SMP is deactivated, CONFIG_DEBUG_SHIRQ is set to 'y'. System hangs on boot.
Thomas, the hang happens between the up/down change in: [...] r8169: eth0: status = 00000020 r8169: eth0: link up r8169: eth0: link up r8169: eth0: link down Right ? -- Ueimor (In reply to comment #112) > Thomas, the hang happens between the up/down change in: I think so, yes. It's just after iptables got initialized and before all other network interfaces are brought up. Disaster guys! After I've upgraded kernel recently from 2.6.21-1.3228.fc7 to 2.6.22.1-33.fc7 I cannot use my 01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8101E PCI Express Fast Ethernet controller (rev 01). Fortunately, I can revert to old kernel. The effect is that it hangs totally trying to get IP address from DHCP (DHCP server log shows that it gathered address successfully!), probably the problem is as you've mention above: it hangs during initialization. My host is: Asus notebook F9F Core 2 Duo 1GB RAM. Hi, I have the exact same problem as comment 106 describes: dhcp doesn't work for the r8169 driver in 2.6.22.1-33.fc7, but it works for the older kernel 2.6.22.1-27.fc7. lspci output: 05:05.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) Subsystem: Fujitsu Siemens Computer GmbH Unknown device 10B0 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 20 I/O ports at 3000 [size=256] Memory at b0304800 (32-bit, non-prefetchable) [size=256] [virtual] Expansion ROM at 88000000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 (a fixed IP works just fine) Franky (In reply to comment #110) > Created an attachment (id=160173) [edit] > debug helper > > Could someone who experiences the bug on a UP system (no HT please) try > the attached patch with an UP built kernel and send the resulting dmesg ? I tested with 2.6.22.1-27.fc7 + latest r8169.c from linux-2.6.git + the patch. It never boots with runlevel 5. So I firstly boot in single user mode, then "rmmod r8169" and "/etc/init.d/network start". I get a bit different result from comment #111. r8169: eth0: status = 00000020 r8169: eth0: link down r8169: eth0: link up r8169: eth0: link up (In reply to comment #116) > I get a bit different result from comment #111. > > r8169: eth0: status = 00000020 > r8169: eth0: link down > r8169: eth0: link up > r8169: eth0: link up > Did you perhaps boot with the cable unplugged and then replugged it, when the system hang? 2.6.22.1-41.fc7 from updates-testing fixes the issue for me. Created attachment 160277 [details]
More debug helper
Can someone apply the attached patch on top of the previous one and send
the updated dmesg ?
Testing with CONFIG_PRINTK_TIME enabled would be welcome.
--
Ueimor
Created attachment 160280 [details]
dmesg output of patched r8169 driver (2)
I applied the patch to the same kernel I used last time. Config is the same,
except that CONFIG_PRINTK_TIME is activated.
System never hang during several reboots. Is this a possible effect from your
patch or should I try harder to get the system to lock up?
The important lines from dmesg are probably the following:
[ 37.110992] r8169: eth0: status = 00000020
[ 37.111000] r8169: eth0: link up
[ 37.111003] r8169: eth0: in your hands ... (00000020)
[ 37.111024] r8169: eth0: link up
(In reply to comment #117) > Did you perhaps boot with the cable unplugged and then replugged it, when the > system hang? No. 1. Boot single user mode with the cable plugged. 2. rmmod r8169 (because the module is already loaded by udev(?)). 3. Run "/etc/init.d/network start" (module is auto-loaded and starts to talk to DHCP server) The possibility of hung up is higher ,when I don't run "rmmod r8169". I found a patch, that seems to fix problem in my r8169. http://lkml.org/lkml/2007/6/28/326 They say that the patch will included in kernel 2.6.23. Thomas Müller:
[...]
> System never hang during several reboots. Is this a possible effect from your
> patch or should I try harder to get the system to lock up?
It is a possible/expected effect of the patch. It is not clear why the
scheduling of the NAPI call (soft-)locks the system but I should have
made it more dependent on the status register :o/
I'll polish the patch.
--
Ueimor
Jouni Valiaho:
[...]
> I found a patch, that seems to fix problem in my r8169.
> http://lkml.org/lkml/2007/6/28/326
> They say that the patch will included in kernel 2.6.23.
2.6.23-rc1 ought to behave as described in the message from lkml.
If addresses different issues though.
--
Ueimor
Created attachment 160474 [details]
avoid useless NAPI poll scheduling
The attached patch should be enough to fix the broken kernels.
I'll welcome people to test it.
The patch should be usable against your favorite FC kernel or
against 2.6.22 or later kernels.
--
Ueimor
Created attachment 160507 [details]
modified patch for 2.6.22
I can confirm that 2.6.22 and the current git kernel work fine with your patch
(or some version of it) applied.
However, the patch does not cleanly apply to 2.6.22 because a variable was
renamed/moved. The attached patch should do exactly the same as yours, but can
be applied to 2.6.22.
Regarding my previous complain: I've updated to kernel-2.6.22.1-41.fc7 and my RTL8101E PCI Express Fast Ethernet controller (rev 01) works again. (In reply to comment #125) > Created an attachment (id=160474) [edit] > avoid useless NAPI poll scheduling I tested with 2.6.22.1-27.fc7 + latest r8169.c from linux-2.6.git + the patch. It works fine so far. Is this the final patch to be margined upstream ? Thank you for resolving this problem. Masaki Chikama: > I tested with 2.6.22.1-27.fc7 + latest r8169.c from linux-2.6.git + the patch. (just to be sure) Did you use the same .config as was used with the non-working kernel ? > Is this the final patch to be margined upstream ? If it really works, yes. :o) I would not mind someone helping me to pinpoint the reason of the problem with further testing though. -- Ueimor (In reply to comment #129) > Masaki Chikama: > > I tested with 2.6.22.1-27.fc7 + latest r8169.c from linux-2.6.git + the patch. > > (just to be sure) > Did you use the same .config as was used with the non-working kernel ? Yes. I only added CONFIG_PRINTK_TIME=y for previous test. CONFIG_DEBUG_SHIRQ=y is still there. (In reply to comment #129) > (just to be sure) > Did you use the same .config as was used with the non-working kernel ? > Yes. I compiled the unpatched kernel and tried to boot it --> System hang. Then I patched the r8169.c and recompiled without touching the config or anything else --> System worked fine. > I would not mind someone helping me to pinpoint the reason of the problem > with further testing though. Just tell me what you need. I have no problem rebooting my system, it's not an important server or anything like that. :) I recently bought a 8169 and suffered from the startup hangs. However, since a few kernel updates back I have not experienced the freeze. I think all the last 2-3 startups with kernel-2.6.22.1-41.fc7 have gone without a problem. Related or not; about an hour ago the server froze (pointer, network, everything) after being up 1 day. I unplugged the network cable and reinserted it: system unfroze and is still running. No nothing in the log. Just "r8169: eth1: link down" and "r8169: eth1: link up" Sorry for the noise. (In reply to comment #129) > > If it really works, yes. :o) > > I would not mind someone helping me to pinpoint the reason of the problem > with further testing though. Are you going to submit something for -stable after a patch is merged? Chuck Ebbert :
[...]
> Are you going to submit something for -stable after a patch is merged ?
I have not thought about it so far. Is there a strong demand for it ?
--
Ueimor
For us who have the problem, yes :D I´ve been waiting for months to setup my file server because of this bug :P This bug is still persistant in the F8 Test 1 kernel also..... Created attachment 161084 [details]
avoid useless NAPI poll scheduling (against 2.6.22.2)
I have diffed/updated the patch against 2.6.22.2. Can the interested
parties check that it is ok before I submit it for inclusion in -stable ?
Thanks in advance.
--
Ueimor
(In reply to comment #136) > This bug is still persistant in the F8 Test 1 kernel also..... It's fixed upstream, so will be fixed in the next F8 kernel. (In reply to comment #137) > I have diffed/updated the patch against 2.6.22.2. Can the interested > parties check that it is ok before I submit it for inclusion in -stable ? It's similar to the modified patch I already tested and posted earlier in comment #126, however one line changed: Instead of RTL_W16(IntrMask, rtl8169_intr_mask & ~rtl8169_napi_event); you now call RTL_W16(IntrMask, rtl8169_napi_event & ~rtl8169_napi_event); which (if I'm not mistaken) is equal to RTL_W16(IntrMask, 0x00000000); Was that intentional? Thomas Müller <thomas> :
[...]
> Was that intentional?
No.
I'll push your patch to 2.6.22-stable.
--
Ueimor
Wasn't this also fixed in 2.6.22.1-41 by disabling shared IRQ debugging in the kernel config? Chuck Ebbert : > Wasn't this also fixed in 2.6.22.1-41 by disabling shared IRQ debugging in the > kernel config ? Yes. I understood that Andy would prefer a fix which does not exclude this config option though (see comment #105). -- Ueimor Francios, You are correct. I'd like us to enable that config option again as soon as we feel the r8169 driver is safe. *** Bug 245367 has been marked as a duplicate of this bug. *** Is this still a problem with the latest f7 kernels? Francios' patch should be upstream and included in these kernels: commit 313b0305b5a1e7e0fb39383befbf79558ce68a9c Author: Francois Romieu <romieu.com> Date: Thu Aug 2 00:00:48 2007 +0200 r8169: avoid needless NAPI poll scheduling Theory : though needless, it should not have hurt. Practice: it does not play nice with DEBUG_SHIRQ + LOCKDEP + UP (see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=242572). The patch makes sense in itself but I should dig why it has an effect on #242572 (assuming that NAPI do not change in a near future). Signed-off-by: Francois Romieu <romieu.com> Cc: Edward Hsu <edward_hsu.tw> Hello, I'm reviewing this bug as part of the kernel bug triage project, an attempt to isolate current bugs in the Fedora kernel. http://fedoraproject.org/wiki/KernelBugTriage I am CC'ing myself to this bug and will try and assist you in resolving it if I can. There hasn't been much activity on this bug for a while. Could you tell me if you are still having problems with the latest kernel? If the problem no longer exists then please close this bug or I'll do so in a few days if there is no additional information lodged. Closing as per previous comment indicating this should now be resolved. |