Description of problem: Kernel dumps stacktrace after tulip driver initializes. Version-Release number of selected component (if applicable): 2.6.24-0.167.rc8.git4.fc9 2.6.24-2.fc9.i686 How reproducible: Only happened twice, not occurring yet in 2.6.24-7. Steps to Reproduce: 1. kernel oops while bringing eth0 up during boot 2. two backtraces for 2.6.24-2, one on reboot, second was after issuing: ifdown eth0; ifup eth0 0000:02:0a.0: tulip_stop_rxtx() failed (CSR5 0xfc06c012 CSR6 0xff970111) 0000:02:0a.0: tulip_stop_rxtx() failed (CSR5 0xfc664010 CSR6 0xff972113) eth0: Setting full-duplex based on MII#1 link partner capability of cde1. eth1: setting full-duplex. eth0: no IPv6 routers present eth1: no IPv6 routers present ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.24-0.167.rc8.git4.fc9 #1 ------------------------------------------------------- events/0/6 is trying to acquire lock: (rtnl_mutex){--..}, at: [<c05deea7>] rtnl_lock+0xf/0x11 but task is already holding lock: ((linkwatch_work).work){--..}, at: [<c043d511>] run_workqueue+0x85/0x1ac which lock already depends on the new lock.
Created attachment 293570 [details] dmesg-eth0-ifup.txt for 2.6.24-2
Created attachment 293571 [details] dmesg_eth0_threadlock.txt for 2.6.24-0.167
Created attachment 293572 [details] lspci-verbose.txt
Created attachment 293574 [details] kernel 2.6.24-7, shows first call trace after ifdown, second after ifup Actually only one trace from 2.6.24-2, it was when issuing ifdown at console manually. Just decided to try and provoke this in 2.6.24-7 and it did occur. I was logged into X, and issued ifdown eth0, then checked kernel messages. Seeing the problem occurred I logged out of X and then did ifup eth0. The system responded to ctrl-alt-del but nothing else after the ifup.
Created attachment 295760 [details] kernel-ifdown-ifup.txt Here is another example of this with kernel-2.6.25-0.54.rc2.fc9.i686. I came back from the weekend to find my desktop not moving any traffic (it was left seeding torrents) and after resetting the external router/modem I did ifdown/ifup for eth0 which is the external routed interface. The attached bt is what happened, and it has kernel-debug installed so it looks like a bit more info.
[just replied with this to Dave's mail to the netdev list, figured the bugtracker could use a copy too. Comment #4 actually sounds like the deadlock was hit because I frequently see keyboard problems when causing an RTNL deadlock.] On Sun, 2008-02-24 at 21:22 -0500, Dave Jones wrote: > https://bugzilla.redhat.com/show_bug.cgi?id=431038 has some more info, > but the trace is below... > I'll get an rc3 kernel built and ask the user to retest, but in case this > isn't a known problem, I'm forwarding this here. I can't fix it but I can explain it. > Feb 24 17:53:21 cirithungol kernel: ip/10650 is trying to acquire lock: > Feb 24 17:53:21 cirithungol kernel: (events){--..}, at: [<c0436f9a>] flush_workqueue+0x0/0x85 > Feb 24 17:53:21 cirithungol kernel: > Feb 24 17:53:21 cirithungol kernel: but task is already holding lock: > Feb 24 17:53:21 cirithungol kernel: (rtnl_mutex){--..}, at: [<c05cea31>] rtnetlink_rcv+0x12/0x26 > Feb 24 17:53:21 cirithungol kernel: > Feb 24 17:53:21 cirithungol kernel: which lock already depends on the new lock. What's happening here is that the linkwatch_work runs on the generic schedule_work() workqueue. > Feb 24 17:53:21 cirithungol kernel: -> #1 ((linkwatch_work).work){--..}: The function that is called is linkwatch_event(), which acquires the RTNL as you can see here: > Feb 24 17:53:21 cirithungol kernel: -> #2 (rtnl_mutex){--..}: > Feb 24 17:53:21 cirithungol kernel: [<c04458f7>] __lock_acquire+0xa7c/0xbf4 > Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11 > Feb 24 17:53:21 cirithungol kernel: [<c04415dc>] tick_program_event+0x31/0x55 > Feb 24 17:53:21 cirithungol kernel: [<c0445ad9>] lock_acquire+0x6a/0x90 > Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11 > Feb 24 17:53:21 cirithungol kernel: [<c0638d21>] mutex_lock_nested+0xdb/0x271 > Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11 > Feb 24 17:53:21 cirithungol kernel:last message repeated 2 times > Feb 24 17:53:21 cirithungol kernel: [<c05cf755>] linkwatch_event+0x8/0x22 The problem with that is that tulip_down() calls flush_scheduled_work() while holding the RTNL: > Feb 24 17:53:21 cirithungol kernel: [<c0436f9a>] flush_workqueue+0x0/0x85 > Feb 24 17:53:21 cirithungol kernel: [<c043702c>] flush_scheduled_work+0xd/0xf > Feb 24 17:53:21 cirithungol kernel: [<f8f4380a>] tulip_down+0x20/0x1a3 [tulip] [...] > Feb 24 17:53:21 cirithungol kernel: [<c05cea3d>] rtnetlink_rcv+0x1e/0x26 (rtnetlink_rcv will acquire the RTNL) The deadlock that can now happen is that linkwatch_work is scheduled on the workqueue but not running yet. During tulip_down(), flush_scheduled_work() is called which will wait for everything that is scheduled to complete. Among those things could be linkwatch_event() which will start running and try to acquire the RTNL. Because that is already locked it will wait for the RTNL, but on the other hand we're waiting for linkwatch_event() to finish while holding the RTNL. The fix here would most likely be to not use flush_scheduled_work() but rather cancel_work_sync(). This should be a correct change afaict, unless tulip has more work structs than the media work. @@ tulip_down - flush_scheduled_work(); + cancel_work_sync(&tp->media_work); johannes
Created attachment 298138 [details] dmesg-ifup-2.6.25-113-errors.txt Here is an error that occurred with kernel-2.6.25-0.113.rc5.git2. It has some similarities and some obvious differences. It was produced in the same manner, by issuing the command 'ifdown eth0; ifup eth0' on the system while there were applications with open connections. There appear to be two problems in this attachment. The machine had been running this kernel about 24 hours with actively seeding torrents and the folding at home client running the entire time. No kernel or net issues occurred until I decided to provoke this. Also, I'm wondering if bug# 433541 is related to this bug. The second trace in this attachment is different from any that I've posted in either of these bugs, but it occurred immediately with the first (this is directly copied out of dmesg).
Posted to netdev 2008-02-24 by davej as "lockdep trace from rc2" but after some discussion the thread died.
Created attachment 299658 [details] st-tulip-ifup.txt 2.6.25-0.121.rc5 Still occurs in rc5, I'm trying to get an rc7 kernel to run on this machine, the latest in rawhide does not work (0.171) but I'm trying the others. The attachment stacktrace here occurred after I removed NetworkManager and it crashed (bug# 431038). This still does fail during ifup/ifdown though.
Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
*** Bug 449969 has been marked as a duplicate of this bug. ***
This message is a reminder that Fedora 9 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 9. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '9'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 9's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 9 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.