Bug 431038 (tulip-lockdep)
Summary: | oops in net tulip driver on ifup, threadlock, tulip_stop_rxtx() failed | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Andrew Farris <lordmorgul> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 9 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-07-14 15:41:32 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 202141 | ||
Attachments: |
Description
Andrew Farris
2008-01-31 11:43:52 UTC
Created attachment 293570 [details]
dmesg-eth0-ifup.txt for 2.6.24-2
Created attachment 293571 [details]
dmesg_eth0_threadlock.txt for 2.6.24-0.167
Created attachment 293572 [details]
lspci-verbose.txt
Created attachment 293574 [details]
kernel 2.6.24-7, shows first call trace after ifdown, second after ifup
Actually only one trace from 2.6.24-2, it was when issuing ifdown at console
manually.
Just decided to try and provoke this in 2.6.24-7 and it did occur. I was
logged into X, and issued ifdown eth0, then checked kernel messages. Seeing
the problem occurred I logged out of X and then did ifup eth0. The system
responded to ctrl-alt-del but nothing else after the ifup.
Created attachment 295760 [details]
kernel-ifdown-ifup.txt
Here is another example of this with kernel-2.6.25-0.54.rc2.fc9.i686. I came
back from the weekend to find my desktop not moving any traffic (it was left
seeding torrents) and after resetting the external router/modem I did
ifdown/ifup for eth0 which is the external routed interface.
The attached bt is what happened, and it has kernel-debug installed so it looks
like a bit more info.
[just replied with this to Dave's mail to the netdev list, figured the bugtracker could use a copy too. Comment #4 actually sounds like the deadlock was hit because I frequently see keyboard problems when causing an RTNL deadlock.] On Sun, 2008-02-24 at 21:22 -0500, Dave Jones wrote: > https://bugzilla.redhat.com/show_bug.cgi?id=431038 has some more info, > but the trace is below... > I'll get an rc3 kernel built and ask the user to retest, but in case this > isn't a known problem, I'm forwarding this here. I can't fix it but I can explain it. > Feb 24 17:53:21 cirithungol kernel: ip/10650 is trying to acquire lock: > Feb 24 17:53:21 cirithungol kernel: (events){--..}, at: [<c0436f9a>] flush_workqueue+0x0/0x85 > Feb 24 17:53:21 cirithungol kernel: > Feb 24 17:53:21 cirithungol kernel: but task is already holding lock: > Feb 24 17:53:21 cirithungol kernel: (rtnl_mutex){--..}, at: [<c05cea31>] rtnetlink_rcv+0x12/0x26 > Feb 24 17:53:21 cirithungol kernel: > Feb 24 17:53:21 cirithungol kernel: which lock already depends on the new lock. What's happening here is that the linkwatch_work runs on the generic schedule_work() workqueue. > Feb 24 17:53:21 cirithungol kernel: -> #1 ((linkwatch_work).work){--..}: The function that is called is linkwatch_event(), which acquires the RTNL as you can see here: > Feb 24 17:53:21 cirithungol kernel: -> #2 (rtnl_mutex){--..}: > Feb 24 17:53:21 cirithungol kernel: [<c04458f7>] __lock_acquire+0xa7c/0xbf4 > Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11 > Feb 24 17:53:21 cirithungol kernel: [<c04415dc>] tick_program_event+0x31/0x55 > Feb 24 17:53:21 cirithungol kernel: [<c0445ad9>] lock_acquire+0x6a/0x90 > Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11 > Feb 24 17:53:21 cirithungol kernel: [<c0638d21>] mutex_lock_nested+0xdb/0x271 > Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11 > Feb 24 17:53:21 cirithungol kernel:last message repeated 2 times > Feb 24 17:53:21 cirithungol kernel: [<c05cf755>] linkwatch_event+0x8/0x22 The problem with that is that tulip_down() calls flush_scheduled_work() while holding the RTNL: > Feb 24 17:53:21 cirithungol kernel: [<c0436f9a>] flush_workqueue+0x0/0x85 > Feb 24 17:53:21 cirithungol kernel: [<c043702c>] flush_scheduled_work+0xd/0xf > Feb 24 17:53:21 cirithungol kernel: [<f8f4380a>] tulip_down+0x20/0x1a3 [tulip] [...] > Feb 24 17:53:21 cirithungol kernel: [<c05cea3d>] rtnetlink_rcv+0x1e/0x26 (rtnetlink_rcv will acquire the RTNL) The deadlock that can now happen is that linkwatch_work is scheduled on the workqueue but not running yet. During tulip_down(), flush_scheduled_work() is called which will wait for everything that is scheduled to complete. Among those things could be linkwatch_event() which will start running and try to acquire the RTNL. Because that is already locked it will wait for the RTNL, but on the other hand we're waiting for linkwatch_event() to finish while holding the RTNL. The fix here would most likely be to not use flush_scheduled_work() but rather cancel_work_sync(). This should be a correct change afaict, unless tulip has more work structs than the media work. @@ tulip_down - flush_scheduled_work(); + cancel_work_sync(&tp->media_work); johannes Created attachment 298138 [details] dmesg-ifup-2.6.25-113-errors.txt Here is an error that occurred with kernel-2.6.25-0.113.rc5.git2. It has some similarities and some obvious differences. It was produced in the same manner, by issuing the command 'ifdown eth0; ifup eth0' on the system while there were applications with open connections. There appear to be two problems in this attachment. The machine had been running this kernel about 24 hours with actively seeding torrents and the folding at home client running the entire time. No kernel or net issues occurred until I decided to provoke this. Also, I'm wondering if bug# 433541 is related to this bug. The second trace in this attachment is different from any that I've posted in either of these bugs, but it occurred immediately with the first (this is directly copied out of dmesg). Posted to netdev 2008-02-24 by davej as "lockdep trace from rc2" but after some discussion the thread died. Created attachment 299658 [details] st-tulip-ifup.txt 2.6.25-0.121.rc5 Still occurs in rc5, I'm trying to get an rc7 kernel to run on this machine, the latest in rawhide does not work (0.171) but I'm trying the others. The attachment stacktrace here occurred after I removed NetworkManager and it crashed (bug# 431038). This still does fail during ifup/ifdown though. Changing version to '9' as part of upcoming Fedora 9 GA. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping *** Bug 449969 has been marked as a duplicate of this bug. *** This message is a reminder that Fedora 9 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 9. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '9'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 9's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 9 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |