Bug 431038 (tulip-lockdep)

Summary: oops in net tulip driver on ifup, threadlock, tulip_stop_rxtx() failed
Product: [Fedora] Fedora Reporter: Andrew Farris <lordmorgul>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: 9   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-07-14 15:41:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 202141    
Attachments:
Description Flags
dmesg-eth0-ifup.txt for 2.6.24-2
none
dmesg_eth0_threadlock.txt for 2.6.24-0.167
none
lspci-verbose.txt
none
kernel 2.6.24-7, shows first call trace after ifdown, second after ifup
none
kernel-ifdown-ifup.txt
none
dmesg-ifup-2.6.25-113-errors.txt
none
st-tulip-ifup.txt 2.6.25-0.121.rc5 none

Description Andrew Farris 2008-01-31 11:43:52 UTC
Description of problem:
Kernel dumps stacktrace after tulip driver initializes.

Version-Release number of selected component (if applicable):
2.6.24-0.167.rc8.git4.fc9
2.6.24-2.fc9.i686

How reproducible:
Only happened twice, not occurring yet in 2.6.24-7.

Steps to Reproduce:
1. kernel oops while bringing eth0 up during boot
2. two backtraces for 2.6.24-2, one on reboot, second was after issuing: ifdown
eth0; ifup eth0


0000:02:0a.0: tulip_stop_rxtx() failed (CSR5 0xfc06c012 CSR6 0xff970111)
0000:02:0a.0: tulip_stop_rxtx() failed (CSR5 0xfc664010 CSR6 0xff972113)
eth0: Setting full-duplex based on MII#1 link partner capability of cde1.
eth1:  setting full-duplex.
eth0: no IPv6 routers present
eth1: no IPv6 routers present

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.24-0.167.rc8.git4.fc9 #1
-------------------------------------------------------
events/0/6 is trying to acquire lock:
 (rtnl_mutex){--..}, at: [<c05deea7>] rtnl_lock+0xf/0x11

but task is already holding lock:
 ((linkwatch_work).work){--..}, at: [<c043d511>] run_workqueue+0x85/0x1ac

which lock already depends on the new lock.

Comment 1 Andrew Farris 2008-01-31 11:43:52 UTC
Created attachment 293570 [details]
dmesg-eth0-ifup.txt  for 2.6.24-2

Comment 2 Andrew Farris 2008-01-31 11:45:06 UTC
Created attachment 293571 [details]
dmesg_eth0_threadlock.txt for 2.6.24-0.167

Comment 3 Andrew Farris 2008-01-31 11:45:36 UTC
Created attachment 293572 [details]
lspci-verbose.txt

Comment 4 Andrew Farris 2008-01-31 12:17:40 UTC
Created attachment 293574 [details]
kernel 2.6.24-7, shows first call trace after ifdown, second after ifup

Actually only one trace from 2.6.24-2, it was when issuing ifdown at console
manually.

Just decided to try and provoke this in 2.6.24-7 and it did occur.  I was
logged into X, and issued ifdown eth0, then checked kernel messages.  Seeing
the problem occurred I logged out of X and then did ifup eth0.	The system
responded to ctrl-alt-del but nothing else after the ifup.

Comment 5 Andrew Farris 2008-02-25 02:02:02 UTC
Created attachment 295760 [details]
kernel-ifdown-ifup.txt

Here is another example of this with kernel-2.6.25-0.54.rc2.fc9.i686.  I came
back from the weekend to find my desktop not moving any traffic (it was left
seeding torrents) and after resetting the external router/modem I did
ifdown/ifup for eth0 which is the external routed interface.

The attached bt is what happened, and it has kernel-debug installed so it looks
like a bit more info.

Comment 6 Johannes Berg 2008-02-25 10:55:05 UTC
[just replied with this to Dave's mail to the netdev list, figured the
bugtracker could use a copy too.

Comment #4 actually sounds like the deadlock was hit because I frequently see
keyboard problems when causing an RTNL deadlock.]


On Sun, 2008-02-24 at 21:22 -0500, Dave Jones wrote:
> https://bugzilla.redhat.com/show_bug.cgi?id=431038 has some more info,
> but the trace is below...
> I'll get an rc3 kernel built and ask the user to retest, but in case this
> isn't a known problem, I'm forwarding this here.

I can't fix it but I can explain it.

> Feb 24 17:53:21 cirithungol kernel: ip/10650 is trying to acquire lock:
> Feb 24 17:53:21 cirithungol kernel:  (events){--..}, at: [<c0436f9a>]
flush_workqueue+0x0/0x85
> Feb 24 17:53:21 cirithungol kernel: 
> Feb 24 17:53:21 cirithungol kernel: but task is already holding lock:
> Feb 24 17:53:21 cirithungol kernel:  (rtnl_mutex){--..}, at: [<c05cea31>]
rtnetlink_rcv+0x12/0x26
> Feb 24 17:53:21 cirithungol kernel: 
> Feb 24 17:53:21 cirithungol kernel: which lock already depends on the new lock.

What's happening here is that the linkwatch_work runs on the generic
schedule_work() workqueue.

> Feb 24 17:53:21 cirithungol kernel: -> #1 ((linkwatch_work).work){--..}:

The function that is called is linkwatch_event(), which acquires the
RTNL as you can see here:

> Feb 24 17:53:21 cirithungol kernel: -> #2 (rtnl_mutex){--..}:
> Feb 24 17:53:21 cirithungol kernel:        [<c04458f7>] __lock_acquire+0xa7c/0xbf4
> Feb 24 17:53:21 cirithungol kernel:        [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel:        [<c04415dc>]
tick_program_event+0x31/0x55
> Feb 24 17:53:21 cirithungol kernel:        [<c0445ad9>] lock_acquire+0x6a/0x90
> Feb 24 17:53:21 cirithungol kernel:        [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel:        [<c0638d21>]
mutex_lock_nested+0xdb/0x271
> Feb 24 17:53:21 cirithungol kernel:        [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel:last message repeated 2 times
> Feb 24 17:53:21 cirithungol kernel:        [<c05cf755>] linkwatch_event+0x8/0x22

The problem with that is that tulip_down() calls flush_scheduled_work()
while holding the RTNL:

> Feb 24 17:53:21 cirithungol kernel:        [<c0436f9a>] flush_workqueue+0x0/0x85
> Feb 24 17:53:21 cirithungol kernel:        [<c043702c>]
flush_scheduled_work+0xd/0xf
> Feb 24 17:53:21 cirithungol kernel:        [<f8f4380a>] tulip_down+0x20/0x1a3
[tulip]
[...]
> Feb 24 17:53:21 cirithungol kernel:        [<c05cea3d>] rtnetlink_rcv+0x1e/0x26

(rtnetlink_rcv will acquire the RTNL)


The deadlock that can now happen is that linkwatch_work is scheduled on
the workqueue but not running yet. During tulip_down(),
flush_scheduled_work() is called which will wait for everything that is
scheduled to complete. Among those things could be linkwatch_event()
which will start running and try to acquire the RTNL. Because that is
already locked it will wait for the RTNL, but on the other hand we're
waiting for linkwatch_event() to finish while holding the RTNL.

The fix here would most likely be to not use flush_scheduled_work() but
rather cancel_work_sync().

This should be a correct change afaict, unless tulip has more work
structs than the media work.

@@ tulip_down
-       flush_scheduled_work();
+       cancel_work_sync(&tp->media_work);

johannes

Comment 7 Andrew Farris 2008-03-15 09:03:38 UTC
Created attachment 298138 [details]
dmesg-ifup-2.6.25-113-errors.txt

Here is an error that occurred with kernel-2.6.25-0.113.rc5.git2.  It has some
similarities and some obvious differences.  It was produced in the same manner,
by issuing the command 'ifdown eth0; ifup eth0' on the system while there were
applications with open connections.  There appear to be two problems in this
attachment.

The machine had been running this kernel about 24 hours with actively seeding
torrents and the folding at home client running the entire time.  No kernel or
net issues occurred until I decided to provoke this.

Also, I'm wondering if bug# 433541 is related to this bug.  The second trace in
this attachment is different from any that I've posted in either of these bugs,
but it occurred immediately with the first (this is directly copied out of
dmesg).

Comment 8 Chuck Ebbert 2008-03-18 16:42:22 UTC
Posted to netdev 2008-02-24 by davej as "lockdep trace from rc2" but after some
discussion the thread died.

Comment 9 Andrew Farris 2008-03-31 02:03:51 UTC
Created attachment 299658 [details]
st-tulip-ifup.txt 2.6.25-0.121.rc5

Still occurs in rc5, I'm trying to get an rc7 kernel to run on this machine,
the latest in rawhide does not work (0.171) but I'm trying the others.	The
attachment stacktrace here occurred after I removed NetworkManager and it
crashed (bug# 431038).	This still does fail during ifup/ifdown though.

Comment 10 Bug Zapper 2008-05-14 04:56:44 UTC
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 11 Dave Jones 2008-06-04 15:19:59 UTC
*** Bug 449969 has been marked as a duplicate of this bug. ***

Comment 12 Bug Zapper 2009-06-09 23:28:12 UTC
This message is a reminder that Fedora 9 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 9.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '9'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 9's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 9 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 13 Bug Zapper 2009-07-14 15:41:32 UTC
Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.