Bug 169502

Summary: kernel: unregister_netdevice: waiting for ppp0 to become free. pppoe + tc qos?
Product: [Fedora] Fedora Reporter: Trevor Cordes <trevor>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: 8CC: charlieb-fedora-bugzilla, mattdm, rvokal
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.24.4-64 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-04-17 01:08:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
annotated log excerpts showing 4 instances of bug hitting none

Description Trevor Cordes 2005-09-28 23:49:31 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
I was playing around with advanced policy routing on a multihomed (2 DSL links) FC3 router.  During the tests I ran ifup ppp1 immediately (ie: before ifup had completed) followed by ip route show table 2 to confirm the previous ifdown had deleted the ppp1 routes.  The ip route show immediately hung and would not respond to ^C or ^\ or kills (even -9) on another terminal.  The console showed an error:

kernel: unregister_netdevice: waiting for ppp1 to become free. usage count=1

The ppp1 link eventually came up but the ip route show was still hung.  I could not kill it.  I issued a reboot command.  /var/log/messages showed everything shutting down but the ssh terminals I was using were still open and accepting commands!  The system refused to shutdown or reboot!  I issued a shutdown command and it did nothing.  I had to soft reset with the reset button on the box.

My biggest concern is that if this can happen on non-multihomed boxes that aren't using advanced routing (or even multihomed boxes with advanced routing!), you have a race condition where someone (or a script) might be ip route show'ing while the interface is coming up, causing very bizarre kernel behaviour and instabilities.

The only boxes I have with pppoe are remote production boxes and I cannot easily test this bug.  I'm not sure how much of it is related to my advanced routing and how much is just basic ppp stuff.  A good test would be to, on a test box with 1 pppoe connection, do: "ifup ppp0 &; ip route show".  If that doesn't hang then perhaps it's related to advanced routing.

To reproduce my case it would be easy to put in some rules:

ip rule add from 1.2.3.4 table 2
ip rule add to 1.2.3.5 table 2
ip route add 192.168.1.0/24 via <your-ptp-next-hop> table 2
ip route add default via <your-ptp-next-hop> table 2

then try "ifup ppp0 &; ip route show table 2".  I'm guessing the bogus 1.2.3.4 IP's will be good enough to trigger this bug.  I doubt you need to setup "real" routes, like I was using.

Lastly, it might be related to the multihomed aspect of the box, which means you'd need to pppoe connections -- but I seriously doubt this is the problem.  It seems more basic than that to me.


Version-Release number of selected component (if applicable):
initscripts-7.93.7-1 rp-pppoe-3.5-22 iproute-2.6.9-3

How reproducible:
Didn't try

Steps to Reproduce:
1. perhaps setup some policy routing
2. ifup ppp0 &; ip route show
3.
  

Actual Results:  bizarre command hang and inability to shutdown/reboot the kernel

Expected Results:  the ip route show command should immediately return with the results, or perhaps an error


Additional info:

Comment 2 Trevor Cordes 2005-10-02 13:55:24 UTC
I was onsite doing some stuff on this same box tonight and I ran into this
problem again: but worse!  I initiated a shutdown and it did the same thing as
above but was outputting console messages:

kernel: unregister_netdevice: waiting for ppp0 to become free. usage count=1

So now it's doing it on ppp0 AND ppp1, AND I did not issue any manual ip route
commands before the shutdown!  Of course, I have background scripts that could
easily have done an ip route show, but the main point is that this bug is very
easy to trigger.  Since it now shows up on ppp0, I'm starting to think this is
somehow related either to a) 2 pppoe connections, or b) advanced policy routing.


Comment 5 Trevor Cordes 2005-10-26 23:12:31 UTC
This problem is real.  It's really starting to ruin my day on my multihomed
boxes.  Let me know if there's anything I can do to test.

Comment 6 Trevor Cordes 2005-12-15 19:27:20 UTC
Just bit me again.  I had the same server have the internet die again:

Dec 14 21:05:05 firewall pppd[9908]: Connection terminated.
Dec 14 21:05:05 firewall pppd[9908]: Connect time 15433.7 minutes.
Dec 14 21:05:05 firewall pppd[9908]: Sent 1520690784 bytes, received 2367693942
bytes.
Dec 14 21:05:15 firewall kernel: unregister_netdevice: waiting for ppp0 to
become free. Usage count = 1
Dec 14 21:05:46 firewall last message repeated 3 times
Dec 14 21:05:56 firewall kernel: unregister_netdevice: waiting for ppp0 to
become free. Usage count = 1
Dec 14 21:06:02 firewall pppoe[9909]: Inactivity timeout... something wicked
happened on session 2405
Dec 14 21:06:02 firewall pppoe[9909]: Sent PADT
Dec 14 21:06:06 firewall kernel: unregister_netdevice: waiting for ppp0 to
become free. Usage count = 1

The pppoe connection refused to come up for hours all night until someone onsite
rebooted the machine.  I don't yet know if it was able to self-reboot (clean) or
had to be hard reset.


Comment 7 Matthew Miller 2006-07-10 20:51:58 UTC
Fedora Core 3 is now maintained by the Fedora Legacy project for security
updates only. If this problem is a security issue, please reopen and
reassign to the Fedora Legacy product. If it is not a security issue and
hasn't been resolved in the current FC5 updates or in the FC6 test
release, reopen and change the version to match.

Thank you!


Comment 8 Trevor Cordes 2006-10-30 02:11:01 UTC
I just upgraded the box to FC4 -- bug remains.  In fact, worse than ever, hits
on almost every reboot attempt even after the box has only been on a few minutes.

I then upgraded the box to FC5.  Exactly the same.

The only way I can reboot this box now is using reboot -f.  Every attempt at
normal reboot fails with endless loop of waiting for ppp0 to become free errors.

My money says it's in FC6 too, but I can't test that as my production boxes are
all configured for FC5.

Comment 9 Trevor Cordes 2006-10-30 02:36:17 UTC
I think I should provide a little more info.  As time has gone on, I've gained
more insight into this bug.

First off, on the box in question I almost never run ip route show from the
command line interactively.  But I have a script that runs it every time a VPN
goes down to determine how to bring it back up.  So in pathological cases where
the ISP is down for a while, my script is executing many ip route flush, ip rule
add, ip show, ip route add, ip rule show, commands (a batch every minute or so)
until it comes back up.

So it's general "ip" command manipulation that is probably causing this bug.  I
can't pinpoint it to one particular subcommand, though the example in comment #1
is sure to cause me trouble everytime.

Regardless of how many route/rule changes I do, the system should not hang on
reboot "waiting for ppp0".  At the very least have it detect it's trying to
reboot and give up "waiting for ppp0" after 5 tries or something!


Comment 10 Trevor Cordes 2007-03-02 18:27:40 UTC
Bug still occurs as of: 2.6.18-1.2257.fc5smp.  Testing 2288 now.


Comment 11 Trevor Cordes 2007-06-16 09:13:19 UTC
Bug still occurs as of kernel-2.6.20-1.2316.fc5!  Hits once every couple of
months.  Still a pain to go onsite to hard reset.

I think my initial assessment may have been off.  I've been reading up on the
many "unregister_netdevice" bugs that are out there.  It appears these bugs
occur because some kernel code is holding/queuing packets for an interface when
the interface is brought down.

That got me thinking that instead of the multihoming theory in comment #1,
perhaps it's the HTB QOS that I'm doing on the box that's to blame.  If that's
the case, it must be a bug that only occurs with HTB (or all tc QOS) + PPPoE
because I have a very similarly configured box that is cable modem (non-PPPoE)
that has never exhibited this bug.  Also, my similarly configured boxes without
QOS (HTB or otherwise) with or without PPPoE do not exhibit this bug.  Also very
interesting, all the bug hits in the last year have been on ppp0, the only
interface I do QOS on.  That said, obviously from comment #1 I have seen this
bug on ppp1 which I am pretty sure I never ran QOS on.  It could be a red herring.

My tc htb qos setup:
#tc qdisc
qdisc pfifo_fast 0: dev eth1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth4 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc htb 1: dev ppp0 r2q 10 default 13 direct_packets_stat 0
qdisc sfq 110: dev ppp0 parent 1:10 limit 128p quantum 1492b perturb 10sec
qdisc sfq 111: dev ppp0 parent 1:11 limit 128p quantum 1492b perturb 10sec
qdisc sfq 112: dev ppp0 parent 1:12 limit 128p quantum 1492b perturb 10sec
qdisc sfq 113: dev ppp0 parent 1:13 limit 128p quantum 1492b perturb 10sec
qdisc pfifo_fast 0: dev ppp1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
#tc class show dev ppp0
class htb 1:11 parent 1:1 leaf 111: prio 0 rate 10000bit ceil 270000bit burst
1601b cburst 1632b
class htb 1:10 parent 1:1 leaf 110: prio 0 rate 10000bit ceil 270000bit burst
1601b cburst 1632b
class htb 1:1 root rate 270000bit ceil 270000bit burst 1632b cburst 1632b
class htb 1:13 parent 1:1 leaf 113: prio 0 rate 10000bit ceil 270000bit burst
1601b cburst 1632b
class htb 1:12 parent 1:1 leaf 112: prio 0 rate 240000bit ceil 270000bit burst
1629b cburst 1632b
#tc filter show dev ppp0
filter parent 1: protocol ip pref 49151 fw
filter parent 1: protocol ip pref 49151 fw handle 0xc classid 1:12
filter parent 1: protocol ip pref 49151 fw
filter parent 1: protocol ip pref 49151 fw handle 0xb classid 1:11
filter parent 1: protocol ip pref 49152 fw
filter parent 1: protocol ip pref 49152 fw handle 0xa classid 1:10

I now have quite a few log histories of these bugs occurring.  I can pretty much
determine what the sequence of events are from the 4 examples in the last year.
 See the attachment.

Also interesting, unlike many other unregister_netdevice bug reports in google,
my "Usage count" in the bug line is ALWAYS = 1.

This is a persistent and horribly awful bug.  I am going to see if I can open a
kernel.org bugzilla for it too (will post link here).  Obviously it must be
something fairly unique to this box for no one else to have hit upon this yet. 
Unfortunately, this box is the bleeding edge use-every-technology one and it's a
mega-important production box!


Comment 12 Trevor Cordes 2007-06-16 09:15:26 UTC
Created attachment 157183 [details]
annotated log excerpts showing 4 instances of bug hitting

Comment 13 Trevor Cordes 2007-06-16 09:44:48 UTC
I should add that looking back this bug did only start appearing when I added
the 2nd PPPoE line to that box.  I had done QoS a few months before that and I
don't believe this bug hit.  That would indicate it was the 2-connection /
multihoming and relevant routing changes that triggered it.  Maybe my first
impression was correct.  This one is so hard to pin down.

Comment 14 Trevor Cordes 2007-06-19 09:38:41 UTC
new kernel bugzilla link
http://bugzilla.kernel.org/show_bug.cgi?id=8638

Comment 15 Thomas Woerner 2007-07-23 15:31:13 UTC
This seems to be a kernel problem, assigning to kernel.

Comment 16 petrosyan 2008-03-10 15:52:17 UTC
Fedora Core 5 is no longer maintained. Is this bug still present in Fedora 7 or
Fedora 8?

Comment 17 Trevor Cordes 2008-03-10 21:13:21 UTC
As of F8 2.6.23.9-85.fc8.i686 this bug still occurs.  I can confirm the same
error message shows up.  However, I cannot confirm the system(network) hanging
occurs as per earlier comments (my guess is it does) because I now have a script
in place that does a reboot -f immediately when it sees the error message.  A
bad kludge, but it works to keep the system unhung.

I will upgrade to the latest kernel asap and try it out too, but as far as I'm
concerned this bug is still open.  I'll check my upstream bug to see if any
progress has been made there.

Comment 18 Chuck Ebbert 2008-04-17 01:08:45 UTC
According to the upstream bug this is fixed in 2.6.24. Closing.

Comment 19 Charlie Brady 2012-01-18 19:26:15 UTC
(In reply to comment #18)
> According to the upstream bug this is fixed in 2.6.24. Closing.

The upstream bug

http://bugzilla.kernel.org/show_bug.cgi?id=8638

was closed as a duplicate of:

https://bugzilla.kernel.org/show_bug.cgi?id=6197

Bug 6197 was closed after:

https://www.linux-cr.org/redmine/projects/kernel-cr/repository/revisions/5dba4797115c8fa05c1a4d12927a6ae0b33ffc41/diff

fixed a reference leak in IPsec related code. But notice comment #59:

https://bugzilla.kernel.org/show_bug.cgi?id=6197#c59

 "this report is a complete mess of I don't know how many different problems"

Trevor, is your problem really fixed upstream?

Comment 20 Trevor Cordes 2012-01-19 06:02:23 UTC
This bug hasn't occurred in a very long time on the (previously) affected system.  The config and hardware never changed, so the upstream must have fixed it along the way.

It is running Fedora 14 currently so it must have been fixed between Fedora 8 and 14.  I can't recall exactly when ;-)

Since it used to hit every few weeks/months, and it's been years now, I would say for sure this bug has been fixed upstream.

Interesting that it turned out to be an ipsec bug, not a tc/qos bug as I had guessed.  The affected system was/is running a bunch of ipsec connections.

Thanks!!