Bug 169502
Summary: | kernel: unregister_netdevice: waiting for ppp0 to become free. pppoe + tc qos? | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Trevor Cordes <trevor> | ||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 8 | CC: | charlieb-fedora-bugzilla, mattdm, rvokal | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i386 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | 2.6.24.4-64 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-04-17 01:08:45 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Trevor Cordes
2005-09-28 23:49:31 UTC
I was onsite doing some stuff on this same box tonight and I ran into this problem again: but worse! I initiated a shutdown and it did the same thing as above but was outputting console messages: kernel: unregister_netdevice: waiting for ppp0 to become free. usage count=1 So now it's doing it on ppp0 AND ppp1, AND I did not issue any manual ip route commands before the shutdown! Of course, I have background scripts that could easily have done an ip route show, but the main point is that this bug is very easy to trigger. Since it now shows up on ppp0, I'm starting to think this is somehow related either to a) 2 pppoe connections, or b) advanced policy routing. This problem is real. It's really starting to ruin my day on my multihomed boxes. Let me know if there's anything I can do to test. Just bit me again. I had the same server have the internet die again: Dec 14 21:05:05 firewall pppd[9908]: Connection terminated. Dec 14 21:05:05 firewall pppd[9908]: Connect time 15433.7 minutes. Dec 14 21:05:05 firewall pppd[9908]: Sent 1520690784 bytes, received 2367693942 bytes. Dec 14 21:05:15 firewall kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1 Dec 14 21:05:46 firewall last message repeated 3 times Dec 14 21:05:56 firewall kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1 Dec 14 21:06:02 firewall pppoe[9909]: Inactivity timeout... something wicked happened on session 2405 Dec 14 21:06:02 firewall pppoe[9909]: Sent PADT Dec 14 21:06:06 firewall kernel: unregister_netdevice: waiting for ppp0 to become free. Usage count = 1 The pppoe connection refused to come up for hours all night until someone onsite rebooted the machine. I don't yet know if it was able to self-reboot (clean) or had to be hard reset. Fedora Core 3 is now maintained by the Fedora Legacy project for security updates only. If this problem is a security issue, please reopen and reassign to the Fedora Legacy product. If it is not a security issue and hasn't been resolved in the current FC5 updates or in the FC6 test release, reopen and change the version to match. Thank you! I just upgraded the box to FC4 -- bug remains. In fact, worse than ever, hits on almost every reboot attempt even after the box has only been on a few minutes. I then upgraded the box to FC5. Exactly the same. The only way I can reboot this box now is using reboot -f. Every attempt at normal reboot fails with endless loop of waiting for ppp0 to become free errors. My money says it's in FC6 too, but I can't test that as my production boxes are all configured for FC5. I think I should provide a little more info. As time has gone on, I've gained more insight into this bug. First off, on the box in question I almost never run ip route show from the command line interactively. But I have a script that runs it every time a VPN goes down to determine how to bring it back up. So in pathological cases where the ISP is down for a while, my script is executing many ip route flush, ip rule add, ip show, ip route add, ip rule show, commands (a batch every minute or so) until it comes back up. So it's general "ip" command manipulation that is probably causing this bug. I can't pinpoint it to one particular subcommand, though the example in comment #1 is sure to cause me trouble everytime. Regardless of how many route/rule changes I do, the system should not hang on reboot "waiting for ppp0". At the very least have it detect it's trying to reboot and give up "waiting for ppp0" after 5 tries or something! Bug still occurs as of: 2.6.18-1.2257.fc5smp. Testing 2288 now. Bug still occurs as of kernel-2.6.20-1.2316.fc5! Hits once every couple of months. Still a pain to go onsite to hard reset. I think my initial assessment may have been off. I've been reading up on the many "unregister_netdevice" bugs that are out there. It appears these bugs occur because some kernel code is holding/queuing packets for an interface when the interface is brought down. That got me thinking that instead of the multihoming theory in comment #1, perhaps it's the HTB QOS that I'm doing on the box that's to blame. If that's the case, it must be a bug that only occurs with HTB (or all tc QOS) + PPPoE because I have a very similarly configured box that is cable modem (non-PPPoE) that has never exhibited this bug. Also, my similarly configured boxes without QOS (HTB or otherwise) with or without PPPoE do not exhibit this bug. Also very interesting, all the bug hits in the last year have been on ppp0, the only interface I do QOS on. That said, obviously from comment #1 I have seen this bug on ppp1 which I am pretty sure I never ran QOS on. It could be a red herring. My tc htb qos setup: #tc qdisc qdisc pfifo_fast 0: dev eth1 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth2 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth3 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth4 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc htb 1: dev ppp0 r2q 10 default 13 direct_packets_stat 0 qdisc sfq 110: dev ppp0 parent 1:10 limit 128p quantum 1492b perturb 10sec qdisc sfq 111: dev ppp0 parent 1:11 limit 128p quantum 1492b perturb 10sec qdisc sfq 112: dev ppp0 parent 1:12 limit 128p quantum 1492b perturb 10sec qdisc sfq 113: dev ppp0 parent 1:13 limit 128p quantum 1492b perturb 10sec qdisc pfifo_fast 0: dev ppp1 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 #tc class show dev ppp0 class htb 1:11 parent 1:1 leaf 111: prio 0 rate 10000bit ceil 270000bit burst 1601b cburst 1632b class htb 1:10 parent 1:1 leaf 110: prio 0 rate 10000bit ceil 270000bit burst 1601b cburst 1632b class htb 1:1 root rate 270000bit ceil 270000bit burst 1632b cburst 1632b class htb 1:13 parent 1:1 leaf 113: prio 0 rate 10000bit ceil 270000bit burst 1601b cburst 1632b class htb 1:12 parent 1:1 leaf 112: prio 0 rate 240000bit ceil 270000bit burst 1629b cburst 1632b #tc filter show dev ppp0 filter parent 1: protocol ip pref 49151 fw filter parent 1: protocol ip pref 49151 fw handle 0xc classid 1:12 filter parent 1: protocol ip pref 49151 fw filter parent 1: protocol ip pref 49151 fw handle 0xb classid 1:11 filter parent 1: protocol ip pref 49152 fw filter parent 1: protocol ip pref 49152 fw handle 0xa classid 1:10 I now have quite a few log histories of these bugs occurring. I can pretty much determine what the sequence of events are from the 4 examples in the last year. See the attachment. Also interesting, unlike many other unregister_netdevice bug reports in google, my "Usage count" in the bug line is ALWAYS = 1. This is a persistent and horribly awful bug. I am going to see if I can open a kernel.org bugzilla for it too (will post link here). Obviously it must be something fairly unique to this box for no one else to have hit upon this yet. Unfortunately, this box is the bleeding edge use-every-technology one and it's a mega-important production box! Created attachment 157183 [details]
annotated log excerpts showing 4 instances of bug hitting
I should add that looking back this bug did only start appearing when I added the 2nd PPPoE line to that box. I had done QoS a few months before that and I don't believe this bug hit. That would indicate it was the 2-connection / multihoming and relevant routing changes that triggered it. Maybe my first impression was correct. This one is so hard to pin down. new kernel bugzilla link http://bugzilla.kernel.org/show_bug.cgi?id=8638 This seems to be a kernel problem, assigning to kernel. Fedora Core 5 is no longer maintained. Is this bug still present in Fedora 7 or Fedora 8? As of F8 2.6.23.9-85.fc8.i686 this bug still occurs. I can confirm the same error message shows up. However, I cannot confirm the system(network) hanging occurs as per earlier comments (my guess is it does) because I now have a script in place that does a reboot -f immediately when it sees the error message. A bad kludge, but it works to keep the system unhung. I will upgrade to the latest kernel asap and try it out too, but as far as I'm concerned this bug is still open. I'll check my upstream bug to see if any progress has been made there. According to the upstream bug this is fixed in 2.6.24. Closing. (In reply to comment #18) > According to the upstream bug this is fixed in 2.6.24. Closing. The upstream bug http://bugzilla.kernel.org/show_bug.cgi?id=8638 was closed as a duplicate of: https://bugzilla.kernel.org/show_bug.cgi?id=6197 Bug 6197 was closed after: https://www.linux-cr.org/redmine/projects/kernel-cr/repository/revisions/5dba4797115c8fa05c1a4d12927a6ae0b33ffc41/diff fixed a reference leak in IPsec related code. But notice comment #59: https://bugzilla.kernel.org/show_bug.cgi?id=6197#c59 "this report is a complete mess of I don't know how many different problems" Trevor, is your problem really fixed upstream? This bug hasn't occurred in a very long time on the (previously) affected system. The config and hardware never changed, so the upstream must have fixed it along the way. It is running Fedora 14 currently so it must have been fixed between Fedora 8 and 14. I can't recall exactly when ;-) Since it used to hit every few weeks/months, and it's been years now, I would say for sure this bug has been fixed upstream. Interesting that it turned out to be an ipsec bug, not a tc/qos bug as I had guessed. The affected system was/is running a bunch of ipsec connections. Thanks!! |