Bug 160340 - vconfig rem causes unregister_netdevice messages when HWADDR was used on base interface
Summary: vconfig rem causes unregister_netdevice messages when HWADDR was used on base...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Andy Gospodarek
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On: 245197
Blocks: 234251 245198
TreeView+ depends on / blocked
 
Reported: 2005-06-14 15:27 UTC by Sean E. Millichamp
Modified: 2014-06-29 22:57 UTC (History)
6 users (show)

Fixed In Version: RHBA-2007-0791
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-15 16:13:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bond-2.6.13-ref-leak.patch (775 bytes, text/x-patch)
2007-03-16 17:41 UTC, Archana K. Raghavan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0791 0 normal SHIPPED_LIVE Updated kernel packages available for Red Hat Enterprise Linux 4 Update 6 2007-11-14 18:25:55 UTC

Description Sean E. Millichamp 2005-06-14 15:27:39 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050513 Fedora/1.0.2-1.3.1 StumbleUpon/1.9993 Firefox/1.0.4

Description of problem:
If you use the 'HWADDR' option in ifcfg-eth[0-9]+ (or the underlying /sbin/nameif
that it calls) and then create a VLAN'd interface on top of the base eth[0-9]
device, vconfig rem won't let you remove it, it hangs forever with:
unregister_netdevice: waiting for eth0.2 to become free.  Usage count = 7

This is a critical problem because the initscripts call vconfig rem when you ifdown a vlan interface or when you shutdown the system and all interfaces hang.


Version-Release number of selected component (if applicable):
vconfig-1.8-4

How reproducible:
Always

Steps to Reproduce:
1. Create a base Ethernet device configuration in (for example) ifcfg-eth0
2. Make sure to use HWADDR in ifcfg-eth0 and specify a proper MAC address
3. Create a VLAN interface such as ifcfg-eth0.2
4. Do '/sbin/service network start'
5. Do '/sbin/service network stop'

Actual Results:  You will get something like:
unregister_netdevice: waiting for eth0.2 to become free.  Usage count = 7
repeated forever.  All networking related tools on the system seem to 
subsequently hang.

A 'ps' reveals that it is 'vconfig rem eth0.2' which is causing the problem.

Expected Results:  It should have removed the vlan interface from the system without problems.

Additional info:

If I remove the HWADDR specifications from my ifcfg-eth[0-9] files then it works just fine.  As soon as I add them back the problem returns.

Comment 1 Sean E. Millichamp 2005-06-14 19:01:27 UTC
Hmm... It seems that I get the 'unregister_netdevice' messages even without the
ifcfg HWADDR option being set.

I had them unset and performed a reboot and it hung on shutdown again.

There is still something not right with the system, but it is less clear to me
what the triggering factor is.

Comment 2 Philipp Gantert 2005-07-14 07:29:43 UTC
I have two machines one with the kernel 2.6.9-11 with the same problem
and it seems that the other machine witch the kernel 2.6.9-5.0.5 doesn't have 
this problem

Comment 3 Phil Knirsch 2005-08-15 15:35:14 UTC
I strongly suspect by the descriptions of the problems that this is more likely
to be a kernel related problem than then a userland vconfig problem.

Reassigning bug to kernel component.

Read ya, Phil

Comment 6 Buck Huppmann 2006-08-08 12:14:48 UTC
we have the same problem with a machine that's got VLANs configured
on top of a bonded device. the network rc script is able to shut down
a few of them, but then when it tries to shut down the interface that's
configured with the machine's default route (although i'm not saying
that distinguisher is in any way germane to the problem, although it
has always been that particular interface that brings the process to
a halt, but i'm not sure there aren't any interfacen following there-
upon that wouldn't do the same, i'm saying), it hangs with a usage
count of 1

this RH4 box replaced a Red Hat 3 machine that had vconfig-ed inter-
faces directly ``atop'' the physical device, and that never exhibited
such problems, FWIW. i thought it was maybe an interaction between
bond-ing and vlan-ning, but it sounds like, from the other reports
here maybe not

Comment 7 Buck Huppmann 2006-08-08 12:16:38 UTC
(In reply to comment #6)

should have mentioned the machine is 32-bit ix86, not x86_64, like the original
submitter's

Comment 8 Steve Snodgrass 2007-02-06 16:55:26 UTC
I've seen this problem on two different RHEL4 systems with VLAN interfaces. 
Only one of the two was using HWADDR on the underlying interface and both of
them are running the 32-bit kernel.  The latest incidence was this morning
running this kernel:

2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:28:02 EDT 2006 i686 athlon i386 GNU/Linux

The machine hung on shutdown with:

unregister_netdevice: waiting for eth0.15 to become free. Usage count = 1

Comment 11 Steve Snodgrass 2007-03-06 14:55:22 UTC
I just saw this problem again this morning on the same kernel.

2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:28:02 EDT 2006 i686 athlon i386 GNU/Linux

I'll upgrade to the latest version but I don't think there's any indication that
this will be fixed.  :(

Comment 12 David Miller 2007-03-07 19:45:15 UTC
As is often the case I bet ipv6 is to blame here.

Here is a test people can run to verify, but please be careful :)
Find the ipv6.ko module for the kernel you are running, and move
it out of the way (for example "mv ipv6.ko ipv6.ko.BAK") then
reboot.

See if that makes the problem go away.  If not, then bonding is likely
the next thing that should be investigated.


Comment 13 Sean E. Millichamp 2007-03-08 03:24:11 UTC
Unfortunately, I am not certain I'll be able to test this any time soon.  The
machines where I was regularly observing this were ones I worked on at a prior
job.  That machine wasn't using IPv6, but it did have the module loaded.  No
bonding was in use.

I do believe I just recently saw this with a CentOS machine though which used
tagged VLANs on a bonded (active-passive) interface.  I'll see if I can find an
opportunity to reproduce it and narrow it down.

If someone else on this ticket reliably sees this failure and can do this test
I'd appreciate it.

Comment 16 Archana K. Raghavan 2007-03-16 17:41:57 UTC
Created attachment 150261 [details]
bond-2.6.13-ref-leak.patch

Comment 20 David Miller 2007-04-11 21:05:09 UTC
Archana please post the patch for inclusion, it looks fine.


Comment 25 Andy Gospodarek 2007-04-12 00:05:00 UTC
This patch seems fine -- I'll add it to my test kernels.

commit ed4b9f8014db4f343e89b44b7c5ca355f439ce36
Author: Jay Vosburgh <fubar.com>
Date:   Wed Sep 14 14:52:09 2005 -0700

    [PATCH] bonding: plug reference count leak
    
        Bonding leaks route structures when the ARP monitor is
    configured to send probes over VLANs.
    
        Originally reported by Ian Abel <ian.abel>; his
    original fix was modified by Jay Vosburgh to correct coding style and to
    close a leak it missed.
    
    Signed-off-by: Jay Vosburgh <fubar.com>
    Signed-off-by: Jeff Garzik <jgarzik>



Comment 27 Steve Snodgrass 2007-04-17 14:05:45 UTC
I just hit this problem again this morning on the latest RHEL4 kernel.

Shutting down interface eth0.13:  [  OK  ]
Shutting down interface eth0.15:  unregister_netdevice: waiting for eth0.15 to b
ecome free. Usage count = 1
unregister_netdevice: waiting for eth0.15 to become free. Usage count = 1
...and so on....

Kernel info:
2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:17:21 EST 2007 i686 athlon i386 GNU/Linux

Note that I am *not* using the bonding driver, so this problem is not specific
to bonding.  These are just VLAN interfaces on top of an e1000 interface.

Comment 28 Andy Gospodarek 2007-04-17 21:37:09 UTC
Thanks for pointing that out, Steve.  Is this something you can pretty easily
recreate or was this just a freak occurance?


Comment 29 Steve Snodgrass 2007-04-18 03:10:07 UTC
(In reply to comment #28)
> Thanks for pointing that out, Steve.  Is this something you can pretty easily
> recreate or was this just a freak occurance?

Unfortunately I can't recreate it on demand, but it's been happening on one of
my production firewalls about once a month for the past 4-5 months when the
system tries to do an automatic reboot for software maintenance reasons.

Comment 30 Andy Gospodarek 2007-04-18 14:12:40 UTC
Thanks for the feedback, Steve.  Though it isn't a direct reproducer it helps to
know this system is used as a firewall so the conntrack code might be
problematic as well.

Comment 31 Andy Gospodarek 2007-04-23 17:45:06 UTC
Bonding fix added to my test kernels available here:

http://people.redhat.com/agospoda/#rhel4

I'll continue to investigate other possible patches, but please test this one
against bonding configurations that are problematic.

Comment 32 Steve Snodgrass 2007-05-03 03:47:23 UTC
I have disabled the IPv6 module on my problem system to see if that has any affect.

Comment 38 Jason Baron 2007-06-19 14:17:31 UTC
committed in stream U6 build 55.9. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 41 Steve Snodgrass 2007-09-15 00:43:36 UTC
I think the -55 kernels may have actually fixed this for me, but I'm a little
confused because I thought the patch that went in was only for the bonding
driver, which I'm not using.

Comment 44 errata-xmlrpc 2007-11-15 16:13:02 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html



Note You need to log in before you can comment on or make changes to this bug.