Description of problem: The underlying problem that causes a delay to release the device refcnt to be released is under investigation yet. So, when avahi-daemon tries to leave a multicasting group, it hangs waiting device's refcnt goes down to zero. In meanwhile, all the networking stuff hangs as well and the following message is repeated in the console: "kernel: unregister_netdevice: waiting for bond0.200 to become free. Usage count = 5" See the avahi-daemon waiting holding the lock: avahi-daemon D ffff810009025e20 0 10038 1 10039 10066 9993 (NOTLB) ffff81041af53c28 0000000000000082 0000000000000000 ffff81041f722cc0 00000000fb0000e0 0000000000000008 ffff8104157f7080 ffff81041fc7e080 000001247bf44bc6 00000000001c1d41 ffff8104157f7268 000000048003dbe6 Call Trace: [<ffffffff8006388b>] schedule_timeout+0x8a/0xad [<ffffffff80099854>] process_timeout+0x0/0x5 [<ffffffff80099f1d>] msleep+0x21/0x2c <-- wait until all references are gone [<ffffffff80234d79>] netdev_run_todo+0x10e/0x222 <-- lock held [<ffffffff8026a4ba>] ip_mc_leave_group+0x78/0xc1 [<ffffffff802545b3>] do_ip_setsockopt+0x6e6/0x9d3 [<ffffffff8000d4ab>] dput+0x2c/0x114 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff8002ca7b>] mntput_no_expire+0x19/0x89 [<ffffffff8000ead8>] link_path_walk+0xa6/0xb2 [<ffffffff8000c71c>] _atomic_dec_and_lock+0x39/0x57 [<ffffffff8002ca7b>] mntput_no_expire+0x19/0x89 [<ffffffff8022b14d>] sys_sendto+0x131/0x164 [<ffffffff80254958>] ip_setsockopt+0x22/0x78 [<ffffffff8022a978>] sys_setsockopt+0x91/0xb7 And now some tools hanging because of that: vconfig D ffff810009025e20 0 11130 10751 (NOTLB) ffff81040edc5d68 0000000000000082 0000000000402140 ffff81041bcd7140 ffff81040edc5d88 0000000000000008 ffff81041de1e040 ffff81041fc7e080 00000116a0729c90 000000000000024e ffff81041de1e228 00000004157f7080 Call Trace: [<ffffffff80046fc0>] try_to_wake_up+0x472/0x484 [<ffffffff80063c4f>] __mutex_lock_slowpath+0x60/0x9b [<ffffffff8008fc68>] __cond_resched+0x1c/0x44 [<ffffffff80063c99>] .text.lock.mutex+0xf/0x14 [<ffffffff80234c7f>] netdev_run_todo+0x14/0x222 <-- [<ffffffff80063be6>] __mutex_unlock_slowpath+0x2a/0x33 [<ffffffff885de9ae>] :8021q:vlan_ioctl_handler+0x563/0x615 [<ffffffff8022a519>] sock_ioctl+0x161/0x1e5 [<ffffffff8004226a>] do_ioctl+0x21/0x6b [<ffffffff8003026e>] vfs_ioctl+0x457/0x4b9 [<ffffffff800b9609>] audit_syscall_entry+0x1a4/0x1cf [<ffffffff8004c73b>] sys_ioctl+0x59/0x78 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 ifconfig D ffff81000903f1a0 0 11177 11144 (NOTLB) ffff81040e34bd48 0000000000000086 ffff810417faa021 ffff81040e34bea8 0000000000000000 0000000000000008 ffff81041a463100 ffff81041fda8100 00000120f6b7d273 0000000000176090 ffff81041a4632e8 00000007093d3b70 Call Trace: [<ffffffff80063c4f>] __mutex_lock_slowpath+0x60/0x9b [<ffffffff80063c99>] .text.lock.mutex+0xf/0x14 [<ffffffff80234c7f>] netdev_run_todo+0x14/0x222 [<ffffffff802347ac>] dev_ioctl+0x38d/0x465 [<ffffffff800237f5>] __path_lookup_intent_open+0x87/0x97 [<ffffffff8001b0a6>] open_namei+0x73/0x712 [<ffffffff8022a58c>] sock_ioctl+0x1d4/0x1e5 [<ffffffff8004226a>] do_ioctl+0x21/0x6b [<ffffffff8003026e>] vfs_ioctl+0x457/0x4b9 [<ffffffff800b9609>] audit_syscall_entry+0x1a4/0x1cf [<ffffffff8004c73b>] sys_ioctl+0x59/0x78 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 ntpq D ffff810009015120 0 11449 11448 11450 (NOTLB) ffff810401971b38 0000000000000082 0000000000000202 ffffffff8005578e ffff810401f77c00 0000000000000001 ffff8104154b9860 ffff81041fc1d080 0000013f4a80a009 0000000000104449 ffff8104154b9a48 0000000201f77c00 Call Trace: [<ffffffff8005578e>] skb_queue_tail+0x17/0x3e [<ffffffff80063c4f>] __mutex_lock_slowpath+0x60/0x9b [<ffffffff80063c99>] .text.lock.mutex+0xf/0x14 [<ffffffff80234c7f>] netdev_run_todo+0x14/0x222 [<ffffffff8023af0a>] rtnetlink_rcv+0x41/0x4e [<ffffffff8024987e>] netlink_data_ready+0x12/0x50 [<ffffffff80248a35>] netlink_sendskb+0x26/0x40 [<ffffffff80249859>] netlink_sendmsg+0x2b4/0x2c7 [<ffffffff8000ba76>] touch_atime+0x67/0xaa [<ffffffff8005522d>] sock_sendmsg+0xf8/0x14a [<ffffffff8000c71c>] _atomic_dec_and_lock+0x39/0x57 [<ffffffff8000a814>] __link_path_walk+0xf79/0xfb9 [<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e [<ffffffff80008d56>] __handle_mm_fault+0x5f3/0x1039 [<ffffffff8002e261>] __wake_up+0x38/0x4f [<ffffffff8022b14d>] sys_sendto+0x131/0x164 [<ffffffff8022b33c>] move_addr_to_user+0x5d/0x78 [<ffffffff8022b14d>] sys_sendto+0x131/0x164 [<ffffffff8022b33c>] move_addr_to_user+0x5d/0x78 [<ffffffff8022b499>] sys_getsockname+0x82/0xb2 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Version-Release number of selected component (if applicable): 2.6.18-238.el5 How reproducible: Customer can reproduce easily using their application. I couldn't manage to recreate the problem. Steps to Reproduce: 1. unknown Actual results: Processes hanging at that lock. Although fixing the underlying problem (assuming that there is one) would probably get rid of this problem as well, there is a patch removing the serialization at netdev_run_todo(). After applying the patch, the system does not hang anymore as before. After a couple minutes, the interface is released and everything is back to normal life. Upstream patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=58ec3b4db9eb5a28e3aec5f407a54e28f7039c19 ----8<------ From: Herbert Xu <herbert.org.au> Date: Tue, 7 Oct 2008 22:50:03 +0000 (-0700) Subject: net: Fix netdev_run_todo dead-lock X-Git-Tag: v2.6.27~6^2~1 net: Fix netdev_run_todo dead-lock Benjamin Thery tracked down a bug that explains many instances of the error unregister_netdevice: waiting for %s to become free. Usage count = %d It turns out that netdev_run_todo can dead-lock with itself if a second instance of it is run in a thread that will then free a reference to the device waited on by the first instance. The problem is really quite silly. We were trying to create parallelism where none was required. As netdev_run_todo always follows a RTNL section, and that todo tasks can only be added with the RTNL held, by definition you should only need to wait for the very ones that you've added and be done with it. There is no need for a second mutex or spinlock. This is exactly what the following patch does. Signed-off-by: Herbert Xu <herbert.org.au> Signed-off-by: David S. Miller <davem> ----8<------ Another commit is needed to satisfy a dependency for Red Hat Enterprise Linux 5: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=1536cc0d55a2820b71daf912060fe43ec15630c2 ----8<------ From: Denis V. Lunev <den> Date: Thu, 11 Oct 2007 04:12:58 +0000 (-0700) Subject: [NET]: rtnl_unlock cleanups [...] ----8<------ Customer reports that after applying those two patches the system does not hang anymore and the command is released after a couple minutes.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-246.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html
>Description of problem: > >The underlying problem that causes a delay to release >the device refcnt to be released is under investigation yet. > I think this bug deals with the underlying problem: https://bugzilla.redhat.com/show_bug.cgi?id=723411 fbl