Description of problem: When I installed win2003 guest from pxe server, host hung. note: "unregister_netdevice: waiting for vbr0 to become free. Usage count = 4" Also, I could not execute any shell cmd, host could not be shutdown by 'Ctrl+Alt+Delete'. Version-Release number of selected component (if applicable): host kernel: 2.6.18-231.el5 kvm-83-207.el5 How reproducible: not always Steps to Reproduce: 1.setup private bridge brname='vbr0' /usr/sbin/brctl addbr $brname echo 1 > /proc/sys/net/ipv6/conf/$brname/disable_ipv6 echo 1 > /proc/sys/net/ipv4/ip_forward /usr/sbin/brctl stp $brname on /usr/sbin/brctl setfd $brname 0 ifconfig $brname 192.168.58.1 ifconfig $brname up iptables -t nat -A POSTROUTING -s 192.168.58.254/24 ! -d 192.168.58.254/24 -j MASQUERADE 2.setup dnsmasq pxe/tftp # dnsmasq --strict-order --bind-interfaces --listen-address 192.168.58.1 --dhcp-range 192.168.58.2,192.168.58.254 --enable-tftp --tftp-root $AUTODIR/tests/kvm/images/tftpboot --dhcp-boot pxelinux.0 --dhcp-no-override 3.start to install guest from network 4.repeat step 1~3 Actual results: host hung Expected results: Installation completed Additional info: 1. detail of unattended_install testcase: https://github.com/autotest/autotest/blob/master/client/tests/kvm/tests/unattended_install.py 2. dmesg info: unregister_netdevice: waiting for vbr0 to become free. Usage count = 4 unregister_netdevice: waiting for vbr0 to become free. Usage count = 4 unregister_netdevice: waiting for vbr0 to become free. Usage count = 4 INFO: task irqbalance:3425 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. irqbalance D ffff810001025e20 0 3425 1 3445 3269 (NOTLB) ffff810874f85d48 0000000000000086 ffff810874f85d48 0000000000000310 000000000000000a 0000000000000007 ffff81087bb7e0c0 ffff81011dda5080 00009d2b7e9a7a3d 000000000005a386 ffff81087bb7e2a8 0000000400001065 Call Trace: [<ffffffff80022242>] __up_read+0x19/0x7f [<ffffffff80063c4f>] __mutex_lock_slowpath+0x60/0x9b [<ffffffff80063c99>] .text.lock.mutex+0xf/0x14 [<ffffffff802349b4>] netdev_run_todo+0x14/0x222 [<ffffffff8023448d>] dev_ioctl+0x339/0x465 [<ffffffff80047652>] sprintf+0x51/0x59 [<ffffffff8022a300>] sock_ioctl+0x1d4/0x1e5 [<ffffffff80042268>] do_ioctl+0x21/0x6b [<ffffffff80030262>] vfs_ioctl+0x457/0x4b9 [<ffffffff800b95cb>] audit_syscall_entry+0x1a4/0x1cf [<ffffffff8004c737>] sys_ioctl+0x59/0x78 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task snmpd:8421 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. snmpd D ffff810001025e20 0 8421 1 8424 8388 (NOTLB) ffff810861c15b38 0000000000000086 0000000000000206 ffffffff8005578a ffff8108744e0400 000000000000000a ffff81085485b0c0 ffff81011dda5080 00009d124ef2184e 000000000002bbed ffff81085485b2a8 00000004744e0400 Call Trace: [<ffffffff8005578a>] skb_queue_tail+0x17/0x3e [<ffffffff80063c4f>] __mutex_lock_slowpath+0x60/0x9b [<ffffffff80063c99>] .text.lock.mutex+0xf/0x14 [<ffffffff802349b4>] netdev_run_todo+0x14/0x222 [<ffffffff8023ac3f>] rtnetlink_rcv+0x41/0x4e [<ffffffff8024959d>] netlink_data_ready+0x12/0x50 [<ffffffff80248754>] netlink_sendskb+0x26/0x40 [<ffffffff80249578>] netlink_sendmsg+0x2b4/0x2c7 [<ffffffff80055229>] sock_sendmsg+0xf8/0x14a [<ffffffff8001a7c4>] vsnprintf+0x3f8/0x627 [<ffffffff8006723e>] do_page_fault+0x4fe/0x874 [<ffffffff800a2896>] autoremove_wake_function+0x0/0x2e [<ffffffff8001a9ab>] vsnprintf+0x5df/0x627 [<ffffffff80047652>] sprintf+0x51/0x59 [<ffffffff8022aeac>] sys_sendto+0x11c/0x14f [<ffffffff800fa171>] inotify_d_instantiate+0x12/0x3c [<ffffffff8003b1ba>] d_instantiate+0x4a/0x82 [<ffffffff80042146>] d_rehash+0x21/0x34 [<ffffffff8022aade>] sock_attach_fd+0x8f/0xfd [<ffffffff8005d28d>] tracesys+0xd5/0xe0 unregister_netdevice: waiting for vbr0 to become free. Usage count = 4 unregister_netdevice: waiting for vbr0 to become free. Usage count = 4 3. qemu-kvm cmdline: # qemu-kvm -name 'vm1' -monitor unix:'/tmp/monitor-humanmonitor1-20101116-040615-Ij08',server,nowait -serial unix:'/tmp/serial-20101116-040615-Ij08',server,nowait -drive file='/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/images/win2003-32-virtio.qcow2',index=0,if=virtio,media=disk,cache=none,boot=on,format=qcow2 -net nic,vlan=0,model=virtio,macaddr='9a:71:5c:4f:b6:50' -net tap,vlan=0,ifname='t0-040615-Ij08',script='/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/scripts/qemu-ifup-vbr0',downscript='no' -m 2048 -smp 2,cores=1,threads=1,sockets=2 -drive file='/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/isos/ISO/Win2003/32/en_win_srv_2003_r2_enterprise_with_sp2_cd1_X13-05460.iso',media=cdrom,index=1 -drive file='/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/isos/windows/winutils.iso',media=cdrom,index=2 -drive file='/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/isos/windows/virtio-win.iso',media=cdrom,index=3 -cpu qemu64,+sse2 -soundhw ac97 -fda '/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/images/win2003-32/answer.vfd' -redir tcp:5000::10023 -vnc :0 -spice port=8000,disable-ticketing -qxl 1 -rtc-td-hack -M rhel5.5.0 -usbdevice tablet -boot d 4. host cpuinfo: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU E5310 @ 1.60GHz stepping : 11 cpu MHz : 1595.926 cache size : 4096 KB physical id : 1 siblings : 4 core id : 3 cpu cores : 4 apicid : 7 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr lahf_lm bogomips : 3191.91 clflush size : 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management: 5. host mem info# cat /proc/meminfo MemTotal: 32940120 kB MemFree: 32542400 kB Buffers: 81428 kB Cached: 201532 kB SwapCached: 0 kB Active: 166644 kB Inactive: 150780 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 32940120 kB LowFree: 32542400 kB SwapTotal: 34996216 kB SwapFree: 34996216 kB Dirty: 28 kB Writeback: 72 kB AnonPages: 34348 kB Mapped: 13796 kB Slab: 25532 kB PageTables: 3880 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 51466276 kB Committed_AS: 176964 kB VmallocTotal: 34359738367 kB VmallocUsed: 286272 kB VmallocChunk: 34359451971 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB
Created attachment 473979 [details] Debug vbr0 ref count leak Can you please try applying this patch on the host and then reproduce the problem? The kernel log should then point us in the direction of the culprit. Thanks!
Created attachment 474236 [details] dmesg I try to release private bridge when guest is using it. attached the dmesg of host.
Thanks! Unfortunately it seems that the log isn't complete as I only see a single hold in there. Can you please try grabbing the syslog copy of the kernel messages? If it still appears imcomplete, you may need to increase the kernel printk buffer size.
Created attachment 475747 [details] calltrace of hold/put vbr0 host hung about 60 seconds, but not outputed "unregister_netdevice: waiting for vbr0 to become free. Usage count = .."
Created attachment 475748 [details] hold/put vbr0 caused host fs read-only When I hold/put vbr0, it caused host fs read-only. host hung about 60 seconds, and not outputted "unregister_netdevice: waiting for vbr0 to become free. Usage count = .." I'll try in other machines.
Created attachment 475757 [details] reproduced reference-count issue I re-tested in another machine, I try to delete vbr0 when guest is using it, kernel outputted """ Message from syslogd@ at Fri Jan 28 17:26:06 2011 ... intel-8400-8-2 last message repeated 6 times unregister_netdevice: waiting for vbr0 to become free. Usage count = 7 ... """ I try to kill guest process by kill -9 $pid, but it becomes zombie. # ps aux|grep qemu root 3371 2.2 0.0 0 0 pts/1 Zl 17:20 0:10 [qemu-kvm] <defunct> attached the detail dmesg.
Thanks a lot for the data! I tried analysing it and there may be multiple sources of leaks. But the most obvious should be cured if we backport this patch: commit 1e493d1946a0b26b79001c18d7312d536156ff5a Author: David S. Miller <davem> Date: Wed Sep 10 17:27:15 2008 -0700 ipv6: On interface down/unregister, purge icmp routes too. Johannes Berg reported that occaisionally, bringing an interface down or unregistering it would hang for up to 30 seconds. Using debugging output he provided it became clear that ICMP6 routes were the culprit. The problem is that ICMP6 routes live in their own world totally separate from normal ipv6 routes. So there are all kinds of special cases throughout the ipv6 code to handle this. While we should really try to unify all of this stuff somehow, for the time being let's fix this by purging the ICMP6 routes that match the device in question during rt6_ifdown(). Signed-off-by: David S. Miller <davem> I'll get a patch to you in a few days to test.