Bug 654189 - Host hung when installing win2003 with a private bridge
Host hung when installing win2003 with a private bridge
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.6
Unspecified Unspecified
low Severity medium
: rc
: ---
Assigned To: Herbert Xu
Red Hat Kernel QE team
: Triaged
Depends On:
Blocks: Rhel5KvmTier2
  Show dependency treegraph
 
Reported: 2010-11-17 00:04 EST by Amos Kong
Modified: 2015-05-24 20:06 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-10-22 14:19:26 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Debug vbr0 ref count leak (726 bytes, patch)
2011-01-18 02:00 EST, Herbert Xu
no flags Details | Diff
dmesg (16.53 KB, application/octet-stream)
2011-01-19 05:21 EST, Amos Kong
no flags Details
calltrace of hold/put vbr0 (121.69 KB, text/plain)
2011-01-28 03:30 EST, Amos Kong
no flags Details
hold/put vbr0 caused host fs read-only (156.81 KB, text/plain)
2011-01-28 03:36 EST, Amos Kong
no flags Details
reproduced reference-count issue (124.46 KB, application/octet-stream)
2011-01-28 04:33 EST, Amos Kong
no flags Details

  None (edit)
Description Amos Kong 2010-11-17 00:04:32 EST
Description of problem:
When I installed win2003 guest from pxe server, host hung.
note: "unregister_netdevice: waiting for vbr0 to become free. Usage count = 4"
Also, I could not execute any shell cmd, host could not be shutdown by 'Ctrl+Alt+Delete'.

Version-Release number of selected component (if applicable):
host kernel: 2.6.18-231.el5
kvm-83-207.el5

How reproducible:
not always

Steps to Reproduce:
1.setup private bridge

brname='vbr0'
/usr/sbin/brctl addbr $brname
echo 1 > /proc/sys/net/ipv6/conf/$brname/disable_ipv6
echo 1 > /proc/sys/net/ipv4/ip_forward
/usr/sbin/brctl stp $brname on
/usr/sbin/brctl setfd $brname 0
ifconfig $brname 192.168.58.1
ifconfig $brname up
iptables -t nat -A POSTROUTING -s 192.168.58.254/24 ! -d 192.168.58.254/24 -j MASQUERADE

2.setup dnsmasq pxe/tftp
# dnsmasq --strict-order --bind-interfaces --listen-address 192.168.58.1 --dhcp-range 192.168.58.2,192.168.58.254  --enable-tftp --tftp-root $AUTODIR/tests/kvm/images/tftpboot --dhcp-boot pxelinux.0 --dhcp-no-override

3.start to install guest from network
4.repeat step 1~3
 
Actual results:
host hung

Expected results:
Installation completed

Additional info:
1. detail of unattended_install testcase:
https://github.com/autotest/autotest/blob/master/client/tests/kvm/tests/unattended_install.py

2. dmesg info:
unregister_netdevice: waiting for vbr0 to become free. Usage count = 4
unregister_netdevice: waiting for vbr0 to become free. Usage count = 4
unregister_netdevice: waiting for vbr0 to become free. Usage count = 4
INFO: task irqbalance:3425 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
irqbalance    D ffff810001025e20     0  3425      1          3445  3269 (NOTLB)
 ffff810874f85d48 0000000000000086 ffff810874f85d48 0000000000000310
 000000000000000a 0000000000000007 ffff81087bb7e0c0 ffff81011dda5080
 00009d2b7e9a7a3d 000000000005a386 ffff81087bb7e2a8 0000000400001065
Call Trace:
 [<ffffffff80022242>] __up_read+0x19/0x7f
 [<ffffffff80063c4f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff80063c99>] .text.lock.mutex+0xf/0x14
 [<ffffffff802349b4>] netdev_run_todo+0x14/0x222
 [<ffffffff8023448d>] dev_ioctl+0x339/0x465
 [<ffffffff80047652>] sprintf+0x51/0x59
 [<ffffffff8022a300>] sock_ioctl+0x1d4/0x1e5
 [<ffffffff80042268>] do_ioctl+0x21/0x6b
 [<ffffffff80030262>] vfs_ioctl+0x457/0x4b9
 [<ffffffff800b95cb>] audit_syscall_entry+0x1a4/0x1cf
 [<ffffffff8004c737>] sys_ioctl+0x59/0x78
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task snmpd:8421 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
snmpd         D ffff810001025e20     0  8421      1          8424  8388 (NOTLB)
 ffff810861c15b38 0000000000000086 0000000000000206 ffffffff8005578a
 ffff8108744e0400 000000000000000a ffff81085485b0c0 ffff81011dda5080
 00009d124ef2184e 000000000002bbed ffff81085485b2a8 00000004744e0400
Call Trace:
 [<ffffffff8005578a>] skb_queue_tail+0x17/0x3e
 [<ffffffff80063c4f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff80063c99>] .text.lock.mutex+0xf/0x14
 [<ffffffff802349b4>] netdev_run_todo+0x14/0x222
 [<ffffffff8023ac3f>] rtnetlink_rcv+0x41/0x4e
 [<ffffffff8024959d>] netlink_data_ready+0x12/0x50
 [<ffffffff80248754>] netlink_sendskb+0x26/0x40
 [<ffffffff80249578>] netlink_sendmsg+0x2b4/0x2c7
 [<ffffffff80055229>] sock_sendmsg+0xf8/0x14a
 [<ffffffff8001a7c4>] vsnprintf+0x3f8/0x627
 [<ffffffff8006723e>] do_page_fault+0x4fe/0x874
 [<ffffffff800a2896>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8001a9ab>] vsnprintf+0x5df/0x627
 [<ffffffff80047652>] sprintf+0x51/0x59
 [<ffffffff8022aeac>] sys_sendto+0x11c/0x14f
 [<ffffffff800fa171>] inotify_d_instantiate+0x12/0x3c
 [<ffffffff8003b1ba>] d_instantiate+0x4a/0x82
 [<ffffffff80042146>] d_rehash+0x21/0x34
 [<ffffffff8022aade>] sock_attach_fd+0x8f/0xfd
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

unregister_netdevice: waiting for vbr0 to become free. Usage count = 4
unregister_netdevice: waiting for vbr0 to become free. Usage count = 4

3. qemu-kvm cmdline:
# qemu-kvm -name 'vm1' -monitor unix:'/tmp/monitor-humanmonitor1-20101116-040615-Ij08',server,nowait -serial unix:'/tmp/serial-20101116-040615-Ij08',server,nowait -drive file='/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/images/win2003-32-virtio.qcow2',index=0,if=virtio,media=disk,cache=none,boot=on,format=qcow2 -net nic,vlan=0,model=virtio,macaddr='9a:71:5c:4f:b6:50' -net tap,vlan=0,ifname='t0-040615-Ij08',script='/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/scripts/qemu-ifup-vbr0',downscript='no' -m 2048 -smp 2,cores=1,threads=1,sockets=2 -drive file='/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/isos/ISO/Win2003/32/en_win_srv_2003_r2_enterprise_with_sp2_cd1_X13-05460.iso',media=cdrom,index=1 -drive file='/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/isos/windows/winutils.iso',media=cdrom,index=2 -drive file='/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/isos/windows/virtio-win.iso',media=cdrom,index=3 -cpu qemu64,+sse2 -soundhw ac97 -fda '/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/images/win2003-32/answer.vfd' -redir tcp:5000::10023 -vnc :0 -spice port=8000,disable-ticketing -qxl 1 -rtc-td-hack -M rhel5.5.0 -usbdevice tablet -boot d

4. host cpuinfo:
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           E5310  @ 1.60GHz
stepping        : 11
cpu MHz         : 1595.926
cache size      : 4096 KB
physical id     : 1
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr lahf_lm
bogomips        : 3191.91
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

5. host mem info# cat /proc/meminfo 
MemTotal:     32940120 kB
MemFree:      32542400 kB
Buffers:         81428 kB
Cached:         201532 kB
SwapCached:          0 kB
Active:         166644 kB
Inactive:       150780 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     32940120 kB
LowFree:      32542400 kB
SwapTotal:    34996216 kB
SwapFree:     34996216 kB
Dirty:              28 kB
Writeback:          72 kB
AnonPages:       34348 kB
Mapped:          13796 kB
Slab:            25532 kB
PageTables:       3880 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  51466276 kB
Committed_AS:   176964 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    286272 kB
VmallocChunk: 34359451971 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB
Comment 2 Herbert Xu 2011-01-18 02:00:08 EST
Created attachment 473979 [details]
Debug vbr0 ref count leak

Can you please try applying this patch on the host and then reproduce the problem? The kernel log should then point us in the direction of the culprit.  Thanks!
Comment 3 Amos Kong 2011-01-19 05:21:56 EST
Created attachment 474236 [details]
dmesg

I try to release private bridge when guest is using it.
attached the dmesg of host.
Comment 4 Herbert Xu 2011-01-19 07:52:32 EST
Thanks! Unfortunately it seems that the log isn't complete as I only see a single hold in there.

Can you please try grabbing the syslog copy of the kernel messages? If it still appears imcomplete, you may need to increase the kernel printk buffer size.
Comment 5 Amos Kong 2011-01-28 03:30:44 EST
Created attachment 475747 [details]
calltrace of hold/put vbr0


host hung about 60 seconds, but not outputed 
"unregister_netdevice: waiting for vbr0 to become free. Usage count = .."
Comment 6 Amos Kong 2011-01-28 03:36:35 EST
Created attachment 475748 [details]
hold/put vbr0 caused host fs read-only

When I hold/put vbr0, it caused host fs read-only.
host hung about 60 seconds, and not outputted "unregister_netdevice: waiting for vbr0 to become free. Usage count = .."

I'll try in other machines.
Comment 7 Amos Kong 2011-01-28 04:33:24 EST
Created attachment 475757 [details]
reproduced reference-count issue

I re-tested in another machine, I try to delete vbr0 when guest is using it, kernel outputted 
"""
Message from syslogd@ at Fri Jan 28 17:26:06 2011 ...
intel-8400-8-2 last message repeated 6 times

unregister_netdevice: waiting for vbr0 to become free. Usage count = 7
...
"""

I try to kill guest process by kill -9 $pid, but it becomes zombie.
# ps aux|grep qemu
root      3371  2.2  0.0      0     0 pts/1    Zl   17:20   0:10 [qemu-kvm] <defunct>

attached the detail dmesg.
Comment 8 Herbert Xu 2011-02-18 00:04:55 EST
Thanks a lot for the data! I tried analysing it and there may be multiple sources of leaks.  But the most obvious should be cured if we backport this patch:

commit 1e493d1946a0b26b79001c18d7312d536156ff5a
Author: David S. Miller <davem@davemloft.net>
Date:   Wed Sep 10 17:27:15 2008 -0700

    ipv6: On interface down/unregister, purge icmp routes too.
    
    Johannes Berg reported that occaisionally, bringing an interface
    down or unregistering it would hang for up to 30 seconds.  Using
    debugging output he provided it became clear that ICMP6 routes
    were the culprit.
    
    The problem is that ICMP6 routes live in their own world totally
    separate from normal ipv6 routes.  So there are all kinds of special
    cases throughout the ipv6 code to handle this.
    
    While we should really try to unify all of this stuff somehow,
    for the time being let's fix this by purging the ICMP6 routes
    that match the device in question during rt6_ifdown().
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

I'll get a patch to you in a few days to test.

Note You need to log in before you can comment on or make changes to this bug.