Bug 1240187

Summary: virsh attach-interface / attach-device hangs VMs with high amount of memory for several seconds
Product: Red Hat Enterprise Linux 6 Reporter: Martin Tessun <mtessun>
Component: qemu-kvmAssignee: jason wang <jasowang>
Status: CLOSED WONTFIX QA Contact: xiywang
Severity: high Docs Contact:
Priority: high    
Version: 6.6CC: ailan, chayang, cww, hhuang, jasowang, juzhang, mkenneth, mst, rbalakri, sfroemer, virt-bugs, virt-maint, weliao, xiywang
Target Milestone: rcFlags: jasowang: needinfo? (mst)
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-31 21:07:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1172231, 1269194, 1359965    

Description Martin Tessun 2015-07-06 07:23:39 UTC
Description of problem:
When attaching a Device to a running VM with one of the following commands, the command might hang for several seconds in case the VM has a huge amount of memory:
virsh attach-interface <interface definition>
virsh attach-device <xml-file>

Version-Release number of selected component (if applicable):
Most probably all versions, reproduced with the following versions:
qemu-kvm-0.12.1.2-2.445.el6.x86_64
libvirt-0.10.2-46.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. Define a VM with 120GB+ Main memory:
     <memory unit='KiB'>125829120</memory>
     <currentMemory unit='KiB'>125829120</currentMemory>
     <vcpu placement='static'>4</vcpu>

2. Start the VM
   # virsh start <VM-NAME>
3. Attach an interface to the domain:
   [root@ibm-hx5-7872-01 ~]# time virsh attach-interface rhel6 --type bridge --model virtio --source virbr0 --mac 00:16:3e:00:08:24
   Interface attached successfully
   
   
   real	0m4.332s
   user	0m0.041s
   sys	0m0.012s
   [root@ibm-hx5-7872-01 ~]#
4. During the operation also run a simple while-loop in the VM:
   [root@rhel6 ~]# while true; do date; sleep 1; done
   Mon Jul  6 09:05:03 CEST 2015
   Mon Jul  6 09:05:04 CEST 2015
   Mon Jul  6 09:05:05 CEST 2015
   Mon Jul  6 09:05:06 CEST 2015
   Mon Jul  6 09:05:11 CEST 2015 <=== 5 seconds gap!!
   Mon Jul  6 09:05:12 CEST 2015
   Mon Jul  6 09:05:13 CEST 2015
   ^C
   [root@rhel6 ~]# 

Actual results:
The VM "hangs" for about 4-5 seconds.

Expected results:
The VM shouldn't hang, and the attach operation should be similiar as with less memory (e.g. 4 GB):

[root@ibm-hx5-7872-01 ~]# virsh dumpxml rhel6
<domain type='kvm' id='2'>
  <name>rhel6</name>
  <uuid>b72d1061-517b-292a-b989-07d66e5151b6</uuid>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>4</vcpu>
[...]

[root@ibm-hx5-7872-01 ~]# time virsh attach-interface rhel6 --type bridge --model virtio --source virbr0 --mac 00:16:3e:00:08:24
Interface attached successfully


real	0m0.405s
user	0m0.034s
sys	0m0.027s
[root@ibm-hx5-7872-01 ~]# 

[root@rhel6 ~]# while true; do date; sleep 1; done
Mon Jul  6 09:19:43 CEST 2015
Mon Jul  6 09:19:44 CEST 2015
Mon Jul  6 09:19:45 CEST 2015
Mon Jul  6 09:19:46 CEST 2015
Mon Jul  6 09:19:47 CEST 2015
Mon Jul  6 09:19:48 CEST 2015
Mon Jul  6 09:19:49 CEST 2015
Mon Jul  6 09:19:50 CEST 2015
Mon Jul  6 09:19:51 CEST 2015
Mon Jul  6 09:19:52 CEST 2015
Mon Jul  6 09:19:53 CEST 2015
Mon Jul  6 09:19:54 CEST 2015
Mon Jul  6 09:19:55 CEST 2015
^C
[root@rhel6 ~]# 


Additional info:
After stracing the qemu process, the main reason for this seems to be the futex call:

10269 13:20:02.221543 write(23, "\1\0\0\0\0\0\0\0", 8) = 8 <0.000022>
10269 13:20:02.221624 ioctl(53, 0x400454d4, 0x7fff9c19681c) = 0 <0.000023>
10269 13:20:02.221710 ioctl(54, 0xaf01 <unfinished ...>
10281 13:20:02.303602 <... ioctl resumed> , 0) = 0 <0.241958>
10281 13:20:02.303689 futex(0x7f86a5acb7a0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
10269 13:20:02.487004 <... ioctl resumed> , 0) = 0 <0.265269>
10269 13:20:02.487166 ioctl(54, 0x8008af00, 0x7fff9c1967e8) = 0 <0.001335>

[2.5 Seconds nothing happens]

10269 13:20:04.987971 ioctl(53, 0x800454d7, 0x7fff9c1967ec) = 0 <0.002352>
10269 13:20:04.990440 ioctl(53, 0x400454d8, 0x7fff9c1967dc) = 0 <0.000023>

That is also represented in the duration of the systemcalls:
system-call, total, count, mean, min, max, unfinished
open, 0.000061, 2, 0.000030, 0.000029, 0.000032, 0
select, 0.541804, 13896, 0.000039, 0.000000, 0.001208, 0
brk, 0.000069, 3, 0.000023, 0.000022, 0.000024, 0
futex, 24.476751, 15157, 0.001615, 0.000000, 9.285405, 2
fdatasync, 0.000838, 15, 0.000056, 0.000000, 0.000080, 0
[...]

Comment 2 Martin Tessun 2015-07-09 13:57:36 UTC
Some additional tests about the influence of main-memory on that procedure:
All mentioned sizes are Main Memory.

96GB:
[root@ibm-hx5-7872-01 ~]# time virsh attach-interface rhel6 --type bridge --model virtio --source virbr0 --mac 00:16:3e:00:08:24
Interface attached successfully


real	0m3.707s
user	0m0.033s
sys	0m0.028s
[root@ibm-hx5-7872-01 ~]# 


64GB:
[root@ibm-hx5-7872-01 ~]# time virsh attach-interface rhel6 --type bridge --model virtio --source virbr0 --mac 00:16:3e:00:08:24
Interface attached successfully


real	0m2.690s
user	0m0.037s
sys	0m0.017s

32GB:
[root@ibm-hx5-7872-01 ~]# time virsh attach-interface rhel6 --type bridge --model virtio --source virbr0 --mac 00:16:3e:00:08:24
Interface attached successfully


real	0m1.448s
user	0m0.031s
sys	0m0.031s
[root@ibm-hx5-7872-01 ~]# 

16GB:
[root@ibm-hx5-7872-01 ~]# time virsh attach-interface rhel6 --type bridge --model virtio --source virbr0 --mac 00:16:3e:00:08:24
Interface attached successfully


real	0m1.419s
user	0m0.026s
sys	0m0.025s
[root@ibm-hx5-7872-01 ~]# 

8GB:
[root@ibm-hx5-7872-01 ~]# time virsh attach-interface rhel6 --type bridge --model virtio --source virbr0 --mac 00:16:3e:00:08:24
Interface attached successfully


real	0m0.759s
user	0m0.037s
sys	0m0.022s
[root@ibm-hx5-7872-01 ~]# 

So it somehow looks dependant on the amount of main memory, although it doesn't make any difference for 16 and 32GB. But maybe that was just some coincidence.

Comment 6 juzhang 2016-01-08 11:02:58 UTC
Hi Chayang,

Could you take care this issue?

Best Regards,
Junyi

Comment 8 weliao 2016-01-12 03:00:48 UTC
Hi jason, QE can reproduce this issue.
and I have a finding: if add NIC without vhost, no this issue, just add NIC with vhost=on, hit this issue.

Host version:
2.6.32-599.el6.x86_64
qemu-kvm-0.12.1.2-2.483.el6.x86_64
Guest: 2.6.32-595.el6.x86_64

step:
1.launch guest with big Memory
2.execute while loop in guest:
while true; do date; sleep 1; done
3.add NIC with vhost=on
{ 'execute' : 'qmp_capabilities' }
{"return": {}}
{ "execute": "netdev_add", "arguments": { "type":"tap","id":"hostnet1","vhost":"on"}}
{"return": {}}

qemu command:
/usr/libexec/qemu-kvm -name aaa  \
-M rhel6.6.0 -enable-kvm \
-m 163840  -realtime mlock=off \
-smp 2,sockets=2,cores=1,threads=1 \
-uuid 8f751f99-1dc9-c004-2b1a-595876325a48 \
-nodefconfig -nodefaults  \
-rtc base=utc -no-shutdown \
-device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 \
-device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 \
-device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 \
-device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 \
-drive file=/mnt/RHEL-Server-6.7-64-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-netdev tap,id=hostnet0,vhost=on \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:b7:eb:79,bus=pci.0,addr=0x3 \
-chardev pty,id=charserial0 \
-device isa-serial,chardev=charserial0,id=serial0 \
-device usb-tablet,id=input0 -vnc :0 \
-vga cirrus -device intel-hda,id=sound0,bus=pci.0,addr=0x4 \
-device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 \
-msg timestamp=on \
-monitor stdio \
-qmp tcp:0:5555,nowait,server

scenarios:
1.120G
-m 122880 ; hang about 5s
Tue Jan 12 10:53:11 CST 2016
Tue Jan 12 10:53:16 CST 2016

2.64G
-m 65536 hang about 2-3s
Tue Jan 12 10:49:41 CST 2016
Tue Jan 12 10:49:44 CST 2016

3.32G
-m 65536 hang about 2s
Tue Jan 12 10:55:41 CST 2016
Tue Jan 12 10:55:43 CST 2016

4.16G hang about 1s

Comment 9 jason wang 2016-02-23 01:44:03 UTC
(In reply to weliao from comment #8)
> Hi jason, QE can reproduce this issue.
> and I have a finding: if add NIC without vhost, no this issue, just add NIC
> with vhost=on, hit this issue.
> 
> Host version:
> 2.6.32-599.el6.x86_64
> qemu-kvm-0.12.1.2-2.483.el6.x86_64
> Guest: 2.6.32-595.el6.x86_64
> 
> step:
> 1.launch guest with big Memory
> 2.execute while loop in guest:
> while true; do date; sleep 1; done
> 3.add NIC with vhost=on
> { 'execute' : 'qmp_capabilities' }
> {"return": {}}
> { "execute": "netdev_add", "arguments": {
> "type":"tap","id":"hostnet1","vhost":"on"}}
> {"return": {}}
> 
> qemu command:
> /usr/libexec/qemu-kvm -name aaa  \
> -M rhel6.6.0 -enable-kvm \
> -m 163840  -realtime mlock=off \
> -smp 2,sockets=2,cores=1,threads=1 \
> -uuid 8f751f99-1dc9-c004-2b1a-595876325a48 \
> -nodefconfig -nodefaults  \
> -rtc base=utc -no-shutdown \
> -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 \
> -device
> ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,
> addr=0x5 \
> -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 \
> -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 \
> -drive
> file=/mnt/RHEL-Server-6.7-64-virtio.qcow2,if=none,id=drive-virtio-disk0,
> format=qcow2,cache=none \
> -device
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,
> id=virtio-disk0,bootindex=1 \
> -netdev tap,id=hostnet0,vhost=on \
> -device
> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:b7:eb:79,bus=pci.0,
> addr=0x3 \
> -chardev pty,id=charserial0 \
> -device isa-serial,chardev=charserial0,id=serial0 \
> -device usb-tablet,id=input0 -vnc :0 \
> -vga cirrus -device intel-hda,id=sound0,bus=pci.0,addr=0x4 \
> -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 \
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 \
> -msg timestamp=on \
> -monitor stdio \
> -qmp tcp:0:5555,nowait,server
> 
> scenarios:
> 1.120G
> -m 122880 ; hang about 5s
> Tue Jan 12 10:53:11 CST 2016
> Tue Jan 12 10:53:16 CST 2016
> 
> 2.64G
> -m 65536 hang about 2-3s
> Tue Jan 12 10:49:41 CST 2016
> Tue Jan 12 10:49:44 CST 2016
> 
> 3.32G
> -m 65536 hang about 2s
> Tue Jan 12 10:55:41 CST 2016
> Tue Jan 12 10:55:43 CST 2016
> 
> 4.16G hang about 1s

Can you reproduce this on RHEL7? And how about retrying without cirrus?

Comment 10 weliao 2016-03-01 07:18:53 UTC
hi, jason
  on RHEL7,the same test command and the same test steps I can't reproduce this bug.
Host:
  3.10.0-351.el7.x86_64
  qemu-kvm-1.5.3-105.el7_2.3.x86_64
Guest:
  rhel7.3 and rhel6.8 guest all no hit this issue.

 retested without cirrues(-spice port=5900,disable-ticketing -vga qxl) on RHEL6.8 Host, also had this issue.
Host:
  2.6.32-621.el6.x86_64
  qemu-kvm-0.12.1.2-2.490.el6.x86_64
Guest:
  rhel6.8 and rhel7.3 guest all hit this issue.

Comment 11 jason wang 2016-03-01 09:41:34 UTC
(In reply to weliao from comment #10)
> hi, jason
>   on RHEL7,the same test command and the same test steps I can't reproduce
> this bug.
> Host:
>   3.10.0-351.el7.x86_64
>   qemu-kvm-1.5.3-105.el7_2.3.x86_64
> Guest:
>   rhel7.3 and rhel6.8 guest all no hit this issue.

Cool, can you do a reverse bisection to see which commit fixes the issue? If you wish, I can tell you the steps.

> 
>  retested without cirrues(-spice port=5900,disable-ticketing -vga qxl) on
> RHEL6.8 Host, also had this issue.
> Host:
>   2.6.32-621.el6.x86_64
>   qemu-kvm-0.12.1.2-2.490.el6.x86_64
> Guest:
>   rhel6.8 and rhel7.3 guest all hit this issue.

Comment 32 Steffen Froemer 2017-04-24 10:10:28 UTC
Hi Jason, can I get a current status on this?

Comment 36 jason wang 2017-04-26 03:43:50 UTC
Michael:

Any commit in your mind that can be back-ported to fix this issue? I remember we fix similar issues in the past.

Thanks

Comment 38 Chris Williams 2017-05-31 21:07:22 UTC
Red Hat Enterprise Linux 6 transitioned to the Production 3 Phase on May 10, 2017.  During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not appear to meet the inclusion criteria for the Production Phase 3 and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification.  Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com