Bug 525323 - QEMU terminates without warning with virtio-net and SMP enabled
Summary: QEMU terminates without warning with virtio-net and SMP enabled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm
Version: 5.4
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Michael S. Tsirkin
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 548013 561022 595301
TreeView+ depends on / blocked
 
Reported: 2009-09-23 22:13 UTC by Dor Laor
Modified: 2014-06-16 14:57 UTC (History)
15 users (show)

Fixed In Version: kvm-83-154.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 595301 (view as bug list)
Environment:
Last Closed: 2010-03-30 07:56:46 UTC


Attachments (Terms of Use)
Adding IxChariot win2k8 R2 workaround (1.88 MB, application/pdf)
2009-09-24 08:05 UTC, Dor Laor
no flags Details
drivers\net\tun.c from Cisco's kernel 2.6.23 (22.42 KB, text/x-csrc)
2009-11-23 17:32 UTC, James Ko
no flags Details
applied net-tun-add-iff_vnet_hdr-tungetfeatures-tungetiff.patch (10.04 KB, application/octet-stream)
2009-11-24 18:18 UTC, James Ko
no flags Details
after applying this patch, and rebuilding, qemu will crash (546 bytes, patch)
2010-02-07 14:30 UTC, Michael S. Tsirkin
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0271 normal SHIPPED_LIVE Important: kvm security, bug fix and enhancement update 2010-03-29 13:19:48 UTC

Description Dor Laor 2009-09-23 22:13:12 UTC
Description of problem:
-----------------
IxChariot sends and receives TCP packets to measure network throughput
between endpoints. In our test one endpoint was Windows 2008 R2 running
in virtual machine and the other endpoint is a linux machine.
 
They are TCP packets. It will be better if
Redhat can install ixChariot on your labs. Its very easy to reproduce.
 
The endpoint libraries can be downloaded from here
 
http://www.ixiacom.com/support/endpoint_library/
 
and windows 2008 R2 can be downloaded here
 
http://www.microsoft.com/windowsserver2008/en/us/default.aspx
 
You will need a IxChariot license for the GUI which starts, stops and collects test results
 
http://www.ixchariot.com/products/datasheets/ixchariot.html
 
Be sure to run the virtual machine in SMP mode (two cores). The crash doesn't happen with just one core or if network device emulated is e1000. 

Version-Release number of selected component (if applicable):
pre-5.4 snapshot 105
virtio network driver from virtio-win-1.0.0-3.31351.el5.noarch.rpm 

Actual results:
No error codes, No core file, Nothing on stderr/stdout
The crash is pretty instataneous once we start the test

Expected results:


Additional info:

Comment 1 Dor Laor 2009-09-24 08:05:43 UTC
Created attachment 362444 [details]
Adding IxChariot win2k8 R2 workaround

Comment 4 Dor Laor 2009-09-25 09:42:35 UTC
From the customer: "We use the high-performance script included as part of the ixChariot
software

#1: launch ixChariot
#2: define endpoints (IP addresses). 
#3: click on select script and choose high-performance-throughput.scr
that comes bundled with ixChariot.

Also when running traffic tests with w2k8-r2 as an endpoint, you need to
disable the windows firewall.
One way is to open the cmd window and type

'netsh firewall set opmode disable'
"

Comment 6 Dor Laor 2009-09-30 08:15:43 UTC
From the customer: "I may been able to reproduce this with iperf v 2.0.4
server is a virtual machine running windows server 2008 R2 with 2 CPU in SMP mode
iperf -s -w 256k

client a real machine running windows server 2003
iperf -c <ip> -w 256k -n 100000M -M 65535 "

Comment 7 lihuang 2009-10-18 09:38:47 UTC
(In reply to comment #6)
> From the customer: "I may been able to reproduce this with iperf v 2.0.4
> server is a virtual machine running windows server 2008 R2 with 2 CPU in SMP
> mode
> iperf -s -w 256k
> 
> client a real machine running windows server 2003
> iperf -c <ip> -w 256k -n 100000M -M 65535 "  

Hi Dor
I test with iperf 1.7.0 several times. but can not reproduce the issue.
My question is 
 1. iperf _v2.0.4_ is must ?
 2. any configuration about the v-NIC and bridge I've missed to reproduce the issue ? ( I just used the default setting ).

Thanks
Lijun Huang

Comment 9 Dor Laor 2009-11-10 10:09:07 UTC
Moving under virtio-win.

Adding info received by email through James:
"I was able to get the driver to run.
The Driver details in adapter properties is showing
Driver Date 10/25/2009
Driver Version 6.0.209.427

But it does not solve the issue.  I made other changes to the virtio.c
file so it retries getting the virtqueue_num_heads and it is still showing
an index difference < 256 on the second call to vring_avail_idx(vq)
after an initial failure.
Guest moved used index from 62670 to 62736, max 256
Guest moved used index from 6107 to 6168, max 256
Guest moved used index from 35495 to 35593, max 256
Guest moved used index from 57570 to 57615, max 256
Guest moved used index from 56259 to 56325, max 256
Guest moved used index from 51156 to 51220, max 256
Guest moved used index from 15062 to 15126, max 256
Guest moved used index from 54207 to 54272, max 256

Looks like some a race condition in updating these values from different threads.

James

FYI... my patch is as follows and this workaround has successfully prevented unwanted exit so far.

diff -urp a/qemu/hw/virtio.c b/qemu/hw/virtio.c
--- a/qemu/hw/virtio.c  2009-10-23 15:42:18.000000000 -0700
+++ b/qemu/hw/virtio.c  2009-10-23 14:08:01.000000000 -0700
@@ -328,12 +328,15 @@ void virtqueue_push(VirtQueue *vq, const
static int virtqueue_num_heads(VirtQueue *vq, unsigned int idx)
{
    uint16_t num_heads = vring_avail_idx(vq) - idx;
+    int retry = 3;

    /* Check it isn't doing very strange things with descriptor numbers. */
-    if (num_heads > vq->vring.num) {
-        fprintf(stderr, "Guest moved used index from %u to %u",
-                idx, vring_avail_idx(vq));
-        exit(1);
+    while (num_heads > vq->vring.num) {
+        fprintf(stderr, "Guest moved used index from %u to %u, max %u\n",
+                idx, vring_avail_idx(vq), vq->vring.num);
+        if (retry-- == 0)
+            exit(1);
+        num_heads = vring_avail_idx(vq) - idx;
    }

    return num_heads;

"

Michael/Yan, any ideas?

Comment 10 Dor Laor 2009-11-18 12:31:56 UTC
Did you manage to reproduce it over RHEL?

Comment 11 James Ko 2009-11-18 17:19:46 UTC
Still working on trying to get RHEL reproduction.

Network performance under RHEL is rather poor however.  The traffic curve is very jagged and we have not been unable to get the traffic rate up over 600Mbps.

One other issue with the test on RHEL is that ixChariot endpoint on Windows is reporting an error in the high precision timer which causes the test to abort.
This is the case for when the endpoint is primarily the sender rather than receiver.  This is happening soon after the traffic starts, seconds in some cases.

Comment 12 Dor Laor 2009-11-19 12:46:37 UTC
(In reply to comment #11)
> Still working on trying to get RHEL reproduction.
> 
> Network performance under RHEL is rather poor however.  The traffic curve is
> very jagged and we have not been unable to get the traffic rate up over
> 600Mbps.

Added Mark Wagner to help.

How do you test traffic? :
Is it bi-direction/rx/tx? UDP/TCP? What packet sizes? What's the load on the host?

In addition, there are registry settings on the VM that increases performance: http://www.linux-kvm.org/page/WindowsGuestDrivers/kvmnet/registry

For large packet sizes we can go up to several Gb. 

> 
> One other issue with the test on RHEL is that ixChariot endpoint on Windows is
> reporting an error in the high precision timer which causes the test to abort.
> This is the case for when the endpoint is primarily the sender rather than
> receiver.  This is happening soon after the traffic starts, seconds in some
> cases.  

Can you try using pmtimer in the guest. It is recommended anyway:
http://support.microsoft.com/kb/833721

Comment 13 James Ko 2009-11-19 19:35:20 UTC
(In reply to comment #12)
> (In reply to comment #11)
> > Still working on trying to get RHEL reproduction.
> > 
> > Network performance under RHEL is rather poor however.  The traffic curve is
> > very jagged and we have not been unable to get the traffic rate up over
> > 600Mbps.
> 
> Added Mark Wagner to help.
> 
> How do you test traffic? :
> Is it bi-direction/rx/tx? UDP/TCP? What packet sizes? What's the load on the
> host?
> 
We are primarily using ixChariot to test but also iperf.
Traffic is mostly uni-directional and failure appears to be associated
more when guest is receiving high traffic rate.
See Comment #6 for more on iperf settings.
The host load average is showing 1.98 0.91 0.67 after a couple minutes
of running the test.
The qemu process is consuming 99.9% of the cpu's assigned to it
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7679 admin     20   0  571m 524m 1304 S 99.9 13.7  23:35.00 qemu

If the test is not running the host load average is :  0.28, 0.92, 0.76
And qemu consumes only about 5% to 6% of the CPU.

> In addition, there are registry settings on the VM that increases performance:
> http://www.linux-kvm.org/page/WindowsGuestDrivers/kvmnet/registry
> 
> For large packet sizes we can go up to several Gb. 
> 
> > 
> > One other issue with the test on RHEL is that ixChariot endpoint on Windows is
> > reporting an error in the high precision timer which causes the test to abort.
> > This is the case for when the endpoint is primarily the sender rather than
> > receiver.  This is happening soon after the traffic starts, seconds in some
> > cases.  
> 
> Can you try using pmtimer in the guest. It is recommended anyway:
> http://support.microsoft.com/kb/833721  

From what I could find online, the problem pmtimer fixes is not present in Server 2008.

Comment 14 Dor Laor 2009-11-19 22:22:05 UTC
What's your qemu cmdline? Do you add -no-hpet (you should)?

Regarding the network performance, iperf and the windows/linux stack is sensitive to socket buffers, you can play with it a bit to see if you get better number. 
Also don't forget to configure the win registry settings.
We expect much better performance with 64k packets.
Yan, any idea?

Comment 15 Yan Vugenfirer 2009-11-22 12:54:08 UTC
You should see performance boost with packets of 4K and up.

There are several important things to configure in registry:
TCPWindow Scaling:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]
"Tcp1323Opts"=dword:00000003

And TCP windows size:
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]
"TcpWindowSize"=dword:00100000


[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters]
"DefaultReceiveWindow"=dword:00100000
"DefaultSendWindow"=dword:00100000


With UDP it is important to configure the fast copy threshld option (otherwise you will see a performance drop with messages bigger than 1K):

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters]
"FastSendDatagramThreshold"=dword:00004000 


Check also the following WiKi page:
http://www.linux-kvm.org/page/WindowsGuestDrivers/kvmnet/registry

Comment 16 Yaniv Kaul 2009-11-22 14:12:39 UTC
According to http://www.microsoft.com/whdc/system/sysperf/Perf_tun_srv-R2.mspx:
If your hardware supports TOE, then you must enable that option in the operating system to benefit from the hardware’s capability. You can enable TOE by running the following command:
netsh int tcp set global chimney = enabled

Comment 17 Dor Laor 2009-11-22 15:14:13 UTC
Chimney won't be supported ATM since it needs multiple virtio receive queues.

Comment 19 James Ko 2009-11-23 04:03:49 UTC
I don't really need to boost the traffic rate any higher.

I am able to reproduce the failure with our host with just 100 to 200 Mbps throughput and 32k transaction sizes.

The ixChariot High Performance test script is configured with 32k send and receive buffer sizes (bytes of data in each SEND/RECEIVE).

Still unable to reproduce it with RHEL as host however.

Can you point me to the relevant kernel specific virtio-net files or tun/tap files which the kvm module and qemu depends on?

Comment 20 Michael S. Tsirkin 2009-11-23 10:00:30 UTC
For tun, look for uses of macro TUN_VNET_HDR
in drivers/net/tun.c

Relevant routines 
tun_get_iff
tun_set_iff
tun_get_user
tun_put_user

Comment 21 James Ko 2009-11-23 17:32:02 UTC
Created attachment 373182 [details]
drivers\net\tun.c from Cisco's kernel 2.6.23

Comment 22 James Ko 2009-11-23 17:45:39 UTC
Our version of tun drivers does not have the TUN_VNET_HDR macro.
I've included our version of the file as an attachment.

Looks like I need this patch from your kernel sources...
linux-2.6-net-tun-add-iff_vnet_hdr-tungetfeatures-tungetiff.patch

Looks like there are a few conflicts so it's not a simple patch application.

Any other patches I should be looking at?

Comment 23 Mark Wagner 2009-11-23 18:08:32 UTC
Have you tried to disable the HW acceleration on the host NIC ?
ethtool ethX -K rx off tx off

These will most likely hurt throughput but will help narrow down where the issue is.

Comment 24 James Ko 2009-11-24 01:35:28 UTC
I've applied the patch above as best I could with a minor change to csum_start calculation since our sk_buff does not have h and nh struct members.

The patch did not solve the problem.

I also disabled the HW acceleration as instructed but the failures persist.
HW acceleration setting did not appear to make any difference to throughput.

Comment 25 Dor Laor 2009-11-24 11:08:33 UTC
We can try running without vnet_hdr on rhel and check if we can reproduce the bug.

Comment 26 Michael S. Tsirkin 2009-11-24 16:51:48 UTC
James, could you attach the actual
patch that you applied?

Comment 27 James Ko 2009-11-24 18:18:23 UTC
Created attachment 373504 [details]
applied net-tun-add-iff_vnet_hdr-tungetfeatures-tungetiff.patch

This is the patch applied to Cisco's kernel 2.6.23 for testing possible fix.

Comment 28 James Ko 2009-11-24 18:19:40 UTC
(In reply to comment #25)
> We can try running without vnet_hdr on rhel and check if we can reproduce the
> bug.  

Please provide instructions on how to remove this for testing.

Comment 29 Michael S. Tsirkin 2009-11-24 19:55:10 UTC
in recent qemu tap has an option vnet_hdr.
you just set it to false.

Comment 30 James Ko 2009-11-25 09:19:02 UTC
(In reply to comment #29)
> in recent qemu tap has an option vnet_hdr.
> you just set it to false.  

I added ,vnet_hdr=false to the -net tap,... option list but I did not notice any performance difference nor did it help in reproducing the bug.

Comment 31 Michael S. Tsirkin 2009-11-25 09:25:52 UTC
Can you reproduce the bug on kernel v2.6.23.17 from kernel.org?

Comment 32 Michael S. Tsirkin 2009-11-25 09:31:26 UTC
(In reply to comment #27)
> Created an attachment (id=373504) [details]
> applied net-tun-add-iff_vnet_hdr-tungetfeatures-tungetiff.patch
> 
> This is the patch applied to Cisco's kernel 2.6.23 for testing possible fix.  

BTW, do you see the "GSO" debug print when you use this?

Comment 33 James Ko 2009-11-25 10:23:58 UTC
Not sure I'll be able to test on that version from kernel.org.  How would I go about doing that on our RHEL 5.4 installation?

I do not see the GSO debug print messages.  Just to be sure I changed it to pr_info incase DEBUG wasn't enabled in kernel.

Comment 34 Michael S. Tsirkin 2009-11-25 10:31:32 UTC
to test on kernel from kernel.org,
git clone it, make oldconfig and
make install. sometimes you need to
create initrd yourself.

The fact you do not get GSO explains
why you do not get performance changes
from disabling vnet header.

Comment 35 Michael S. Tsirkin 2009-11-25 11:36:23 UTC
clarification: when I say make oldconfig,
I really mean you should take the .config from
your 2.6.23 kernel.
Hope this makes sense.

Comment 36 Dor Laor 2009-11-30 16:19:54 UTC
(In reply to comment #29)
> in recent qemu tap has an option vnet_hdr.
> you just set it to false.  

Michael can you try reproduce it with vnet_hdr=false on rhel5.4?
Also, send a similar patch to Cisco so they can try it out too.

Comment 37 James Ko 2009-11-30 18:45:03 UTC
Would a driver with specific debugs covering the code segment of interest help with the testing since enabling the currently available debugs impacts the performance and timing such that the problem is no longer reproducible.

I will try to run with 2.6.23.17 kernel.

Comment 38 James Ko 2009-12-01 00:47:38 UTC
I compiled and booted our RHEL test box with kernel 2.6.23.17 but kvm-qemu won't start with the following errors.

TUNGETIFF ioctl() failed
TUNSETSNDBUF ioctl() failed
Hypervisor too old: KVM_CAP_USER_MEMORY extension not supported

Looks like I need to apply some patches for these items. I can find patches for the first two but not for the last one.

Comment 39 James Ko 2009-12-01 09:58:34 UTC
(In reply to comment #37)
> Would a driver with specific debugs covering the code segment of interest help
> with the testing since enabling the currently available debugs impacts the
> performance and timing such that the problem is no longer reproducible.
> 
> I will try to run with 2.6.23.17 kernel.  

I do think a debug version of the driver may help us in narrowing down the problem.  From what I can tell, there is a race condition which allows the virt queue window to overflow.

Comment 40 Michael S. Tsirkin 2009-12-01 12:35:46 UTC
(In reply to comment #38)
> I compiled and booted our RHEL test box with kernel 2.6.23.17 but kvm-qemu
> won't start with the following errors.
> 
> TUNGETIFF ioctl() failed
> TUNSETSNDBUF ioctl() failed
> Hypervisor too old: KVM_CAP_USER_MEMORY extension not supported
> 
> Looks like I need to apply some patches for these items. I can find patches for
> the first two but not for the last one.  

How does it work on Cisco's kernel  2.6.23?
Does it have backported kvm or are you using kvm_kmod?

Comment 41 James Ko 2009-12-01 20:41:32 UTC
(In reply to comment #40)
> 
> How does it work on Cisco's kernel  2.6.23?
> Does it have backported kvm or are you using kvm_kmod?  

We are using kvm_kmod in our system.

debugshell# modinfo /sw/kvm/lib/modules/kvm-intel.ko
filename:       /sw/kvm/lib/modules/kvm-intel.ko
version:
author:         Qumranet
license:        GPL
vermagic:       2.6.23waas64. SMP mod_unload
depends:        kvm,kvm
srcversion:     3A2787BAF8EFED8D06C2554
parm:           emulate_invalid_guest_state:bool
parm:           enable_ept:bool
parm:           flexpriority_enabled:bool
parm:           enable_vpid:bool
parm:           bypass_guest_pf:bool
debugshell# modinfo /sw/kvm/lib/modules/kvm.ko
filename:       /sw/kvm/lib/modules/kvm.ko
version:
author:         Qumranet
license:        GPL
vermagic:       2.6.23waas64. SMP mod_unload
depends:
srcversion:     1959A152FDDD3E22E0C87D9
parm:           oos_shadow:bool
parm:           force_kvmclock:bool

Comment 42 James Ko 2009-12-02 00:41:18 UTC
I have successfully compiled the kvm_kmod version and loaded it instead of the kernel default version.

Test traffic so far is running without failure.
I'm getting about a 50% received traffic rate.

Comment 43 Michael S. Tsirkin 2009-12-02 16:49:05 UTC
I would like to understand whether the issues
are attributable to using SMP on host.
Could you please verify whether
the issue happens if you limit qemu
to run on a single cpu on host?
I think you can do this using taskset command.

Comment 44 James Ko 2009-12-02 17:41:43 UTC
The failure is not reproducible when -smp 1 is used.
The failure is reproducible when -smp 2 is used.
We are not using more than 2 for -smp configuration.

Our qemu tasks are configured to run on real CPUs 2 & 3.
All other tasks on the system are limited to real CPUs 0 & 1.

If starting with -smp 2 and then changing all tasks' affinity to
use only real CPU 2, the throughput drops to about 25% and the
problem is not reproducible.

debugshell# cat /var/run/vb1.pid
7666

debugshell# ps -L -p 7666 -o tid
  TID
 7666
 7699
 7700
 7701
 7703
debugshell# taskset -pc 2 7666
bash: taskset: command not found
debugshell# /sw/kvm/bin/taskset -pc 2 7666
pid 7666's current affinity list: 2,3
pid 7666's new affinity list: 2
debugshell# /sw/kvm/bin/taskset -pc 2 7699
pid 7699's current affinity list: 2,3
pid 7699's new affinity list: 2
debugshell# /sw/kvm/bin/taskset -pc 2 7700
pid 7700's current affinity list: 2,3
pid 7700's new affinity list: 2
debugshell# /sw/kvm/bin/taskset -pc 2 7701
pid 7701's current affinity list: 2,3
pid 7701's new affinity list: 2
debugshell# /sw/kvm/bin/taskset -pc 2 7703
pid 7703's current affinity list: 2,3
pid 7703's new affinity list: 2

Our stdout and stderr is redirected /var/urn/vb1.qemu
debugshell# tail -f /var/run/vb1.qemu
char device redirected to /dev/pts/0
rom checksum: c9f19e31


After changing the CPU affinity back to use 2 & 3, the throughput
is around 70% to 80% and the problem reoccurs.

debugshell# /sw/kvm/bin/taskset -pc 2,3 7666
pid 7666's current affinity list: 2
pid 7666's new affinity list: 2,3
debugshell# /sw/kvm/bin/taskset -pc 2,3 7699
pid 7699's current affinity list: 2
pid 7699's new affinity list: 2,3
debugshell# /sw/kvm/bin/taskset -pc 2,3 7700
pid 7700's current affinity list: 2
pid 7700's new affinity list: 2,3
debugshell# /sw/kvm/bin/taskset -pc 2,3 7701
pid 7701's current affinity list: 2
pid 7701's new affinity list: 2,3
debugshell# /sw/kvm/bin/taskset -pc 2,3 7703
pid 7703's current affinity list: 2
pid 7703's new affinity list: 2,3
debugshell# tail -f /var/run/vb1.qemu
char device redirected to /dev/pts/0
rom checksum: c9f19e31
Guest moved used index from 56535 to 56600, max 256
Guest moved used index from 29902 to 29962, max 256
Guest moved used index from 48348 to 48413, max 256

Comment 45 Michael S. Tsirkin 2009-12-02 18:01:14 UTC
Sorry I was not clear.
Please try launching qemu with taskset
so that *all* qemu threads run *on the same real CPU*,
while still using -smp 2 with qemu.

Does the problem re-occur?

Comment 46 James Ko 2009-12-02 18:08:50 UTC
How is running with taskset at startup any different from running taskset on each of the tasks after start?

I have changed all threads to run on the same real CPU.

Comment 47 Michael S. Tsirkin 2009-12-02 18:14:49 UTC
Hmm, it won't be different unless guest manages to do something
with virtio meanwhile. Might this be the case?

Also, I assumed that since you say "-pc 2,3" qemu will use
2 host processors?

Comment 48 Michael S. Tsirkin 2009-12-02 18:18:12 UTC
Um, I did not read Comment 44 correctly.
So I think we can confirm that the problem does
not happen when run on a single host CPU.

Unfortunately this is no guarantee that
the problem is SMP related as we know
the problem only presents itself when
throughput is high, and, when run on a single
CPU, throughput is low.

Comment 49 James Ko 2009-12-02 23:24:30 UTC
I don't think the throughput has as much to do with it since I have also limited it in ixChariot to 155Mbps (~16%) and still reproduce the problem.

Comment 50 James Ko 2009-12-10 07:59:23 UTC
We're still at a loss to why it's only happening on our test setup currently.

FYI... these are our config options.

Error: libpci check failed
Disable KVM Device Assignment capability.

Install prefix    /sw/kvm
BIOS directory    /sw/kvm/share/qemu
binary directory  /sw/kvm/bin
Manual directory  /sw/kvm/share/man
ELF interp prefix /usr/gnemul/qemu-%M
Source path       /work/koj/ros_vb_enh/x86_64-derived/src/kvm/qemu
C compiler        /adbu-waas-tools/nptl/linux-2.6.10/gcc-4.1.1-glibc-2.3.6-mallocfix/x86_64-unknown-linux-gnu/bin/x86_64-unknown-linux-gnu-gcc
Host C compiler   gcc
ARCH_CFLAGS       -m64
make              make
install           install
host CPU          x86_64
host big endian   no
target list       x86_64-softmmu
gprof enabled     no
sparse enabled    no
profiler          no
static build      no
-Werror enabled   no
SDL support       yes
SDL static link   yes
curses support    no
mingw32 support   no
Audio drivers     oss
Extra audio cards ac97
Mixer emulation   no
VNC TLS support   no
kqemu support     no
kvm support       yes
CPU emulation     yes
brlapi support    no
Documentation     yes
NPTL support      yes
vde support       no
AIO support       yes
QXL               yes
Spice             no
SMB directores    yes
SCSI devices      yes
ISAPC support     yes
KVM nested        yes
USB storage       yes
USB wacom         yes
USB serial        yes
USB net           yes
USB bluez         no
VMware drivers    yes
NBD support       yes
bluetooth support no
Only generic cpus no

Comment 51 Michael S. Tsirkin 2010-01-23 22:38:08 UTC
This is qemu-kvm bug, not virtio-win bug:
we are using memcpy to read index value,
which is not guaranteed to be atomic.

Verified that replacing memcpy with
read/write for index accesses fixes the
problem.

Comment 60 Chris Ward 2010-02-03 10:02:12 UTC
@James @Cisco,

It appears our QE team is having trouble reproducing this issue and we'll need your help to confirm that the fix proposed for RHEL 5.5 update performs as expected. 

RHEL 5.5 Beta will be out soon and should contain the updated KVM packages including this fix. I will post an announcement to this list when the Beta bits have been made available on RHN. If you could, please grab those bits when they're available and let us know the results of your testing.

Also, if you would not be able to complete this test feedback request, we would appreciate to know that in advance too.

Thank you for your support.

Comment 61 James Ko 2010-02-03 17:24:08 UTC
(In reply to comment #60)

Our release stream is currently based on RHEL 5.4 so I will not be able to pull a RHEL 5.5 Beta for this testing.  Bug 561022 copies this bug and it is expected that this be used for the backporting to 5.4 z-stream which we can update to.

Comment 62 Lawrence Lim 2010-02-04 08:28:16 UTC
(In reply to comment #61)

So do u have access to kvm-83-105.el5_4.20 as pointed out in #c4?

<https://bugzilla.redhat.com/show_bug.cgi?id=561022#c4>

Comment 63 James Ko 2010-02-04 17:49:42 UTC
Latest version on RHN is only at kvm-83-105.el5_4.13.x86_64 currently.
Do you know when el5_4.20 will be posted?

Comment 64 Michael S. Tsirkin 2010-02-07 14:30:11 UTC
Created attachment 389384 [details]
after applying this patch, and rebuilding, qemu will crash

The problem only triggers when qemu RPM is built with custom
compiler from source. and running on custom kernel.
The only way I found to approximately reproduce this issue on our systems,
is by replacing memcpy implementation with a custom routine.

Comment 65 Michael S. Tsirkin 2010-02-07 14:37:30 UTC
How to reproduce:
On both host and guest, build netperf from source:
find it here: ftp://ftp.netperf.org/netperf/netperf-2.4.5.tar.bz2

Apply the patch above (attachment id=389384)
and rebuild qemu-kvm from source.

This replaces memcpy with a custom function
while preventing compiler from optimizing it.

Run qemu-kvm with userspace networking. E.g.

qemu-kvm -drive file=$HOME/disk.raw -net user -net nic,model=virtio -redir tcp:8022::22 

on host, run netserver.

ssh into guest: ssh -P 8022 <host>

once there, run netperf repeatedly on guest: 

while
date
do
netperf -H <host address>
done

qemu will crash shortly.

Comment 66 Michael S. Tsirkin 2010-02-07 15:00:58 UTC
sorry ssh -P 8022 <host> should have been:
ssh -p 8022 <host>

Comment 69 Chris Ward 2010-03-05 09:45:08 UTC
@Cisco, @Michael,

As far as I understand, this issue should be fixed in the latest RHEL 5.5 Beta snapshot. Could you please verify this and report back your test results as soon as possible. Thanks!

Comment 70 Michael S. Tsirkin 2010-03-07 14:14:45 UTC
I have verified that this is fixed as of kvm-83-156.

Comment 71 Michael S. Tsirkin 2010-03-07 14:32:10 UTC
kvm-83-154 is also fine.

Comment 75 errata-xmlrpc 2010-03-30 07:56:46 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0271.html


Note You need to log in before you can comment on or make changes to this bug.