Bug 643751

Summary: writing to a virtio serial port while no one is listening on the host side hangs the guest
Product: Red Hat Enterprise Linux 6 Reporter: Hans de Goede <hdegoede>
Component: kernelAssignee: Amit Shah <amit.shah>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: dawu, dhoward, fhrbata, juzhang, mjenner, plyons, virt-maint
Target Milestone: rcKeywords: ZStream
Target Release: 6.1   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-91.el6 Doc Type: Bug Fix
Doc Text:
If a host was slow in reading data or did not read data at all, blocking write() calls not only blocked the program that called the write() call but also the entire guest. This was caused by the write() calls waiting until an acknowledgment that the data consumed was received from the host. With this update, write() calls no longer wait for such acknowledgment: control is immediately returned to the user space application. This ensures that even if the host is busy processing other data or is not consuming data at all, the guest is not blocked.
Story Points: ---
Clone Of:
: 644735 (view as bug list) Environment:
Last Closed: 2011-05-23 20:26:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580954, 644735, 678562    
Attachments:
Description Flags
serial-console.py none

Description Hans de Goede 2010-10-17 20:06:43 UTC
The problem is this "beauty" in virtio_console.c: send_buf()

        /*
         * Wait till the host acknowledges it pushed out the data we
         * sent.  This is done for ports in blocking mode or for data
         * from the hvc_console; the tty operations are performed with
         * spinlocks held so we can't sleep here.
         */
        while (!virtqueue_get_buf(out_vq, &len))
                cpu_relax();

I see a number of possible (partial) solutions here:

1) the code says it is using cpu_relux rather then sleep, because the tty
functions are called with a spinlock held. but fops-write is not called with
any spinlock held, I believe. How about a parameter to send_buf, called
"may_sleep" and then use sleep rather then relax if may_sleep is true?

2) The waiting is done for: "This is done for ports in blocking mode or for
data from the hvc_console". I wonder why the waiting is done in blocking mode
too, I guess this is some sort of workaround for the missing waitqueue wakeups,
see bug 643750, with those waitqueue wakeups added I would think / expect the
waiting for the host acknowledge is only needed for tty usage, and that we
could skip the wait entirely (making 1 mute) when called from fops_write ?

I know that work is being done for a more permanent solution, but if the above 2 are possible this would be a nice way to lessen the cases where this problem happens, which will also help while running in VM's without the new more permanent fix.

Note I believe this bug should be assigned to Amit Shah (but I'm, not sure if
it is ok to do this myself wrt the kernel teams procedures).

Comment 2 Hans de Goede 2010-10-19 07:58:10 UTC
Amit has posted a patch for this, assigning to Amit.

Comment 3 Amit Shah 2010-10-20 06:07:51 UTC
To test:

- open guest virtio-console port
- open host virtio-console port

Without reading from the host side, keep writing to the guest port.  After a
few writes, the guest will freeze.

After applying the patch that fixes this, the guest will not freeze, but the
application writing data to the guest port will wait till the host side data is
read off.

This test is the test_blocking_write() in test-virtserial.git:

http://fedorapeople.org/gitweb?p=amitshah/public_git/test-virtserial.git;a=commitdiff;h=e5cbe2be47ca7cf5fce86da694869fc8e922d41c

Comment 4 RHEL Program Management 2010-10-20 12:10:05 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 5 Amit Shah 2010-10-21 07:45:56 UTC
Additional testing note: the effect of this patch will be visible with host-side qemu modifications which aren't yet in RHEL6.  When testing, ask me for a brew build.

Comment 6 Aristeu Rozanski 2010-12-15 16:05:30 UTC
Patch(es) available on kernel-2.6.32-91.el6

Comment 11 juzhang 2011-03-23 05:48:11 UTC
(In reply to comment #3)
> To test:
> 
> - open guest virtio-console port
> - open host virtio-console port
> 
> Without reading from the host side, keep writing to the guest port.  After a
> few writes, the guest will freeze.
Reproduced on kernel-2.6.32-90.el6
After step3,30 seconds later,the guest is hang.

Verified  on kernel-2.6.32-118.el6 with qemu-kvm-0.12.1.2-2.151.el6.x86_64

Steps:
1. boot guest.
#/usr/libexec/qemu-kvm -m 2G -smp 4 -drive file=/root/zhangjunyi/rhel6.1-ide.qcow2,if=none,id=test,cache=none,format=qcow2,werror=stop,rerror=stop -device virtio-blk-pci,drive=test -cpu qemu64,+sse2,+x2apic -boot c -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=22:11:22:45:66:94 -vnc :10  -device virtio-serial-pci,id=virtio-serial0,max_ports=31 -chardev socket,id=channel0,path=/var/zhangjunyi0,server,nowait -device virtserialport,bus=virtio-serial0.0,chardev=channel0,name=org.port.0,id=port1 -serial stdio -qmp tcp:0:4444,server,nowait

2.in guest,write big file
cat partaa > /dev/vport0p1

3.in host.
just open  virtio-console without portreading.

results:
10 mins later,guest still well.

Comment 12 juzhang 2011-03-23 05:49:02 UTC
According to comment11,set this issue as verified.

Comment 13 Martin Prpič 2011-04-12 12:42:10 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
If a host was slow in reading data or did not read data at all, blocking write() calls not only blocked the program that called the write() call but also the entire guest. This was caused by the write() calls waiting until an acknowledgment that the data consumed was received from the host. With this update, write() calls no longer wait for such acknowledgment: control is immediately returned to the user space application. This ensures that even if the host is busy processing other data or is not consuming data at all, the guest is not blocked.

Comment 14 dawu 2011-04-19 03:25:47 UTC
Verified  on kernel-2.6.32-133.el6 with qemu-kvm-0.12.1.2-2.158.el6.x86_64
this issue does not reproduce, 10 mins later,guest still well,following is the details:

Steps:
1. boot guest.
/usr/libexec/qemu-kvm -m 2G -smp 4 -drive file=RHEL-Server-6.1-64-virtio.qcow2,if=none,id=test,cache=none,format=qcow2,werror=stop,rerror=stop -device virtio-blk-pci,drive=test -cpu qemu64,+sse2,+x2apic -boot c -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=net0,mac=22:11:22:45:66:94 -vnc :1  -device virtio-serial-pci,id=virtio-serial0,max_ports=31 -chardev socket,id=channel0,path=/var/zhangjunyi0,server,nowait -device virtserialport,bus=virtio-serial0.0,chardev=channel0,name=org.port.0,id=port1-serial -qmp tcp:0:4444,server,nowait

2.in guest,write big file
cat partaa > /dev/vport0p1

3.in host.
just open  virtio-console without portreading. (please refer to the attached python script of "serial-console.py")
#python serial-console.py /var/zhangjunyi0

results:
10 mins later,guest still well

Comment 15 dawu 2011-04-19 03:27:28 UTC
Created attachment 493062 [details]
serial-console.py

Comment 16 errata-xmlrpc 2011-05-23 20:26:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0542.html