Bug 1451470

Summary: RHEL 7.2 based VM (Virtual Machine) hung for several hours apparently waiting for lock held by main_loop
Product: Red Hat Enterprise Linux 7 Reporter: Bimal Chollera <bcholler>
Component: qemu-kvmAssignee: Fam Zheng <famz>
Status: CLOSED ERRATA QA Contact: Sitong Liu <siliu>
Severity: urgent Docs Contact:
Priority: high    
Version: 7.2CC: chayang, coli, dvacek, ehabkost, famz, jcoscia, jherrman, juzhang, knoel, michen, mkalinin, mrezanin, pbonzini, rbalakri, rhodain, salmy, sjohnsto, slopezpa, tlavigne, virt-bugs, virt-maint, xfu
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-1.5.3-140.el7 Doc Type: Bug Fix
Doc Text:
Previously, guest virtual machines in some cases became unresponsive when the "pty" back end of a serial device performed an irregular I/O communication. This update improves the handling of serial I/O on guests, which prevents the described problem from occurring.
Story Points: ---
Clone Of:
: 1452331 1452332 (view as bug list) Environment:
Last Closed: 2017-08-01 17:49:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1452331, 1452332    

Description Bimal Chollera 2017-05-16 17:53:18 UTC
Description of problem:

RHEL 7.2 based VM became hung and only recoverable with virsh destroy command.
From gcore perspective, threads (in vCPU) are waiting for a lock held by main_loop


Version-Release number of selected component (if applicable):

Red Hat Enterprise Linux Server release 7.2 - 3.10.0-327.28.3.el7.x86_64
qemu-kvm-1.5.3-105.el7_2.7.x86_64 


How reproducible:

On the end user environment, the problem recurred on a second host and VM unrelated to the first.

Attempts to reproduce in our own lab has been mildly successful although not 100% at this time


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 14 Paolo Bonzini 2017-05-17 14:06:29 UTC
> I believe that Fam's scratch build will fix it. serial_xmit() is the only 
> caller of qemu_chr_fe_add_watch() that ignores its return value.

It wouldn't necessarily fix it, but it would assert as soon as two watches are set up at the same time.

My plan is to:

1) try and reproduce with downstream QEMU

2) try and reproduce with Fam's patch

3) try moving the patch to add the assertion early in the series, to see which patch fixes it

Comment 15 Eduardo Habkost 2017-05-17 14:41:48 UTC
(In reply to Paolo Bonzini from comment #14)
> > I believe that Fam's scratch build will fix it. serial_xmit() is the only 
> > caller of qemu_chr_fe_add_watch() that ignores its return value.
> 
> It wouldn't necessarily fix it, but it would assert as soon as two watches
> are set up at the same time.

That's true about a1df76da57aa8772a75e7c49f8e3829d07b4c46c. However, the scratch build also includes commit f702e62a193e9ddb41cef95068717e5582b39a64, which rewrites the retry logic completely.

Comment 16 Paolo Bonzini 2017-05-17 21:51:27 UTC
Right.  So I think this hunk is the one that fixes the bug:

@@ -293,7 +298,9 @@ static void serial_ioport_write(void *opaque,
             s->thr_ipending = 0;
             s->lsr &= ~UART_LSR_THRE;
             serial_update_irq(s);
-            serial_xmit(NULL, G_IO_OUT, s);
+            if (s->tsr_retry <= 0) {
+                serial_xmit(NULL, G_IO_OUT, s);
+            }
         }
         break;
     case 1:

although I still support backporting all the patches that Fam identified, especially 0d931d70 ("serial: clean up THRE/TEMT handling") and 62c339c52 ("qemu-char: ignore flow control if a PTY's slave is not connected").

Comment 29 Miroslav Rezanina 2017-05-23 06:42:19 UTC
Fix included in qemu-kvm-1.5.3-138.el7

Comment 41 Fam Zheng 2017-05-24 12:13:29 UTC
Not sure about 2), but O_NONBLOCK was added to pty fd in upstream:

commit fac6688a18574b6f2caa8c699a936e729ed53ece
Author: Don Slutz <dslutz>
Date:   Mon Dec 22 10:04:00 2014 -0500

    Do not hang on full PTY
    
    Signed-off-by: Don Slutz <dslutz>
    Reviewed-by: Paolo Bonzini <pbonzini>
    Signed-off-by: Michael Tokarev <mjt.ru>

diff --git a/qemu-char.c b/qemu-char.c
index 5430b87..98d4342 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -1402,6 +1402,7 @@ static CharDriverState *qemu_chr_open_pty(const char *id,
     }
 
     close(slave_fd);
+    qemu_set_nonblock(master_fd);
 
     chr = qemu_chr_alloc();

Comment 43 Sergio Lopez 2017-05-25 08:00:35 UTC
(In reply to Fam Zheng from comment #41)
> Not sure about 2), but O_NONBLOCK was added to pty fd in upstream:

I've tried just with O_NONBLOCK first, and while it QEMU no longer blocks on write(), the cost of writing to the serial port is so high for the VM, it still feels like it's hung.

Comment 53 FuXiangChun 2017-06-14 07:38:15 UTC
According to comment48 and comment52. set this bug as verified.

Comment 54 errata-xmlrpc 2017-08-01 17:49:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1856