Red Hat Bugzilla – Bug 1451470
RHEL 7.2 based VM (Virtual Machine) hung for several hours apparently waiting for lock held by main_loop
Last modified: 2017-08-01 13:49:19 EDT
Description of problem: RHEL 7.2 based VM became hung and only recoverable with virsh destroy command. From gcore perspective, threads (in vCPU) are waiting for a lock held by main_loop Version-Release number of selected component (if applicable): Red Hat Enterprise Linux Server release 7.2 - 3.10.0-327.28.3.el7.x86_64 qemu-kvm-1.5.3-105.el7_2.7.x86_64 How reproducible: On the end user environment, the problem recurred on a second host and VM unrelated to the first. Attempts to reproduce in our own lab has been mildly successful although not 100% at this time Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
> I believe that Fam's scratch build will fix it. serial_xmit() is the only > caller of qemu_chr_fe_add_watch() that ignores its return value. It wouldn't necessarily fix it, but it would assert as soon as two watches are set up at the same time. My plan is to: 1) try and reproduce with downstream QEMU 2) try and reproduce with Fam's patch 3) try moving the patch to add the assertion early in the series, to see which patch fixes it
(In reply to Paolo Bonzini from comment #14) > > I believe that Fam's scratch build will fix it. serial_xmit() is the only > > caller of qemu_chr_fe_add_watch() that ignores its return value. > > It wouldn't necessarily fix it, but it would assert as soon as two watches > are set up at the same time. That's true about a1df76da57aa8772a75e7c49f8e3829d07b4c46c. However, the scratch build also includes commit f702e62a193e9ddb41cef95068717e5582b39a64, which rewrites the retry logic completely.
Right. So I think this hunk is the one that fixes the bug: @@ -293,7 +298,9 @@ static void serial_ioport_write(void *opaque, s->thr_ipending = 0; s->lsr &= ~UART_LSR_THRE; serial_update_irq(s); - serial_xmit(NULL, G_IO_OUT, s); + if (s->tsr_retry <= 0) { + serial_xmit(NULL, G_IO_OUT, s); + } } break; case 1: although I still support backporting all the patches that Fam identified, especially 0d931d70 ("serial: clean up THRE/TEMT handling") and 62c339c52 ("qemu-char: ignore flow control if a PTY's slave is not connected").
Fix included in qemu-kvm-1.5.3-138.el7
Not sure about 2), but O_NONBLOCK was added to pty fd in upstream: commit fac6688a18574b6f2caa8c699a936e729ed53ece Author: Don Slutz <dslutz@verizon.com> Date: Mon Dec 22 10:04:00 2014 -0500 Do not hang on full PTY Signed-off-by: Don Slutz <dslutz@verizon.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru> diff --git a/qemu-char.c b/qemu-char.c index 5430b87..98d4342 100644 --- a/qemu-char.c +++ b/qemu-char.c @@ -1402,6 +1402,7 @@ static CharDriverState *qemu_chr_open_pty(const char *id, } close(slave_fd); + qemu_set_nonblock(master_fd); chr = qemu_chr_alloc();
(In reply to Fam Zheng from comment #41) > Not sure about 2), but O_NONBLOCK was added to pty fd in upstream: I've tried just with O_NONBLOCK first, and while it QEMU no longer blocks on write(), the cost of writing to the serial port is so high for the VM, it still feels like it's hung.
According to comment48 and comment52. set this bug as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1856