Bug 1473536

Summary: Hangs in serial console under qemu
Product: Red Hat Enterprise Linux 7 Reporter: YongkuiGuo <yoguo>
Component: qemu-kvmAssignee: Paolo Bonzini <pbonzini>
Status: CLOSED ERRATA QA Contact: Sitong Liu <siliu>
Severity: medium Docs Contact:
Priority: high    
Version: 7.4CC: berrange, chayang, hhuang, juzhang, knoel, lmiksik, michen, pbonzini, ptoscano, rbalakri, rjones, salmy, siliu, virt-maint, xchen, xfu, yoguo
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-1.5.3-153.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 14:35:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1474405    
Bug Blocks: 910269    
Attachments:
Description Flags
the output with "libguestfs-test-tool" none

Description YongkuiGuo 2017-07-21 06:38:09 UTC
Created attachment 1302176 [details]
the output with "libguestfs-test-tool"

Description of problem:

  The libguestfs-test-tool command failed on an intel host(Intel(R) Xeon(R) CPU E3-1225 v3)reserved from beaker. The last part of output is shown as below.

--------------------------------------------------------------------------------
... ...
 + uname -a
Linux (none) 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
+ ls -lR /dev
/dev:
total 0
crw------- 1 root root  10, 235 Jul 20 08:40 autofs
drwxr-xr-x 2 root root       80 Jul 20 08:40 block
drwxr-xr-x 2 root root       80 Jul 20 08:40 bsg
crw------- 1 root root  10, 234 Jul 20 08:40 btrfs-control
drwxr-xr-x 2 root root     2260 Jul 20 08:40 char
crw------- 1 root root   5,   1 Jul 20 08:40 console
lrwxrwxrwx 1 root root       11 Jul 20 08:40 core -> /proc/kcore
drwxr-xr-x 3 root root       80 Jul 20 08:40 cpu
crw------- 1 root root  10,  61 Jul 20 08:40 cpu_dma_latency
crw------- 1 root root  10,  62 Jul 20 08:40 crash
drwxr-xr-x 5 root root      100 Jul 20 08:40 disk
lrwxrwxrwx 1 root root       13 Jul 20 08:40 fd -> /proc/self/fd
crw-rw-rw- 1 root root   1,   7 Jul 20 08:40 full
crw-rw-rw- 1 root root  10, 229 Jul 20 08:40 fuse
crw------- 1 root root  10, 183 Jul 20 08:40 hwrng
drwxr-xr-x 3 root root      220 Jul 20 08:40 input
crw-r--r-- 1 root root   1,  11 Jul 20 08:40 kmsg
crw-rw---- 1 root disk  10, 237 Jul 20 08:40 loop-control
drwxr-xr-x 2 root root       60 Jul 20 08:40 mapper
crw------- 1 root root  10, 227 Jul 20 08:40 mcelog
crw------- 1 root root   1,   1 Jul 20 08:40 mem
drwxr-xr-x 2 root root       60 Jul 20 08:40 net
crw------- 1 root root  10,  60 Jul 20 08:40 network_latency
crw------- 1 root root  10,  59 Jul 20 08:40 network_throughput
crw-rw-rw- 1 root root   1,   3 Jul 20 08:40 null
crw------- 1 root root  10, 144 Jul 20 08:40 nvram
crw------- 1 root root   1,  12 Jul 20 08:40 oldmem
crw------- 1 root root   1,   4 Jul 20 08:40 port
crw------- 1 root root 108,   0 Jul 20 08:40 ppp
crw-rw-rw- 1 root root   5,   2 Jul 20 08:40 ptmx
drwxr-xr-x 2 root root        0 Jul 20 08:40 pts
crw-rw-rw- 1 root root   1,   8 Jul 20 08:40 random
drwxr-xr-x 2 root root       60 Jul 20 08:40 raw
lrwxrwxrwx 1 root root        4 Jul 20 08:40 rtc -> rtc0
crw------- 1 root root 253,   0 Jul 20 08:40 rtc0
brw------- 1 root root   8,   0 Jul 20 08:40 sda
brw------- 1 root root   8,  16 Jul 20 08:40 sdb
crw-rw---- 1 root disk  21,   0 Jul 20 08:40 sg0
crw-rw---- 1 root disk  21,   1 Jul 20 08:40 sg1
crw------- 1 root root  10, 231 Jul 20 08:40 snapshot
drwxr-xr-x 2 root root       80 Jul 20 08:40 snd
lrwxrwxrwx 1 root root       15 Jul 20 08:40 stderr -> /proc/self/fd/2
lrwxrwxrwx 1 root root       15 Jul 20 08:40 stdin -> /proc/self/fd/0
lrwxrwxrwx 1 root root       15 Jul 20 08:40 stdout -> /proc/self/fd/1
crw-rw-rw- 1 root root   5,   0 Jul 20 08:40 tty
crw------- 1 root root   4,   0 Jul 20 08:40 tty0
crw------- 1 root root   4,   1 Jul 20 08:40 tty1
crw------- 1 root root   4,  10 Jul 20 08:40 tty10
crw------- 1 root root   4,  11 Jul 20 08:40 tty11
crw------- 1 root root   4,  12 Jul 20 08:40 tty12
crw------- 1 root root   4,  13 Jul 20 08:40 tty13
crw------- 1 root root   4,  14 Jul 20 08:40 tty14
crw------- 1 root root   4,  15 Jul 20 08:40 tty15
crw------- 1 root root   4,  16 Jul 20 08:40 tty16
crw------- 1 root root   4,  17 Jul 20 08:40 tty17
crw------- 1 root root   4,  18 Jul 20 08:40 tty18
crw------- 1 root root   4,  19 Jul 20 08:40 tty19
crw------- 1 root root   4,   2 Jul 20 08:40 tty2
crw------- 1 root root   4,  20 Jul 20 08:40 tty20
crw------- 1 root root   4,  21 Jul 20 08:40 tty21
crw------- 1 root root   4,  22 Jul 20 08:40 tty22
crw------- 1 root root   4,  23 Jul 20 08:40 tty23
crw------- 1 root root   4,  24 Jul 20 08:40 tty24
crw------- 1 root root   4,  25 Jul 20 08:40 tty25
crw------- 1 root root   4,  26 Jul 20 08:40 tty26
crw------- 1 root root   4,  27 Jul 20 08:40 tty27
crw------- 1 root root   4,  28 Jul 20 08:40 tty28
crw------- 1 root root   4,  29 Jul 20 08:40 tty29
--------------------------------------------------------------------------------

  When running 'ls -lR /dev', the libguestfs-test-tool command is stuck, and exit later.


Version-Release number of selected component (if applicable):
libguestfs-1.36.3-6.el7_4.2.x86_64
kernel-3.10.0-693.el7.x86_64
qemu-kvm-1.5.3-141.el7_4.1.x86_64
glib2-2.50.3-3.el7.x86_64

How reproducible:
100%(just on my test env)

Steps to Reproduce:
1. On rhel7.4 machine(test env: 10.66.144.29/redhat)
#libguestfs-test-tool

Actual results:
The command hang and exit abnormally


Expected results:
... ...
===== TEST FINISHED OK =====

Additional info:
It's difficult to reproduce the problem on other machines.

Comment 2 YongkuiGuo 2017-07-21 06:49:47 UTC
Asked for rjones's help, He thought this is a bug we have seen before: https://bugzilla.redhat.com/show_bug.cgi?id=1435432. It's caused by the *glib2* rebase (to glib2-2.50.3-3.el7.x86_64)

Comment 3 YongkuiGuo 2017-07-24 02:36:55 UTC
This bug is the same as bug1473605(https://bugzilla.redhat.com/show_bug.cgi?id=1473605), which provides more datailed info.

Comment 4 Richard W.M. Jones 2017-07-24 07:49:23 UTC
If we were to just fix this in qemu alone, I believe it would require
this patch:

commit ecbddbb106114f90008024b4e6c3ba1c38d7ca0e
Author: Richard W.M. Jones <rjones>
Date:   Fri Mar 31 21:51:33 2017 +0100

    main-loop: Acquire main_context lock around os_host_main_loop_wait.

However this is really a bug in glib2, not in qemu.  The above is
a work around as I remember it.

Comment 5 Richard W.M. Jones 2017-07-24 07:50:09 UTC
Also the patch would be needed for both qemu-kvm and qemu-kvm-rhev,
and is not present in either.

Comment 6 Richard W.M. Jones 2017-07-24 09:35:33 UTC
(In reply to Richard W.M. Jones from comment #5)
> Also the patch would be needed for both qemu-kvm and qemu-kvm-rhev,
> and is not present in either.

Oops, as danpb pointed out, this is present in qemu-kvm-rhev because
we rebased, but not in qemu-kvm.

Comment 7 Daniel Berrangé 2017-07-25 14:27:06 UTC
*** Bug 1474405 has been marked as a duplicate of this bug. ***

Comment 11 Paolo Bonzini 2018-01-10 16:04:22 UTC
It is instead of the glib2 fix.  Fixing the depends on.

Comment 12 Miroslav Rezanina 2018-01-12 09:48:48 UTC
Fix included in qemu-kvm-1.5.3-153.el7

Comment 14 Sitong Liu 2018-01-29 02:49:42 UTC
QE tried to reproduce this bug with 'virt-rescue', the version QE can get is 1.36, with command:

    $ virt-rescue -e ^] --scratch
    ><rescue> while true; do ls -l /usr/bin; done

But failed.

And tried the other way, with command:

    $ while libguestfs-test-tool -t 60 >& /tmp/log ; do echo -n . ; done

When tested enough time (one day), still can't get hang.

As developer said, it is hard to reproduce this bug, so QE did sanity test and
patch review for this bz.

Versions:
libguestfs-1.36.10-4.el7.x86_64
kernel-3.10.0-830.el7.x86_64
qemu-kvm-1.5.3-153.el7.x86_64
glib2-2.54.2-2.el7.x86_64

Run the below command on a metal host for two days, 'libguestfs-test-tool' finished OK for 54000 times, no hang occured.

    $ while libguestfs-test-tool -t 60 >& /tmp/log ; do echo -n . ; done

# git checkout origin/rhel7/master-1.5.3
Note: checking out 'origin/rhel7/master-1.5.3'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 2beece2... Update to qemu-kvm-1.5.3-153.el7

# git log --grep="Bugzilla: 1473536"
commit 6baaf82a7742a1de9160146b08ba0cc86b3d4e79
Author: Paolo Bonzini <pbonzini>
Date:   Wed Jan 10 17:02:21 2018 +0100

    main-loop: Acquire main_context lock around os_host_main_loop_wait.
    
    RH-Author: Paolo Bonzini <pbonzini>
    Message-id: <20180110170221.28975-1-pbonzini>
    Patchwork-id: 78541
    O-Subject: [RHEL7.5 qemu-kvm PATCH] main-loop: Acquire main_context lock around os_host_main_loop_wait.
    Bugzilla: 1473536
    RH-Acked-by: Jeffrey Cody <jcody>
    RH-Acked-by: John Snow <jsnow>
    RH-Acked-by: Miroslav Rezanina <mrezanin>
    
    Bugzilla: 1473536
    
    Brew build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14912977
    
    When running virt-rescue the serial console hangs from time to time.
    Virt-rescue runs an ordinary Linux kernel "appliance", but there is
    only a single idle process running inside, so the qemu main loop is
    largely idle.  With virt-rescue >= 1.37 you may be able to observe the
    hang by doing:
    
      $ virt-rescue -e ^] --scratch
      ><rescue> while true; do ls -l /usr/bin; done
    
    The hang in virt-rescue can be resolved by pressing a key on the
    serial console.
    
    Possibly with the same root cause, we also observed hangs during very
    early boot of regular Linux VMs with a serial console.  Those hangs
    are extremely rare, but you may be able to observe them by running
    this command on baremetal for a sufficiently long time:
    
      $ while libguestfs-test-tool -t 60 >& /tmp/log ; do echo -n . ; done
    
    (Check in /tmp/log that the failure was caused by a hang during early
    boot, and not some other reason)
    
    During investigation of this bug, Paolo Bonzini wrote:
    
    > glib is expecting QEMU to use g_main_context_acquire around accesses to
    > GMainContext.  However QEMU is not doing that, instead it is taking its
    > own mutex.  So we should add g_main_context_acquire and
    > g_main_context_release in the two implementations of
    > os_host_main_loop_wait; these should undo the effect of Frediano's
    > glib patch.
    
    This patch exactly implements Paolo's suggestion in that paragraph.
    
    This fixes the serial console hang in my testing, across 3 different
    physical machines (AMD, Intel Core i7 and Intel Xeon), over many hours
    of automated testing.  I wasn't able to reproduce the early boot hangs
    (but as noted above, these are extremely rare in any case).
    
    Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432
    Reported-by: Richard W.M. Jones <rjones>
    Tested-by: Richard W.M. Jones <rjones>
    Signed-off-by: Richard W.M. Jones <rjones>
    Message-Id: <20170331205133.23906-1-rjones>
    [Paolo: this is actually a glib bug: recent glib versions are also
    expecting g_main_context_acquire around g_poll---but that is not
    documented and probably not even intended].
    Signed-off-by: Paolo Bonzini <pbonzini>
    (cherry picked from commit ecbddbb106114f90008024b4e6c3ba1c38d7ca0e)
    
    Signed-off-by: Miroslav Rezanina <mrezanin>

Comment 19 errata-xmlrpc 2018-04-10 14:35:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0816