Bug 1473536
Summary: | Hangs in serial console under qemu | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | YongkuiGuo <yoguo> | ||||
Component: | qemu-kvm | Assignee: | Paolo Bonzini <pbonzini> | ||||
Status: | CLOSED ERRATA | QA Contact: | Sitong Liu <siliu> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.4 | CC: | berrange, chayang, hhuang, juzhang, knoel, lmiksik, michen, pbonzini, ptoscano, rbalakri, rjones, salmy, siliu, virt-maint, xchen, xfu, yoguo | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | qemu-kvm-1.5.3-153.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-04-10 14:35:07 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1474405 | ||||||
Bug Blocks: | 910269 | ||||||
Attachments: |
|
Asked for rjones's help, He thought this is a bug we have seen before: https://bugzilla.redhat.com/show_bug.cgi?id=1435432. It's caused by the *glib2* rebase (to glib2-2.50.3-3.el7.x86_64) This bug is the same as bug1473605(https://bugzilla.redhat.com/show_bug.cgi?id=1473605), which provides more datailed info. If we were to just fix this in qemu alone, I believe it would require this patch: commit ecbddbb106114f90008024b4e6c3ba1c38d7ca0e Author: Richard W.M. Jones <rjones> Date: Fri Mar 31 21:51:33 2017 +0100 main-loop: Acquire main_context lock around os_host_main_loop_wait. However this is really a bug in glib2, not in qemu. The above is a work around as I remember it. Also the patch would be needed for both qemu-kvm and qemu-kvm-rhev, and is not present in either. (In reply to Richard W.M. Jones from comment #5) > Also the patch would be needed for both qemu-kvm and qemu-kvm-rhev, > and is not present in either. Oops, as danpb pointed out, this is present in qemu-kvm-rhev because we rebased, but not in qemu-kvm. *** Bug 1474405 has been marked as a duplicate of this bug. *** It is instead of the glib2 fix. Fixing the depends on. Fix included in qemu-kvm-1.5.3-153.el7 QE tried to reproduce this bug with 'virt-rescue', the version QE can get is 1.36, with command: $ virt-rescue -e ^] --scratch ><rescue> while true; do ls -l /usr/bin; done But failed. And tried the other way, with command: $ while libguestfs-test-tool -t 60 >& /tmp/log ; do echo -n . ; done When tested enough time (one day), still can't get hang. As developer said, it is hard to reproduce this bug, so QE did sanity test and patch review for this bz. Versions: libguestfs-1.36.10-4.el7.x86_64 kernel-3.10.0-830.el7.x86_64 qemu-kvm-1.5.3-153.el7.x86_64 glib2-2.54.2-2.el7.x86_64 Run the below command on a metal host for two days, 'libguestfs-test-tool' finished OK for 54000 times, no hang occured. $ while libguestfs-test-tool -t 60 >& /tmp/log ; do echo -n . ; done # git checkout origin/rhel7/master-1.5.3 Note: checking out 'origin/rhel7/master-1.5.3'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b <new-branch-name> HEAD is now at 2beece2... Update to qemu-kvm-1.5.3-153.el7 # git log --grep="Bugzilla: 1473536" commit 6baaf82a7742a1de9160146b08ba0cc86b3d4e79 Author: Paolo Bonzini <pbonzini> Date: Wed Jan 10 17:02:21 2018 +0100 main-loop: Acquire main_context lock around os_host_main_loop_wait. RH-Author: Paolo Bonzini <pbonzini> Message-id: <20180110170221.28975-1-pbonzini> Patchwork-id: 78541 O-Subject: [RHEL7.5 qemu-kvm PATCH] main-loop: Acquire main_context lock around os_host_main_loop_wait. Bugzilla: 1473536 RH-Acked-by: Jeffrey Cody <jcody> RH-Acked-by: John Snow <jsnow> RH-Acked-by: Miroslav Rezanina <mrezanin> Bugzilla: 1473536 Brew build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14912977 When running virt-rescue the serial console hangs from time to time. Virt-rescue runs an ordinary Linux kernel "appliance", but there is only a single idle process running inside, so the qemu main loop is largely idle. With virt-rescue >= 1.37 you may be able to observe the hang by doing: $ virt-rescue -e ^] --scratch ><rescue> while true; do ls -l /usr/bin; done The hang in virt-rescue can be resolved by pressing a key on the serial console. Possibly with the same root cause, we also observed hangs during very early boot of regular Linux VMs with a serial console. Those hangs are extremely rare, but you may be able to observe them by running this command on baremetal for a sufficiently long time: $ while libguestfs-test-tool -t 60 >& /tmp/log ; do echo -n . ; done (Check in /tmp/log that the failure was caused by a hang during early boot, and not some other reason) During investigation of this bug, Paolo Bonzini wrote: > glib is expecting QEMU to use g_main_context_acquire around accesses to > GMainContext. However QEMU is not doing that, instead it is taking its > own mutex. So we should add g_main_context_acquire and > g_main_context_release in the two implementations of > os_host_main_loop_wait; these should undo the effect of Frediano's > glib patch. This patch exactly implements Paolo's suggestion in that paragraph. This fixes the serial console hang in my testing, across 3 different physical machines (AMD, Intel Core i7 and Intel Xeon), over many hours of automated testing. I wasn't able to reproduce the early boot hangs (but as noted above, these are extremely rare in any case). Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432 Reported-by: Richard W.M. Jones <rjones> Tested-by: Richard W.M. Jones <rjones> Signed-off-by: Richard W.M. Jones <rjones> Message-Id: <20170331205133.23906-1-rjones> [Paolo: this is actually a glib bug: recent glib versions are also expecting g_main_context_acquire around g_poll---but that is not documented and probably not even intended]. Signed-off-by: Paolo Bonzini <pbonzini> (cherry picked from commit ecbddbb106114f90008024b4e6c3ba1c38d7ca0e) Signed-off-by: Miroslav Rezanina <mrezanin> Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:0816 |
Created attachment 1302176 [details] the output with "libguestfs-test-tool" Description of problem: The libguestfs-test-tool command failed on an intel host(Intel(R) Xeon(R) CPU E3-1225 v3)reserved from beaker. The last part of output is shown as below. -------------------------------------------------------------------------------- ... ... + uname -a Linux (none) 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux + ls -lR /dev /dev: total 0 crw------- 1 root root 10, 235 Jul 20 08:40 autofs drwxr-xr-x 2 root root 80 Jul 20 08:40 block drwxr-xr-x 2 root root 80 Jul 20 08:40 bsg crw------- 1 root root 10, 234 Jul 20 08:40 btrfs-control drwxr-xr-x 2 root root 2260 Jul 20 08:40 char crw------- 1 root root 5, 1 Jul 20 08:40 console lrwxrwxrwx 1 root root 11 Jul 20 08:40 core -> /proc/kcore drwxr-xr-x 3 root root 80 Jul 20 08:40 cpu crw------- 1 root root 10, 61 Jul 20 08:40 cpu_dma_latency crw------- 1 root root 10, 62 Jul 20 08:40 crash drwxr-xr-x 5 root root 100 Jul 20 08:40 disk lrwxrwxrwx 1 root root 13 Jul 20 08:40 fd -> /proc/self/fd crw-rw-rw- 1 root root 1, 7 Jul 20 08:40 full crw-rw-rw- 1 root root 10, 229 Jul 20 08:40 fuse crw------- 1 root root 10, 183 Jul 20 08:40 hwrng drwxr-xr-x 3 root root 220 Jul 20 08:40 input crw-r--r-- 1 root root 1, 11 Jul 20 08:40 kmsg crw-rw---- 1 root disk 10, 237 Jul 20 08:40 loop-control drwxr-xr-x 2 root root 60 Jul 20 08:40 mapper crw------- 1 root root 10, 227 Jul 20 08:40 mcelog crw------- 1 root root 1, 1 Jul 20 08:40 mem drwxr-xr-x 2 root root 60 Jul 20 08:40 net crw------- 1 root root 10, 60 Jul 20 08:40 network_latency crw------- 1 root root 10, 59 Jul 20 08:40 network_throughput crw-rw-rw- 1 root root 1, 3 Jul 20 08:40 null crw------- 1 root root 10, 144 Jul 20 08:40 nvram crw------- 1 root root 1, 12 Jul 20 08:40 oldmem crw------- 1 root root 1, 4 Jul 20 08:40 port crw------- 1 root root 108, 0 Jul 20 08:40 ppp crw-rw-rw- 1 root root 5, 2 Jul 20 08:40 ptmx drwxr-xr-x 2 root root 0 Jul 20 08:40 pts crw-rw-rw- 1 root root 1, 8 Jul 20 08:40 random drwxr-xr-x 2 root root 60 Jul 20 08:40 raw lrwxrwxrwx 1 root root 4 Jul 20 08:40 rtc -> rtc0 crw------- 1 root root 253, 0 Jul 20 08:40 rtc0 brw------- 1 root root 8, 0 Jul 20 08:40 sda brw------- 1 root root 8, 16 Jul 20 08:40 sdb crw-rw---- 1 root disk 21, 0 Jul 20 08:40 sg0 crw-rw---- 1 root disk 21, 1 Jul 20 08:40 sg1 crw------- 1 root root 10, 231 Jul 20 08:40 snapshot drwxr-xr-x 2 root root 80 Jul 20 08:40 snd lrwxrwxrwx 1 root root 15 Jul 20 08:40 stderr -> /proc/self/fd/2 lrwxrwxrwx 1 root root 15 Jul 20 08:40 stdin -> /proc/self/fd/0 lrwxrwxrwx 1 root root 15 Jul 20 08:40 stdout -> /proc/self/fd/1 crw-rw-rw- 1 root root 5, 0 Jul 20 08:40 tty crw------- 1 root root 4, 0 Jul 20 08:40 tty0 crw------- 1 root root 4, 1 Jul 20 08:40 tty1 crw------- 1 root root 4, 10 Jul 20 08:40 tty10 crw------- 1 root root 4, 11 Jul 20 08:40 tty11 crw------- 1 root root 4, 12 Jul 20 08:40 tty12 crw------- 1 root root 4, 13 Jul 20 08:40 tty13 crw------- 1 root root 4, 14 Jul 20 08:40 tty14 crw------- 1 root root 4, 15 Jul 20 08:40 tty15 crw------- 1 root root 4, 16 Jul 20 08:40 tty16 crw------- 1 root root 4, 17 Jul 20 08:40 tty17 crw------- 1 root root 4, 18 Jul 20 08:40 tty18 crw------- 1 root root 4, 19 Jul 20 08:40 tty19 crw------- 1 root root 4, 2 Jul 20 08:40 tty2 crw------- 1 root root 4, 20 Jul 20 08:40 tty20 crw------- 1 root root 4, 21 Jul 20 08:40 tty21 crw------- 1 root root 4, 22 Jul 20 08:40 tty22 crw------- 1 root root 4, 23 Jul 20 08:40 tty23 crw------- 1 root root 4, 24 Jul 20 08:40 tty24 crw------- 1 root root 4, 25 Jul 20 08:40 tty25 crw------- 1 root root 4, 26 Jul 20 08:40 tty26 crw------- 1 root root 4, 27 Jul 20 08:40 tty27 crw------- 1 root root 4, 28 Jul 20 08:40 tty28 crw------- 1 root root 4, 29 Jul 20 08:40 tty29 -------------------------------------------------------------------------------- When running 'ls -lR /dev', the libguestfs-test-tool command is stuck, and exit later. Version-Release number of selected component (if applicable): libguestfs-1.36.3-6.el7_4.2.x86_64 kernel-3.10.0-693.el7.x86_64 qemu-kvm-1.5.3-141.el7_4.1.x86_64 glib2-2.50.3-3.el7.x86_64 How reproducible: 100%(just on my test env) Steps to Reproduce: 1. On rhel7.4 machine(test env: 10.66.144.29/redhat) #libguestfs-test-tool Actual results: The command hang and exit abnormally Expected results: ... ... ===== TEST FINISHED OK ===== Additional info: It's difficult to reproduce the problem on other machines.