Bug 970047
Summary: | query mem info from monitor would cause qemu-kvm hang [RHEL-7] | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Xiaoqing Wei <xwei> | ||||||||
Component: | qemu-kvm | Assignee: | Laszlo Ersek <lersek> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 7.0 | CC: | acathrow, hhuang, juzhang, michen, qzhang, shuang, virt-maint | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | qemu-kvm-1.5.2-3.el7 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 985334 (view as bug list) | Environment: | |||||||||
Last Closed: | 2014-06-13 13:29:23 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
I don't understand step 2 in comment 0: qemu#c Thanks. (In reply to Laszlo Ersek from comment #2) > I don't understand step 2 in comment 0: > > qemu#c > > Thanks. Hi Laszlo, the vm was launched w/ a '-S' option, that means paused state, and a 'c' would push it into running status. if launch the vm w/o a -S, then the step 2 is not necessary. Regards, Xiaoqing. Hi, another question: what tool generated the qemu-kvm command line? For example, where does "-no-kvm-pit-reinjection" come from, and why is it necessary? I'll try to reserve a similar machine in Beaker (given '-cpu host'). Thanks. (In reply to Laszlo Ersek from comment #4) > Hi, > > another question: what tool generated the qemu-kvm command line? For > example, where does "-no-kvm-pit-reinjection" come from, and why is it > necessary? > > I'll try to reserve a similar machine in Beaker (given '-cpu host'). > > Thanks. Hi Laszlo, the cmd is generated from autotest, to prevent some guest timedrift. Regards. The IO thread is spinning as follows: main_loop_wait() slirp_pollfds_fill() qemu_iohandler_fill() (1) os_host_main_loop_wait() glib_pollfds_fill() g_main_context_query() (2) g_poll() glib_pollfds_poll() qemu_iohandler_poll() slirp_pollfds_poll() At (1), just before entering os_host_main_loop_wait(), this is the contents of the gpollfds array (length 3): (gdb) print *(GPollFD (*)[3])gpollfds->data $24 = {{fd = 21, events = 25, revents = 0}, {fd = 11, events = 25, revents = 0}, {fd = 3, events = 25, revents = 0}} Inside glib_pollfds_fill(), the value of "glib_n_poll_fds" is huge: (gdb) print glib_n_poll_fds $26 = 2897 The new size of the array is 3 + 2897 == 2900. After glib_pollfds_fill() returns, ie. at (2), just before os_host_main_loop_wait() calls g_poll(), the contents of the array is (length 2900): (gdb) print *(GPollFD (*)[2900])gpollfds->data $34 = {{fd = 21, events = 25, revents = 0}, {fd = 11, events = 25, revents = 0}, {fd = 3, events = 25, revents = 0}, {fd = 5, events = 1, revents = 0}, {fd = 6, events = 1, revents = 0}, {fd = 7, events = 1, revents = 0}, {fd = 8, events = 1, revents = 0}, {fd = 16, events = 1, revents = 0}, {fd = 14, events = 1, revents = 0}, {fd = 13, events = 1, revents = 0}, {fd = 4, events = 1, revents = 0}, {fd = 1, events = 4, revents = 0} <repeats 2888 times>, {fd = 0, events = 1, revents = 0}} The standard output is placed 2888 times in the array. events = 4 is POLLOUT. (G_IO_OUT / GLIB_SYSDEF_POLLOUT are identity mapped to POLLOUT, most likely.) When we invoke g_poll() (which is only a thin wrapper around poll(); GPollFD being "struct pollfd" in fact), it returns -1, and sets errno to EINVAL (errno 22). The poll(2) manual says, EINVAL The nfds value exceeds the RLIMIT_NOFILE value. and do_sys_poll() in "fs/select.c" seems to confirm that that's the only EINVAL case. This instance of qemu-kvm was started with "-monitor stdio". Apparently, each one of the huge number of lines printed to the stdout monitor by "info tlb" leaks a GPollFD. Printing to the monitor is implemented by: tlb_info() ... print_pte() monitor_printf() monitor_vprintf() monitor_puts() monitor_flush() qemu_chr_fe_write() io_channel_send() (*) g_io_channel_write_chars() qemu_chr_fe_add_watch() fd_chr_add_watch() (*) g_io_create_watch() g_source_set_callback() g_source_attach() (*) (gdb) print default_mon->chr->chr_write $44 = (int (*)(struct CharDriverState *, const uint8_t *, int)) 0x7f6010ab29f0 <fd_chr_write> (gdb) print default_mon->chr->chr_add_watch $45 = (GSource *(*)(struct CharDriverState *, GIOCondition)) 0x7f6010ab2870 <fd_chr_add_watch> monitor_flush() intends to reinstall the watch (by calling qemu_chr_fe_add_watch()) only if the write in qemu_chr_fe_write() is partial, but I think that logic doesn't work for some reason. Something's wrong between monitor_flush() and io_channel_send(). The former doesn't handle negative return values (errors), it reinstalls the watch all the same. The latter (even in upstream!) fails to increment "offset" with "bytes_written" in some cases, possibly reporting a partial write to monitor_flush() rather than a full write. This would be consistent with the symptoms. The GLib documentation clearly states that "bytes_written" can be set to nonzero independently of g_io_channel_write_chars()'s retval. I'll write a patch for QE to test (apparently this test case can easily reproduce the problem). Created attachment 774372 [details]
proposed patch
Hmm, no. Even though I think io_channel_send() could be improved, that function doesn't fully explain how so many watches / GPollFDs accrue. Extraneously reinstalling a watch in monitor_flush() would not *increase* the number of watches, just keep it unchanged in the longer term, because monitor_unblocked(), the callback for the watch, returns FALSE, which removes the watch automatically. Perhaps the other call chain to monitor_flush() is the culprit. tlb_info() (indirectly) invokes monitor_flush() very many times in quick succession. (Each print_pte() call ends with a newline character.) mon->outbuf may keep growing and growing, hence the chance to a complete write decreases. After a point probably all writes are incomplete, and from that point onward each print_pte() -- ie. monitor_flush() -- call will add a brand new watch. On the non-callback-initiated call path, no watch should be added if there's already one watch in place. Indeed. See usbredir_write() and usbredir_write_unblocked(): they track the one watch in dev->watch. (Upstream commit c874ea97.) Exact same pattern in flush_buf() [hw/char/virtio-console.c] and chr_write_unblocked(), same file: they track vcon->watch. (Upstream commit c3d6b96e, again by Hans.) Created attachment 774410 [details]
proposed patch (additional)
Posted upstream series: http://thread.gmane.org/gmane.comp.emulators.qemu/222525 upstream commits: 1 ac8c26f char: io_channel_send: don't lose written bytes 2 293d2a0 monitor: maintain at most one G_IO_OUT watch Fix included in qemu-kvm-1.5.2-3.el7 Verified the bug on qemu-kvm-1.5.3-31.el7.x86_64. Steps: (1) In terminal A, issue the following commands: mkfifo fifo.in fifo.out /usr/libexec/qemu-kvm -chardev pipe,id=fifo,path=fifo \ -mon chardev=fifo,default (2) In terminal B, issue the following command (same directory): cat fifo.out (3) In terminal C, issue the following command (same directory): cat >fifo.in (4) Still in terminal C, type the following command, and verify that its output appears in terminal B: info registers (5) In terminal B, press ^Z (ie. stop (but do not kill) the "cat" process reading from "fifo.out"). (6) In terminal C, repeat the following command indefinitely (it's simples to keep pasting it from the clipboard): info registers Result: Qemu process does not die after 300 times "info registers" input. So, this issue is fixed. This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. |
Created attachment 756263 [details] gdb -p `pidof qemu-kvm` thread apply all bt full Description of problem: query mem info from monitor would cause qemu-kvm hang Version-Release number of selected component (if applicable): qemu-kvm-1.5.0-2.el7.x86_64 kernel-3.9.0-0.55.el7.x86_64 How reproducible: 5/5 Steps to Reproduce: 1. qemu-kvm -monitor stdio -S \ -name 'vm1' \ -chardev socket,id=qmp_id_qmpmonitor1,path=/tmp/monitor-qmpmonitor1-20130530-191349-ZMSDl1gm,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,id=serial_id_serial1,path=/tmp/serial-serial1-20130530-191349-ZMSDl1gm,server,nowait \ -device isa-serial,chardev=serial_id_serial1 \ -chardev socket,id=seabioslog_id_20130530-191349-ZMSDl1gm,path=/tmp/seabios-20130530-191349-ZMSDl1gm,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20130530-191349-ZMSDl1gm,iobase=0x402 \ -device ich9-usb-uhci1,id=usb1,bus=pci.0,addr=0x4 \ -drive file='/home/staf-kvm-devel/autotest-devel/client/tests/virt/shared/data/images/RHEL-Server-7.0-64.qcow2',if=none,id=drive-ide0-0-0,media=disk,cache=none,snapshot=off,format=qcow2,aio=native \ -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0 \ -device rtl8139,netdev=idIg594w,mac=9a:3b:3c:3d:3e:3f,bus=pci.0,addr=0x3,id='idCSLDpm' \ -netdev tap,id=idIg594w \ -m 4096 \ -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \ -cpu 'host' \ -M pc \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -vnc :0 \ -vga cirrus \ -rtc base=utc,clock=host,driftfix=slew \ -boot order=cdn,once=c,menu=off \ -no-kvm-pit-reinjection \ -enable-kvm 2.qemu#c 3. repeatly do info mem info tlb Actual results: qemu-kvm hangs Expected results: qemu-kvm work well Additional info: processor : 23 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz stepping : 7 microcode : 0x70d cpu MHz : 1200.000 cache size : 15360 KB physical id : 1 siblings : 12 core id : 5 cpu cores : 6 apicid : 43 initial apicid : 43 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4004.81 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: