Red Hat Bugzilla – Bug 765733
Error reporting when qemu terminates unexpectedly is inconsistent and sometimes unhelpful
Last modified: 2016-04-26 11:38:39 EDT
Description of problem: start guest will fail with different errors when can't allocate memory Version-Release number of selected component (if applicable): qemu-kvm-0.12.1.2-2.210.el6.x86_64 libvirt-0.9.8-1.el6.x86_64 How reproducible: always Steps to Reproduce: 1. On host make sure how much memory you can allocate. [root@zhpeng ~]# sysctl -a|grep overcommit vm.overcommit_memory = 0 [root@zhpeng ~]# free -m total used free shared buffers cached Mem: 7815 1243 6571 0 45 456 -/+ buffers/cache: 742 7073 Swap: 4095 0 4095 2. Define a guest with memory bigger than you can allocate, in this case we will use 20G ... <memory>20971520</memory> <currentMemory>20971520</currentMemory> ... 3. Boot the guest: [root@zhpeng ~]# while true; do virsh start kvm1; sleep 1; done Actual results: The step 3 have 3 kinds of results: error: Failed to start domain kvm1 error: Unable to read from monitor: Connection reset by peer error: Failed to start domain kvm1 error: internal error Process exited while reading console log output: char device redirected to /dev/pts/6 Failed to allocate 21474836480 B: Cannot allocate memory error: Failed to start domain kvm1 error: internal error process exited while connecting to monitor: char device redirected to /dev/pts/6 Failed to allocate 21474836480 B: Cannot allocate memory Expected results: Only one kind of error message.
And qemu-kvm-0.12.1.2-2.209 can reproduce this too.
These error messages represent errors, that may happen while connecting to the monitor of qemu. The two latter messages appear, when qemu exits during early initialisation phases. Then we assume something wrong happened and print the last line of qemu's log file. In most of the cases (there are some possibilities when this would not be the case) the last line contains informations relevant to the exit case. In this case out of memory, but it might be something else. The first message appears after a successful connect to the monitor after what the qemu process terminates. This might happen after qemu has been running for longer time and for example exits because it's killed. In this case, the last line would be inadequate in most of the cases. A quick test shows, that qemu returns code 134 if memory allocation fails. This could be used to check if OOM is the cause of qemu's death. I couldn't find any documentation on qemu's return values, but if error 134 is consistent for OOM cases this could be a way to consolidate the error messages.
The error message comes from qemu_memalign(): void *qemu_memalign(size_t alignment, size_t size) { void *ptr; #if defined(_POSIX_C_SOURCE) && !defined(__sun__) int ret; ret = posix_memalign(&ptr, alignment, size); if (ret != 0) { fprintf(stderr, "Failed to allocate %zu B: %s\n", size, strerror(ret)); abort(); } #elif defined(CONFIG_BSD) ptr = oom_check(valloc(size)); #else ptr = oom_check(memalign(alignment, size)); #endif trace_qemu_memalign(alignment, size, ptr); return ptr; } where #if !defined(_POSIX_C_SOURCE) || defined(_WIN32) || defined(__sun__) static void *oom_check(void *ptr) { if (ptr == NULL) { #if defined(_WIN32) fprintf(stderr, "Failed to allocate memory: %lu\n", GetLastError()); #else fprintf(stderr, "Failed to allocate memory: %s\n", strerror(errno)); #endif abort(); } return ptr; } #endif So the error message comes in a slightly different form on machines where we don't use posix_memalign(). abort() unblocks and raises SIGABRT, and since the signal isn't caught, this terminates the process with exit status 134 (and dumps core). Upstream has basically the same code arranged differently. Note that we call abort() for numerous other fatal error conditions, both directly and via assert() and similar. So it's fair to say that asking for too much guest RAM will yield exit status 134 for wide range of qemu versions, but the converse is not true: exit status 134 does *not* imply you asked for too much guest RAM. Calling abort() when out of memory is somewhat controversial. Upstream feels strongly it's the right thing to do, except perhaps when allocating guest RAM. The place that handles out of memory doesn't know the difference, so guest RAM doesn't get special treatment. But that could be changed. In that light, relying on exit status 134 doesn't seem advisable.
See also BZ 867921
This bug is about error reporting when qemu-kvm terminates unexpectedly. The reporter uses a sufficiently large guest memory size to make qemu-kvm terminate, but that's just one way to trigger unexpected termination. libvirt reports different errors, depending on how it detects the termination. The bug report lists three detections: 1. qemuMonitorIORead(): read() fails with ECONNRESET error: Unable to read from monitor: Connection reset by peer 2. qemuProcessReadLogOutput(): saferead() succeeds after qemu-kvm died error: internal error Process exited while reading console log output: LOG where LOG is whatever this invocation of qemuProcessReadLogOutput() has read so far, with certain patterns removed (I don't understand that part). Can be multiple lines. The reporter got "char device redirected to /dev/pts/6\nFailed to allocate 21474836480 B: Cannot allocate memory\n" 3. qemuProcessWaitForMonitor(): detects qemu-kvm died after qemuProcessReadLogOutput() succeeded error: internal error process exited while connecting to monitor: LOG where LOG is as in 2. Racy; if it dies a bit earlier, we get 2. There might be more. Aspects of the bug: A. Cases 2 and 3 include the latest qemu output in the error message. With luck, the output is exactly the error message qemu printed before it died. But I'm afraid it could be more, or less, depending on timing. B. Case 1's error message is useless. "Unable to read from monitor: Connection reset by peer" is unintelligible unless you know that ECONNRESET means qemu died. Even then, the message gives you no clue why it dies, because it doesn't include any qemu output. C. Additional ways to report qemu-kvm termination may exist, and their error reporting could need improvement, too. Right now, this bug is about all three aspects. Updating bug summary accordingly. If some of them need to be tracked separately, please clone this bug for them. Non-aspect: * qemu's error message when it can't allocate guest RAM could use some love. qemu-kvm bug 867921. Blocks this bug, but I really can't see how a different qemu error message and/or exit status could help libvirt with the problems described above. Therefore, I'm breaking the dependency. Feel free to restore it with an explanation why you need the qemu-kvm change, preferably as comment on bug 867921.
Dave Allan suggested adding this to this bug: OpenStack Nova calls into libvirt which calls into qemu which fails to start an instance, with message: QEMU error : internal error Process exited while reading console log output: chardev: opening backend "file" failed: Permission denied So this tells us that qemu failed to open a the "file" backend due to permission problems, but not *which* file it was trying to open when it failed. That information would be helpful.
For an example of how bad qemu is, qemu 1.4 would exit with success even when failing incoming migration. I'm hoping this patch makes qemu 1.5 and gets backported to RHEL: http://thread.gmane.org/gmane.comp.emulators.qemu/206758
I think the core problem is that we don't check for qemu error when we already have error about connecting to the monitor (usually -- this depends on at which point qemu fails). The same problem is for example happening in bug 959888, when qemu is failing due to invalid device being specified (which exists on different qemu versions). The problem is that we rely on qemu telling us that there was an invalid device/driver type, but we don't use it's error message and thus management applications (in this case virt-manager) cannot properly say what went wrong. Am I right that if we fix this error reporting, both these two issues (and possibly others as well) will be solved or did I miss something?
*** Bug 892007 has been marked as a duplicate of this bug. ***
I think the core of this bug should be fixed by patches that are already present in 7.0 for bug 1001738. I'm marking this bug as TestOnly to confirm the error messages are clear now.
I think the error message like "Connection reset by peer" is disappeared when a guest start fail.And now the output error message just like: # virsh start r7 error: Failed to start domain r7 error: internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory error: Failed to start domain r7 error: internal error: early end of file from monitor: possible problem: Cannot set up guest memory 'pc.ram': Cannot allocate memory The error message is useful now. But i found another problem about error message not clear: 1.when a guest shutdown fail(use two terminal do shutdown at the same time) # virsh shutdown r6 acpi error: Failed to shutdown domain r6 error: Unable to read from monitor: Connection reset by peer 2.when a guest is creating a snapshot and destroy it in another terminal # virsh snapshot-create-as test6 s105 error: Unable to read from monitor: Connection reset by peer Seems the error message are not clear when a guest shut off.
Still can found error message like this when start a guest : 1.prepare a guest with 160 vcpu and set memtune # virsh dumpxml test3 <domain type='kvm'> <name>test3</name> <uuid>7347d748-f7ce-448f-8d49-3d29c9bcac30</uuid> <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>714400</currentMemory> <memtune> <hard_limit unit='KiB'>23112</hard_limit> <soft_limit unit='KiB'>1</soft_limit> <swap_hard_limit unit='KiB'>23411</swap_hard_limit> </memtune> <vcpu placement='static'>160</vcpu> 2. start it and some times it will start and then guest crashed # virsh start test3 error: Failed to start domain test3 error: Unable to read from monitor: Connection reset by peer # virsh start test3 Domain test3 started # virsh domstate test3 --reason shut off (crashed)
What version of libvirt?
libvirt-1.2.8-2.el7.x86_64
verify it on build libvirt-1.2.8-9.el7.x86_64 steps: 1. check free memory on host # free -m total used free shared buff/cache available Mem: 3653 172 1259 16 2221 3205 Swap: 1023 0 1023 2. edit a guest to set memory out of free # virsh dumpxml rhel7 <domain type='kvm'> <name>rhel7</name> <uuid>49a3438a-70a3-4ba8-92ce-3a05e0934608</uuid> <description>desc</description> <memory unit='KiB'>81048576</memory> <currentMemory unit='KiB'>8048576</currentMemory> <vcpu placement='static'>1</vcpu> 3. start guest, report error as below # virsh start rhel7 error: Failed to start domain rhel7 error: internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory
hi, Jiri I have verified this bug for the description described steps, please refer comment 19. but for the similar errors found by lhuang as comment 15 and 16 described I can not produce comment 15 and I can produce comment 16 as following result # virsh dumpxml test3 <domain type='kvm'> <name>test3</name> <uuid>7347d748-f7ce-448f-8d49-3d29c9bcac30</uuid> <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>714400</currentMemory> <memtune> <hard_limit unit='KiB'>23112</hard_limit> <soft_limit unit='KiB'>1</soft_limit> <swap_hard_limit unit='KiB'>23411</swap_hard_limit> </memtune> <vcpu placement='static'>160</vcpu> #virsh start vm1 error: Failed to start domain vm1 error: Unable to read from monitor: Connection reset by peer I want to confirm with you about this error, is this acceptable from libvirt view? if not, do we need to file a new tracking that? thanks
In this case kernel just kills QEMU because of the low memory limit. No message will be printed to qemu log about it so the error we currently report is the best one we can reasonably provide.
(In reply to Jiri Denemark from comment #21) > In this case kernel just kills QEMU because of the low memory limit. No > message will be printed to qemu log about it so the error we currently > report is the best one we can reasonably provide. thanks for your explanation i will change verified for this bug
And after update libvirt to libvirt-1.2.8-15.el7.x86_64, i noticed a new error message when start guest failed (not always meet this error): error: Failed to start domain test4 error: internal error: early end of file from monitor: possible problem: Cannot set up guest memory 'pc.ram': Cannot allocate memory error: Failed to start domain test4 error: internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory error: Failed to start domain test4 error: operation failed: domain is no longer running error: Failed to start domain test4 error: internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory error: Failed to start domain test4 error: operation failed: domain is no longer running error: Failed to start domain test4 error: operation failed: domain is no longer running
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0323.html