Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 765733 - Error reporting when qemu terminates unexpectedly is inconsistent and sometimes unhelpful
Error reporting when qemu terminates unexpectedly is inconsistent and sometim...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt (Show other bugs)
7.0
x86_64 Linux
medium Severity medium
: rc
: ---
Assigned To: Eric Blake
Virtualization Bugs
: TestOnly
: 892007 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2011-12-09 04:42 EST by zhpeng
Modified: 2016-04-26 11:38 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-03-05 02:19:08 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:0323 normal SHIPPED_LIVE Low: libvirt security, bug fix, and enhancement update 2015-03-05 07:10:54 EST

  None (edit)
Description zhpeng 2011-12-09 04:42:42 EST
Description of problem:
start guest will fail with different errors when can't allocate memory

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.210.el6.x86_64
libvirt-0.9.8-1.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. On host make sure how much memory you can allocate.
   [root@zhpeng ~]# sysctl -a|grep overcommit
   vm.overcommit_memory = 0
   [root@zhpeng ~]# free -m
                 total       used       free     shared    buffers     cached
   Mem:          7815       1243       6571          0         45        456
   -/+ buffers/cache:        742       7073
   Swap:         4095          0       4095

2. Define a guest with memory bigger than you can allocate, in this case we will use 20G
    ...
  <memory>20971520</memory>
  <currentMemory>20971520</currentMemory>
    ...
3. Boot the guest:
  [root@zhpeng ~]# while true; do virsh start kvm1; sleep 1; done

Actual results:
  The step 3 have 3 kinds of results:

error: Failed to start domain kvm1
error: Unable to read from monitor: Connection reset by peer


error: Failed to start domain kvm1
error: internal error Process exited while reading console log output: char device redirected to /dev/pts/6
Failed to allocate 21474836480 B: Cannot allocate memory


error: Failed to start domain kvm1
error: internal error process exited while connecting to monitor: char device redirected to /dev/pts/6
Failed to allocate 21474836480 B: Cannot allocate memory


Expected results:
  Only one kind of error message.
Comment 2 zhpeng 2011-12-11 21:52:36 EST
And qemu-kvm-0.12.1.2-2.209 can reproduce this too.
Comment 3 Peter Krempa 2012-01-03 08:20:58 EST
These error messages represent errors, that may happen while connecting to the monitor of qemu. 

The two latter messages appear, when qemu exits during early initialisation phases. Then we assume something wrong happened and print the last line of qemu's log file. In most of the cases (there are some possibilities when this would not be the case) the last line contains informations relevant to the exit case. In this case out of memory, but it might be something else.

The first message appears after a successful connect to the monitor after what the qemu process terminates. This might happen after qemu has been running for longer time and for example exits because it's killed. In this case, the last line would be inadequate in most of the cases.

A quick test shows, that qemu returns code 134 if memory allocation fails. This could be used to check if OOM is the cause of qemu's death. I couldn't find any documentation on qemu's return values, but if error 134 is consistent for OOM cases this could be a way to consolidate the error messages.
Comment 4 Markus Armbruster 2012-10-29 11:03:28 EDT
The error message comes from qemu_memalign():

void *qemu_memalign(size_t alignment, size_t size)
{
    void *ptr;
#if defined(_POSIX_C_SOURCE) && !defined(__sun__)
    int ret;
    ret = posix_memalign(&ptr, alignment, size);
    if (ret != 0) {
        fprintf(stderr, "Failed to allocate %zu B: %s\n",
                size, strerror(ret));
        abort();
    }
#elif defined(CONFIG_BSD)
    ptr = oom_check(valloc(size));
#else
    ptr = oom_check(memalign(alignment, size));
#endif
    trace_qemu_memalign(alignment, size, ptr);
    return ptr;
}

where

#if !defined(_POSIX_C_SOURCE) || defined(_WIN32) || defined(__sun__)
static void *oom_check(void *ptr)
{
    if (ptr == NULL) {
#if defined(_WIN32)
        fprintf(stderr, "Failed to allocate memory: %lu\n", GetLastError());
#else
        fprintf(stderr, "Failed to allocate memory: %s\n", strerror(errno));
#endif
        abort();
    }
    return ptr;
}
#endif

So the error message comes in a slightly different form on machines where we don't use posix_memalign().

abort() unblocks and raises SIGABRT, and since the signal isn't caught, this terminates the process with exit status 134 (and dumps core).

Upstream has basically the same code arranged differently.

Note that we call abort() for numerous other fatal error conditions, both directly and via assert() and similar.  So it's fair to say that asking for too much guest RAM will yield exit status 134 for wide range of qemu versions, but the converse is not true: exit status 134 does *not* imply you asked for too much guest RAM.

Calling abort() when out of memory is somewhat controversial.  Upstream feels strongly it's the right thing to do, except perhaps when allocating guest RAM.  The place that handles out of memory doesn't know the difference, so guest RAM doesn't get special treatment.  But that could be changed.  In that light, relying on exit status 134 doesn't seem advisable.
Comment 5 Dave Allan 2012-10-29 11:35:33 EDT
See also BZ 867921
Comment 7 Markus Armbruster 2012-11-16 04:32:57 EST
This bug is about error reporting when qemu-kvm terminates unexpectedly.  The reporter uses a sufficiently large guest memory size to make qemu-kvm terminate, but that's just one way to trigger unexpected termination.

libvirt reports different errors, depending on how it detects the termination.  The bug report lists three detections:

1. qemuMonitorIORead(): read() fails with ECONNRESET
   error: Unable to read from monitor: Connection reset by peer

2. qemuProcessReadLogOutput(): saferead() succeeds after qemu-kvm died
   error: internal error Process exited while reading console log output: LOG
where LOG is whatever this invocation of qemuProcessReadLogOutput() has read so far, with certain patterns removed (I don't understand that part).  Can be multiple lines.  The reporter got "char device redirected to /dev/pts/6\nFailed to allocate 21474836480 B: Cannot allocate memory\n"

3. qemuProcessWaitForMonitor(): detects qemu-kvm died after qemuProcessReadLogOutput() succeeded
   error: internal error process exited while connecting to monitor: LOG
where LOG is as in 2.  Racy; if it dies a bit earlier, we get 2.

There might be more.

Aspects of the bug:

A. Cases 2 and 3 include the latest qemu output in the error message. With luck, the output is exactly the error message qemu printed before it died.  But I'm afraid it could be more, or less, depending on timing.

B. Case 1's error message is useless.  "Unable to read from monitor: Connection reset by peer" is unintelligible unless you know that ECONNRESET means qemu died.  Even then, the message gives you no clue why it dies, because it doesn't include any qemu output.

C. Additional ways to report qemu-kvm termination may exist, and their error reporting could need improvement, too.

Right now, this bug is about all three aspects.  Updating bug summary accordingly.  If some of them need to be tracked separately, please clone this bug for them.

Non-aspect:

* qemu's error message when it can't allocate guest RAM could use some love.  qemu-kvm bug 867921.  Blocks this bug, but I really can't see how a different qemu error message and/or exit status could help libvirt with the problems described above.  Therefore, I'm breaking the dependency.  Feel free to restore it with an explanation why you need the qemu-kvm change, preferably as comment on bug 867921.
Comment 8 David Ripton 2013-03-20 15:13:13 EDT
Dave Allan suggested adding this to this bug:

OpenStack Nova calls into libvirt which calls into qemu which fails to start an instance, with message:

QEMU error : internal error Process exited while reading console log output: chardev: opening backend "file" failed: Permission denied

So this tells us that qemu failed to open a the "file" backend due to permission problems, but not *which* file it was trying to open when it failed.  That information would be helpful.
Comment 9 Eric Blake 2013-04-16 17:59:26 EDT
For an example of how bad qemu is, qemu 1.4 would exit with success even when failing incoming migration.  I'm hoping this patch makes qemu 1.5 and gets backported to RHEL:

http://thread.gmane.org/gmane.comp.emulators.qemu/206758
Comment 11 Martin Kletzander 2013-05-10 08:39:16 EDT
I think the core problem is that we don't check for qemu error when we already have error about connecting to the monitor (usually -- this depends on at which point qemu fails).  The same problem is for example happening in bug 959888, when qemu is failing due to invalid device being specified (which exists on different qemu versions).  The problem is that we rely on qemu telling us that there was an invalid device/driver type, but we don't use it's error message and thus management applications (in this case virt-manager) cannot properly say what went wrong.  Am I right that if we fix this error reporting, both these two issues (and possibly others as well) will be solved or did I miss something?
Comment 12 Eric Blake 2013-07-12 15:21:04 EDT
*** Bug 892007 has been marked as a duplicate of this bug. ***
Comment 14 Jiri Denemark 2014-07-21 11:42:11 EDT
I think the core of this bug should be fixed by patches that are already present in 7.0 for bug 1001738. I'm marking this bug as TestOnly to confirm the error messages are clear now.
Comment 15 Luyao Huang 2014-08-12 22:11:05 EDT
I think the error message like "Connection reset by peer" is disappeared when a guest start fail.And now the output error message just like:
# virsh start r7
error: Failed to start domain r7
error: internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory

error: Failed to start domain r7
error: internal error: early end of file from monitor: possible problem:
Cannot set up guest memory 'pc.ram': Cannot allocate memory


The error message is useful now.
But i found another problem about error message not clear:
1.when a guest shutdown fail(use two terminal do shutdown at the same time) 

 # virsh shutdown r6 acpi
error: Failed to shutdown domain r6
error: Unable to read from monitor: Connection reset by peer

2.when a guest is creating a snapshot and destroy it in another terminal

# virsh snapshot-create-as test6 s105
error: Unable to read from monitor: Connection reset by peer

Seems the error message are not clear when a guest shut off.
Comment 16 Luyao Huang 2014-09-18 04:29:11 EDT
Still can found error message like this when start a guest :

1.prepare a guest with 160 vcpu and set memtune
# virsh dumpxml test3
<domain type='kvm'>
  <name>test3</name>
  <uuid>7347d748-f7ce-448f-8d49-3d29c9bcac30</uuid>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>714400</currentMemory>
  <memtune>
    <hard_limit unit='KiB'>23112</hard_limit>
    <soft_limit unit='KiB'>1</soft_limit>
    <swap_hard_limit unit='KiB'>23411</swap_hard_limit>
  </memtune>
  <vcpu placement='static'>160</vcpu>

2. start it and some times it will start and then guest crashed

# virsh start test3
error: Failed to start domain test3
error: Unable to read from monitor: Connection reset by peer

# virsh start test3
Domain test3 started

# virsh domstate test3 --reason
shut off (crashed)
Comment 17 Jiri Denemark 2014-09-18 04:53:42 EDT
What version of libvirt?
Comment 18 Luyao Huang 2014-09-18 05:06:27 EDT
libvirt-1.2.8-2.el7.x86_64
Comment 19 vivian zhang 2014-12-01 03:32:01 EST
verify it on build
libvirt-1.2.8-9.el7.x86_64

steps:

1. check free memory on host
# free -m
              total        used        free      shared  buff/cache   available
Mem:           3653         172        1259          16        2221        3205
Swap:          1023           0        1023

2. edit a guest to set memory out of free
# virsh dumpxml rhel7
<domain type='kvm'>
  <name>rhel7</name>
  <uuid>49a3438a-70a3-4ba8-92ce-3a05e0934608</uuid>
  <description>desc</description>
  <memory unit='KiB'>81048576</memory>
  <currentMemory unit='KiB'>8048576</currentMemory>
  <vcpu placement='static'>1</vcpu>

3. start guest, report error as below
# virsh start rhel7
error: Failed to start domain rhel7
error: internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory
Comment 20 vivian zhang 2014-12-09 03:37:38 EST
hi, Jiri

I have verified this bug for the description described steps, please refer comment 19.

but for the similar errors found by lhuang as comment 15 and 16 described
I can not produce comment 15
and I can produce comment 16 as following result
# virsh dumpxml test3
<domain type='kvm'>
  <name>test3</name>
  <uuid>7347d748-f7ce-448f-8d49-3d29c9bcac30</uuid>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>714400</currentMemory>
  <memtune>
    <hard_limit unit='KiB'>23112</hard_limit>
    <soft_limit unit='KiB'>1</soft_limit>
    <swap_hard_limit unit='KiB'>23411</swap_hard_limit>
  </memtune>
  <vcpu placement='static'>160</vcpu>
#virsh start vm1
    error: Failed to start domain vm1
    error: Unable to read from monitor: Connection reset by peer
I want to confirm with you about this error, is this acceptable from libvirt view?
if not, do we need to file a new tracking that?

thanks
Comment 21 Jiri Denemark 2014-12-09 03:54:06 EST
In this case kernel just kills QEMU because of the low memory limit. No message will be printed to qemu log about it so the error we currently report is the best one we can reasonably provide.
Comment 22 vivian zhang 2014-12-09 04:09:50 EST
(In reply to Jiri Denemark from comment #21)
> In this case kernel just kills QEMU because of the low memory limit. No
> message will be printed to qemu log about it so the error we currently
> report is the best one we can reasonably provide.

thanks for your explanation
i will change verified for this bug
Comment 23 Luyao Huang 2015-01-26 04:02:29 EST
And after update libvirt to libvirt-1.2.8-15.el7.x86_64, i noticed a new error  message when start guest failed (not always meet this error): 

error: Failed to start domain test4
error: internal error: early end of file from monitor: possible problem:
Cannot set up guest memory 'pc.ram': Cannot allocate memory


error: Failed to start domain test4
error: internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory


error: Failed to start domain test4
error: operation failed: domain is no longer running


error: Failed to start domain test4
error: internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory


error: Failed to start domain test4
error: operation failed: domain is no longer running


error: Failed to start domain test4
error: operation failed: domain is no longer running
Comment 25 errata-xmlrpc 2015-03-05 02:19:08 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html

Note You need to log in before you can comment on or make changes to this bug.