765733 – Error reporting when qemu terminates unexpectedly is inconsistent and sometimes unhelpful

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 765733 - Error reporting when qemu terminates unexpectedly is inconsistent and sometimes unhelpful

Summary: Error reporting when qemu terminates unexpectedly is inconsistent and sometim...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	7.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Eric Blake
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	892007 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-12-09 09:42 UTC by zhpeng
Modified:	2016-04-26 15:38 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-03-05 07:19:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:0323	0	normal	SHIPPED_LIVE	Low: libvirt security, bug fix, and enhancement update	2015-03-05 12:10:54 UTC

Description zhpeng 2011-12-09 09:42:42 UTC

Description of problem:
start guest will fail with different errors when can't allocate memory

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.210.el6.x86_64
libvirt-0.9.8-1.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. On host make sure how much memory you can allocate.
   [root@zhpeng ~]# sysctl -a|grep overcommit
   vm.overcommit_memory = 0
   [root@zhpeng ~]# free -m
                 total       used       free     shared    buffers     cached
   Mem:          7815       1243       6571          0         45        456
   -/+ buffers/cache:        742       7073
   Swap:         4095          0       4095

2. Define a guest with memory bigger than you can allocate, in this case we will use 20G
    ...
  <memory>20971520</memory>
  <currentMemory>20971520</currentMemory>
    ...
3. Boot the guest:
  [root@zhpeng ~]# while true; do virsh start kvm1; sleep 1; done

Actual results:
  The step 3 have 3 kinds of results:

error: Failed to start domain kvm1
error: Unable to read from monitor: Connection reset by peer


error: Failed to start domain kvm1
error: internal error Process exited while reading console log output: char device redirected to /dev/pts/6
Failed to allocate 21474836480 B: Cannot allocate memory


error: Failed to start domain kvm1
error: internal error process exited while connecting to monitor: char device redirected to /dev/pts/6
Failed to allocate 21474836480 B: Cannot allocate memory


Expected results:
  Only one kind of error message.

Comment 2 zhpeng 2011-12-12 02:52:36 UTC

And qemu-kvm-0.12.1.2-2.209 can reproduce this too.

Comment 3 Peter Krempa 2012-01-03 13:20:58 UTC

These error messages represent errors, that may happen while connecting to the monitor of qemu. 

The two latter messages appear, when qemu exits during early initialisation phases. Then we assume something wrong happened and print the last line of qemu's log file. In most of the cases (there are some possibilities when this would not be the case) the last line contains informations relevant to the exit case. In this case out of memory, but it might be something else.

The first message appears after a successful connect to the monitor after what the qemu process terminates. This might happen after qemu has been running for longer time and for example exits because it's killed. In this case, the last line would be inadequate in most of the cases.

A quick test shows, that qemu returns code 134 if memory allocation fails. This could be used to check if OOM is the cause of qemu's death. I couldn't find any documentation on qemu's return values, but if error 134 is consistent for OOM cases this could be a way to consolidate the error messages.

Comment 4 Markus Armbruster 2012-10-29 15:03:28 UTC

The error message comes from qemu_memalign():

void *qemu_memalign(size_t alignment, size_t size)
{
    void *ptr;
#if defined(_POSIX_C_SOURCE) && !defined(__sun__)
    int ret;
    ret = posix_memalign(&ptr, alignment, size);
    if (ret != 0) {
        fprintf(stderr, "Failed to allocate %zu B: %s\n",
                size, strerror(ret));
        abort();
    }
#elif defined(CONFIG_BSD)
    ptr = oom_check(valloc(size));
#else
    ptr = oom_check(memalign(alignment, size));
#endif
    trace_qemu_memalign(alignment, size, ptr);
    return ptr;
}

where

#if !defined(_POSIX_C_SOURCE) || defined(_WIN32) || defined(__sun__)
static void *oom_check(void *ptr)
{
    if (ptr == NULL) {
#if defined(_WIN32)
        fprintf(stderr, "Failed to allocate memory: %lu\n", GetLastError());
#else
        fprintf(stderr, "Failed to allocate memory: %s\n", strerror(errno));
#endif
        abort();
    }
    return ptr;
}
#endif

So the error message comes in a slightly different form on machines where we don't use posix_memalign().

abort() unblocks and raises SIGABRT, and since the signal isn't caught, this terminates the process with exit status 134 (and dumps core).

Upstream has basically the same code arranged differently.

Note that we call abort() for numerous other fatal error conditions, both directly and via assert() and similar.  So it's fair to say that asking for too much guest RAM will yield exit status 134 for wide range of qemu versions, but the converse is not true: exit status 134 does *not* imply you asked for too much guest RAM.

Calling abort() when out of memory is somewhat controversial.  Upstream feels strongly it's the right thing to do, except perhaps when allocating guest RAM.  The place that handles out of memory doesn't know the difference, so guest RAM doesn't get special treatment.  But that could be changed.  In that light, relying on exit status 134 doesn't seem advisable.

Comment 5 Dave Allan 2012-10-29 15:35:33 UTC

See also BZ 867921

Comment 7 Markus Armbruster 2012-11-16 09:32:57 UTC

This bug is about error reporting when qemu-kvm terminates unexpectedly. The reporter uses a sufficiently large guest memory size to make qemu-kvm terminate, but that's just one way to trigger unexpected termination.

libvirt reports different errors, depending on how it detects the termination. The bug report lists three detections:

1. qemuMonitorIORead(): read() fails with ECONNRESET
error: Unable to read from monitor: Connection reset by peer

2. qemuProcessReadLogOutput(): saferead() succeeds after qemu-kvm died
error: internal error Process exited while reading console log output: LOG
where LOG is whatever this invocation of qemuProcessReadLogOutput() has read so far, with certain patterns removed (I don't understand that part). Can be multiple lines. The reporter got "char device redirected to /dev/pts/6\nFailed to allocate 21474836480 B: Cannot allocate memory\n"

3. qemuProcessWaitForMonitor(): detects qemu-kvm died after qemuProcessReadLogOutput() succeeded
error: internal error process exited while connecting to monitor: LOG
where LOG is as in 2. Racy; if it dies a bit earlier, we get 2.

There might be more.

Aspects of the bug:

A. Cases 2 and 3 include the latest qemu output in the error message. With luck, the output is exactly the error message qemu printed before it died. But I'm afraid it could be more, or less, depending on timing.

B. Case 1's error message is useless. "Unable to read from monitor: Connection reset by peer" is unintelligible unless you know that ECONNRESET means qemu died. Even then, the message gives you no clue why it dies, because it doesn't include any qemu output.

C. Additional ways to report qemu-kvm termination may exist, and their error reporting could need improvement, too.

Right now, this bug is about all three aspects. Updating bug summary accordingly. If some of them need to be tracked separately, please clone this bug for them.

Non-aspect:

* qemu's error message when it can't allocate guest RAM could use some love. qemu-kvm bug 867921. Blocks this bug, but I really can't see how a different qemu error message and/or exit status could help libvirt with the problems described above. Therefore, I'm breaking the dependency. Feel free to restore it with an explanation why you need the qemu-kvm change, preferably as comment on bug 867921.

Comment 8 David Ripton 2013-03-20 19:13:13 UTC

Dave Allan suggested adding this to this bug:

OpenStack Nova calls into libvirt which calls into qemu which fails to start an instance, with message:

QEMU error : internal error Process exited while reading console log output: chardev: opening backend "file" failed: Permission denied

So this tells us that qemu failed to open a the "file" backend due to permission problems, but not *which* file it was trying to open when it failed.  That information would be helpful.

Comment 9 Eric Blake 2013-04-16 21:59:26 UTC

For an example of how bad qemu is, qemu 1.4 would exit with success even when failing incoming migration.  I'm hoping this patch makes qemu 1.5 and gets backported to RHEL:

http://thread.gmane.org/gmane.comp.emulators.qemu/206758

Comment 11 Martin Kletzander 2013-05-10 12:39:16 UTC

I think the core problem is that we don't check for qemu error when we already have error about connecting to the monitor (usually -- this depends on at which point qemu fails).  The same problem is for example happening in bug 959888, when qemu is failing due to invalid device being specified (which exists on different qemu versions).  The problem is that we rely on qemu telling us that there was an invalid device/driver type, but we don't use it's error message and thus management applications (in this case virt-manager) cannot properly say what went wrong.  Am I right that if we fix this error reporting, both these two issues (and possibly others as well) will be solved or did I miss something?

Comment 12 Eric Blake 2013-07-12 19:21:04 UTC

*** Bug 892007 has been marked as a duplicate of this bug. ***

Comment 14 Jiri Denemark 2014-07-21 15:42:11 UTC

I think the core of this bug should be fixed by patches that are already present in 7.0 for bug 1001738. I'm marking this bug as TestOnly to confirm the error messages are clear now.

Comment 15 Luyao Huang 2014-08-13 02:11:05 UTC

I think the error message like "Connection reset by peer" is disappeared when a guest start fail.And now the output error message just like:
# virsh start r7
error: Failed to start domain r7
error: internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory

error: Failed to start domain r7
error: internal error: early end of file from monitor: possible problem:
Cannot set up guest memory 'pc.ram': Cannot allocate memory


The error message is useful now.
But i found another problem about error message not clear:
1.when a guest shutdown fail(use two terminal do shutdown at the same time) 

 # virsh shutdown r6 acpi
error: Failed to shutdown domain r6
error: Unable to read from monitor: Connection reset by peer

2.when a guest is creating a snapshot and destroy it in another terminal

# virsh snapshot-create-as test6 s105
error: Unable to read from monitor: Connection reset by peer

Seems the error message are not clear when a guest shut off.

Comment 16 Luyao Huang 2014-09-18 08:29:11 UTC

Still can found error message like this when start a guest :

1.prepare a guest with 160 vcpu and set memtune
# virsh dumpxml test3
<domain type='kvm'>
  <name>test3</name>
  <uuid>7347d748-f7ce-448f-8d49-3d29c9bcac30</uuid>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>714400</currentMemory>
  <memtune>
    <hard_limit unit='KiB'>23112</hard_limit>
    <soft_limit unit='KiB'>1</soft_limit>
    <swap_hard_limit unit='KiB'>23411</swap_hard_limit>
  </memtune>
  <vcpu placement='static'>160</vcpu>

2. start it and some times it will start and then guest crashed

# virsh start test3
error: Failed to start domain test3
error: Unable to read from monitor: Connection reset by peer

# virsh start test3
Domain test3 started

# virsh domstate test3 --reason
shut off (crashed)

Comment 17 Jiri Denemark 2014-09-18 08:53:42 UTC

What version of libvirt?

Comment 18 Luyao Huang 2014-09-18 09:06:27 UTC

libvirt-1.2.8-2.el7.x86_64

Comment 19 vivian zhang 2014-12-01 08:32:01 UTC

verify it on build
libvirt-1.2.8-9.el7.x86_64

steps:

1. check free memory on host
# free -m
              total        used        free      shared  buff/cache   available
Mem:           3653         172        1259          16        2221        3205
Swap:          1023           0        1023

2. edit a guest to set memory out of free
# virsh dumpxml rhel7
<domain type='kvm'>
  <name>rhel7</name>
  <uuid>49a3438a-70a3-4ba8-92ce-3a05e0934608</uuid>
  <description>desc</description>
  <memory unit='KiB'>81048576</memory>
  <currentMemory unit='KiB'>8048576</currentMemory>
  <vcpu placement='static'>1</vcpu>

3. start guest, report error as below
# virsh start rhel7
error: Failed to start domain rhel7
error: internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory

Comment 20 vivian zhang 2014-12-09 08:37:38 UTC

hi, Jiri

I have verified this bug for the description described steps, please refer comment 19.

but for the similar errors found by lhuang as comment 15 and 16 described
I can not produce comment 15
and I can produce comment 16 as following result
# virsh dumpxml test3
<domain type='kvm'>
  <name>test3</name>
  <uuid>7347d748-f7ce-448f-8d49-3d29c9bcac30</uuid>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>714400</currentMemory>
  <memtune>
    <hard_limit unit='KiB'>23112</hard_limit>
    <soft_limit unit='KiB'>1</soft_limit>
    <swap_hard_limit unit='KiB'>23411</swap_hard_limit>
  </memtune>
  <vcpu placement='static'>160</vcpu>
#virsh start vm1
    error: Failed to start domain vm1
    error: Unable to read from monitor: Connection reset by peer
I want to confirm with you about this error, is this acceptable from libvirt view?
if not, do we need to file a new tracking that?

thanks

Comment 21 Jiri Denemark 2014-12-09 08:54:06 UTC

In this case kernel just kills QEMU because of the low memory limit. No message will be printed to qemu log about it so the error we currently report is the best one we can reasonably provide.

Comment 22 vivian zhang 2014-12-09 09:09:50 UTC

(In reply to Jiri Denemark from comment #21)
> In this case kernel just kills QEMU because of the low memory limit. No
> message will be printed to qemu log about it so the error we currently
> report is the best one we can reasonably provide.

thanks for your explanation
i will change verified for this bug

Comment 23 Luyao Huang 2015-01-26 09:02:29 UTC

And after update libvirt to libvirt-1.2.8-15.el7.x86_64, i noticed a new error  message when start guest failed (not always meet this error): 

error: Failed to start domain test4
error: internal error: early end of file from monitor: possible problem:
Cannot set up guest memory 'pc.ram': Cannot allocate memory


error: Failed to start domain test4
error: internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory


error: Failed to start domain test4
error: operation failed: domain is no longer running


error: Failed to start domain test4
error: internal error: process exited while connecting to monitor: Cannot set up guest memory 'pc.ram': Cannot allocate memory


error: Failed to start domain test4
error: operation failed: domain is no longer running


error: Failed to start domain test4
error: operation failed: domain is no longer running

Comment 25 errata-xmlrpc 2015-03-05 07:19:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html

Note You need to log in before you can comment on or make changes to this bug.