RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1212722 - qemu-kvm crashes with iwp->src == NULL in io_watch_poll_finalize
Summary: qemu-kvm crashes with iwp->src == NULL in io_watch_poll_finalize
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: glib2
Version: 6.5
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Colin Walters
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 1172231 1269194
TreeView+ depends on / blocked
 
Reported: 2015-04-17 08:25 UTC by Takayuki Nagata
Modified: 2020-05-14 14:58 UTC (History)
18 users (show)

Fixed In Version: glib2-2.28.8-6.el6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-21 09:02:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
backport of upstream patches (2.73 KB, patch)
2015-09-28 13:50 UTC, Paolo Bonzini
no flags Details | Diff
qemu: Port to GSource API (5.59 KB, patch)
2015-09-28 14:41 UTC, Colin Walters
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1412633 0 None None None Never
Red Hat Product Errata RHBA-2017:0567 0 normal SHIPPED_LIVE glib2 bug fix update 2017-03-21 12:22:56 UTC

Comment 2 Ademar Reis 2015-04-17 17:19:34 UTC
The code that includes the assertion was introduced by Paolo back in 2013, in this commit:

commit 2b316774f60291f57ca9ecb6a9f0712c532cae34
Author: Paolo Bonzini <pbonzini>
Date:   Fri Apr 19 17:32:09 2013 +0200

    qemu-char: do not operate on sources from finalize callbacks
    
    Due to a glib bug, the finalize callback is called with the GMainContext
    lock held.  Thus, any operation on the context from the callback will
    cause recursive locking and a deadlock.  This happens, for example,
    when a client disconnects from a socket chardev.
    
    The fix for this is somewhat ugly, because we need to forego polymorphism
    and implement our own function to destroy IOWatchPoll sources.  The
    right thing to do here would be child sources, but we support older
    glib versions that do not have them.  Not coincidentially, glib developers
    found and fixed the deadlock as part of implementing child sources.
    
    Signed-off-by: Paolo Bonzini <pbonzini>
    Tested-by: Sander Eikelenboom <linux>
    Message-id: 1366385529-10329-5-git-send-email-pbonzini
    Signed-off-by: Anthony Liguori <aliguori.com>

Comment 3 Paolo Bonzini 2015-04-21 16:13:46 UTC
The source that is removed in pty_chr_rearm_timer should not have been an io_watch_poll source, so the assertion is correct.  I cannot find anything wrong in the code.

Is it possible to get the core files?

Comment 8 Adrian 2015-09-23 16:45:14 UTC
We've been having this issue on a semi-regular basis and I'd like to help troubleshoot. What files/logs/etc would be helpful?

Comment 10 Paolo Bonzini 2015-09-25 12:24:23 UTC
Adrian,

please attach to this bug any backtraces that you can get.  It would be very helpful to know if they all look the same or they are different.

Comment 12 Paolo Bonzini 2015-09-28 13:00:18 UTC
This is upstream bug https://bugzilla.gnome.org/show_bug.cgi?id=687098.

Backporting the patches from https://bugzilla.gnome.org/show_bug.cgi?id=724839 would actually be better though, as they are simpler.  I'm building a modified glib package now.

-------------------------------------------
Reproduction steps from CentOS bug tracker:
-------------------------------------------

1. Add vm-channels to the guest
  a. virsh -c qemu:///system edit {guest-name}
  b. Add the following snippet of xml under devices:

    <channel type='pty'>
      <target type='virtio' name='inbound'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <channel type='pty'>
      <target type='virtio' name='outbound'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>

  c. Fully shutdown the guest (if not already). Start the guest.

2. Open the vm channel for reading on the host side

  a. virsh -c qemu:///system dumpxml {guest-name}
  b. Find the snippet that looks like your outbound channel:

   <channel type='pty'>
      <source path='/dev/pts/4'/>
      <target type='virtio' name='outbound'/>
      <alias name='channel2'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>

  c. cat /dev/pts/N (in this case, /dev/pts/4) (this call will block)

3. In your guest, start sending data to the outbound vm-channel

  a. This bash script can be used:

    #!/bin/sh
    (while true; do
      echo 123456789012345678901234567890123456789012345678901234567890123
      sleep 1
    done) > /dev/virtio-ports/aapps_outbound

4. Kill your read side cat (on the host side)

5. Let this run. This may take hours, but it will crash eventually.

Comment 13 Colin Walters 2015-09-28 13:37:13 UTC
Hmm.  Is this one source per created per write on a virtio channel?  That sounds bad.

Have you considered porting qemu to the GSource* API instead of the "convenience" API?

Comment 14 Paolo Bonzini 2015-09-28 13:50:28 UTC
Created attachment 1077939 [details]
backport of upstream patches

Comment 15 Paolo Bonzini 2015-09-28 14:00:09 UTC
The source is only created when flow control kicks in, which should be rare.  This testcase is more or less the worst case because the host side "cat" was terminated and the guest keeps writing.

> Have you considered porting qemu to the GSource* API instead of the 
> "convenience" API?

Yes, but replacing every g_*_add with g_source_new + g_source_set_callback + g_source_attach is a bit clunky.  The incentive to do it upstream is low, since it is already fixed in upstream glib and it only happens very rarely.  RHEL7 also works, which a QEMU patch even less appealing.  Plus I have done (though not tested) the backport already. :)

Comment 16 Colin Walters 2015-09-28 14:41:06 UTC
Created attachment 1077943 [details]
qemu: Port to GSource API

Comment 17 Colin Walters 2015-09-28 14:45:13 UTC
It wasn't too bad to type out the patch, though I didn't test it.  What do you think?

Your backport of the glib patch looks sane, but I see these costs:
 - It's in a very complex area of the code, there is risk of regression
 - glib2 in RHEL6 is old, but a lot of things use it

If there were *two* known programs broken by the glib2 overflow, I think that would argue strongly for a backport.  (I'm sure there are more, but hopefully most everyone is on RHEL7 now)

Comment 19 Paolo Bonzini 2015-09-28 15:08:26 UTC
That patch also looks sane, but the very same objections apply to it (complex area of the code / risk of regression, old codebase).

https://bugzilla.gnome.org/show_bug.cgi?id=684526 shows another guy that stumbled upon the same issue.  In general it seems to me that the bug will be hard to debug (because it's in a library that's generally taken for granted, and because it's hard to reproduce) so I'd really prefer to have it fixed for real.

While I agree that gmain is complex, ids are hardly used internally, which makes the patch is small.  In fact it is much simpler than the one originally written for GNOME bug 687098.  It is also relatively easy to review by looking at uses of context->next_id, source->source_id, g_source_list_add and g_source_list_remove.

Comment 24 Adrian 2016-06-06 15:55:21 UTC
For what it's worth, this is still a persistent problem (our primary production database guest is crashing about twice a week with this error). We're now on fully-patched 6.7 and the issue seems to be getting worse.

Comment 25 Paolo Bonzini 2016-06-06 19:12:34 UTC
Junyi, can you qa_ack this?

Comment 26 Chao Yang 2016-06-07 02:41:41 UTC
(In reply to Paolo Bonzini from comment #25)
> Junyi, can you qa_ack this?

Hi Paolo,

QE will verify this bug by following instructions in Comment 12 if we ack this bug, is there any additional concern?

Comment 27 Paolo Bonzini 2016-06-07 09:59:21 UTC
No, it's okay.

Comment 28 Paolo Bonzini 2016-07-15 07:46:22 UTC
Junyi, please qa_ack this.

Comment 29 Paolo Bonzini 2016-07-15 07:47:29 UTC
I'd like to have this in 6.8.z too since there are customer cases waiting on Red Hat.  Can GSS add the flag?

Comment 37 Paolo Bonzini 2016-12-02 09:23:54 UTC
Unfortunately there's no way to accelerate.

Did you get the output before you killed the cat process.  I noted that the bash script uses "aapps_outbound" while the XML uses "outbound" only.

Comment 38 Yumei Huang 2016-12-06 01:57:57 UTC
Thanks Paolo. The script should use /dev/virtio-ports/outbound instead of aapps_outbound.

KVM QE has reproduced with glib2-2.26.1-3.el6.x86_64. After running the script in guest for 15 hours, guest crashed with "qemu-kvm: /builddir/build/BUILD/qemu-kvm-0.12.1.2/qemu-char.c:634: io_watch_poll_finalize: Assertion `iwp->src == ((void *)0)' failed."

Also KVM QE has run the same steps with glib2-2.28.8-6.el6.x86_64.  Guest still works well after more than 20 hours. 

So the fixed glib2 rpm is fine from KVM QE's perspective.

Comment 39 Tomas Pelka 2016-12-06 07:31:49 UTC
Thanks moving to VERIFIED based on c38.

Comment 41 errata-xmlrpc 2017-03-21 09:02:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0567.html


Note You need to log in before you can comment on or make changes to this bug.