Bug 1212722
| Summary: | qemu-kvm crashes with iwp->src == NULL in io_watch_poll_finalize | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Takayuki Nagata <tnagata> | ||||||
| Component: | glib2 | Assignee: | Colin Walters <walters> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 6.5 | CC: | adevolder, adrian.allen, chayang, cww, james.hofmeister, jkoten, juzhang, mkenneth, pbonzini, qzhang, rbalakri, rpacheco, tlavigne, tnagata, tpelka, virt-maint, walters, yuhuang | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | glib2-2.28.8-6.el6 | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2017-03-21 09:02:18 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1172231, 1269194 | ||||||||
| Attachments: |
|
||||||||
The source that is removed in pty_chr_rearm_timer should not have been an io_watch_poll source, so the assertion is correct. I cannot find anything wrong in the code. Is it possible to get the core files? We've been having this issue on a semi-regular basis and I'd like to help troubleshoot. What files/logs/etc would be helpful? Adrian, please attach to this bug any backtraces that you can get. It would be very helpful to know if they all look the same or they are different. This is upstream bug https://bugzilla.gnome.org/show_bug.cgi?id=687098. Backporting the patches from https://bugzilla.gnome.org/show_bug.cgi?id=724839 would actually be better though, as they are simpler. I'm building a modified glib package now. ------------------------------------------- Reproduction steps from CentOS bug tracker: ------------------------------------------- 1. Add vm-channels to the guest a. virsh -c qemu:///system edit {guest-name} b. Add the following snippet of xml under devices: <channel type='pty'> <target type='virtio' name='inbound'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> <channel type='pty'> <target type='virtio' name='outbound'/> <address type='virtio-serial' controller='0' bus='0' port='2'/> </channel> c. Fully shutdown the guest (if not already). Start the guest. 2. Open the vm channel for reading on the host side a. virsh -c qemu:///system dumpxml {guest-name} b. Find the snippet that looks like your outbound channel: <channel type='pty'> <source path='/dev/pts/4'/> <target type='virtio' name='outbound'/> <alias name='channel2'/> <address type='virtio-serial' controller='0' bus='0' port='2'/> </channel> c. cat /dev/pts/N (in this case, /dev/pts/4) (this call will block) 3. In your guest, start sending data to the outbound vm-channel a. This bash script can be used: #!/bin/sh (while true; do echo 123456789012345678901234567890123456789012345678901234567890123 sleep 1 done) > /dev/virtio-ports/aapps_outbound 4. Kill your read side cat (on the host side) 5. Let this run. This may take hours, but it will crash eventually. Hmm. Is this one source per created per write on a virtio channel? That sounds bad. Have you considered porting qemu to the GSource* API instead of the "convenience" API? Created attachment 1077939 [details]
backport of upstream patches
The source is only created when flow control kicks in, which should be rare. This testcase is more or less the worst case because the host side "cat" was terminated and the guest keeps writing.
> Have you considered porting qemu to the GSource* API instead of the
> "convenience" API?
Yes, but replacing every g_*_add with g_source_new + g_source_set_callback + g_source_attach is a bit clunky. The incentive to do it upstream is low, since it is already fixed in upstream glib and it only happens very rarely. RHEL7 also works, which a QEMU patch even less appealing. Plus I have done (though not tested) the backport already. :)
Created attachment 1077943 [details]
qemu: Port to GSource API
It wasn't too bad to type out the patch, though I didn't test it. What do you think? Your backport of the glib patch looks sane, but I see these costs: - It's in a very complex area of the code, there is risk of regression - glib2 in RHEL6 is old, but a lot of things use it If there were *two* known programs broken by the glib2 overflow, I think that would argue strongly for a backport. (I'm sure there are more, but hopefully most everyone is on RHEL7 now) That patch also looks sane, but the very same objections apply to it (complex area of the code / risk of regression, old codebase). https://bugzilla.gnome.org/show_bug.cgi?id=684526 shows another guy that stumbled upon the same issue. In general it seems to me that the bug will be hard to debug (because it's in a library that's generally taken for granted, and because it's hard to reproduce) so I'd really prefer to have it fixed for real. While I agree that gmain is complex, ids are hardly used internally, which makes the patch is small. In fact it is much simpler than the one originally written for GNOME bug 687098. It is also relatively easy to review by looking at uses of context->next_id, source->source_id, g_source_list_add and g_source_list_remove. For what it's worth, this is still a persistent problem (our primary production database guest is crashing about twice a week with this error). We're now on fully-patched 6.7 and the issue seems to be getting worse. Junyi, can you qa_ack this? (In reply to Paolo Bonzini from comment #25) > Junyi, can you qa_ack this? Hi Paolo, QE will verify this bug by following instructions in Comment 12 if we ack this bug, is there any additional concern? No, it's okay. Junyi, please qa_ack this. I'd like to have this in 6.8.z too since there are customer cases waiting on Red Hat. Can GSS add the flag? Unfortunately there's no way to accelerate. Did you get the output before you killed the cat process. I noted that the bash script uses "aapps_outbound" while the XML uses "outbound" only. Thanks Paolo. The script should use /dev/virtio-ports/outbound instead of aapps_outbound. KVM QE has reproduced with glib2-2.26.1-3.el6.x86_64. After running the script in guest for 15 hours, guest crashed with "qemu-kvm: /builddir/build/BUILD/qemu-kvm-0.12.1.2/qemu-char.c:634: io_watch_poll_finalize: Assertion `iwp->src == ((void *)0)' failed." Also KVM QE has run the same steps with glib2-2.28.8-6.el6.x86_64. Guest still works well after more than 20 hours. So the fixed glib2 rpm is fine from KVM QE's perspective. Thanks moving to VERIFIED based on c38. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0567.html |
The code that includes the assertion was introduced by Paolo back in 2013, in this commit: commit 2b316774f60291f57ca9ecb6a9f0712c532cae34 Author: Paolo Bonzini <pbonzini> Date: Fri Apr 19 17:32:09 2013 +0200 qemu-char: do not operate on sources from finalize callbacks Due to a glib bug, the finalize callback is called with the GMainContext lock held. Thus, any operation on the context from the callback will cause recursive locking and a deadlock. This happens, for example, when a client disconnects from a socket chardev. The fix for this is somewhat ugly, because we need to forego polymorphism and implement our own function to destroy IOWatchPoll sources. The right thing to do here would be child sources, but we support older glib versions that do not have them. Not coincidentially, glib developers found and fixed the deadlock as part of implementing child sources. Signed-off-by: Paolo Bonzini <pbonzini> Tested-by: Sander Eikelenboom <linux> Message-id: 1366385529-10329-5-git-send-email-pbonzini Signed-off-by: Anthony Liguori <aliguori.com>