Bug 1142857
| Summary: | [abrt] qemu-kvm: bdrv_error_action(): qemu-kvm killed by SIGABRT | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Tomas Dolezal <todoleza> | ||||||||||||||||||||
| Component: | qemu-kvm | Assignee: | Paolo Bonzini <pbonzini> | ||||||||||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||||||||||||
| Priority: | urgent | ||||||||||||||||||||||
| Version: | 7.0 | CC: | areis, armbru, dgilbert, famz, gwatson, hhuang, huding, jherrman, jraju, juzhang, michen, mklika, mrezanin, mzheng, pbonzini, rbalakri, todoleza, uobergfe, virt-maint, xfu | ||||||||||||||||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||
| Whiteboard: | abrt_hash:070748678ed842e5f195e7365ca2467ac9f559ab | ||||||||||||||||||||||
| Fixed In Version: | qemu-kvm-1.5.3-93.el7 | Doc Type: | Bug Fix | ||||||||||||||||||||
| Doc Text: |
Due to incorrect implementation of portable memory barriers, the QEMU emulator in some cases terminated unexpectedly when a virtual disk was under heavy I/O load. This update fixes the implementation in order to achieve correct synchronization between QEMU's threads. As a result, the described crash no longer occurs.
|
Story Points: | --- | ||||||||||||||||||||
| Clone Of: | |||||||||||||||||||||||
| : | 1231335 1233643 (view as bug list) | Environment: | |||||||||||||||||||||
| Last Closed: | 2015-11-19 04:56:45 UTC | Type: | --- | ||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
| Embargoed: | |||||||||||||||||||||||
| Bug Depends On: | |||||||||||||||||||||||
| Bug Blocks: | 1231335, 1233643 | ||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||
|
Description
Tomas Dolezal
2014-09-17 14:06:59 UTC
Created attachment 938477 [details]
File: backtrace
Created attachment 938478 [details]
File: cgroup
Created attachment 938479 [details]
File: core_backtrace
Created attachment 938480 [details]
File: dso_list
Created attachment 938481 [details]
File: environ
Created attachment 938482 [details]
File: limits
Created attachment 938483 [details]
File: maps
Created attachment 938484 [details]
File: open_fds
Created attachment 938485 [details]
File: proc_pid_status
for the record: I was using virt-manager from remote el7 via qemu+ssh (remote: virt-manager-0.10.0-20.el7.noarch) Backtrace suggests bdrv_co_em_bh() called virtio_blk_rw_complete() through acb->common.cb with a positive ret argument. Passes -ret through virtio_blk_handle_rw_error() to bdrv_error_action(), tripping its assertion. What's the contract for acb->common.cb? Two possibilities come to mind: 1. Positive ret argument means success virtio_blk_rw_complete() needs to be fixed not to call virtio_blk_handle_rw_error() then. 2. Positive ret argument must not happen Whatever created the argument needs to be found and fixed. A closer look at the backtrace: #5 0x00007f61da3fbc89 in virtio_blk_handle_rw_error (req=req@entry=0x7f61dfb97e80, error=-1641789906, is_read=true) at /usr/src/debug/qemu-1.5.3/hw/block/virtio-blk.c:69 -1641789906 is not a negative errno. Did something scribble over the acb? We got a core, but no straightforward reproducer. Too late for 7.1 without a heroic effort. The bug looks too exotic to justify that. Punting to 7.2. Can't say whether it's the same bug without a core to inspect at least. If it got triggered the same way, using this BZ to track it is best. FaF reports from CentOS users: https://retrace.fedoraproject.org/faf/problems/670281/ looks like the same bug; 3 reports on versions 10:1.5.3-60.el7_0.7.0.1 10:1.5.3-86.el7_1.1 10:1.5.3-86.el7_1.2 More discussion is happening on the mailing list, and it looks like the compiler is reordering the stores despite the barrier. Note that it's expected that smp_rmb() and smp_wmb() produce no assembly code on x86. We've contacted the tools team to understand if this is a QEMU bug, a GCC bug, or both. Paolo,
it seems to me that there are actually two issues:
- The compiler should not generate a series of machine instructions
that reorder the sequence in which 'state' and 'ret' are stored in
the ThreadPoolElement.
req->state = THREAD_DONE;
// r12->state = 2 (THREAD_DONE)
0x00007fa71f51254b <+235>: movl $0x2,0x38(%r12)
req->ret = ret;
// r12->ret = eax (ret)
0x00007fa71f512554 <+244>: mov %eax,0x3c(%r12)
worker_thread() should store 'ret' _before_ 'state' as intended by
the C code.
req->ret = ret;
/* Write ret before state. */
smp_wmb();
req->state = THREAD_DONE;
- However, even if the compiler would generate the intended series of
machine instructions, i.e.
// r12->ret = eax (ret)
[1] mov %eax,0x3c(%r12)
// r12->state = 2 (THREAD_DONE)
[2] movl $0x2,0x38(%r12)
wouldn't we also need an explicit memory barrier instruction between
[1] and [2] similar to the fix for BZ 804578 to prevent the processor
from reordering the two stores internally ?
Regards,
Uli
Paolo, I think I found the answer to my question in comment #31 myself. Intel SDM Vol. 3 states under "Memory Ordering in P6 and More Recent Processor Families ... Writes to memory are not reordered with other writes, with the following exceptions: - writes executed with the CLFLUSH instruction; - streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and - string operations ..." Since the 'mov' instructions [1] and [2] in comment #31 don't fall into one of the above exception categories, we don't need a barrier between them. Please correct me if I'm wrong. Regards, Uli Reassigning to Paolo, at least for now. Ulrich, Tomas: we'll need GSSApproved in the whiteboard to get the fix to the z-stream. Can you add it somehow? Fix included in qemu-kvm-1.5.3-93.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2213.html |