Bug 1375520

Summary:	qemu core dump when there is an I/O error on AHCI
Product:	Red Hat Enterprise Linux 7	Reporter:	jingzhao <jinzhao>
Component:	qemu-kvm-rhev	Assignee:	John Snow <jsnow>
Status:	CLOSED ERRATA	QA Contact:	Xueqiang Wei <xuwei>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.3	CC:	chayang, coli, jen, jherrman, jinzhao, jsnow, juzhang, knoel, kraxel, kwolf, mrezanin, nerijus, pbonzini, virt-bugs, virt-maint, xfu
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Due to asychronous I/O control blocks (AIOCBs) not being properly cleared, guests that use the Advanced Host Controller Interface (AHCI) in some cases terminated unexpectedly when an I/O error occurred. With this update, AIOCB is cleared properly, and I/O errors on guests with AHCI are resolved gracefully.	Story Points:	---
Clone Of:	887844
Clones:	1393736 (view as bug list)		Environment:
Last Closed:	2017-08-01 23:34:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	887844, 953062
Bug Blocks:	1227278, 1393736

Comment 2 John Snow 2016-09-14 21:45:06 UTC

jingzhao, can you please do me a favor and try running this reproducer using the fix for BZ #1299876 ?

I haven't been able to reproduce yet, but by removing the obvious source of the segfault, maybe the problem will manifest differently in a way that helps us move forward with this issue.

I have a build based on qemu-kvm-rhev-2.6.0-25.el7 that includes the fixes for #1299876 that I think *might* stop the segfault here. If it does or it doesn't, it will tell us a lot about the nature of the problem that may help diagnose it better.

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11754826

In the meantime, I'd like to make sure I have my facts straight about the nature of this bug without the fix posted above:

(1) This is observed under qemu-kvm-rhev-2.6.0-*, most recently #24.

(2) The crash happens when using which guest? RHEL of some version?

(3) It does not appear to happen when using the loopback device, but does appear to happen when using iSCSI.

(4) It happens on both Q35 and PC machines when using the AHCI controller.

(5) When it happens, there is no opportunity to resume the VM, as it crashes before it pauses.

(6) No "STOP" event is emitted via the QMP stream.

(7) The crash appears to happen immediately after the disk becomes FULL with no further interaction from the user.

Would it be correct to say that the only difference you can observe is the different backing storage technique?

Sorry if I am being redundant, but I thank you for your patience and diligence.
--John

Comment 5 jingzhao 2016-09-21 09:21:52 UTC

(In reply to John Snow from comment #2)
> jingzhao, can you please do me a favor and try running this reproducer using
> the fix for BZ #1299876 ?
> 
> I haven't been able to reproduce yet, but by removing the obvious source of
> the segfault, maybe the problem will manifest differently in a way that
> helps us move forward with this issue.
> 
> I have a build based on qemu-kvm-rhev-2.6.0-25.el7 that includes the fixes
> for #1299876 that I think *might* stop the segfault here. If it does or it
> doesn't, it will tell us a lot about the nature of the problem that may help
> diagnose it better.
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11754826

Sorry for the late

Also can reproduce on the bz and bz1299876 qemu-kvm-rhev-2.6.0-25.el7.x86_64 (https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=513032). Could you share the private build with me again because I lost it.

> 
> -
> 
> In the meantime, I'd like to make sure I have my facts straight about the
> nature of this bug without the fix posted above:
> 
> (1) This is observed under qemu-kvm-rhev-2.6.0-*, most recently #24.
> 
> (2) The crash happens when using which guest? RHEL of some version?
--Use the rhel guest.(kernel-3.10.0-481.el7.x86_64)
> 
> (3) It does not appear to happen when using the loopback device, but does
> appear to happen when using iSCSI.
--yeap
> 
> (4) It happens on both Q35 and PC machines when using the AHCI controller.
--yeap
> 
> (5) When it happens, there is no opportunity to resume the VM, as it crashes
> before it pauses.
--yeap
> 
> (6) No "STOP" event is emitted via the QMP stream.

--Actually, seems guest stop because there have stop event via QMP

{"timestamp": {"seconds": 1474449608, "microseconds": 538541}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "nospace": true, "__com.redhat_reason": "enospc", "reason": "No space left on device", "operation": "write", "action": "stop"}}

> 
> (7) The crash appears to happen immediately after the disk becomes FULL with
> no further interaction from the user.
--yeap
> 
> Would it be correct to say that the only difference you can observe is the
> different backing storage technique?
---the backing storage and I didn't do the "system_reset"


Thanks
Jing Zhao

Comment 6 John Snow 2016-09-21 18:00:58 UTC

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11792798

http://download-node-02.eng.bos.redhat.com/brewroot/work/tasks/2802/11792802/

This is the fix for #1299876 applied on top of qemu-kvm-rhev-2.6.0-26.el7.

I've re-hosted the files at http://file.bos.redhat.com/jhuston/11792802/ this time so they don't disappear on us.

Comment 7 jingzhao 2016-09-22 06:18:21 UTC

(In reply to John Snow from comment #6)
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11792798
> 
> http://download-node-02.eng.bos.redhat.com/brewroot/work/tasks/2802/11792802/
> 
> This is the fix for #1299876 applied on top of qemu-kvm-rhev-2.6.0-26.el7.
> 
> I've re-hosted the files at http://file.bos.redhat.com/jhuston/11792802/
> this time so they don't disappear on us.

Hi John

1.Also reproduce the core dump issue used the test build (http://file.bos.redhat.com/jhuston/11792802/).

2.And I tried bz1299876 with the test build according to https://bugzilla.redhat.com/show_bug.cgi?id=1299876#c3, and didn't hit the core dump issue.

Thanks
Jing Zhao

Comment 8 John Snow 2016-09-22 19:52:02 UTC

Aaaaaaah ... !!

Please try http://file.bos.redhat.com/jhuston/11801934/ instead, the fix for AHCI was incomplete.

Sorry for the inconvenience.

Comment 9 jingzhao 2016-09-23 02:59:49 UTC

(In reply to John Snow from comment #8)
> Aaaaaaah ... !!
> 
> Please try http://file.bos.redhat.com/jhuston/11801934/ instead, the fix for
> AHCI was incomplete.
> 
> Sorry for the inconvenience.

Hi John

Tested it with http://file.bos.redhat.com/jhuston/11801934/

Didn't reproduce the bz

1.Boot guest with the iscsi backend
2.In guest, dd if=/dev/zero of=/dev/sda bs=1M count=8192
3.Check the status through hmp
(qemu) info status
VM status: paused (io-error)
4.In hmp
(qemu) c
(qemu) info status
VM status: running
(qemu) system_reset 
and guest had a response 

Add info:
/usr/libexec/qemu-kvm \
-M pc \
-cpu SandyBridge \
-nodefaults -rtc base=utc \
-m 4G \
-smp 2,sockets=2,cores=1,threads=1 \
-enable-kvm \
-name rhel7.3 \
-uuid 990ea161-6b67-47b2-b803-19fb01d30d12 \
-smbios type=1,manufacturer='Red Hat',product='RHEV Hypervisor',version=el6,serial=koTUXQrb,uuid=feebc8fd-f8b0-4e75-abc3-e63fcdb67170 \
-k en-us \
-nodefaults \
-serial unix:/tmp/serial0,server,nowait \
-boot menu=on \
-bios /usr/share/seabios/bios.bin \
-chardev file,path=/home/seabios.log,id=seabios \
-device isa-debugcon,chardev=seabios,iobase=0x402 \
-qmp tcp:0:6666,server,nowait \
-device VGA,id=video \
-vnc :2 \
-drive file=/home/bug/rhel73.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,werror=stop,rerror=stop \
-device virtio-blk-pci,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-device virtio-net-pci,netdev=tap10,mac=9a:6a:6b:6c:6d:6e -netdev tap,id=tap10 \
-device ahci,id=ahci0 \
-drive file=/mnt/test.qcow2,if=none,id=drive-virtio-disk1,format=qcow2,werror=stop,rerror=stop \
-device ide-hd,drive=drive-virtio-disk1,id=virtio-disk1,bus=ahci0.0 \
-monitor stdio \


Is it enough? and please tell me if need to do much more test.

Thanks
Jing Zhao

Comment 10 John Snow 2016-09-23 04:20:40 UTC

If the guest didn't report any IO errors and everything appears to have worked correctly, I'll submit my patches downstream and move the bug into POST.

Thanks for your patience!

Comment 12 Ademar Reis 2016-09-28 01:49:33 UTC

For reference, this is the cluster of BZ related to this issue: bug 1281713, bug 1299876, bug 1299875, bug 1361487, bug 1361490, bug 1361488, bug 1375520

Comment 17 Xueqiang Wei 2017-06-07 06:32:46 UTC

According to https://bugzilla.redhat.com/show_bug.cgi?id=887844#c11 , reproduce this bug on:
host kernel:3.10.0-496.el7.x86_64
qemu-kvm-rhev-2.6.0-22.el7.x86_64


Retest on the latest RHEL7.3.z, not hit this issue:
host kernel: 3.10.0-514.25.2.el7.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.10


After "dd" in guest:
(qemu) info status 
VM status: paused (io-error)
(qemu) system_reset
(qemu) info status 
VM status: paused (prelaunch)
(qemu) info status 
VM status: running

So verify it.

Comment 18 Xueqiang Wei 2017-06-07 07:21:17 UTC

Retest on the latest RHEL7.4, not hit this issue:
host kernel: 3.10.0-679.el7.x86_64
qemu-kvm-rhev-2.9.0-8.el7

Comment 20 errata-xmlrpc 2017-08-01 23:34:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 21 errata-xmlrpc 2017-08-02 01:12:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 22 errata-xmlrpc 2017-08-02 02:04:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 23 errata-xmlrpc 2017-08-02 02:45:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 24 errata-xmlrpc 2017-08-02 03:09:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 25 errata-xmlrpc 2017-08-02 03:29:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392