Bug 1584914

Summary: SATA emulator lags and hangs
Product: Red Hat Enterprise Linux 7 Reporter: John Snow <jsnow>
Component: qemu-kvm-rhevAssignee: John Snow <jsnow>
Status: CLOSED ERRATA QA Contact: Xueqiang Wei <xuwei>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.6CC: aliang, chayang, coli, jinzhao, juzhang, lolyu, michen, ngu, virt-maint, xuwei
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.12.0-8.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-01 11:10:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Snow 2018-05-31 23:02:04 UTC
Description of problem:

The SATA emulator (ide-hd, ide-cd or ide-drive when used with the AHCI host bus adapter) can occasionally cause a guest to hang because of a race condition in the completion code in hw/ide/ahci.c.

Version-Release number of selected component (if applicable):
All versions of qemu since v0.14.0.
I do not presently know when SATA was considered "supported" in a Red Hat product; but it was not sooner than 2.4.0.


How reproducible:

2.11 and prior: ~0%
2.12 and later: ~0% - 100%, depending on threads, timing, and guest operating system.


Steps to Reproduce:
1. Boot guest using -M q35 and an AHCI disk, such as Windows 10
2. Observe that the spinning dots loading animation will freeze for several seconds (around 10 to 12 seconds) before reaching the login screen.


Actual results:

- SATA performance is marred by occasional freezes, characterized by guest-driver errors. Linux may emit warnings in dmesg. Windows may freeze for 10-12 seconds at a time before attempting to reset the device.


Expected results:

- The SATA emulator, while slow, should not freeze or cause error messages or hangs in guest operating systems.


Additional info:

This is caused by a race condition where the PxCI register was not cleared prior to raising an IRQ upon AHCI command completion. Prior to v2.12.0, the timing for this was apparently not an issue, but changes in the locking primitives in 2.12.0 made the bug more likely to hit.

For a guest operating system to see the bug, the guest SATA driver must interrogate the PxCI register to see it is not cleared in its interrupt handler, it may then opt to take corrective action.

See the launchpad for more information.

Comment 2 John Snow 2018-06-12 23:20:59 UTC
Fixed upstream for QEMU 3.0; 5694c7eacce6b263ad7497cc1bb76aad746cfd4e

requires backport.

Comment 3 Xueqiang Wei 2018-06-13 08:28:17 UTC
I tried 20 times, hit once.

1. test with qemu-kvm-rhev-2.12.0-1.el7
2. Boot guest using -M q35 and an AHCI disk, Windows 10
3. Observe that the spinning dots loading animation will freeze for several seconds (around 10 to 12 seconds) before reaching the login screen.
4. the guests hangs around 10 to 12 seconds. I can move the mouse but everything needing disk access is unresponsive.

Comment 4 Xueqiang Wei 2018-06-13 08:28:18 UTC
I tried 20 times, hit once.

1. test with qemu-kvm-rhev-2.12.0-1.el7
2. Boot guest using -M q35 and an AHCI disk, Windows 10
3. Observe that the spinning dots loading animation will freeze for several seconds (around 10 to 12 seconds) before reaching the login screen.
4. the guests hangs around 10 to 12 seconds. I can move the mouse but everything needing disk access is unresponsive.

Comment 5 John Snow 2018-06-13 17:06:54 UTC
Thank you for testing and reproducing this, sorry I was not able to give better reproduction instructions.

I tried personally with `./x86_64-softmmu/qemu-system-x86_64 -m 4096 -cpu host -M q35 -enable-kvm -smp 4 -drive id=sda,if=none,file=/home/bos/jhuston/windows_10.qcow -device ide-hd,drive=sda -qmp tcp::4444,server,nowait -snapshot`

and was able to reproduce it fairly often on my T460S laptop:
- Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
- MemTotal:       20423644 kB

I saw it most frequently on 2.12.0 upstream and did not test with our downstream product, but the underlying bug has existed for all versions of the AHCI emulator, so it may be more or less likely to trigger on various versions for various reasons.

Comment 10 Miroslav Rezanina 2018-07-24 14:22:29 UTC
Fix included in qemu-kvm-rhev-2.12.0-8.el7

Comment 12 Xueqiang Wei 2018-07-30 07:33:01 UTC
According to Comment 4,  I tried 30 times, not hit the issue. So verify this bug.

1. test with qemu-kvm-rhev-2.12.0-8.el7
2. Boot guest using -M q35 and an AHCI disk, Windows 10

Comment 13 errata-xmlrpc 2018-11-01 11:10:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443