Bug 1569614

Summary: IOERROR pause code lost after resuming a VM while I/O error is still present
Product: Red Hat Enterprise Linux 7 Reporter: Markus Armbruster <armbru>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED NOTABUG QA Contact: Yanqiu Zhang <yanqzhan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.5CC: aliang, chayang, chhu, coli, dyuan, fjin, jdenemar, jherrman, jiyan, juzhang, knoel, lmen, michal.skrivanek, michen, mzamazal, ngu, rbalakri, virt-maint, xuwei, xuzhang, yanqzhan, yhong
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Under certain circumstances, resuming a paused guest generated redundant "VIR_DOMAIN_PAUSED_UNKNOWN" error messages in the libvirt log. This update corrects the event sending order when resuming guests, which prevents the errors being logged.
Story Points: ---
Clone Of: 1566153 Environment:
Last Closed: 2018-05-29 21:51:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1526025    

Comment 2 Markus Armbruster 2018-04-19 15:18:55 UTC
When a VM is paused due to an erroneous storage, libvirt emits a
corresponding life cycle event with VIR_DOMAIN_PAUSED_IOERROR reason
and then VIR_DOMAIN_EVENT_ID_IO_ERROR_REASON event. Also the state of
the VM is set appropriately:

  # virsh -r domstate 2 --reason
  paused (I/O error)

When the VM is then resumed manually while the I/O error still persists, it gets paused again immediately. However in that case the life cycle event contains VIR_DOMAIN_PAUSED_UNKNOWN reason. I/O error is also no longer reported when asking for the VM state:

  # virsh -r domstate 2 --reason
  paused (unknown)

Additionally, the order of incoming events is weird, as follows:

- IO_ERROR event
- RESUME event
- PAUSED event

That means the real pause reason is lost.

This happens because libvirt gets confused by the QMP events it receives from qemu-kvm after the resume: first BLOCK_IO_ERROR, then RESUME, then STOP.

I guess libvirt would be fine if qemu-kvm sent them in the more natural order RESUME, BLOCK_IO_ERROR, STOP.

Perhaps we can fix qemu-kvm to do that (bug 1566153), and perhaps making libvirt coping with the current order won't be necessary then.  This bug tracks possible libvirt work in case we can't fix qemu-kvm, or libvirt needs to cope with unfixed versions of qemu-kvm.

For detailed reproducers see bug 1566153.

Comment 3 Jiri Denemark 2018-05-29 21:51:27 UTC
QEMU fixed the order of emitted events and no additional libvirt work is needed.