Bug 1566153

Summary:

IOERROR pause code lost after resuming a VM while I/O error is still present

Product:

Red Hat Enterprise Linux 7

Reporter:

Milan Zamazal <mzamazal>

Component:

qemu-kvm-rhev

Assignee:

Markus Armbruster <armbru>

Status:

CLOSED ERRATA

QA Contact:

CongLi <coli>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.5

CC:

aliang, chayang, chhu, coli, dyuan, jdenemar, jherrman, jiyan, jsuchane, juzhang, knoel, lmen, michal.skrivanek, michen, mtessun, ngu, virt-maint, xuwei, xuzhang, yanqzhan, yhong

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

qemu-kvm-rhev-2.12.0-2.el7

Doc Type:

Bug Fix

Doc Text:

Under certain circumstances, resuming a paused guest generated redundant "VIR_DOMAIN_PAUSED_UNKNOWN" error messages in the libvirt log. This update corrects the event sending order when resuming guests, which prevents the errors being logged.

Story Points:

---

Clone Of:

Clones:

1569614 1582122 (view as bug list)

Environment:

Last Closed:

2018-11-01 11:07:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1526025, 1582122

Attachments:

Description	Flags
libvirt log	none
QeLogs_BugIoerror	none

Description Milan Zamazal 2018-04-11 15:58:42 UTC

Created attachment 1420445 [details]
libvirt log

Description of problem:

When a VM is paused due to an erroneous storage, libvirt emits a corresponding life cycle event with VIR_DOMAIN_PAUSED_IOERROR reason and then VIR_DOMAIN_EVENT_ID_IO_ERROR_REASON event. Also the state of the VM is set appropriately:

  # virsh -r domstate 2 --reason
  paused (I/O error)

When the VM is then resumed manually while the I/O error still persists, it gets paused again immediately. However in that case the life cycle event contains VIR_DOMAIN_PAUSED_UNKNOWN reason. I/O error is also no longer reported when asking for the VM state:

  # virsh -r domstate 2 --reason
  paused (unknown)

Additionally, the order of incoming events is weird, as follows:

- IO_ERROR event
- RESUME event
- PAUSED event

That means the real pause reason is lost.

Version-Release number of selected component (if applicable):

libvirt-3.9.0-14.el7_5.2.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64
kernel-3.10.0-862.el7.x86_64

How reproducible:

100% reproducible on some setups.
Not reproducible on other setups.

Steps to Reproduce:
1. Run a VM with a disk attached.
2. Cause an I/O error on the disk, such as blocking an iSCSI disk with iptables.
3. Wait until the VM gets paused.
4. Resume the VM.

Actual results:

The VM gets paused with UNKNOWN pause reason.
The order of events is IOERROR - RESUME - PAUSED.

Expected results:

The VM gets paused with IOERROR pause reason.
The order of events is RESUME - PAUSED - IOERROR or RESUME - IOERROR - PAUSED.

Additional info:

libvirtd.log is attached.

Comment 2 Yanqiu Zhang 2018-04-12 09:22:11 UTC

This issue can be reproduced:

pkg version:
libvirt-3.9.0-14.el7_5.2.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64


Steps:
1. Run a VM with a disk attached.
2. Cause an I/O error by overwriting on the disk：
    # dd if=/dev/zero of=/mnt/test.img bs=1M count=50
3. Wait until the VM gets paused.
4. Resume the VM.
5. Check event and libvirtd.log (refer to attachment:"QeLogs_BugIoerror")


Results:
1. Issue reproduces:

# grep ' pause' libvirtd-ioerror.log 
2018-04-12 07:20:47.791+0000: 6083: debug : qemuProcessHandleIOError:971 : Transitioned guest V to paused state due to IO error
2018-04-12 07:21:48.272+0000: 6083: debug : qemuProcessHandleResume:808 : Transitioned guest V out of paused into resumed state
2018-04-12 07:21:48.272+0000: 6083: debug : qemuProcessHandleStop:756 : Transitioned guest V to paused state, reason unknown
2018-04-12 07:25:32.687+0000: 6083: debug : qemuProcessHandleResume:808 : Transitioned guest V out of paused into resumed state

2.From following content, events in order are: "BLOCK_IO_ERROR"-"RESUME"-"STOP"-"SHUTDOWN"-"RESUME". Qemu didn't send ioerror msg again when the second paused("STOP"), so seems libvirt can not do something to highlight the ioerror or its reason.

# grep QEMU_MONITOR_RECV_EVENT libvirtd-ioerror.log 
2018-04-12 07:20:47.791+0000: 6083: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f9928000fb0 event={"timestamp": {"seconds": 1523517647, "microseconds": 791396}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "nospace": true, "__com.redhat_reason": "enospc", "node-name": "#block368", "reason": "No space left on device", "operation": "write", "action": "stop"}}
...
...
2018-04-12 07:21:48.272+0000: 6083: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f9928000fb0 event={"timestamp": {"seconds": 1523517708, "microseconds": 250598}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "nospace": true, "__com.redhat_reason": "enospc", "node-name": "#block368", "reason": "No space left on device", "operation": "write", "action": "stop"}}
2018-04-12 07:21:48.272+0000: 6083: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f9928000fb0 event={"timestamp": {"seconds": 1523517708, "microseconds": 255772}, "event": "RESUME"}
2018-04-12 07:21:48.272+0000: 6083: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f9928000fb0 event={"timestamp": {"seconds": 1523517708, "microseconds": 259194}, "event": "STOP"}
2018-04-12 07:25:29.118+0000: 6083: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f9928000fb0 event={"timestamp": {"seconds": 1523517929, "microseconds": 118290}, "event": "SHUTDOWN", "data": {"guest": false}}
2018-04-12 07:25:32.687+0000: 6083: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f997c0162a0 event={"timestamp": {"seconds": 1523517932, "microseconds": 687012}, "event": "RESUME"}

Comment 3 Yanqiu Zhang 2018-04-12 09:29:07 UTC

Created attachment 1420763 [details]
QeLogs_BugIoerror

Comment 4 Jiri Denemark 2018-04-12 09:58:54 UTC

Right, we call "cont" and get several BLOCK_IO_ERROR events followed by RESUME
event, successful return from "cont", and STOP event.

2018-04-11 14:52:43.918+0000: 1628: info : qemuMonitorIOWrite:543 : QEMU_MONITOR_IO_WRITE: mon=0x7fc76401a810 buf={"execute":"cont","id":"libvirt-2482"}
2018-04-11 14:52:43.922+0000: 1628: info : qemuMonitorIOProcess:438 : QEMU_MONITOR_IO_PROCESS: mon=0x7fc76401a810 buf={"timestamp": {"seconds": 1523458363, "microseconds": 922090}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-ua-29ea81dc-a9a1-4f41-83ad-7715422e1079", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block109", "reason": "Input/output error", "operation": "read", "action": "stop"}}
2018-04-11 14:52:43.923+0000: 1628: info : qemuMonitorIOProcess:438 : QEMU_MONITOR_IO_PROCESS: mon=0x7fc76401a810 buf={"timestamp": {"seconds": 1523458363, "microseconds": 922346}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-ua-29ea81dc-a9a1-4f41-83ad-7715422e1079", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block109", "reason": "Input/output error", "operation": "write", "action": "stop"}}
2018-04-11 14:52:43.928+0000: 1628: info : qemuMonitorIOProcess:438 : QEMU_MONITOR_IO_PROCESS: mon=0x7fc76401a810 buf={"timestamp": {"seconds": 1523458363, "microseconds": 928242}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-ua-29ea81dc-a9a1-4f41-83ad-7715422e1079", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block109", "reason": "Input/output error", "operation": "write", "action": "stop"}}
2018-04-11 14:52:43.942+0000: 1628: info : qemuMonitorIOProcess:438 : QEMU_MONITOR_IO_PROCESS: mon=0x7fc76401a810 buf={"timestamp": {"seconds": 1523458363, "microseconds": 942444}, "event": "RESUME"}
{"return": {}, "id": "libvirt-2482"}
2018-04-11 14:52:43.951+0000: 1628: info : qemuMonitorIOProcess:438 : QEMU_MONITOR_IO_PROCESS: mon=0x7fc76401a810 buf={"timestamp": {"seconds": 1523458363, "microseconds": 951301}, "event": "STOP"}

I'm moving this to QEMU for further investigation.

Comment 7 Markus Armbruster 2018-04-17 06:14:12 UTC

I asked yanqzhan for a more detailed reproducer, and got this in a pastebin:


[io-error] Deliberately leading to domain I/O errors

Setup:
Prepare a small partition for this tesing.
for example /dev/sda3 11M

# mkfs.ext3 /dev/sda11

1.define and start a pool:
# cat fs-pool.xml
<pool type='fs'>
  <name>mypool</name>
  <source>
    <device path='/dev/sda3'/>
    <format type='auto'/>
  </source>
  <target>
    <path>/var/lib/libvirt/images/mypool</path>
  </target>
</pool>

# virsh pool-define fs-pool.xml
# virsh pool-build mypool
# virsh pool-start mypool

Check the pool is working fine.
# df -h

2.Create volume in the pool.
# cat vol-disk-template.xml
<volume>
  <name>disk1.img</name>
  <capacity unit='M'>100</capacity>
  <allocation unit='M'>0</allocation>
  <target>
    <path>/var/lib/libvirt/images/mypool/disk1.img</path>
    <format type='raw'/>
  </target>
</volume>

# virsh vol-create mypool vol-disk-template.xml

3.Add the volume to an guest as secondary disk, then start the guest.
# virsh edit testvm
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/mypool/disk1.img'/>
      <target dev='vdb' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </disk>

# virsh start testvm

4. In guest, try to write some stuff (which size is bigger than 50M ) in the secondary disk.
# fdisk /dev/vdb
In fdisk prompt, print "n, p, 1, enter, enter, w" in order.
(it will be the vda if the first disk is not virtio driver, use "# fdisk -l" to determine the 2rd virtual disk for the guest)

# mkfs.ext3 /dev/vdb1

# mount /dev/vdb1 /mnt

# dd if=/dev/zero of=/mnt/test.img bs=1M count=50

5.On host, check the partion status: 100% used.
# df -h|grep sda3
(note: can delete vol then reuse this partion.)

Comment 8 Markus Armbruster 2018-04-17 08:10:37 UTC

Simplified reproducer with the upper layers peeled off:

1. Create a scratch image
   $ dd if=/dev/zero of=scratch.img bs=1M count=100

   Size doesn't actually matter.

2. Prepare blkdebug configuration:

   $ cat >blkdebug.conf <<EOF
   [inject-error]
   event = "write_aio"
   errno = "5"
   EOF

   Note that errno 5 is EIO.

3. Run a guest with an additional scratch disk, i.e. with additional
   arguments
   -drive if=none,id=scratch-drive,format=raw,werror=stop,file=blkdebug:blkdebug.conf:scratch.img
   -device virtio-blk-pci,id=scratch,drive=scratch-drive

   The blkdebug part makes all writes to the scratch drive fail with
   EIO.  The werror=stop pauses the guest on write errors.

4. Connect to the QMP socket e.g. like this:
   $ socat UNIX:/your/qmp/socket READLINE,history=$HOME/.qmp_history,prompt='QMP> '

   Issue QMP command 'qmp_capabilities':
   QMP> { "execute": "qmp_capabilities" }

5. Boot the guest.

6. In the guest, write to the scratch disk, e.g. like this:

   # dd if=/dev/zero of=/dev/vdb count=1

   Do double-check the device specified with of= is actually the
   scratch device!

7. Issue QMP command 'cont':
   QMP> { "execute": "cont" }

Actual results for upstream QEMU 2.12.0-rc3:

* After step 6, I get these QMP events:

    {"timestamp": {"seconds": 1523950690, "microseconds": 13256}, "event": "BLOCK_IO_ERROR", "data": {"device": "scratch-drive", "nospace": false, "node-name": "#block634", "reason": "Input/output error", "operation": "write", "action": "stop"}}
    {"timestamp": {"seconds": 1523950690, "microseconds": 14614}, "event": "STOP"}

* After step 7, I get

    {"timestamp": {"seconds": 1523950730, "microseconds": 551845}, "event": "BLOCK_IO_ERROR", "data": {"device": "scratch-drive", "nospace": false, "node-name": "#block634", "reason": "Input/output error", "operation": "write", "action": "stop"}}
    {"timestamp": {"seconds": 1523950730, "microseconds": 553629}, "event": "RESUME"}
    {"return": {}}
    {"timestamp": {"seconds": 1523950730, "microseconds": 555970}, "event": "STOP"}

  Events BLOCK_IO_ERROR and RESUME before the reply to "cont", event
  "STOP" afterwards.

This is consistent with what I see in the libvirt logs from comment#3,
and the log excerpt in comment#4.

The BLOCK_IO_ERROR event "overtakes" the RESUME event here.

This matches the order of libvirt events observed by the reporter:
IO_ERROR, RESUME, PAUSED.

The reporter further observed that virsh domstate --reason reports
"unknown" after resume.  Note that QEMU's BLOCK_IO_ERROR event still
carries the correct "reason" then.  Looks like it gets lost on the way
to virsh domstate, possibly because libvirt is not prepared for this
order of QEMU events.

I think we got two possible bugs here:

#1 qemu-kvm bug: event ordering

   I'm not saying this is a bug in need of fixing.  I am saying we
   should find out whether it is one.

#2 libvirt bug: failure to cope with said event ordering

   Even if #1 is deemed a genuine bug, libvirt may want to cope with
   unfixed QEMU versions.

Jiri, what do you think?

Comment 9 Jiri Denemark 2018-04-17 11:38:05 UTC

#1 The ordering is pretty weird, IMHO it would be more logical to fail the
   "cont" command, but I'm afraid this won't be possible because it's
   completely asynchronous.

#2 Libvirt reacts on each event separately. When it sees BLOCK_IO_ERROR and
   the configuration says it should stop guest CPUs in such case, the domain
   is marked as paused because of IO error. Then it sees RESUME and marks the
   domain as running. Once STOP arrives, the domain is switched to paused with
   unknown reason.

We could theoretically remember the last IO error and consider it valid later
when we see STOP, but it sounds too hacky and fragile. Or we could just call
query-status when we see the STOP event because it is able to give us the
details we need to set the appropriate stop reason. This a bit better, until
we start thinking about it more. Libvirt would report paused/unknown for the
time frame between the STOP event and the reply from query-status as we'd have
to schedule this command to be processed asynchronously as another synchronous
command may be running at the time.

So ideally, we'd just get the reason with the STOP event itself so that we
don't have to query it explicitly. Unless it's easy to fix the order of
events, of course.

Comment 12 Markus Armbruster 2018-04-19 15:02:14 UTC

I'm going to clone this bug for tracking possible libvirt work (comment#8 item #2).  This bug will be for tracking qemu-kvm work (item #1).

Comment 13 Markus Armbruster 2018-04-23 07:32:43 UTC

Reproduced with latest upstream.

The culprit is vm_prepare_start().

    /* Ensure that a STOP/RESUME pair of events is emitted if a
     * vmstop request was pending.  The BLOCK_IO_ERROR event, for
     * example, according to documentation is always followed by
     * the STOP event.
     */
    if (runstate_is_running()) {
        qapi_event_send_stop(&error_abort);
        res = -1;
    } else {
        replay_enable_events();
        cpu_enable_ticks();
        runstate_set(RUN_STATE_RUNNING);
        vm_state_notify(1, RUN_STATE_RUNNING);
    }

    /* We are sending this now, but the CPUs will be resumed shortly later */
    qapi_event_send_resume(&error_abort);
    return res;

When resuming a stopped guest, we take the else branch before we get
to sending RESUME.  vm_state_notify() runs virtio_vmstate_change(),
among other things.  This restarts I/O, triggering the BLOCK_IO_ERROR
event.

Comment 14 Markus Armbruster 2018-04-23 08:46:18 UTC

Proposed upstream patch

Message-Id: <20180423084518.2426-1-armbru>
https://lists.gnu.org/archive/html/qemu-devel/2018-04/msg03639.html

Comment 17 Miroslav Rezanina 2018-05-16 07:57:29 UTC

Fix included in qemu-kvm-rhev-2.12.0-2.el7

Comment 19 CongLi 2018-05-18 02:38:05 UTC

(In reply to Markus Armbruster from comment #8)
> Simplified reproducer with the upper layers peeled off:
> 
> 1. Create a scratch image
>    $ dd if=/dev/zero of=scratch.img bs=1M count=100
> 
>    Size doesn't actually matter.
> 
> 2. Prepare blkdebug configuration:
> 
>    $ cat >blkdebug.conf <<EOF
>    [inject-error]
>    event = "write_aio"
>    errno = "5"
>    EOF
> 
>    Note that errno 5 is EIO.
> 
> 3. Run a guest with an additional scratch disk, i.e. with additional
>    arguments
>    -drive
> if=none,id=scratch-drive,format=raw,werror=stop,file=blkdebug:blkdebug.conf:
> scratch.img
>    -device virtio-blk-pci,id=scratch,drive=scratch-drive
> 
>    The blkdebug part makes all writes to the scratch drive fail with
>    EIO.  The werror=stop pauses the guest on write errors.
> 
> 4. Connect to the QMP socket e.g. like this:
>    $ socat UNIX:/your/qmp/socket
> READLINE,history=$HOME/.qmp_history,prompt='QMP> '
> 
>    Issue QMP command 'qmp_capabilities':
>    QMP> { "execute": "qmp_capabilities" }
> 
> 5. Boot the guest.
> 
> 6. In the guest, write to the scratch disk, e.g. like this:
> 
>    # dd if=/dev/zero of=/dev/vdb count=1
> 
>    Do double-check the device specified with of= is actually the
>    scratch device!
> 
> 7. Issue QMP command 'cont':
>    QMP> { "execute": "cont" }
> 

Verified this bug on qemu-kvm-rhev-2.12.0-2.el7.x86_64:

After step 6:
{"timestamp": {"seconds": 1526610722, "microseconds": 326202}, "event": "BLOCK_IO_ERROR", "data": {"device": "scratch-drive", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block421", "reason": "Input/output error", "operation": "write", "action": "stop"}}
{"timestamp": {"seconds": 1526610722, "microseconds": 335677}, "event": "STOP"}

After step 7:
{"timestamp": {"seconds": 1526610729, "microseconds": 156841}, "event": "RESUME"}
{"timestamp": {"seconds": 1526610729, "microseconds": 157809}, "event": "BLOCK_IO_ERROR", "data": {"device": "scratch-drive", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block421", "reason": "Input/output error", "operation": "write", "action": "stop"}}
{"timestamp": {"seconds": 1526610729, "microseconds": 158925}, "event": "STOP"}


The 'RESUME' event triggered before 'BLOCK_IO_ERROR' event, the event ordering 'RESUME' -> 'BLOCK_IO_ERROR' -> 'STOP' is triggered as expected.

Thanks.

Comment 22 errata-xmlrpc 2018-11-01 11:07:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443