Bug 1860377 - qemu-kvm-rhev does not survive lun handover in between 2 NimbleStorage devices that operate in synchronous replication mode
Summary: qemu-kvm-rhev does not survive lun handover in between 2 NimbleStorage devic...
Keywords:
Status: CLOSED DUPLICATE of bug 1855868
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact: Avihai
URL:
Whiteboard:
Depends On: 1855868
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-24 11:46 UTC by Marian Jankular
Modified: 2022-07-27 14:11 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-30 06:58:19 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
System tap script to see if/why multipath is failing IO (2.01 KB, text/x-matlab)
2020-07-30 03:57 UTC, Ben Marzinski
no flags Details
systemtap script to check for ioctl error to device-mapper devices (1001 bytes, text/x-matlab)
2020-08-27 01:25 UTC, Ben Marzinski
no flags Details

Description Marian Jankular 2020-07-24 11:46:18 UTC
Description of problem:
qemu-kvm-rhev does not survive lun handover in between 2 NimbleStorage devices that operate in synchronous replication mode

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.12.0-44.el7_8.2.x86_64

How reproducible:
20% of lun handovers in customer environment 

Steps to Reproduce:
1. start ha vm
2. conduct lun handover

Actual results:
vm is paused if resume behavior is kill, it is killed and restarted on the other hosts, if resume behavior is autoresume vm remains paused

Expected results:
lun handover should be transparent to qemu as it does survive dd to the multipath device even with "oflag=direct"

Additional info:
https://infosight.hpe.com/InfoSight/media/cms/active/public/pubs_GUI_Administration_Guide_NOS_52x.whz/ypq1584111570400.html

https://infosight.hpe.com/InfoSight/media/cms/active/public/pubs_GUI_Administration_Guide_NOS_52x.whz/egp1534956790827.html

https://infosight.hpe.com/InfoSight/media/cms/active/public/pubs_GUI_Administration_Guide_NOS_52x.whz/udz1584387105907.html

Comment 11 Ademar Reis 2020-07-27 16:47:05 UTC
While watching the video, I saw this in the UI, where the user performs the actual handover:

----------------------
"Handover" is a controlled way to change the direction of replication for a
synchronous replicated volume collection with no loss of data. This process is
transparent to the host(s) and does not cause any outage.

** Note that I/O will be paused for a very short duration during this operation. **

If an array in the upstream pool is not reachable, ensure that no host is
connected to any upstream volume before using the option to override the
upstream down condition. Otherwise, it might cause data corruption
----------------------
(emphasis mine)

Given it fails in ~20% of the cases, my question here would be how "very short" is this pause in I/O and what they mean by "pause". What would be the expected behavior from QEMU here?

Comment 12 Ben Marzinski 2020-07-27 18:15:13 UTC
Looking at the multipath log messages, there doesn't appear to be any time when the paths are down, so IO should never be failing back, and it doesn't appear to be. It's possible that an ioctl could fail, if it comes in while the paths are transitioning. There will unavoidably be a time when a paths ALUA access state has changed, and the path should be listed as a ghost path, but multipath won't have reacted to that yet. It seems possible that an ioctl would fail here. That being said, I have no evidence that an ioctl DID fail. Also if the issue was related to the ioctl coming it and picking the wrong
path, the gap in timing here seems unlikely:

Jul 22 11:35:51 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna
Jul 22 11:35:51 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna
Jul 22 11:35:52 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdf - tur checker reports path is in standby state
Jul 22 11:35:52 hpb-ws02-11 multipathd: 8:80: reinstated
Jul 22 11:35:54 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdb - tur checker reports path is in standby state
Jul 22 11:35:54 hpb-ws02-11 multipathd: 8:16: reinstated
Jul 22 11:36:01 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdd - tur checker reports path is up
Jul 22 11:36:01 hpb-ws02-11 multipathd: 8:48: reinstated
Jul 22 11:37:20 hpb-ws02-11 kernel: VirtualMachines: port 2(vnet0) entered disabled state
 un
There is over a minute after multipath's state has completely transitioned before the "entered disabled state" message. But again, I don't really know enough about kvm to know if a delay like this would be expected.

Comment 15 John Ferlan 2020-07-28 15:54:47 UTC
(In reply to Ben Marzinski from comment #12)
> Looking at the multipath log messages, there doesn't appear to be any time
> when the paths are down, so IO should never be failing back, and it doesn't
> appear to be. It's possible that an ioctl could fail, if it comes in while
> the paths are transitioning. There will unavoidably be a time when a paths
> ALUA access state has changed, and the path should be listed as a ghost
> path, but multipath won't have reacted to that yet. It seems possible that
> an ioctl would fail here. That being said, I have no evidence that an ioctl
> DID fail. Also if the issue was related to the ioctl coming it and picking
> the wrong
> path, the gap in timing here seems unlikely:
> 
> Jul 22 11:35:51 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A
> non-preferred supports tolusna
> Jul 22 11:35:51 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A
> non-preferred supports tolusna
> Jul 22 11:35:52 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1:
> sdf - tur checker reports path is in standby state
> Jul 22 11:35:52 hpb-ws02-11 multipathd: 8:80: reinstated
> Jul 22 11:35:54 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1:
> sdb - tur checker reports path is in standby state
> Jul 22 11:35:54 hpb-ws02-11 multipathd: 8:16: reinstated
> Jul 22 11:36:01 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1:
> sdd - tur checker reports path is up
> Jul 22 11:36:01 hpb-ws02-11 multipathd: 8:48: reinstated
> Jul 22 11:37:20 hpb-ws02-11 kernel: VirtualMachines: port 2(vnet0) entered
> disabled state
>  un
> There is over a minute after multipath's state has completely transitioned
> before the "entered disabled state" message. But again, I don't really know
> enough about kvm to know if a delay like this would be expected.


Hmmm... matching this up to the provided iothread enabled vdsm log shows:

2020-07-22 11:35:46,615+0200 DEBUG (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') event Suspended detail 2 opaque None (vm:5464)
2020-07-22 11:35:46,615+0200 INFO  (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') CPU stopped: onSuspend (vm:6100)
2020-07-22 11:35:46,617+0200 DEBUG (mpathlistener) [storage.udev] Received udev event (action=change, device=Device(u'/sys/devices/platform/host3/session1/target3:0:0/3:0:0:0/block/sdb')) (udev:218)
2020-07-22 11:35:46,620+0200 WARN  (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error (vm:4005)
2020-07-22 11:35:46,624+0200 DEBUG (mpathlistener) [storage.udev] Received udev event (action=change, device=Device(u'/sys/devices/virtual/block/dm-20')) (udev:218)
2020-07-22 11:35:46,999+0200 DEBUG (mpathlistener) [storage.udev] Received udev event (action=change, device=Device(u'/sys/devices/platform/host4/session2/target4:0:0/4:0:0:0/block/sdd')) (udev:218)
2020-07-22 11:35:47,002+0200 DEBUG (mpathlistener) [storage.udev] Received udev event (action=change, device=Device(u'/sys/devices/platform/host5/session3/target5:0:0/5:0:0:0/block/sdf')) (udev:218)
2020-07-22 11:35:47,008+0200 DEBUG (mpathlistener) [storage.udev] Received udev event (action=change, device=Device(u'/sys/devices/platform/host6/session4/target6:0:0/6:0:0:0/block/sdh')) (udev:

There's a similar pattern in the provided iothread disabled vdsm log:

2020-07-22 11:51:31,610+0200 DEBUG (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') event Suspended detail 2 opaque None (vm:5464)
2020-07-22 11:51:31,610+0200 INFO  (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') CPU stopped: onSuspend (vm:6100)
2020-07-22 11:51:31,613+0200 WARN  (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error (vm:4005)
2020-07-22 11:51:31,625+0200 DEBUG (mpathlistener) [storage.udev] Received udev event (action=change, device=Device(u'/sys/devices/virtual/block/dm-20')) (udev:218)
2020-07-22 11:51:31,990+0200 DEBUG (check/loop) [storage.check] START check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (delay=0.01) (check:289)
2020-07-22 11:51:32,005+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.02) (check:359)

Both have only 1 such message and it seems both have failures right around those times. Just before in each there's a "CPU stopped: onIOError (vm:6100)".
I'm only pattern matching here, so it could be a false positive.


Still like has been noted being able to get more logs from qemu would help - of course doing so could throw off timing as well.  Furthermore, code is based on qemu 2.12, which by this time is rather old and if a qemu issue could already be resolved in some future version.

Comment 16 Ben Marzinski 2020-07-28 17:19:13 UTC
(In reply to John Ferlan from comment #15)
>
> 
> Hmmm... matching this up to the provided iothread enabled vdsm log shows:
> 
> 2020-07-22 11:35:46,615+0200 DEBUG (libvirt/events) [virt.vm]
> (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') event Suspended detail 2
> opaque None (vm:5464)
> 2020-07-22 11:35:46,615+0200 INFO  (libvirt/events) [virt.vm]
> (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') CPU stopped: onSuspend
> (vm:6100)
> 2020-07-22 11:35:46,617+0200 DEBUG (mpathlistener) [storage.udev] Received
> udev event (action=change,
> device=Device(u'/sys/devices/platform/host3/session1/target3:0:0/3:0:0:0/
> block/sdb')) (udev:218)
> 2020-07-22 11:35:46,620+0200 WARN  (libvirt/events) [virt.vm]
> (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error
> (vm:4005)

Huh. sda isn't part of the multipath device listed here at all. Or is that sda on the guest, referring to the multipath device on the host? I'm not good at reading vdsm logs.

> 2020-07-22 11:35:46,624+0200 DEBUG (mpathlistener) [storage.udev] Received
> udev event (action=change,
> device=Device(u'/sys/devices/virtual/block/dm-20')) (udev:218)
> 2020-07-22 11:35:46,999+0200 DEBUG (mpathlistener) [storage.udev] Received
> udev event (action=change,
> device=Device(u'/sys/devices/platform/host4/session2/target4:0:0/4:0:0:0/
> block/sdd')) (udev:218)
> 2020-07-22 11:35:47,002+0200 DEBUG (mpathlistener) [storage.udev] Received
> udev event (action=change,
> device=Device(u'/sys/devices/platform/host5/session3/target5:0:0/5:0:0:0/
> block/sdf')) (udev:218)
> 2020-07-22 11:35:47,008+0200 DEBUG (mpathlistener) [storage.udev] Received
> udev event (action=change,
> device=Device(u'/sys/devices/platform/host6/session4/target6:0:0/6:0:0:0/
> block/sdh')) (udev:
> 
> There's a similar pattern in the provided iothread disabled vdsm log:
> 
> 2020-07-22 11:51:31,610+0200 DEBUG (libvirt/events) [virt.vm]
> (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') event Suspended detail 2
> opaque None (vm:5464)
> 2020-07-22 11:51:31,610+0200 INFO  (libvirt/events) [virt.vm]
> (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') CPU stopped: onSuspend
> (vm:6100)
> 2020-07-22 11:51:31,613+0200 WARN  (libvirt/events) [virt.vm]
> (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error
> (vm:4005)
> 2020-07-22 11:51:31,625+0200 DEBUG (mpathlistener) [storage.udev] Received
> udev event (action=change,
> device=Device(u'/sys/devices/virtual/block/dm-20')) (udev:218)
> 2020-07-22 11:51:31,990+0200 DEBUG (check/loop) [storage.check] START check
> '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (delay=0.01) (check:289)
> 2020-07-22 11:51:32,005+0200 DEBUG (check/loop) [storage.check] FINISH check
> '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.02)
> (check:359)
> 
> Both have only 1 such message and it seems both have failures right around
> those times. Just before in each there's a "CPU stopped: onIOError
> (vm:6100)".
> I'm only pattern matching here, so it could be a false positive.
> 
> 
> Still like has been noted being able to get more logs from qemu would help -
> of course doing so could throw off timing as well.  Furthermore, code is
> based on qemu 2.12, which by this time is rather old and if a qemu issue
> could already be resolved in some future version.

Comment 17 Nir Soffer 2020-07-28 17:27:34 UTC
The vdsm log (attachment 1702343 [details]) show that the failover was not transparent
on host side.

$ grep Stopped vdsm.log
2020-07-22 11:53:01,980+0200 DEBUG (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') event Stopped detail 1 opaque None (vm:5464)

By default vdsm configures multipath with no_path_retry = 4, which means
with default io timeout, 20 seconds when I/O may be queued if all paths
are failed for a LUN.

If the failover was slower, qemu will get I/O error an the VM will pause
as we see in vdsm log.

Vdsm will resume the VM if it detected an issue with storage, and then
detected that storage became OK again. But vdsm monitor only reads, so it is 
possible that vdsm will not detect any issue with storage, so it will not 
resume the paused VM.

Looking for monitoring events in the log:

$ grep 'FINISH check' vdsm.log
2020-07-22 11:49:51,998+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.01) (check:359)

...
2020-07-22 11:52:51,995+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.01) (check:359)

Storage check were successful before the pause...

2020-07-22 11:53:01,999+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.01) (check:359)
...
2020-07-22 11:59:51,997+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.01) (check:359)

And were still OK 6 minutes after the pause, so vdsm did not resume
the VM.

Looking for multipath health reports:

$ grep 'FINISH multipath_health' vdsm-noiothread.log
2020-07-22 04:01:11,985+0200 INFO  (jsonrpc/1) [vdsm.api] FINISH multipath_health return={u'29842180f3f697a816c9ce900ddb3faa1': {'valid_paths': 0, 'failed_paths': [u'sdj', u
'sdk']}} from=::ffff:10.22.0.247,43916, task_id=9b827991-4803-4d78-87ee-05c6dba9eb51 (api:54)
...
2020-07-22 11:59:52,029+0200 INFO  (jsonrpc/4) [vdsm.api] FINISH multipath_health return={u'29842180f3f697a816c9ce900ddb3faa1': {'valid_paths': 0, 'failed_paths': [u'sdj', u'sdk']}} from=::ffff:10.22.0.247,43916, task_id=8d2d77ce-9167-4f9b-9c41-43ab091b5bdd (api:54)

One LUN had no paths during the entire time.


Looking in the other log (attachment 1702341 [details]):

$ grep Stopped vdsm-iothread.log 
2020-07-22 11:37:21,478+0200 DEBUG (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') event Stopped detail 1 opaque None (vm:5464)

$ grep 'FINISH check' vdsm-iothread.log 
2020-07-22 11:31:02,647+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.05) (check:359)
...
2020-07-22 11:58:05,066+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.03) (check:359)

No error after the vm got paused. Storage was fine from vdsm point of view.

Looking for multipath health reports, which are not used used for resuming
paused vms:

$ grep 'FINISH multipath_health' vdsm-iothread.log
2020-07-22 11:37:07,328+0200 INFO  (jsonrpc/5) [vdsm.api] FINISH multipath_health return={} from=::ffff:10.22.0.247,45028, task_id=844683d5-5e9e-431a-a60b-729292d11452 (api:54)
2020-07-22 11:37:22,341+0200 INFO  (jsonrpc/0) [vdsm.api] FINISH multipath_health return={} from=::ffff:10.22.0.247,45028, task_id=ba9f1ca9-8773-43a7-9456-efe0fd913d68 (api:54)
2020-07-22 11:37:37,364+0200 INFO  (jsonrpc/7) [vdsm.api] FINISH multipath_health return={} from=::ffff:10.22.0.247,45028, task_id=e0f9b5c3-3206-471e-be65-ba72edc63b57 (api:54)
2020-07-22 11:37:52,433+0200 INFO  (jsonrpc/4) [vdsm.api] FINISH multipath_health return={} from=::ffff:10.22.0.247,45028, task_id=fc31d591-3534-4749-930e-39a156d11375 (api:54)

multipath did not report any issue with any of the paths.


RHV does not promise that VM will be resumed after it was paused in all cases.
This may be what customers expect, but there is no code trying to do this in
RHV. We promised to resume VMs only after we detect an issue in storage, when
the VM recover. Because we check only reads, we cannot detect all issues.

We can change monitoring to detect also writes by adding a special volume for
monitoring, maybe it could help in such case.

We can change the resume behavior to resume the VM after some delay if storage
seems to be accessible, but this may end as endless failing resume attempts.

With current version if the VM was not resume the admin can resume the VM.

We don't know why qemu was paused, maybe this was a failing ioctl suggested
by Ben in comment 12. We know why it was not resumed by vdsm.

I think we can move this bug to RHV, since the resume issue looks like missing 
feature in RHV.

Comment 18 Nir Soffer 2020-07-28 17:41:23 UTC
(In reply to Ben Marzinski from comment #16)
> (In reply to John Ferlan from comment #15)
...
> > 2020-07-22 11:35:46,620+0200 WARN  (libvirt/events) [virt.vm]
> > (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error
> > (vm:4005)
> 
> Huh. sda isn't part of the multipath device listed here at all. Or is that
> sda on the guest,

Kind of, more correctly, the device name in libvirt.

Following the vm id (db6db1f6-cb41-40cd-b850-0026e07861a6) show:

        <disk device="lun" sgio="filtered" snapshot="no" type="block">
            <target bus="scsi" dev="sda" />
            <source dev="/dev/mapper/2e1d2f345f5f3e3b56c9ce900ddb3faa1">
                <seclabel model="dac" relabel="no" type="none" />
            </source>
            <driver cache="none" error_policy="stop" io="native" name="qemu" type="raw" />
            <alias name="ua-e28e932f-242f-465f-9d18-272ecbc8636e" />
            <address bus="0" controller="0" target="0" type="drive" unit="0" />
            <boot order="1" />
        </disk>

The relevant device is:
/dev/mapper/2e1d2f345f5f3e3b56c9ce900ddb3faa1

I don't know why we are using error_policy="stop" for LUNs. We have these bugs
about propagating errors to the guest for LUNs, which would avoid the pause:
- bug 1669367
- bug 1314160

Comment 20 Nir Soffer 2020-07-28 20:45:42 UTC
Actually this is much simpler, RHV does not have any monitoring for
direct LUNs. The monitoring logs mentioned in comment 17 are for the
storage domain:

    3e706ec6-90a8-4426-88e1-7749924cdab0

But the failing LUN is:

   /dev/mapper/2e1d2f345f5f3e3b56c9ce900ddb3faa1

So if the failover was done in the LUN level, vdsm cannot detect
the issue, since it does not monitor LUNs.

Even if the entire server was down, and vdsm was detecting the issue
with the storage domain 3e706ec6-90a8-4426-88e1-7749924cdab0, it would
not resume a VM not using this storage domain, since storage monitoring
is done only for storage domain.

So the issue is not improving storage monitoring for storage domains,
but completely missing feature - monitoring direct LUNs.

This has come up few years ago, but I cannot find the bug now.

Looking in multipathd show, we have interesting info:

# multipathd show maps format "%n %t %x %0 %4"
name                              dm-st  failures path_faults q_timeouts
360014057a53a893c2ed4d8a85b5f3368 active 0        0           0         
36001405f40916efb8724270b117540fd active 0        0           0         
36001405bc00838f1ab744e88e65f1ecc active 0        0           0         
3600140569bf6f34c30a467c89a087d08 active 0        0           0         
360014054dce1e8538004223acc63c4ad active 0        0           0         
360014055aedb3cbe464471cba9df3fc6 active 0        0           0         
3600140571477c50c715418d8dd6495be active 0        0           0         
360014058657c2a1941841348f19c1a50 active 0        0           0         
36001405a9ac1f1638e9439295979c00a active 0        0           0         
36001405bb847e939b364565b42ecf80b active 0        0           0         

# multipathd show paths format "%d %T %m %x %0"
dev chk_st multipath                          failures
sdc ready  360014057a53a893c2ed4d8a85b5f3368  0       
sdb ready  36001405f40916efb8724270b117540fd  0       
sdd ready  36001405bc00838f1ab744e88e65f1ecc  0       
sdf ready  3600140569bf6f34c30a467c89a087d08  0       
sde ready  360014054dce1e8538004223acc63c4ad  0       
sdg ready  360014055aedb3cbe464471cba9df3fc6  0       
sdh ready  3600140571477c50c715418d8dd6495be  0       
sdj ready  360014058657c2a1941841348f19c1a50  0       
sda ready  36001405a9ac1f1638e9439295979c00a  0       
sdi ready  36001405bb847e939b364565b42ecf80b  0       

Can we use failures count (or another counter) for detecting errors?

For example vdsm can do something like:

- On each monitoring cycle, fetch failure count for all maps
- If failure count changed in this cycle, mark the map as "failed"
- Wait for the next multipath check.
- If multipath reports some active paths, clear the "failed" status,
  and resume paused VMs using this map.

Ben, do you think this can detect very short outage that is not detectable
by looking at failed paths?

Comment 21 Ben Marzinski 2020-07-28 22:40:55 UTC
Looking at these different counters:

path failures (show paths %0):
------------------------------
Incremented by kernel whenever it fails a path, either because IO to it failed, or because multipathd told it to fail the path (because the path checker failed)

map failures (show maps %x):
-----------------------------
Incremented whenever a path enters a state with no active paths and queueing disabled. It is possible for multipathd to miss a an instance of this if no_path_retry is set to "fail". In that case, the kernel could run out of paths and fail back an IO error while multipathd was in the process of checking a path which had just become active.  In this case, multipath would never see the map with no
active paths.  However, if no_path_retry is set to queue for some number of checks, multipath won't start the timer until it sees that all the paths are down, so there should be no possibility of missing a failure.

map path_faults (show maps %0):
---------------------------------
incremented when multipathd moves a path to the failed state. Usually close to the paths failure count, but may be different, due to differences between multipathd and kernel path states. 

map q_timeouts (show maps %4):
------------------------------
ncremented whenever multipathd loses all of its paths while no_path_retry is set to a positive number, so multipathd will queue for a limited time

So yes, this should allow you to see if a failure has happened, even if it's not currently happening. You can also reset the maps stats (but not the path failures stats which is controlled by the kernel)
by running

# multipathd reset maps stats

or

# multipathd reset map <multipath_device> stats

However, looking at the messages for this bug, there's no evidence that the multipath devices ever lost all of their paths. So I don't see how IO would have failed here.

Comment 22 Nir Soffer 2020-07-29 00:33:35 UTC
We know that qemu had an error on disk sda, which is
/dev/mapper/2e1d2f345f5f3e3b56c9ce900ddb3faa1 (see comment 18).

We had 2 events:

Event 1:

VM paused:

2020-07-22 11:35:46,620+0200 WARN  (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error (vm:4005)

Relevant log from comment 2:

Jul 22 11:35:05 hpb-ws02-11 kernel: sd 3:0:0:0: Asymmetric access state changed
Jul 22 11:35:05 hpb-ws02-11 kernel: sd 4:0:0:0: Asymmetric access state changed
Jul 22 11:35:05 hpb-ws02-11 kernel: sd 3:0:0:0: alua: port group 01 state A non-preferred supports tolusna
Jul 22 11:35:05 hpb-ws02-11 kernel: sd 5:0:0:0: Asymmetric access state changed
Jul 22 11:35:05 hpb-ws02-11 kernel: sd 4:0:0:0: alua: port group 02 state S non-preferred supports tolusna
Jul 22 11:35:05 hpb-ws02-11 kernel: sd 6:0:0:0: Asymmetric access state changed
Jul 22 11:35:05 hpb-ws02-11 kernel: sd 5:0:0:0: alua: port group 01 state A non-preferred supports tolusna
Jul 22 11:35:05 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state S non-preferred supports tolusna
Jul 22 11:35:06 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdd - tur checker reports path is in standby state
Jul 22 11:35:06 hpb-ws02-11 multipathd: 8:48: reinstated
Jul 22 11:35:06 hpb-ws02-11 multipathd: sdh: prio changed from 50 to 1
Jul 22 11:35:06 hpb-ws02-11 multipathd: sdd: prio changed from 50 to 1
Jul 22 11:35:06 hpb-ws02-11 multipathd: sdf: prio changed from 1 to 50
Jul 22 11:35:06 hpb-ws02-11 multipathd: sdb: prio changed from 1 to 50
Jul 22 11:35:06 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: updating path groups 0
Jul 22 11:35:06 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: load table [0 41943040 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 2 1 8:80 1 8:16 1 service-time 0 2 1 8:112 1 8:48 1]
Jul 22 11:35:06 hpb-ws02-11 kernel: sd 5:0:0:0: alua: port group 01 state A non-preferred supports tolusna
Jul 22 11:35:06 hpb-ws02-11 kernel: sd 5:0:0:0: alua: port group 01 state A non-preferred supports tolusna
Jul 22 11:35:16 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdh - tur checker reports path is in standby state
Jul 22 11:35:16 hpb-ws02-11 multipathd: 8:112: reinstated
Jul 22 11:35:17 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdf - tur checker reports path is up
Jul 22 11:35:17 hpb-ws02-11 multipathd: 8:80: reinstated
Jul 22 11:35:19 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdb - tur checker reports path is up
Jul 22 11:35:19 hpb-ws02-11 multipathd: 8:16: reinstated
Jul 22 11:35:46 hpb-ws02-11 kernel: sd 3:0:0:0: Asymmetric access state changed
Jul 22 11:35:46 hpb-ws02-11 kernel: sd 3:0:0:0: alua: port group 01 state S non-preferred supports tolusna
Jul 22 11:35:46 hpb-ws02-11 kernel: sd 4:0:0:0: Asymmetric access state changed
Jul 22 11:35:46 hpb-ws02-11 kernel: sd 4:0:0:0: alua: port group 02 state A non-preferred supports tolusna
Jul 22 11:35:46 hpb-ws02-11 kernel: sd 5:0:0:0: Asymmetric access state changed
Jul 22 11:35:46 hpb-ws02-11 kernel: sd 6:0:0:0: Asymmetric access state changed
Jul 22 11:35:46 hpb-ws02-11 kernel: sd 5:0:0:0: alua: port group 01 state S non-preferred supports tolusna

--- VM paused here ---

Jul 22 11:35:47 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna
Jul 22 11:35:51 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdh - tur checker reports path is up
Jul 22 11:35:51 hpb-ws02-11 multipathd: 8:112: reinstated
Jul 22 11:35:51 hpb-ws02-11 multipathd: sdf: prio changed from 50 to 1
Jul 22 11:35:51 hpb-ws02-11 multipathd: sdb: prio changed from 50 to 1
Jul 22 11:35:51 hpb-ws02-11 multipathd: sdh: prio changed from 1 to 50
Jul 22 11:35:51 hpb-ws02-11 multipathd: sdd: prio changed from 1 to 50
Jul 22 11:35:51 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: updating path groups 0
Jul 22 11:35:51 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: load table [0 41943040 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 2 1 8:112 1 8:48 1 service-time 0 2 1 8:80 1 8:16 1]
Jul 22 11:35:51 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna
Jul 22 11:35:51 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna
Jul 22 11:35:52 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdf - tur checker reports path is in standby state
Jul 22 11:35:52 hpb-ws02-11 multipathd: 8:80: reinstated
Jul 22 11:35:54 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdb - tur checker reports path is in standby state
Jul 22 11:35:54 hpb-ws02-11 multipathd: 8:16: reinstated
Jul 22 11:36:01 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdd - tur checker reports path is up
Jul 22 11:36:01 hpb-ws02-11 multipathd: 8:48: reinstated


Event 2:

VM started:

2020-07-22 11:46:51,244+0200 INFO  (vm/db6db1f6) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6'

VM paused:

2020-07-22 11:51:31,613+0200 WARN  (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error (vm:4005)

Relevant log from comment 4:


Jul 22 11:50:01 hpb-ms02-16 systemd: Started Session 2012 of user root.
Jul 22 11:50:35 hpb-ms02-16 kernel: sd 3:0:0:0: Asymmetric access state changed
Jul 22 11:50:35 hpb-ms02-16 kernel: sd 3:0:0:0: alua: port group 02 state S non-preferred supports tolusna
Jul 22 11:50:35 hpb-ms02-16 kernel: sd 4:0:0:0: Asymmetric access state changed
Jul 22 11:50:35 hpb-ms02-16 kernel: sd 4:0:0:0: alua: port group 01 state A non-preferred supports tolusna
Jul 22 11:50:35 hpb-ms02-16 kernel: sd 5:0:0:0: Asymmetric access state changed
Jul 22 11:50:35 hpb-ms02-16 kernel: sd 6:0:0:0: Asymmetric access state changed
Jul 22 11:50:35 hpb-ms02-16 kernel: sd 5:0:0:0: alua: port group 01 state A non-preferred supports tolusna
Jul 22 11:50:35 hpb-ms02-16 kernel: sd 6:0:0:0: alua: port group 02 state S non-preferred supports tolusna
Jul 22 11:50:43 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdb - tur checker reports path is in standby state
Jul 22 11:50:43 hpb-ms02-16 multipathd: 8:16: reinstated
Jul 22 11:50:43 hpb-ms02-16 multipathd: sdh: prio changed from 50 to 1
Jul 22 11:50:43 hpb-ms02-16 multipathd: sdb: prio changed from 50 to 1
Jul 22 11:50:43 hpb-ms02-16 multipathd: sdd: prio changed from 1 to 50
Jul 22 11:50:43 hpb-ms02-16 multipathd: sdf: prio changed from 1 to 50
Jul 22 11:50:43 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: updating path groups 0
Jul 22 11:50:43 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: load table [0 41943040 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 2 1 8:48 1 8:80 1 service-time 0 2 1 8:112 1 8:16 1]
Jul 22 11:50:43 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdh - tur checker reports path is in standby state
Jul 22 11:50:43 hpb-ms02-16 multipathd: 8:112: reinstated
Jul 22 11:50:43 hpb-ms02-16 kernel: sd 4:0:0:0: alua: port group 01 state A non-preferred supports tolusna
Jul 22 11:50:43 hpb-ms02-16 kernel: sd 4:0:0:0: alua: port group 01 state A non-preferred supports tolusna
Jul 22 11:50:51 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdd - tur checker reports path is up
Jul 22 11:50:51 hpb-ms02-16 multipathd: 8:48: reinstated
Jul 22 11:50:52 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdf - tur checker reports path is up
Jul 22 11:50:52 hpb-ms02-16 multipathd: 8:80: reinstated
Jul 22 11:51:05 hpb-ms02-16 systemd: Started Session c7339 of user root.
Jul 22 11:51:05 hpb-ms02-16 systemd: Started Session c7340 of user root.
Jul 22 11:51:29 hpb-ms02-16 kernel: sd 3:0:0:0: Asymmetric access state changed
Jul 22 11:51:29 hpb-ms02-16 kernel: sd 3:0:0:0: alua: port group 02 state A non-preferred supports tolusna
Jul 22 11:51:29 hpb-ms02-16 kernel: sd 4:0:0:0: Asymmetric access state changed
Jul 22 11:51:29 hpb-ms02-16 kernel: sd 4:0:0:0: alua: port group 01 state S non-preferred supports tolusna
Jul 22 11:51:29 hpb-ms02-16 kernel: sd 5:0:0:0: Asymmetric access state changed
Jul 22 11:51:29 hpb-ms02-16 kernel: sd 6:0:0:0: Asymmetric access state changed
Jul 22 11:51:29 hpb-ms02-16 kernel: sd 5:0:0:0: alua: port group 01 state S non-preferred supports tolusna
Jul 22 11:51:29 hpb-ms02-16 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna

--- VM paused here ---

Jul 22 11:51:38 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdb - tur checker reports path is up
Jul 22 11:51:38 hpb-ms02-16 multipathd: 8:16: reinstated
Jul 22 11:51:38 hpb-ms02-16 multipathd: sdd: prio changed from 50 to 1
Jul 22 11:51:38 hpb-ms02-16 multipathd: sdf: prio changed from 50 to 1
Jul 22 11:51:38 hpb-ms02-16 multipathd: sdh: prio changed from 1 to 50
Jul 22 11:51:38 hpb-ms02-16 multipathd: sdb: prio changed from 1 to 50
Jul 22 11:51:38 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: updating path groups 0
Jul 22 11:51:38 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: load table [0 41943040 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 2 1 8:112 1 8:16 1 service-time 0 2 1 8:48 1 8:80 1]
Jul 22 11:51:38 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdh - tur checker reports path is up
Jul 22 11:51:38 hpb-ms02-16 multipathd: 8:112: reinstated
Jul 22 11:51:38 hpb-ms02-16 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna
Jul 22 11:51:38 hpb-ms02-16 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna
Jul 22 11:51:46 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdd - tur checker reports path is in standby state
Jul 22 11:51:46 hpb-ms02-16 multipathd: 8:48: reinstated
Jul 22 11:51:47 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdf - tur checker reports path is in standby state
Jul 22 11:51:47 hpb-ms02-16 multipathd: 8:80: reinstated


I don't fully understand all the messages but the pattern I see is:

11:35:06 load table
11:35:46 VM pause with I/O error
11:35:51 load table

11:50:43 load table
11:51:31 VM pause with I/O error
11:51:38 load table

I don't have any idea why the map is reloaded, maybe the handover is not 
really transparent to the host and the kernel triggers a reload of the map?

Maybe reloading the map in this case cause I/O error while the paths are being
reinstated after the reload?

Ben, what do you think?

Comment 23 Ben Marzinski 2020-07-30 03:45:39 UTC
(In reply to Nir Soffer from comment #22)

> I don't fully understand all the messages but the pattern I see is:
> 
> 11:35:06 load table
> 11:35:46 VM pause with I/O error
> 11:35:51 load table
> 
> 11:50:43 load table
> 11:51:31 VM pause with I/O error
> 11:51:38 load table
> 
> I don't have any idea why the map is reloaded, maybe the handover is not 
> really transparent to the host and the kernel triggers a reload of the map?

These reloads are actually unnecessary in this case. The issue is that multipathd sees that a bunch of paths have changed priority, and it's not smart enough to realize that they all changed priority in such a way that they all still belong to the same pathgroups, so it doesn't need to reload to update the pathgroups.

> Maybe reloading the map in this case cause I/O error while the paths are
> being
> reinstated after the reload?

The reinstated messages are a red herring. Multipath is printing them because the path has changed from being ready to being a ghost, or vice versa. Whenever the path changes states to an active state, multipath notifies the kernel that the path is usable.  However, as far as the kernel is concerned, both of those states are the same (both states mean that the path is active).

When the kernel reloads a multipath device, it first waits for all outstanding IO to complete from the underlying devies. Then it quiesces the device, so no new IO can be started to an underlying device,
next, it switches the table (which in this case simply switched the order of the pathgroups, which is unnecessary), finally it reissues any queued IO returns to its normal state. After loading a new table, the kernel always assumes that all the paths are active.  Multipathd needs to explicitly tell the kernel if that's not the case, and it always logs messages when that happens.

> Ben, what do you think?

The biggest issue I have with this bug is that there is no evidence that IO ever failed from multipath's point of view. Personally that's the first thing I would like to understand

Comment 24 Ben Marzinski 2020-07-30 03:57:30 UTC
Created attachment 1702886 [details]
System tap script to see if/why multipath is failing IO

Are you able to run a systemtap script on the machine while reproducing this? If so, can you please try this one out. You need to run this with

# stap -g check_for_error.stp

since the script has embedded C code to grab the name of the device failing the IO. 

If you don't know how to set up systemtap, you should just need to run

# yum install systemtap
# stap-prep

To install all the necessary packages. Then run the script as shown above, wait until it prints "starting", and then reproduce the issue. When you are done, you can stop the script with ctrl-c. Please post the script output to the bugzilla.

This script will print output when a multipath request ends in an IO error, both for the individual paths, and for the entire device.

Comment 25 Nir Soffer 2020-07-30 10:54:43 UTC
Marian, can you check comment 24?

Comment 27 Ben Marzinski 2020-08-10 21:02:29 UTC
Given that dd to the device works during the LUN handover, and that when this is reproduced, it doesn't appear that multipath ever failed an IO request, I don't think that a multipath IO failure is the root cause of this issue. So, why does qemu see an error on this device?

Comment 33 Nir Soffer 2020-08-24 15:42:44 UTC
Marian, can you answer Ben question about the "ending" line? See comment 29

Comment 35 Nir Soffer 2020-08-24 18:04:17 UTC
Ben, how can qemu get EIO (comment 34) if multipath did not return an error and
have valid paths?

Comment 36 Nir Soffer 2020-08-24 18:13:19 UTC
Marian, is there a good reason to use "Enable SCSI Pass-Through"
in this setup?

    <disk device="lun" sgio="filtered" snapshot="no" type="block">

If there is no such reason, is this reproducible unchecking
"Enable SCSI Pass-Through"?

Is it reproducible with virtio-blk (Interface: virtio)?

Comment 39 Ben Marzinski 2020-08-26 14:10:20 UTC
(In reply to Nir Soffer from comment #35)
> Ben, how can qemu get EIO (comment 34) if multipath did not return an error
> and
> have valid paths?

If the IO was done with a SG_IO ioctl, my systemtap script won't catch that.  I'll see if I can write a systemtap script to catch failing ioctls.

Comment 40 Ben Marzinski 2020-08-27 01:25:53 UTC
Created attachment 1712761 [details]
systemtap script to check for ioctl error to device-mapper devices

This script should hopefully be able to tell us if the error that was detected was from an scsi ioctl, instead of from regular IO. You run in the same way as the previous systemtap script. Can you run

# stap -g check_ioctl.stp

Wait for it to print "starting", and then reproduce the issue?

Comment 41 Eyal Shenitzky 2020-08-31 14:10:49 UTC
Nir, anything else to do here?

Comment 42 Nir Soffer 2020-08-31 14:30:04 UTC
(In reply to Eyal Shenitzky from comment #41)
> Nir, anything else to do here?

Yes, Ben provided a new tests script, and this need to be tested in 
the customer setup.

We are also waiting for reply for comment 38.

Comment 43 Nir Soffer 2020-08-31 14:33:51 UTC
Marian, please see comment 38 and comment 40.

Also see comment 34 from Kevin. If you know how to use the scsi_*
tracepoints it will be a good idea to collect more info in the next run.


Note You need to log in before you can comment on or make changes to this bug.