Description of problem: qemu-kvm-rhev does not survive lun handover in between 2 NimbleStorage devices that operate in synchronous replication mode Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.12.0-44.el7_8.2.x86_64 How reproducible: 20% of lun handovers in customer environment Steps to Reproduce: 1. start ha vm 2. conduct lun handover Actual results: vm is paused if resume behavior is kill, it is killed and restarted on the other hosts, if resume behavior is autoresume vm remains paused Expected results: lun handover should be transparent to qemu as it does survive dd to the multipath device even with "oflag=direct" Additional info: https://infosight.hpe.com/InfoSight/media/cms/active/public/pubs_GUI_Administration_Guide_NOS_52x.whz/ypq1584111570400.html https://infosight.hpe.com/InfoSight/media/cms/active/public/pubs_GUI_Administration_Guide_NOS_52x.whz/egp1534956790827.html https://infosight.hpe.com/InfoSight/media/cms/active/public/pubs_GUI_Administration_Guide_NOS_52x.whz/udz1584387105907.html
While watching the video, I saw this in the UI, where the user performs the actual handover: ---------------------- "Handover" is a controlled way to change the direction of replication for a synchronous replicated volume collection with no loss of data. This process is transparent to the host(s) and does not cause any outage. ** Note that I/O will be paused for a very short duration during this operation. ** If an array in the upstream pool is not reachable, ensure that no host is connected to any upstream volume before using the option to override the upstream down condition. Otherwise, it might cause data corruption ---------------------- (emphasis mine) Given it fails in ~20% of the cases, my question here would be how "very short" is this pause in I/O and what they mean by "pause". What would be the expected behavior from QEMU here?
Looking at the multipath log messages, there doesn't appear to be any time when the paths are down, so IO should never be failing back, and it doesn't appear to be. It's possible that an ioctl could fail, if it comes in while the paths are transitioning. There will unavoidably be a time when a paths ALUA access state has changed, and the path should be listed as a ghost path, but multipath won't have reacted to that yet. It seems possible that an ioctl would fail here. That being said, I have no evidence that an ioctl DID fail. Also if the issue was related to the ioctl coming it and picking the wrong path, the gap in timing here seems unlikely: Jul 22 11:35:51 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna Jul 22 11:35:51 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna Jul 22 11:35:52 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdf - tur checker reports path is in standby state Jul 22 11:35:52 hpb-ws02-11 multipathd: 8:80: reinstated Jul 22 11:35:54 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdb - tur checker reports path is in standby state Jul 22 11:35:54 hpb-ws02-11 multipathd: 8:16: reinstated Jul 22 11:36:01 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdd - tur checker reports path is up Jul 22 11:36:01 hpb-ws02-11 multipathd: 8:48: reinstated Jul 22 11:37:20 hpb-ws02-11 kernel: VirtualMachines: port 2(vnet0) entered disabled state un There is over a minute after multipath's state has completely transitioned before the "entered disabled state" message. But again, I don't really know enough about kvm to know if a delay like this would be expected.
(In reply to Ben Marzinski from comment #12) > Looking at the multipath log messages, there doesn't appear to be any time > when the paths are down, so IO should never be failing back, and it doesn't > appear to be. It's possible that an ioctl could fail, if it comes in while > the paths are transitioning. There will unavoidably be a time when a paths > ALUA access state has changed, and the path should be listed as a ghost > path, but multipath won't have reacted to that yet. It seems possible that > an ioctl would fail here. That being said, I have no evidence that an ioctl > DID fail. Also if the issue was related to the ioctl coming it and picking > the wrong > path, the gap in timing here seems unlikely: > > Jul 22 11:35:51 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A > non-preferred supports tolusna > Jul 22 11:35:51 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A > non-preferred supports tolusna > Jul 22 11:35:52 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: > sdf - tur checker reports path is in standby state > Jul 22 11:35:52 hpb-ws02-11 multipathd: 8:80: reinstated > Jul 22 11:35:54 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: > sdb - tur checker reports path is in standby state > Jul 22 11:35:54 hpb-ws02-11 multipathd: 8:16: reinstated > Jul 22 11:36:01 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: > sdd - tur checker reports path is up > Jul 22 11:36:01 hpb-ws02-11 multipathd: 8:48: reinstated > Jul 22 11:37:20 hpb-ws02-11 kernel: VirtualMachines: port 2(vnet0) entered > disabled state > un > There is over a minute after multipath's state has completely transitioned > before the "entered disabled state" message. But again, I don't really know > enough about kvm to know if a delay like this would be expected. Hmmm... matching this up to the provided iothread enabled vdsm log shows: 2020-07-22 11:35:46,615+0200 DEBUG (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') event Suspended detail 2 opaque None (vm:5464) 2020-07-22 11:35:46,615+0200 INFO (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') CPU stopped: onSuspend (vm:6100) 2020-07-22 11:35:46,617+0200 DEBUG (mpathlistener) [storage.udev] Received udev event (action=change, device=Device(u'/sys/devices/platform/host3/session1/target3:0:0/3:0:0:0/block/sdb')) (udev:218) 2020-07-22 11:35:46,620+0200 WARN (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error (vm:4005) 2020-07-22 11:35:46,624+0200 DEBUG (mpathlistener) [storage.udev] Received udev event (action=change, device=Device(u'/sys/devices/virtual/block/dm-20')) (udev:218) 2020-07-22 11:35:46,999+0200 DEBUG (mpathlistener) [storage.udev] Received udev event (action=change, device=Device(u'/sys/devices/platform/host4/session2/target4:0:0/4:0:0:0/block/sdd')) (udev:218) 2020-07-22 11:35:47,002+0200 DEBUG (mpathlistener) [storage.udev] Received udev event (action=change, device=Device(u'/sys/devices/platform/host5/session3/target5:0:0/5:0:0:0/block/sdf')) (udev:218) 2020-07-22 11:35:47,008+0200 DEBUG (mpathlistener) [storage.udev] Received udev event (action=change, device=Device(u'/sys/devices/platform/host6/session4/target6:0:0/6:0:0:0/block/sdh')) (udev: There's a similar pattern in the provided iothread disabled vdsm log: 2020-07-22 11:51:31,610+0200 DEBUG (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') event Suspended detail 2 opaque None (vm:5464) 2020-07-22 11:51:31,610+0200 INFO (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') CPU stopped: onSuspend (vm:6100) 2020-07-22 11:51:31,613+0200 WARN (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error (vm:4005) 2020-07-22 11:51:31,625+0200 DEBUG (mpathlistener) [storage.udev] Received udev event (action=change, device=Device(u'/sys/devices/virtual/block/dm-20')) (udev:218) 2020-07-22 11:51:31,990+0200 DEBUG (check/loop) [storage.check] START check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (delay=0.01) (check:289) 2020-07-22 11:51:32,005+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.02) (check:359) Both have only 1 such message and it seems both have failures right around those times. Just before in each there's a "CPU stopped: onIOError (vm:6100)". I'm only pattern matching here, so it could be a false positive. Still like has been noted being able to get more logs from qemu would help - of course doing so could throw off timing as well. Furthermore, code is based on qemu 2.12, which by this time is rather old and if a qemu issue could already be resolved in some future version.
(In reply to John Ferlan from comment #15) > > > Hmmm... matching this up to the provided iothread enabled vdsm log shows: > > 2020-07-22 11:35:46,615+0200 DEBUG (libvirt/events) [virt.vm] > (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') event Suspended detail 2 > opaque None (vm:5464) > 2020-07-22 11:35:46,615+0200 INFO (libvirt/events) [virt.vm] > (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') CPU stopped: onSuspend > (vm:6100) > 2020-07-22 11:35:46,617+0200 DEBUG (mpathlistener) [storage.udev] Received > udev event (action=change, > device=Device(u'/sys/devices/platform/host3/session1/target3:0:0/3:0:0:0/ > block/sdb')) (udev:218) > 2020-07-22 11:35:46,620+0200 WARN (libvirt/events) [virt.vm] > (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error > (vm:4005) Huh. sda isn't part of the multipath device listed here at all. Or is that sda on the guest, referring to the multipath device on the host? I'm not good at reading vdsm logs. > 2020-07-22 11:35:46,624+0200 DEBUG (mpathlistener) [storage.udev] Received > udev event (action=change, > device=Device(u'/sys/devices/virtual/block/dm-20')) (udev:218) > 2020-07-22 11:35:46,999+0200 DEBUG (mpathlistener) [storage.udev] Received > udev event (action=change, > device=Device(u'/sys/devices/platform/host4/session2/target4:0:0/4:0:0:0/ > block/sdd')) (udev:218) > 2020-07-22 11:35:47,002+0200 DEBUG (mpathlistener) [storage.udev] Received > udev event (action=change, > device=Device(u'/sys/devices/platform/host5/session3/target5:0:0/5:0:0:0/ > block/sdf')) (udev:218) > 2020-07-22 11:35:47,008+0200 DEBUG (mpathlistener) [storage.udev] Received > udev event (action=change, > device=Device(u'/sys/devices/platform/host6/session4/target6:0:0/6:0:0:0/ > block/sdh')) (udev: > > There's a similar pattern in the provided iothread disabled vdsm log: > > 2020-07-22 11:51:31,610+0200 DEBUG (libvirt/events) [virt.vm] > (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') event Suspended detail 2 > opaque None (vm:5464) > 2020-07-22 11:51:31,610+0200 INFO (libvirt/events) [virt.vm] > (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') CPU stopped: onSuspend > (vm:6100) > 2020-07-22 11:51:31,613+0200 WARN (libvirt/events) [virt.vm] > (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error > (vm:4005) > 2020-07-22 11:51:31,625+0200 DEBUG (mpathlistener) [storage.udev] Received > udev event (action=change, > device=Device(u'/sys/devices/virtual/block/dm-20')) (udev:218) > 2020-07-22 11:51:31,990+0200 DEBUG (check/loop) [storage.check] START check > '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (delay=0.01) (check:289) > 2020-07-22 11:51:32,005+0200 DEBUG (check/loop) [storage.check] FINISH check > '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.02) > (check:359) > > Both have only 1 such message and it seems both have failures right around > those times. Just before in each there's a "CPU stopped: onIOError > (vm:6100)". > I'm only pattern matching here, so it could be a false positive. > > > Still like has been noted being able to get more logs from qemu would help - > of course doing so could throw off timing as well. Furthermore, code is > based on qemu 2.12, which by this time is rather old and if a qemu issue > could already be resolved in some future version.
The vdsm log (attachment 1702343 [details]) show that the failover was not transparent on host side. $ grep Stopped vdsm.log 2020-07-22 11:53:01,980+0200 DEBUG (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') event Stopped detail 1 opaque None (vm:5464) By default vdsm configures multipath with no_path_retry = 4, which means with default io timeout, 20 seconds when I/O may be queued if all paths are failed for a LUN. If the failover was slower, qemu will get I/O error an the VM will pause as we see in vdsm log. Vdsm will resume the VM if it detected an issue with storage, and then detected that storage became OK again. But vdsm monitor only reads, so it is possible that vdsm will not detect any issue with storage, so it will not resume the paused VM. Looking for monitoring events in the log: $ grep 'FINISH check' vdsm.log 2020-07-22 11:49:51,998+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.01) (check:359) ... 2020-07-22 11:52:51,995+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.01) (check:359) Storage check were successful before the pause... 2020-07-22 11:53:01,999+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.01) (check:359) ... 2020-07-22 11:59:51,997+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.01) (check:359) And were still OK 6 minutes after the pause, so vdsm did not resume the VM. Looking for multipath health reports: $ grep 'FINISH multipath_health' vdsm-noiothread.log 2020-07-22 04:01:11,985+0200 INFO (jsonrpc/1) [vdsm.api] FINISH multipath_health return={u'29842180f3f697a816c9ce900ddb3faa1': {'valid_paths': 0, 'failed_paths': [u'sdj', u 'sdk']}} from=::ffff:10.22.0.247,43916, task_id=9b827991-4803-4d78-87ee-05c6dba9eb51 (api:54) ... 2020-07-22 11:59:52,029+0200 INFO (jsonrpc/4) [vdsm.api] FINISH multipath_health return={u'29842180f3f697a816c9ce900ddb3faa1': {'valid_paths': 0, 'failed_paths': [u'sdj', u'sdk']}} from=::ffff:10.22.0.247,43916, task_id=8d2d77ce-9167-4f9b-9c41-43ab091b5bdd (api:54) One LUN had no paths during the entire time. Looking in the other log (attachment 1702341 [details]): $ grep Stopped vdsm-iothread.log 2020-07-22 11:37:21,478+0200 DEBUG (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') event Stopped detail 1 opaque None (vm:5464) $ grep 'FINISH check' vdsm-iothread.log 2020-07-22 11:31:02,647+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.05) (check:359) ... 2020-07-22 11:58:05,066+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/3e706ec6-90a8-4426-88e1-7749924cdab0/metadata' (rc=0, elapsed=0.03) (check:359) No error after the vm got paused. Storage was fine from vdsm point of view. Looking for multipath health reports, which are not used used for resuming paused vms: $ grep 'FINISH multipath_health' vdsm-iothread.log 2020-07-22 11:37:07,328+0200 INFO (jsonrpc/5) [vdsm.api] FINISH multipath_health return={} from=::ffff:10.22.0.247,45028, task_id=844683d5-5e9e-431a-a60b-729292d11452 (api:54) 2020-07-22 11:37:22,341+0200 INFO (jsonrpc/0) [vdsm.api] FINISH multipath_health return={} from=::ffff:10.22.0.247,45028, task_id=ba9f1ca9-8773-43a7-9456-efe0fd913d68 (api:54) 2020-07-22 11:37:37,364+0200 INFO (jsonrpc/7) [vdsm.api] FINISH multipath_health return={} from=::ffff:10.22.0.247,45028, task_id=e0f9b5c3-3206-471e-be65-ba72edc63b57 (api:54) 2020-07-22 11:37:52,433+0200 INFO (jsonrpc/4) [vdsm.api] FINISH multipath_health return={} from=::ffff:10.22.0.247,45028, task_id=fc31d591-3534-4749-930e-39a156d11375 (api:54) multipath did not report any issue with any of the paths. RHV does not promise that VM will be resumed after it was paused in all cases. This may be what customers expect, but there is no code trying to do this in RHV. We promised to resume VMs only after we detect an issue in storage, when the VM recover. Because we check only reads, we cannot detect all issues. We can change monitoring to detect also writes by adding a special volume for monitoring, maybe it could help in such case. We can change the resume behavior to resume the VM after some delay if storage seems to be accessible, but this may end as endless failing resume attempts. With current version if the VM was not resume the admin can resume the VM. We don't know why qemu was paused, maybe this was a failing ioctl suggested by Ben in comment 12. We know why it was not resumed by vdsm. I think we can move this bug to RHV, since the resume issue looks like missing feature in RHV.
(In reply to Ben Marzinski from comment #16) > (In reply to John Ferlan from comment #15) ... > > 2020-07-22 11:35:46,620+0200 WARN (libvirt/events) [virt.vm] > > (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error > > (vm:4005) > > Huh. sda isn't part of the multipath device listed here at all. Or is that > sda on the guest, Kind of, more correctly, the device name in libvirt. Following the vm id (db6db1f6-cb41-40cd-b850-0026e07861a6) show: <disk device="lun" sgio="filtered" snapshot="no" type="block"> <target bus="scsi" dev="sda" /> <source dev="/dev/mapper/2e1d2f345f5f3e3b56c9ce900ddb3faa1"> <seclabel model="dac" relabel="no" type="none" /> </source> <driver cache="none" error_policy="stop" io="native" name="qemu" type="raw" /> <alias name="ua-e28e932f-242f-465f-9d18-272ecbc8636e" /> <address bus="0" controller="0" target="0" type="drive" unit="0" /> <boot order="1" /> </disk> The relevant device is: /dev/mapper/2e1d2f345f5f3e3b56c9ce900ddb3faa1 I don't know why we are using error_policy="stop" for LUNs. We have these bugs about propagating errors to the guest for LUNs, which would avoid the pause: - bug 1669367 - bug 1314160
Actually this is much simpler, RHV does not have any monitoring for direct LUNs. The monitoring logs mentioned in comment 17 are for the storage domain: 3e706ec6-90a8-4426-88e1-7749924cdab0 But the failing LUN is: /dev/mapper/2e1d2f345f5f3e3b56c9ce900ddb3faa1 So if the failover was done in the LUN level, vdsm cannot detect the issue, since it does not monitor LUNs. Even if the entire server was down, and vdsm was detecting the issue with the storage domain 3e706ec6-90a8-4426-88e1-7749924cdab0, it would not resume a VM not using this storage domain, since storage monitoring is done only for storage domain. So the issue is not improving storage monitoring for storage domains, but completely missing feature - monitoring direct LUNs. This has come up few years ago, but I cannot find the bug now. Looking in multipathd show, we have interesting info: # multipathd show maps format "%n %t %x %0 %4" name dm-st failures path_faults q_timeouts 360014057a53a893c2ed4d8a85b5f3368 active 0 0 0 36001405f40916efb8724270b117540fd active 0 0 0 36001405bc00838f1ab744e88e65f1ecc active 0 0 0 3600140569bf6f34c30a467c89a087d08 active 0 0 0 360014054dce1e8538004223acc63c4ad active 0 0 0 360014055aedb3cbe464471cba9df3fc6 active 0 0 0 3600140571477c50c715418d8dd6495be active 0 0 0 360014058657c2a1941841348f19c1a50 active 0 0 0 36001405a9ac1f1638e9439295979c00a active 0 0 0 36001405bb847e939b364565b42ecf80b active 0 0 0 # multipathd show paths format "%d %T %m %x %0" dev chk_st multipath failures sdc ready 360014057a53a893c2ed4d8a85b5f3368 0 sdb ready 36001405f40916efb8724270b117540fd 0 sdd ready 36001405bc00838f1ab744e88e65f1ecc 0 sdf ready 3600140569bf6f34c30a467c89a087d08 0 sde ready 360014054dce1e8538004223acc63c4ad 0 sdg ready 360014055aedb3cbe464471cba9df3fc6 0 sdh ready 3600140571477c50c715418d8dd6495be 0 sdj ready 360014058657c2a1941841348f19c1a50 0 sda ready 36001405a9ac1f1638e9439295979c00a 0 sdi ready 36001405bb847e939b364565b42ecf80b 0 Can we use failures count (or another counter) for detecting errors? For example vdsm can do something like: - On each monitoring cycle, fetch failure count for all maps - If failure count changed in this cycle, mark the map as "failed" - Wait for the next multipath check. - If multipath reports some active paths, clear the "failed" status, and resume paused VMs using this map. Ben, do you think this can detect very short outage that is not detectable by looking at failed paths?
Looking at these different counters: path failures (show paths %0): ------------------------------ Incremented by kernel whenever it fails a path, either because IO to it failed, or because multipathd told it to fail the path (because the path checker failed) map failures (show maps %x): ----------------------------- Incremented whenever a path enters a state with no active paths and queueing disabled. It is possible for multipathd to miss a an instance of this if no_path_retry is set to "fail". In that case, the kernel could run out of paths and fail back an IO error while multipathd was in the process of checking a path which had just become active. In this case, multipath would never see the map with no active paths. However, if no_path_retry is set to queue for some number of checks, multipath won't start the timer until it sees that all the paths are down, so there should be no possibility of missing a failure. map path_faults (show maps %0): --------------------------------- incremented when multipathd moves a path to the failed state. Usually close to the paths failure count, but may be different, due to differences between multipathd and kernel path states. map q_timeouts (show maps %4): ------------------------------ ncremented whenever multipathd loses all of its paths while no_path_retry is set to a positive number, so multipathd will queue for a limited time So yes, this should allow you to see if a failure has happened, even if it's not currently happening. You can also reset the maps stats (but not the path failures stats which is controlled by the kernel) by running # multipathd reset maps stats or # multipathd reset map <multipath_device> stats However, looking at the messages for this bug, there's no evidence that the multipath devices ever lost all of their paths. So I don't see how IO would have failed here.
We know that qemu had an error on disk sda, which is /dev/mapper/2e1d2f345f5f3e3b56c9ce900ddb3faa1 (see comment 18). We had 2 events: Event 1: VM paused: 2020-07-22 11:35:46,620+0200 WARN (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error (vm:4005) Relevant log from comment 2: Jul 22 11:35:05 hpb-ws02-11 kernel: sd 3:0:0:0: Asymmetric access state changed Jul 22 11:35:05 hpb-ws02-11 kernel: sd 4:0:0:0: Asymmetric access state changed Jul 22 11:35:05 hpb-ws02-11 kernel: sd 3:0:0:0: alua: port group 01 state A non-preferred supports tolusna Jul 22 11:35:05 hpb-ws02-11 kernel: sd 5:0:0:0: Asymmetric access state changed Jul 22 11:35:05 hpb-ws02-11 kernel: sd 4:0:0:0: alua: port group 02 state S non-preferred supports tolusna Jul 22 11:35:05 hpb-ws02-11 kernel: sd 6:0:0:0: Asymmetric access state changed Jul 22 11:35:05 hpb-ws02-11 kernel: sd 5:0:0:0: alua: port group 01 state A non-preferred supports tolusna Jul 22 11:35:05 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state S non-preferred supports tolusna Jul 22 11:35:06 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdd - tur checker reports path is in standby state Jul 22 11:35:06 hpb-ws02-11 multipathd: 8:48: reinstated Jul 22 11:35:06 hpb-ws02-11 multipathd: sdh: prio changed from 50 to 1 Jul 22 11:35:06 hpb-ws02-11 multipathd: sdd: prio changed from 50 to 1 Jul 22 11:35:06 hpb-ws02-11 multipathd: sdf: prio changed from 1 to 50 Jul 22 11:35:06 hpb-ws02-11 multipathd: sdb: prio changed from 1 to 50 Jul 22 11:35:06 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: updating path groups 0 Jul 22 11:35:06 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: load table [0 41943040 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 2 1 8:80 1 8:16 1 service-time 0 2 1 8:112 1 8:48 1] Jul 22 11:35:06 hpb-ws02-11 kernel: sd 5:0:0:0: alua: port group 01 state A non-preferred supports tolusna Jul 22 11:35:06 hpb-ws02-11 kernel: sd 5:0:0:0: alua: port group 01 state A non-preferred supports tolusna Jul 22 11:35:16 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdh - tur checker reports path is in standby state Jul 22 11:35:16 hpb-ws02-11 multipathd: 8:112: reinstated Jul 22 11:35:17 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdf - tur checker reports path is up Jul 22 11:35:17 hpb-ws02-11 multipathd: 8:80: reinstated Jul 22 11:35:19 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdb - tur checker reports path is up Jul 22 11:35:19 hpb-ws02-11 multipathd: 8:16: reinstated Jul 22 11:35:46 hpb-ws02-11 kernel: sd 3:0:0:0: Asymmetric access state changed Jul 22 11:35:46 hpb-ws02-11 kernel: sd 3:0:0:0: alua: port group 01 state S non-preferred supports tolusna Jul 22 11:35:46 hpb-ws02-11 kernel: sd 4:0:0:0: Asymmetric access state changed Jul 22 11:35:46 hpb-ws02-11 kernel: sd 4:0:0:0: alua: port group 02 state A non-preferred supports tolusna Jul 22 11:35:46 hpb-ws02-11 kernel: sd 5:0:0:0: Asymmetric access state changed Jul 22 11:35:46 hpb-ws02-11 kernel: sd 6:0:0:0: Asymmetric access state changed Jul 22 11:35:46 hpb-ws02-11 kernel: sd 5:0:0:0: alua: port group 01 state S non-preferred supports tolusna --- VM paused here --- Jul 22 11:35:47 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna Jul 22 11:35:51 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdh - tur checker reports path is up Jul 22 11:35:51 hpb-ws02-11 multipathd: 8:112: reinstated Jul 22 11:35:51 hpb-ws02-11 multipathd: sdf: prio changed from 50 to 1 Jul 22 11:35:51 hpb-ws02-11 multipathd: sdb: prio changed from 50 to 1 Jul 22 11:35:51 hpb-ws02-11 multipathd: sdh: prio changed from 1 to 50 Jul 22 11:35:51 hpb-ws02-11 multipathd: sdd: prio changed from 1 to 50 Jul 22 11:35:51 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: updating path groups 0 Jul 22 11:35:51 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: load table [0 41943040 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 2 1 8:112 1 8:48 1 service-time 0 2 1 8:80 1 8:16 1] Jul 22 11:35:51 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna Jul 22 11:35:51 hpb-ws02-11 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna Jul 22 11:35:52 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdf - tur checker reports path is in standby state Jul 22 11:35:52 hpb-ws02-11 multipathd: 8:80: reinstated Jul 22 11:35:54 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdb - tur checker reports path is in standby state Jul 22 11:35:54 hpb-ws02-11 multipathd: 8:16: reinstated Jul 22 11:36:01 hpb-ws02-11 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdd - tur checker reports path is up Jul 22 11:36:01 hpb-ws02-11 multipathd: 8:48: reinstated Event 2: VM started: 2020-07-22 11:46:51,244+0200 INFO (vm/db6db1f6) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6' VM paused: 2020-07-22 11:51:31,613+0200 WARN (libvirt/events) [virt.vm] (vmId='db6db1f6-cb41-40cd-b850-0026e07861a6') device sda reported I/O error (vm:4005) Relevant log from comment 4: Jul 22 11:50:01 hpb-ms02-16 systemd: Started Session 2012 of user root. Jul 22 11:50:35 hpb-ms02-16 kernel: sd 3:0:0:0: Asymmetric access state changed Jul 22 11:50:35 hpb-ms02-16 kernel: sd 3:0:0:0: alua: port group 02 state S non-preferred supports tolusna Jul 22 11:50:35 hpb-ms02-16 kernel: sd 4:0:0:0: Asymmetric access state changed Jul 22 11:50:35 hpb-ms02-16 kernel: sd 4:0:0:0: alua: port group 01 state A non-preferred supports tolusna Jul 22 11:50:35 hpb-ms02-16 kernel: sd 5:0:0:0: Asymmetric access state changed Jul 22 11:50:35 hpb-ms02-16 kernel: sd 6:0:0:0: Asymmetric access state changed Jul 22 11:50:35 hpb-ms02-16 kernel: sd 5:0:0:0: alua: port group 01 state A non-preferred supports tolusna Jul 22 11:50:35 hpb-ms02-16 kernel: sd 6:0:0:0: alua: port group 02 state S non-preferred supports tolusna Jul 22 11:50:43 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdb - tur checker reports path is in standby state Jul 22 11:50:43 hpb-ms02-16 multipathd: 8:16: reinstated Jul 22 11:50:43 hpb-ms02-16 multipathd: sdh: prio changed from 50 to 1 Jul 22 11:50:43 hpb-ms02-16 multipathd: sdb: prio changed from 50 to 1 Jul 22 11:50:43 hpb-ms02-16 multipathd: sdd: prio changed from 1 to 50 Jul 22 11:50:43 hpb-ms02-16 multipathd: sdf: prio changed from 1 to 50 Jul 22 11:50:43 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: updating path groups 0 Jul 22 11:50:43 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: load table [0 41943040 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 2 1 8:48 1 8:80 1 service-time 0 2 1 8:112 1 8:16 1] Jul 22 11:50:43 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdh - tur checker reports path is in standby state Jul 22 11:50:43 hpb-ms02-16 multipathd: 8:112: reinstated Jul 22 11:50:43 hpb-ms02-16 kernel: sd 4:0:0:0: alua: port group 01 state A non-preferred supports tolusna Jul 22 11:50:43 hpb-ms02-16 kernel: sd 4:0:0:0: alua: port group 01 state A non-preferred supports tolusna Jul 22 11:50:51 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdd - tur checker reports path is up Jul 22 11:50:51 hpb-ms02-16 multipathd: 8:48: reinstated Jul 22 11:50:52 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdf - tur checker reports path is up Jul 22 11:50:52 hpb-ms02-16 multipathd: 8:80: reinstated Jul 22 11:51:05 hpb-ms02-16 systemd: Started Session c7339 of user root. Jul 22 11:51:05 hpb-ms02-16 systemd: Started Session c7340 of user root. Jul 22 11:51:29 hpb-ms02-16 kernel: sd 3:0:0:0: Asymmetric access state changed Jul 22 11:51:29 hpb-ms02-16 kernel: sd 3:0:0:0: alua: port group 02 state A non-preferred supports tolusna Jul 22 11:51:29 hpb-ms02-16 kernel: sd 4:0:0:0: Asymmetric access state changed Jul 22 11:51:29 hpb-ms02-16 kernel: sd 4:0:0:0: alua: port group 01 state S non-preferred supports tolusna Jul 22 11:51:29 hpb-ms02-16 kernel: sd 5:0:0:0: Asymmetric access state changed Jul 22 11:51:29 hpb-ms02-16 kernel: sd 6:0:0:0: Asymmetric access state changed Jul 22 11:51:29 hpb-ms02-16 kernel: sd 5:0:0:0: alua: port group 01 state S non-preferred supports tolusna Jul 22 11:51:29 hpb-ms02-16 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna --- VM paused here --- Jul 22 11:51:38 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdb - tur checker reports path is up Jul 22 11:51:38 hpb-ms02-16 multipathd: 8:16: reinstated Jul 22 11:51:38 hpb-ms02-16 multipathd: sdd: prio changed from 50 to 1 Jul 22 11:51:38 hpb-ms02-16 multipathd: sdf: prio changed from 50 to 1 Jul 22 11:51:38 hpb-ms02-16 multipathd: sdh: prio changed from 1 to 50 Jul 22 11:51:38 hpb-ms02-16 multipathd: sdb: prio changed from 1 to 50 Jul 22 11:51:38 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: updating path groups 0 Jul 22 11:51:38 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: load table [0 41943040 multipath 1 queue_if_no_path 1 alua 2 1 service-time 0 2 1 8:112 1 8:16 1 service-time 0 2 1 8:48 1 8:80 1] Jul 22 11:51:38 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdh - tur checker reports path is up Jul 22 11:51:38 hpb-ms02-16 multipathd: 8:112: reinstated Jul 22 11:51:38 hpb-ms02-16 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna Jul 22 11:51:38 hpb-ms02-16 kernel: sd 6:0:0:0: alua: port group 02 state A non-preferred supports tolusna Jul 22 11:51:46 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdd - tur checker reports path is in standby state Jul 22 11:51:46 hpb-ms02-16 multipathd: 8:48: reinstated Jul 22 11:51:47 hpb-ms02-16 multipathd: 2e1d2f345f5f3e3b56c9ce900ddb3faa1: sdf - tur checker reports path is in standby state Jul 22 11:51:47 hpb-ms02-16 multipathd: 8:80: reinstated I don't fully understand all the messages but the pattern I see is: 11:35:06 load table 11:35:46 VM pause with I/O error 11:35:51 load table 11:50:43 load table 11:51:31 VM pause with I/O error 11:51:38 load table I don't have any idea why the map is reloaded, maybe the handover is not really transparent to the host and the kernel triggers a reload of the map? Maybe reloading the map in this case cause I/O error while the paths are being reinstated after the reload? Ben, what do you think?
(In reply to Nir Soffer from comment #22) > I don't fully understand all the messages but the pattern I see is: > > 11:35:06 load table > 11:35:46 VM pause with I/O error > 11:35:51 load table > > 11:50:43 load table > 11:51:31 VM pause with I/O error > 11:51:38 load table > > I don't have any idea why the map is reloaded, maybe the handover is not > really transparent to the host and the kernel triggers a reload of the map? These reloads are actually unnecessary in this case. The issue is that multipathd sees that a bunch of paths have changed priority, and it's not smart enough to realize that they all changed priority in such a way that they all still belong to the same pathgroups, so it doesn't need to reload to update the pathgroups. > Maybe reloading the map in this case cause I/O error while the paths are > being > reinstated after the reload? The reinstated messages are a red herring. Multipath is printing them because the path has changed from being ready to being a ghost, or vice versa. Whenever the path changes states to an active state, multipath notifies the kernel that the path is usable. However, as far as the kernel is concerned, both of those states are the same (both states mean that the path is active). When the kernel reloads a multipath device, it first waits for all outstanding IO to complete from the underlying devies. Then it quiesces the device, so no new IO can be started to an underlying device, next, it switches the table (which in this case simply switched the order of the pathgroups, which is unnecessary), finally it reissues any queued IO returns to its normal state. After loading a new table, the kernel always assumes that all the paths are active. Multipathd needs to explicitly tell the kernel if that's not the case, and it always logs messages when that happens. > Ben, what do you think? The biggest issue I have with this bug is that there is no evidence that IO ever failed from multipath's point of view. Personally that's the first thing I would like to understand
Created attachment 1702886 [details] System tap script to see if/why multipath is failing IO Are you able to run a systemtap script on the machine while reproducing this? If so, can you please try this one out. You need to run this with # stap -g check_for_error.stp since the script has embedded C code to grab the name of the device failing the IO. If you don't know how to set up systemtap, you should just need to run # yum install systemtap # stap-prep To install all the necessary packages. Then run the script as shown above, wait until it prints "starting", and then reproduce the issue. When you are done, you can stop the script with ctrl-c. Please post the script output to the bugzilla. This script will print output when a multipath request ends in an IO error, both for the individual paths, and for the entire device.
Marian, can you check comment 24?
Given that dd to the device works during the LUN handover, and that when this is reproduced, it doesn't appear that multipath ever failed an IO request, I don't think that a multipath IO failure is the root cause of this issue. So, why does qemu see an error on this device?
Marian, can you answer Ben question about the "ending" line? See comment 29
Ben, how can qemu get EIO (comment 34) if multipath did not return an error and have valid paths?
Marian, is there a good reason to use "Enable SCSI Pass-Through" in this setup? <disk device="lun" sgio="filtered" snapshot="no" type="block"> If there is no such reason, is this reproducible unchecking "Enable SCSI Pass-Through"? Is it reproducible with virtio-blk (Interface: virtio)?
(In reply to Nir Soffer from comment #35) > Ben, how can qemu get EIO (comment 34) if multipath did not return an error > and > have valid paths? If the IO was done with a SG_IO ioctl, my systemtap script won't catch that. I'll see if I can write a systemtap script to catch failing ioctls.
Created attachment 1712761 [details] systemtap script to check for ioctl error to device-mapper devices This script should hopefully be able to tell us if the error that was detected was from an scsi ioctl, instead of from regular IO. You run in the same way as the previous systemtap script. Can you run # stap -g check_ioctl.stp Wait for it to print "starting", and then reproduce the issue?
Nir, anything else to do here?
(In reply to Eyal Shenitzky from comment #41) > Nir, anything else to do here? Yes, Ben provided a new tests script, and this need to be tested in the customer setup. We are also waiting for reply for comment 38.
Marian, please see comment 38 and comment 40. Also see comment 34 from Kevin. If you know how to use the scsi_* tracepoints it will be a good idea to collect more info in the next run.