Bug 2000720 - Previously detached FC Storage Domain not displayed in Admin Portal 'Import Domain' window the first time
Summary: Previously detached FC Storage Domain not displayed in Admin Portal 'Import D...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.4.9
: 4.4.9
Assignee: Nir Soffer
QA Contact: Amit Sharir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-02 18:35 UTC by Gordon Watson
Modified: 2023-01-20 10:31 UTC (History)
11 users (show)

Fixed In Version: vdsm-4.40.90.1
Doc Type: Bug Fix
Doc Text:
Previously, VDSM reported partial results to the engine which resulted in a failure to import the storage domain using new LUNs. This happened because VDSM would not wait for the creation of new multipath devices after the discovery of new LUNs. The current release fixes this issue and VDSM waits for multipathd reports to be ready and the storage domain is now detected.
Clone Of:
Environment:
Last Closed: 2021-11-16 15:12:47 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
asharir: testing_plan_complete+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43368 0 None None None 2021-09-02 18:36:52 UTC
Red Hat Knowledge Base (Solution) 6304851 0 None None None 2021-09-02 20:55:25 UTC
Red Hat Product Errata RHBA-2021:4704 0 None None None 2021-11-16 15:12:59 UTC
oVirt gerrit 116467 0 master MERGED multipath: Wait until multiapthd is ready 2021-09-10 16:45:35 UTC
oVirt gerrit 116650 0 ovirt-4.4.z MERGED multipath: Wait until multiapthd is ready 2021-09-13 07:27:17 UTC

Description Gordon Watson 2021-09-02 18:35:17 UTC
Description of problem:

FC luns comprising a previously-detached storage domain were re-presented to the hosts. However, when going into 'Import Domain' in the Admin Portal, the storage domain was not displayed in the list.

On the SPM host, all luns are recognised by 'multipath' and and the volume group is recognised by LVM.

Upon exiting and re-entering the 'Import Domain' window, the storage domain was displayed the second time.


Version-Release number of selected component (if applicable):

RHVM 4.4.5
RHVH 4.4.7
  vdsm-4.40.70.6-1.el8ev.x86_64 


How reproducible:

100% in customer's environment.


Steps to Reproduce:
1. Create a storage domain on 4 FibreChannel luns
2. Remove the storage domain from RHV
3. Unzone the luns in the SAN 
4. Reboot the host(s)
5. Present the luns to the host(s) again 
6. Check that 'multipath' sees the luns and the paths are active
7. Check that LVM sees the VG, LVs
8. In RHVM Admin Portal, go into 'Import Domain' and select FC storage
9. The storage domain is not displayed
10.Exit and re-enter the 'Import Domain' window
11.The storage domain is displayed



Actual results:

Storage domain not displayed the first time.


Expected results:

Storage domain to be displayed the first time.


Additional info:

Comment 9 Nir Soffer 2021-09-03 19:45:28 UTC
Attached patch should mitigate this issue, but I don't have a FC setup
to test this.

This build can be used for testing:
https://jenkins.ovirt.org/job/vdsm_standard-check-patch/29461/artifact/build-artifacts.build-py3.el8stream.x86_64/

Comment 11 Amit Sharir 2021-09-05 12:40:15 UTC
I tried doing the flow mentioned in the bug description and these are my findings. 

Version : ovirt-engine-4.4.8.4-0.7.el8ev.noarch

I used an FC enabled environment in my testing:

1. I used an FC storage domain that was configured in my environment (the domain was named differently in the Netapp system manager and in the WEB UI - it's important to validate the lun using the id which can be found using the command "multipath -ll" from the host terminal).
2. Via web console UI: remove the storage domain from RHV (put the relevant data center in maintenance mode, detach storage domain, remove).
3. Via Netapp "ONTAP System Manager" UI: put the relevant lun in "offline mode" (remember the name differences mentioned in step 2).
4. Via host terminal: Check that the relevant lun is still present using - [root@<host> ~]# multipath -ll | grep <lun id>.
5. Via host terminal: [root@<host> ~]# reboot
6. Wait for the host to come back up + re-enter the web console of RHV.

Time-critical flow:
7. Via Netapp "ONTAP System Manager" UI: put the relevant lun in "online mode".
8. Straight away check that the lun is visible using - [root@<host> ~]# multipath -ll | grep <lun id>.
9. In addition to step 8, open the import domain window via the web console.
10. 15 seconds After the storage domain of the relevant <lun id> was visible in the host terminal, the window from step 9 was refreshed (by itself) and I was able to select the relevant FC storage domain for import.

Please note that in general, when trying to import FC storage domain in my environment, after entering all the relevant parameters in the import window of the environment RHV UI, I needed to wait for 15 seconds in order to see available storage domains for import. The window refreshes automatically and I just needed to wait. The 15-second wait is relevant both for the edited FC domain that I changed using the steps mentioned and also for the other FC domains that I did not alter.

Comment 14 Nir Soffer 2021-09-05 14:22:38 UTC
(In reply to Amit Sharir from comment #11)
Reproducing with storage domain that have single LUN is unlikely to succeed.

The root cause in this bug is that after SCSI scan completes, multipath is busy
handling new LUNs for ~2 seconds (based on this case logs). Vdsm wait on
"udevadm settle", returning ~2 seconds before multipathd handled the new
devices, so vdsm does not find the PV with the metadata and the storage
domain is not found during import.

On the second try to import the domain, multipath already know the LUNs, so
vdsm find all the PVs for the storage domain and the import succeeds.

How to reproduce:


## Setup

1. Create 4 LUNs (size does not matter) on FC storage

2. Map the LUNs to all hosts (using system with one host is good enough)

3. Create new storage domain from the 4 LUNs

4. Create few empty disks on this storage domain

5. Set vdsm log level to DEBUG

Change these loggers level to DEBUG in /etc/vdsm/logger.conf

- logger_root
- logger_vds
- logger_storage

restart vdsmd to apply the log change.

Alternative way, change log level temporarily before each test:

   vdsm-client Host setLogLevel name=root level=DEBUG
   vdsm-client Host setLogLevel name=vds level=DEBUG
   vdsm-client Host setLogLevel name=storage level=DEBUG


# Before test

1. Deactivate, detach and remove (without formating!) the storage domain

2. On the storage, umap the LUNs from all hosts

3. Remove the LUNs from all hosts as explained here:
   
find all paths to remove using (e.g. sda, sdb, ...):

    multipath -ll <multiapth-device>

remove the map

    multipath -f <multipath-device>
   
remove the SCSI devices found in step a.

    echo 1 > /sys/block/<device-name>/device/delete

Check with lsblk that the multipath devices and LUNs were removed from
the system.

You want to automated this step with a script, since we will need to do this
multiple time as cleanup after every test iteration.

Alternative way to remove the devices is to reboot the host.

## Test

1. Check with lsblk that the multipath devices and LUNs are *not*
   available in the system.

2. Add log markers

   echo "======== $(date) - start import domain =======" >> /var/log/vdsm/vdsm.log
   logger "======== $(date) - start import domain ======="

3. On the FC storage server, map the LUNs to the hosts

4. On engine side, open "Import domain" dialog, and select "FC" storage
   and wait until the import dialog show the found domains

5. Add log markers

   echo "======== $(date) - end import domain =======" >> /var/log/vdsm/vdsm.log
   logger "======== $(date) - end import domain ======="

6. Collect logs

- /var/log/vdsm/vdsm.log
- /var/log/messages

Expected results (with fix):
The storage domain is found and listed in the import dialog.

Actual results (before fix):
The storage domain is not found.

Repeat the test several times, the issues is likely to be hard to reproduce.
You must run the cleanup step before every test.

Comment 17 Vojtech Juranek 2021-09-07 08:25:55 UTC
I managed to reproduce the issue with FC storage domain with 4 LUNs. WWIDs of the LUNs are:

* 3600a09803830447a4f244c4657616f76
* 3600a098038304479363f4c4870455162
* 3600a098038304479363f4c4870455163
* 3600a09803830447a4f244c4657616f77

I was able to reproduce it only with vdsm log level on INFO. Out of 5 attempts, in 4 cases I reproduced the issue.
Trying same with vdsm log level on DEBUG didn't succeeded. Out of 3 attempts, non reproduced the issue.

I was also able to verify proposed Nir's patch (https://gerrit.ovirt.org/c/vdsm/+/116467/2), using the build from https://jenkins.ovirt.org/job/vdsm_standard-check-patch/29478/
Out of 7 attempts, all 7 were successful and issue never appeared.

Grepping the vdsm logs, there are about 200 records about waiting for multipath (added in gerrit #116467) and the average waiting time was 2.216 seconds.

Comment 20 Nir Soffer 2021-09-09 13:36:10 UTC
Looking in reproduction attempt logs from attachment attachment 1821092 [details]:

## BZ2000720_vdsm.log

1. Test started

======== Mon Sep  6 17:40:18 IDT 2021 - start import domain =======

2. Getting device list

2021-09-06 17:40:40,145+0300 INFO  (jsonrpc/5) [vdsm.api] START getDeviceList(storageType=2, guids=(), checkStatus=False, refresh=True) from=::ffff:10.46.12.130,53554, flow_id=d5cd880f-346e-4b75-8dca-45b1bf8afdb2, task_id=913c7118-4111-4eeb-b72a-ceaae53fb222 (api:48)

2021-09-06 17:40:40,145+0300 INFO  (jsonrpc/5) [storage.StorageDomainCache] Refreshing storage domain cache (resize=True) (sdc:80)

3. Scanning FC devices

2021-09-06 17:40:40,588+0300 INFO  (jsonrpc/5) [storage.HBA] Scanning FC devices (hba:60)
2021-09-06 17:40:42,913+0300 INFO  (jsonrpc/5) [storage.HBA] Scanning FC devices: 2.33 seconds (utils:390)

4. Resizing multipath devices

2021-09-06 17:40:44,203+0300 INFO  (jsonrpc/5) [storage.Multipath] Resizing multipath devices (multipath:104)
2021-09-06 17:40:44,216+0300 ERROR (jsonrpc/5) [storage.Multipath] Could not resize device 3600a09803830447a4f244c4657616f77 (multipath:111)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/devicemapper.py", line 54, in getDmId
    devStat = os.stat(devlinkPath)
FileNotFoundError: [Errno 2] No such file or directory: '/dev/mapper/3600a09803830447a4f244c4657616f77'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/multipath.py", line 109, in resize_devices
    _resize_if_needed(guid)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/multipath.py", line 115, in _resize_if_needed
    name = devicemapper.getDmId(guid)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/devicemapper.py", line 57, in getDmId
    deviceMultipathName)
OSError: [Errno 19] Could not find dm device named `3600a09803830447a4f244c4657616f77`

device 3600a09803830447a4f244c4657616f77 is a new LUN (see comment 17).

Clearly multipathd did not finish handling the new devices.

5. Refresh storage completed

2021-09-06 17:40:44,227+0300 INFO  (jsonrpc/5) [storage.Multipath] Resizing multipath devices: 0.02 seconds (utils:390)
2021-09-06 17:40:44,227+0300 INFO  (jsonrpc/5) [storage.StorageDomainCache] Refreshing storage domain cache: 4.08 seconds (uti

6. Looking for PVs

2021-09-06 17:40:44,427+0300 WARN  (jsonrpc/5) [storage.LVM] Command ['/sbin/lvm', 'pvs', '--config', 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/36001405194892563377425285018012e$|^/dev/mapper/360014053e51335e27c24270a339155d4$|^/dev/mapper/3600140567811936bde246aea5ec3c878$|^/dev/mapper/360014058a90b491aed74b3da520d977f$|^/dev/mapper/36001405992bcec685da4dd99ca7fb58a$|^/dev/mapper/36001405d56aeeed10f84ca49ab027b9b$|^/dev/mapper/36001405f309c88532b44172a5831031a$|^/dev/mapper/3600a098038304479363f4c487045512f$|^/dev/mapper/3600a098038304479363f4c4870455157$|^/dev/mapper/3600a098038304479363f4c4870455158$|^/dev/mapper/3600a098038304479363f4c4870455159$|^/dev/mapper/3600a098038304479363f4c487045515a$|^/dev/mapper/3600a098038304479363f4c4870455161$|^/dev/mapper/3600a098038304479363f4c4870455162$|^/dev/mapper/3600a098038304479363f4c4870455163$|^/dev/mapper/3600a098038304479363f4c4870455164$|^/dev/mapper/3600a09803830447a4f244c4657616f72$|^/dev/mapper/3600a09803830447a4f244c4657616f73$|^/dev/mapper/3600a09803830447a4f244c4657616f74$|^/dev/mapper/3600a09803830447a4f244c4657616f76$|^/dev/mapper/3600a09803830447a4f244c4657616f77$|^/dev/mapper/3600a09803830447a4f244c4657616f78$|^/dev/mapper/3600a09803830447a4f244c4657617469$|^/dev/mapper/3600a09803830447a4f244c465761746a$|^/dev/mapper/3600a09803830447a4f244c465761746b$|^/dev/mapper/3600a09803830447a4f244c465761746e$|^/dev/mapper/3600a09803830447a4f244c4657617672$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }', '--noheadings', '--units', 'b', '--nosuffix', '--separator', '|', '--ignoreskippedcluster', '-o', 'uuid,name,size,vg_name,vg_uuid,pe_start,pe_count,pe_alloc_count,mda_count,dev_size,mda_used_count'] succeeded with warnings: ["  WARNING: Couldn't find device with uuid uEN4mu-QKj9-7C5M-Tfvj-C2pi-dJIe-pmzANv.", '  WARNING: VG 1c548258-c92a-4839-ab05-871f73662571 is missing PV uEN4mu-QKj9-7C5M-Tfvj-C2pi-dJIe-pmzANv (last written to /dev/mapper/3600a098038304479363f4c4870455163).'] (lvm:355)

2021-09-06 17:40:44,428+0300 ERROR (jsonrpc/5) [storage.LVM] Missing pv: uEN4mu-QKj9-7C5M-Tfvj-C2pi-dJIe-pmzANv in vg: 1c548258-c92a-4839-ab05-871f73662571 (lvm:614)

A PV in the VG is missing, because vdsm did not wait until multipathd
is ready.

7. Returning partial device list

2021-09-06 17:40:44,762+0300 INFO  (jsonrpc/5) [vdsm.api] FINISH getDeviceList return=...

$ grep 'FINISH getDeviceList return' BZ2000720_vdsm.log | tr ' ' '\n' | egrep '3600a09803830447a4f244c4657616f76|3600a098038304479363f4c4870455162|3600a098038304479363f4c4870455163|3600a09803830447a4f244c4657616f77'
'3600a09803830447a4f244c4657616f77',
'3600a098038304479363f4c4870455162',
'3600a098038304479363f4c4870455163',
'3600a09803830447a4f244c4657616f76',

All the device are reported:

    {
      "GUID": "3600a09803830447a4f244c4657616f76",
      "capacity": "80530636800",
      "devtype": "FCP",
      "fwrev": "9700",
      "logicalblocksize": "512",
      "pathlist": [],
      "pathstatus": [
        {
          "physdev": "sdc",
          "state": "active",
          "capacity": "80530636800",
          "lun": "2",
          "type": "FCP"
        }
      ],
      "physicalblocksize": "4096",
      "productID": "LUN C-Mode",
      "pvUUID": "ITTaXK-m63S-CVU5-iqhX-DErq-z0p1-i38WE1",
      "pvsize": "80127983616",
      "serial": "SNETAPP_LUN_C-Mode_80DzO_LFWaov",
      "vendorID": "NETAPP",
      "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP",
      "discard_max_bytes": 0,
      "discard_zeroes_data": 0,
      "status": "unknown"
    },

    {
      "GUID": "3600a098038304479363f4c4870455162",
      "capacity": "53687091200",
      "devtype": "FCP",
      "fwrev": "9700",
      "logicalblocksize": "512",
      "pathlist": [],
      "pathstatus": [
        {
          "physdev": "sdd",
          "state": "active",
          "capacity": "53687091200",
          "lun": "3",
          "type": "FCP"
        }
      ],
      "physicalblocksize": "4096",
      "productID": "LUN C-Mode",
      "pvUUID": "HGfG0T-OMrP-CDqw-FDBv-mzhD-tpKw-yit6zd",
      "pvsize": "53284438016",
      "serial": "SNETAPP_LUN_C-Mode_80Dy6_LHpEQb",
      "vendorID": "NETAPP",
      "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP",
      "discard_max_bytes": 0,
      "discard_zeroes_data": 0,
      "status": "unknown"
    },

    {
      "GUID": "3600a098038304479363f4c4870455163",
      "capacity": "53687091200",
      "devtype": "FCP",
      "fwrev": "9700",
      "logicalblocksize": "512",
      "pathlist": [],
      "pathstatus": [
        {
          "physdev": "sde",
          "state": "active",
          "capacity": "53687091200",
          "lun": "4",
          "type": "FCP"
        }
      ],
      "physicalblocksize": "4096",
      "productID": "LUN C-Mode",
      "pvUUID": "",
      "pvsize": "",
      "serial": "SNETAPP_LUN_C-Mode_80Dy6_LHpEQc",
      "vendorID": "NETAPP",
      "vgUUID": "",
      "discard_max_bytes": 0,
      "discard_zeroes_data": 0,
      "status": "unknown"
    },

    {
      "GUID": "3600a09803830447a4f244c4657616f77",
      "capacity": "53687091200",
      "devtype": "FCP",
      "fwrev": "9700",
      "logicalblocksize": "512",
      "pathlist": [],
      "pathstatus": [
        {
          "physdev": "sdf",
          "state": "active",
          "capacity": "53687091200",
          "lun": "5",
          "type": "FCP"
        }
      ],
      "physicalblocksize": "4096",
      "productID": "LUN C-Mode",
      "pvUUID": "at6vrb-PcSs-HeJz-o3UX-3LpH-Yu6J-EgUwF9",
      "pvsize": "53284438016",
      "serial": "SNETAPP_LUN_C-Mode_80DzO_LFWaow",
      "vendorID": "NETAPP",
      "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP",
      "discard_max_bytes": 0,
      "discard_zeroes_data": 0,
      "status": "unknown"
    },

But LUN 3600a098038304479363f4c4870455162 is not reported as part of the VG:

      "pvUUID": "",
      "pvsize": "",
      "vgUUID": "",


8. Getting VG info fails

2021-09-06 17:40:45,712+0300 INFO  (jsonrpc/6) [vdsm.api] FINISH getVGInfo error=Multipath cannot access physical device(s): "devices=('/dev/mapper/3600a098038304479363f4c4870455163',)" from=::ffff:10.46.12.130,53554, task_id=c0b96937-e813-4b17-b9ad-894bbb8370df (api:52)
2021-09-06 17:40:45,713+0300 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='c0b96937-e813-4b17-b9ad-894bbb8370df') Unexpected error (task:877)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-139>", line 2, in getVGInfo
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2918, in getVGInfo
    return dict(info=self.__getVGsInfo([vgUUID])[0])
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2865, in __getVGsInfo
    self.__processVGInfos(vgInfos, pathDict, getGuid)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2884, in __processVGInfos
    pvInfo = lvm.getPV(pv)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1264, in getPV
    raise se.InaccessiblePhysDev((pvName,))
vdsm.storage.exception.InaccessiblePhysDev: Multipath cannot access physical device(s): "devices=('/dev/mapper/3600a098038304479363f4c4870455163',)"

LUN 3600a098038304479363f4c4870455163 is still not accessible.

9. Test ended:

======== Mon Sep  6 17:40:59 IDT 2021 - end import domain =======


## BZ2000720_messages

1. Test started

Sep  6 17:40:24 oncilla03 root[1135809]: ======== Mon Sep  6 17:40:24 IDT 2021 - start import domain =======

2. SCSI scan

Sep  6 17:40:40 oncilla03 systemd[1]: Started Session c6167 of user root.
Sep  6 17:40:40 oncilla03 kernel: sd 20:0:0:0: alua: supports implicit TPGS
Sep  6 17:40:40 oncilla03 kernel: sd 20:0:0:0: alua: device t10.NETAPP   LUN 80Dy6?LHpEQW        port group 3e9 rel port 6
Sep  6 17:40:40 oncilla03 kernel: sd 17:0:0:0: alua: supports implicit TPGS
Sep  6 17:40:40 oncilla03 kernel: sd 18:0:0:0: alua: supports implicit TPGS
Sep  6 17:40:40 oncilla03 kernel: sd 20:0:0:1: alua: supports implicit TPGS
Sep  6 17:40:40 oncilla03 kernel: sd 20:0:0:1: alua: device t10.NETAPP   LUN 80Dy6?LHpEQX        port group 3e9 rel port 6
Sep  6 17:40:40 oncilla03 kernel: sd 20:0:0:0: alua: port group 3e9 state A non-preferred supports TolUsNA
Sep  6 17:40:40 oncilla03 kernel: sd 19:0:0:0: alua: supports implicit TPGS
Sep  6 17:40:40 oncilla03 kernel: sd 17:0:0:0: alua: device t10.LIO-ORG 67811936-bde2-46ae-a5ec-3c87822b429c\0 port group 3 rel port 2
...
Sep  6 17:40:42 oncilla03 kernel: sd 5:0:0:11: alua: port group 3e8 state A non-preferred supports TolUsNA
Sep  6 17:40:42 oncilla03 kernel: sd 5:0:2:11: alua: port group 3e9 state N non-preferred supports TolUsNA
Sep  6 17:40:42 oncilla03 kernel: sd 5:0:1:4: alua: port group 3e8 state N non-preferred supports TolUsNA
Sep  6 17:40:42 oncilla03 kernel: sd 5:0:3:5: alua: port group 3e9 state A non-preferred supports TolUsNA
Sep  6 17:40:44 oncilla03 kernel: sd 5:0:3:5: alua: port group 3e9 state A non-preferred supports TolUsNA

3. Multiapthd processing new devices

$ grep multipathd BZ2000720_messages | egrep '3600a09803830447a4f244c4657616f76|3600a098038304479363f4c4870455162|3600a098038304479363f4c4870455163|3600a09803830447a4f244c4657616f77'
Sep  6 17:40:44 oncilla03 multipathd[1980]: 3600a098038304479363f4c4870455162: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 8:48 1]
Sep  6 17:40:44 oncilla03 multipathd[1980]: sdd [8:48]: path added to devmap 3600a098038304479363f4c4870455162
Sep  6 17:40:44 oncilla03 multipathd[1980]: 3600a09803830447a4f244c4657616f76: load table [0 157286400 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 8:32 1]
Sep  6 17:40:44 oncilla03 multipathd[1980]: sdc [8:32]: path added to devmap 3600a09803830447a4f244c4657616f76
Sep  6 17:40:44 oncilla03 multipathd[1980]: 3600a09803830447a4f244c4657616f77: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 8:80 1]
Sep  6 17:40:44 oncilla03 multipathd[1980]: sdf [8:80]: path added to devmap 3600a09803830447a4f244c4657616f77
Sep  6 17:40:44 oncilla03 multipathd[1980]: 3600a098038304479363f4c4870455163: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 8:64 1]
Sep  6 17:40:44 oncilla03 multipathd[1980]: sde [8:64]: path added to devmap 3600a098038304479363f4c4870455163
Sep  6 17:40:45 oncilla03 multipathd[1980]: 3600a098038304479363f4c4870455162: performing delayed actions
Sep  6 17:40:45 oncilla03 multipathd[1980]: 3600a098038304479363f4c4870455162: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 65:144 1 66:64 1 service-time 0 2 1 8:48 1 8:224 1]
Sep  6 17:40:45 oncilla03 multipathd[1980]: 3600a09803830447a4f244c4657616f76: performing delayed actions
Sep  6 17:40:45 oncilla03 multipathd[1980]: 3600a09803830447a4f244c4657616f76: load table [0 157286400 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 8:32 1 8:208 1 service-time 0 2 1 65:128 1 66:48 1]
Sep  6 17:40:45 oncilla03 multipathd[1980]: 3600a09803830447a4f244c4657616f77: performing delayed actions
Sep  6 17:40:45 oncilla03 multipathd[1980]: 3600a09803830447a4f244c4657616f77: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 65:176 1 66:96 1 service-time 0 2 1 8:80 1 65:0 1]
Sep  6 17:40:45 oncilla03 multipathd[1980]: 3600a098038304479363f4c4870455163: performing delayed actions
Sep  6 17:40:45 oncilla03 multipathd[1980]: 3600a098038304479363f4c4870455163: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 66:80 1 65:160 1 service-time 0 2 1 8:64 1 8:240 1]


multipathd seems to be ready at this point. We can see all the devices,
and all of them have 4 paths to storage, for example 3600a098038304479363f4c4870455163
has:

    66:80, 65:160, 8:64, 8:240

4. Test ended

Sep  6 17:41:08 oncilla03 root[1136734]: ======== Mon Sep  6 17:41:08 IDT 2021 - end import domain =======


So we did reproduce the issue:
- vdsm returns device list on 17:40:44,762
- multipathd finished processing devices on 17:40:45

Comment 21 Nir Soffer 2021-09-09 14:16:52 UTC
Looking in reproduction verification logs from attachment attachment 1821092 [details]:

## BZ2000720_verification_vdsm.log


1. Test started

======== Tue Sep  7 01:26:31 IDT 2021 - start import domain =======

2. Getting device list

2021-09-07 01:26:52,593+0300 INFO  (jsonrpc/0) [vdsm.api] START getDeviceList(storageType=2, guids=(), checkStatus=False, refresh=True) from=::ffff:10.46.12.130,34246, flow_id=bae1dfc2-ebcd-4663-bddf-5d53209b58dc, task_id=e7ffc1cd-6468-4151-a90f-7b298fd20fca (api:48)

3. Refreshing storage

2021-09-07 01:26:52,594+0300 INFO  (jsonrpc/0) [storage.storagedomaincache] Refreshing storage domain cache (resize=True) (sdc:80)

4. Scanning FC devices

2021-09-07 01:26:53,696+0300 INFO  (jsonrpc/0) [storage.hba] Scanning FC devices (hba:60)
2021-09-07 01:26:56,570+0300 INFO  (jsonrpc/0) [storage.hba] Scanning FC devices: 2.87 seconds (utils:390)
2021-09-07 01:26:56,571+0300 INFO  (jsonrpc/0) [storage.multipath] Waiting until multipathd is ready (multipath:112)
2021-09-07 01:26:58,632+0300 INFO  (jsonrpc/0) [storage.multipath] Waited 2.06 seconds for multipathd (ready=2) (multipath:134)

5. Resizing multiapth devices

2021-09-07 01:26:58,632+0300 INFO  (jsonrpc/0) [storage.multipath] Resizing multipath devices (multipath:187)
2021-09-07 01:26:58,650+0300 INFO  (jsonrpc/0) [storage.multipath] Resizing multipath devices: 0.02 seconds (utils:390)

7. Refresh finished

2021-09-07 01:26:58,650+0300 INFO  (jsonrpc/0) [storage.storagedomaincache] Refreshing storage domain cache: 6.06 seconds (utils:390)

8. Returning device list

2021-09-07 01:26:59,178+0300 INFO  (jsonrpc/0) [vdsm.api] FINISH getDeviceList return=...

$ grep 'FINISH getDeviceList return=' BZ2000720_verification_vdsm.log | tr ' ' '\n' | egrep '3600a09803830447a4f244c4657616f76|3600a098038304479363f4c4870455162|3600a098038304479363f4c4870455163|3600a09803830447a4f244c4657616f77'
'3600a09803830447a4f244c4657616f77',
'3600a098038304479363f4c4870455163',
'3600a09803830447a4f244c4657616f76',
'3600a098038304479363f4c4870455162',

All the device included, checking device details:

  {
    "GUID": "3600a09803830447a4f244c4657616f77",
    "capacity": "53687091200",
    "devtype": "FCP",
    "fwrev": "9700",
    "logicalblocksize": "512",
    "pathlist": [],
    "pathstatus": [
      {
        "physdev": "sdaq",
        "state": "active",
        "capacity": "53687091200",
        "lun": "5",
        "type": "FCP"
      },
      {
        "physdev": "sds",
        "state": "active",
        "capacity": "53687091200",
        "lun": "5",
        "type": "FCP"
      },
      {
        "physdev": "sdae",
        "state": "active",
        "capacity": "53687091200",
        "lun": "5",
        "type": "FCP"
      },
      {
        "physdev": "sdg",
        "state": "active",
        "capacity": "53687091200",
        "lun": "5",
        "type": "FCP"
      }
    ],
    "physicalblocksize": "4096",
    "productID": "LUN C-Mode",
    "pvUUID": "at6vrb-PcSs-HeJz-o3UX-3LpH-Yu6J-EgUwF9",
    "pvsize": "53284438016",
    "serial": "SNETAPP_LUN_C-Mode_80DzO_LFWaow",
    "vendorID": "NETAPP",
    "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP",
    "discard_max_bytes": 0,
    "discard_zeroes_data": 0,
    "status": "unknown"
  },

  {
    "GUID": "3600a098038304479363f4c4870455163",
    "capacity": "53687091200",
    "devtype": "FCP",
    "fwrev": "9700",
    "logicalblocksize": "512",
    "pathlist": [],
    "pathstatus": [
      {
        "physdev": "sdf",
        "state": "active",
        "capacity": "53687091200",
        "lun": "4",
        "type": "FCP"
      },
      {
        "physdev": "sdap",
        "state": "active",
        "capacity": "53687091200",
        "lun": "4",
        "type": "FCP"
      },
      {
        "physdev": "sdr",
        "state": "active",
        "capacity": "53687091200",
        "lun": "4",
        "type": "FCP"
      },
      {
        "physdev": "sdad",
        "state": "active",
        "capacity": "53687091200",
        "lun": "4",
        "type": "FCP"
      }
    ],
    "physicalblocksize": "4096",
    "productID": "LUN C-Mode",
    "pvUUID": "uEN4mu-QKj9-7C5M-Tfvj-C2pi-dJIe-pmzANv",
    "pvsize": "53284438016",
    "serial": "SNETAPP_LUN_C-Mode_80Dy6_LHpEQc",
    "vendorID": "NETAPP",
    "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP",
    "discard_max_bytes": 0,
    "discard_zeroes_data": 0,
    "status": "unknown"
  },

  {
    "GUID": "3600a09803830447a4f244c4657616f76",
    "capacity": "80530636800",
    "devtype": "FCP",
    "fwrev": "9700",
    "logicalblocksize": "512",
    "pathlist": [],
    "pathstatus": [
      {
        "physdev": "sdab",
        "state": "active",
        "capacity": "80530636800",
        "lun": "2",
        "type": "FCP"
      },
      {
        "physdev": "sdd",
        "state": "active",
        "capacity": "80530636800",
        "lun": "2",
        "type": "FCP"
      },
      {
        "physdev": "sdan",
        "state": "active",
        "capacity": "80530636800",
        "lun": "2",
        "type": "FCP"
      },
      {
        "physdev": "sdp",
        "state": "active",
        "capacity": "80530636800",
        "lun": "2",
        "type": "FCP"
      }
    ],
    "physicalblocksize": "4096",
    "productID": "LUN C-Mode",
    "pvUUID": "ITTaXK-m63S-CVU5-iqhX-DErq-z0p1-i38WE1",
    "pvsize": "80127983616",
    "serial": "SNETAPP_LUN_C-Mode_80DzO_LFWaov",
    "vendorID": "NETAPP",
    "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP",
    "discard_max_bytes": 0,
    "discard_zeroes_data": 0,
    "status": "unknown"
  },

  {
    "GUID": "3600a098038304479363f4c4870455162",
    "capacity": "53687091200",
    "devtype": "FCP",
    "fwrev": "9700",
    "logicalblocksize": "512",
    "pathlist": [],
    "pathstatus": [
      {
        "physdev": "sdq",
        "state": "active",
        "capacity": "53687091200",
        "lun": "3",
        "type": "FCP"
      },
      {
        "physdev": "sdac",
        "state": "active",
        "capacity": "53687091200",
        "lun": "3",
        "type": "FCP"
      },
      {
        "physdev": "sde",
        "state": "active",
        "capacity": "53687091200",
        "lun": "3",
        "type": "FCP"
      },
      {
        "physdev": "sdao",
        "state": "active",
        "capacity": "53687091200",
        "lun": "3",
        "type": "FCP"
      }
    ],
    "physicalblocksize": "4096",
    "productID": "LUN C-Mode",
    "pvUUID": "HGfG0T-OMrP-CDqw-FDBv-mzhD-tpKw-yit6zd",
    "pvsize": "53284438016",
    "serial": "SNETAPP_LUN_C-Mode_80Dy6_LHpEQb",
    "vendorID": "NETAPP",
    "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP",
    "discard_max_bytes": 0,
    "discard_zeroes_data": 0,
    "status": "unknown"
  }

All devices reported with all paths and LVM metadata.

9. Test ended

======== Tue Sep  7 01:27:12 IDT 2021 - end import domain =======


## BZ2000720_verification_messages

1. SCSI scan

Sep  7 01:26:53 oncilla02 kernel: scsi 5:0:0:2: Direct-Access     NETAPP   LUN C-Mode       9700 PQ: 0 ANSI: 5
Sep  7 01:26:53 oncilla02 kernel: scsi 5:0:0:2: alua: supports implicit TPGS
Sep  7 01:26:53 oncilla02 kernel: scsi 5:0:0:2: alua: device t10.NETAPP   LUN 80DzO$LFWaov        port group 3e8 rel port 8
Sep  7 01:26:53 oncilla02 kernel: sd 5:0:0:2: Attached scsi generic sg4 type 0
Sep  7 01:26:53 oncilla02 kernel: sd 5:0:0:2: [sdd] 157286400 512-byte logical blocks: (80.5 GB/75.0 GiB)
Sep  7 01:26:53 oncilla02 kernel: sd 5:0:0:2: alua: transition timeout set to 120 seconds
Sep  7 01:26:53 oncilla02 kernel: scsi 5:0:0:3: Direct-Access     NETAPP   LUN C-Mode       9700 PQ: 0 ANSI: 5
Sep  7 01:26:53 oncilla02 kernel: scsi 5:0:0:3: alua: supports implicit TPGS
Sep  7 01:26:53 oncilla02 kernel: scsi 5:0:0:3: alua: device t10.NETAPP   LUN 80Dy6?LHpEQb        port group 3e8 rel port 8
Sep  7 01:26:53 oncilla02 kernel: sd 5:0:0:3: Attached scsi generic sg5 type 0
Sep  7 01:26:53 oncilla02 kernel: scsi 5:0:0:4: Direct-Access     NETAPP   LUN C-Mode       9700 PQ: 0 ANSI: 5
...
Sep  7 01:26:54 oncilla02 kernel: sd 5:0:3:5: [sdaq] Attached SCSI disk
Sep  7 01:26:54 oncilla02 kernel: sd 5:0:3:4: [sdap] Attached SCSI disk
Sep  7 01:26:54 oncilla02 kernel: sd 5:0:1:3: [sdq] Attached SCSI disk
Sep  7 01:26:54 oncilla02 kernel: sd 5:0:2:3: alua: port group 3e9 state A non-preferred supports TolUsNA
Sep  7 01:26:54 oncilla02 kernel: sd 5:0:3:2: [sdan] Attached SCSI disk
Sep  7 01:26:55 oncilla02 systemd[1]: Started Session c886 of user root.
Sep  7 01:26:56 oncilla02 kernel: sd 5:0:2:4: alua: port group 3e9 state A non-preferred supports TolUsNA
Sep  7 01:26:56 oncilla02 kernel: sd 5:0:3:3: [sdao] Attached SCSI disk
Sep  7 01:26:56 oncilla02 kernel: sd 5:0:0:2: [sdd] Attached SCSI disk
Sep  7 01:26:56 oncilla02 kernel: sd 5:0:2:4: [sdad] Attached SCSI disk

2. multipathd updating new devices

$ grep multipathd BZ2000720_verification_messages | egrep '3600a09803830447a4f244c4657616f76|3600a098038304479363f4c4870455162|3600a098038304479363f4c4870455163|3600a09803830447a4f244c4657616f77'
Sep  7 01:26:56 oncilla02 multipathd[2187]: 3600a098038304479363f4c4870455162: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 8:64 1]
Sep  7 01:26:56 oncilla02 multipathd[2187]: sde [8:64]: path added to devmap 3600a098038304479363f4c4870455162
Sep  7 01:26:56 oncilla02 multipathd[2187]: 3600a098038304479363f4c4870455163: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 65:16 1]
Sep  7 01:26:56 oncilla02 multipathd[2187]: sdr [65:16]: path added to devmap 3600a098038304479363f4c4870455163
Sep  7 01:26:56 oncilla02 multipathd[2187]: 3600a09803830447a4f244c4657616f76: load table [0 157286400 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 65:176 1]
Sep  7 01:26:56 oncilla02 multipathd[2187]: sdab [65:176]: path added to devmap 3600a09803830447a4f244c4657616f76
Sep  7 01:26:56 oncilla02 multipathd[2187]: 3600a09803830447a4f244c4657616f77: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 8:96 1]
Sep  7 01:26:56 oncilla02 multipathd[2187]: sdg [8:96]: path added to devmap 3600a09803830447a4f244c4657616f77
Sep  7 01:26:58 oncilla02 multipathd[2187]: 3600a098038304479363f4c4870455162: performing delayed actions
Sep  7 01:26:58 oncilla02 multipathd[2187]: 3600a098038304479363f4c4870455162: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 65:192 1 65:0 1 service-time 0 2 1 8:64 1 66:128 1]
Sep  7 01:26:58 oncilla02 multipathd[2187]: 3600a098038304479363f4c4870455163: performing delayed actions
Sep  7 01:26:58 oncilla02 multipathd[2187]: 3600a098038304479363f4c4870455163: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 65:16 1 65:208 1 service-time 0 2 1 8:80 1 66:144 1]
Sep  7 01:26:58 oncilla02 multipathd[2187]: 3600a09803830447a4f244c4657616f76: performing delayed actions
Sep  7 01:26:58 oncilla02 multipathd[2187]: 3600a09803830447a4f244c4657616f76: load table [0 157286400 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 66:112 1 8:48 1 service-time 0 2 1 65:176 1 8:240 1]
Sep  7 01:26:58 oncilla02 multipathd[2187]: 3600a09803830447a4f244c4657616f77: performing delayed actions
Sep  7 01:26:58 oncilla02 multipathd[2187]: 3600a09803830447a4f244c4657616f77: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 65:224 1 65:32 1 service-time 0 2 1 8:96 1 66:160 1]


Events summary:

2021-09-07 01:26:53,696 Starting FC scan
2021-09-07 01:26:56,570 Scan ended
2021-09-07 01:26:56,571 Waiting until multipathd is ready
2021-09-07 01:26:58,632 Wait ended
2021-09-07 01:26:58,632 Resizing multipath devices
2021-09-07 01:26:58,650 Resize ended
2021-09-07 01:26:59,178 Returning device list

Comment 22 Nir Soffer 2021-09-10 17:09:54 UTC
Fix merged for 4.5 and backport posted for 4.4.z.

Comment 27 Amit Sharir 2021-10-07 15:15:46 UTC
Version: 
vdsm-4.40.90.1-1.el8ev.x86_64
ovirt-engine-4.4.9-0.6.el8ev.noarch

Verification steps: 
I did the total flow mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2000720#c14 with a few minor adjustments:

#In the setup stage:
4. Create few empty disks on this storage domain -> I used a storage domain with multiple disks (some empty and some with data).
5. Set vdsm log level to DEBUG -> following Vojtech's recommendation in https://bugzilla.redhat.com/show_bug.cgi?id=2000720#c26 I didn't do this step.

#In the before test stage
3. Remove the LUNs from all hosts as explained -> in this stage I was required to remove the LUNs using "multipath -f" command. This command succeeded in 3/5 of the relevant LUNS. In 2/5 of the LUNS I got this message "< multipath-device_id>: map in use". After consultation with Vojtech I understood it shouldn't affect the verification of this bug.

Verification conclusions:
The total flow mentioned was done with no errors, After completing the full flow, I was able to enter the 'Import Domain' window and successfully see the storage domains displayed (it worked on the first time I entered the window)
Bug verified.

Thank you @vjuranek for helping me out with the bug verification flow and the consultations along the way.

Comment 31 errata-xmlrpc 2021-11-16 15:12:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.9]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4704


Note You need to log in before you can comment on or make changes to this bug.