Description of problem: FC luns comprising a previously-detached storage domain were re-presented to the hosts. However, when going into 'Import Domain' in the Admin Portal, the storage domain was not displayed in the list. On the SPM host, all luns are recognised by 'multipath' and and the volume group is recognised by LVM. Upon exiting and re-entering the 'Import Domain' window, the storage domain was displayed the second time. Version-Release number of selected component (if applicable): RHVM 4.4.5 RHVH 4.4.7 vdsm-4.40.70.6-1.el8ev.x86_64 How reproducible: 100% in customer's environment. Steps to Reproduce: 1. Create a storage domain on 4 FibreChannel luns 2. Remove the storage domain from RHV 3. Unzone the luns in the SAN 4. Reboot the host(s) 5. Present the luns to the host(s) again 6. Check that 'multipath' sees the luns and the paths are active 7. Check that LVM sees the VG, LVs 8. In RHVM Admin Portal, go into 'Import Domain' and select FC storage 9. The storage domain is not displayed 10.Exit and re-enter the 'Import Domain' window 11.The storage domain is displayed Actual results: Storage domain not displayed the first time. Expected results: Storage domain to be displayed the first time. Additional info:
Attached patch should mitigate this issue, but I don't have a FC setup to test this. This build can be used for testing: https://jenkins.ovirt.org/job/vdsm_standard-check-patch/29461/artifact/build-artifacts.build-py3.el8stream.x86_64/
I tried doing the flow mentioned in the bug description and these are my findings. Version : ovirt-engine-4.4.8.4-0.7.el8ev.noarch I used an FC enabled environment in my testing: 1. I used an FC storage domain that was configured in my environment (the domain was named differently in the Netapp system manager and in the WEB UI - it's important to validate the lun using the id which can be found using the command "multipath -ll" from the host terminal). 2. Via web console UI: remove the storage domain from RHV (put the relevant data center in maintenance mode, detach storage domain, remove). 3. Via Netapp "ONTAP System Manager" UI: put the relevant lun in "offline mode" (remember the name differences mentioned in step 2). 4. Via host terminal: Check that the relevant lun is still present using - [root@<host> ~]# multipath -ll | grep <lun id>. 5. Via host terminal: [root@<host> ~]# reboot 6. Wait for the host to come back up + re-enter the web console of RHV. Time-critical flow: 7. Via Netapp "ONTAP System Manager" UI: put the relevant lun in "online mode". 8. Straight away check that the lun is visible using - [root@<host> ~]# multipath -ll | grep <lun id>. 9. In addition to step 8, open the import domain window via the web console. 10. 15 seconds After the storage domain of the relevant <lun id> was visible in the host terminal, the window from step 9 was refreshed (by itself) and I was able to select the relevant FC storage domain for import. Please note that in general, when trying to import FC storage domain in my environment, after entering all the relevant parameters in the import window of the environment RHV UI, I needed to wait for 15 seconds in order to see available storage domains for import. The window refreshes automatically and I just needed to wait. The 15-second wait is relevant both for the edited FC domain that I changed using the steps mentioned and also for the other FC domains that I did not alter.
(In reply to Amit Sharir from comment #11) Reproducing with storage domain that have single LUN is unlikely to succeed. The root cause in this bug is that after SCSI scan completes, multipath is busy handling new LUNs for ~2 seconds (based on this case logs). Vdsm wait on "udevadm settle", returning ~2 seconds before multipathd handled the new devices, so vdsm does not find the PV with the metadata and the storage domain is not found during import. On the second try to import the domain, multipath already know the LUNs, so vdsm find all the PVs for the storage domain and the import succeeds. How to reproduce: ## Setup 1. Create 4 LUNs (size does not matter) on FC storage 2. Map the LUNs to all hosts (using system with one host is good enough) 3. Create new storage domain from the 4 LUNs 4. Create few empty disks on this storage domain 5. Set vdsm log level to DEBUG Change these loggers level to DEBUG in /etc/vdsm/logger.conf - logger_root - logger_vds - logger_storage restart vdsmd to apply the log change. Alternative way, change log level temporarily before each test: vdsm-client Host setLogLevel name=root level=DEBUG vdsm-client Host setLogLevel name=vds level=DEBUG vdsm-client Host setLogLevel name=storage level=DEBUG # Before test 1. Deactivate, detach and remove (without formating!) the storage domain 2. On the storage, umap the LUNs from all hosts 3. Remove the LUNs from all hosts as explained here: find all paths to remove using (e.g. sda, sdb, ...): multipath -ll <multiapth-device> remove the map multipath -f <multipath-device> remove the SCSI devices found in step a. echo 1 > /sys/block/<device-name>/device/delete Check with lsblk that the multipath devices and LUNs were removed from the system. You want to automated this step with a script, since we will need to do this multiple time as cleanup after every test iteration. Alternative way to remove the devices is to reboot the host. ## Test 1. Check with lsblk that the multipath devices and LUNs are *not* available in the system. 2. Add log markers echo "======== $(date) - start import domain =======" >> /var/log/vdsm/vdsm.log logger "======== $(date) - start import domain =======" 3. On the FC storage server, map the LUNs to the hosts 4. On engine side, open "Import domain" dialog, and select "FC" storage and wait until the import dialog show the found domains 5. Add log markers echo "======== $(date) - end import domain =======" >> /var/log/vdsm/vdsm.log logger "======== $(date) - end import domain =======" 6. Collect logs - /var/log/vdsm/vdsm.log - /var/log/messages Expected results (with fix): The storage domain is found and listed in the import dialog. Actual results (before fix): The storage domain is not found. Repeat the test several times, the issues is likely to be hard to reproduce. You must run the cleanup step before every test.
I managed to reproduce the issue with FC storage domain with 4 LUNs. WWIDs of the LUNs are: * 3600a09803830447a4f244c4657616f76 * 3600a098038304479363f4c4870455162 * 3600a098038304479363f4c4870455163 * 3600a09803830447a4f244c4657616f77 I was able to reproduce it only with vdsm log level on INFO. Out of 5 attempts, in 4 cases I reproduced the issue. Trying same with vdsm log level on DEBUG didn't succeeded. Out of 3 attempts, non reproduced the issue. I was also able to verify proposed Nir's patch (https://gerrit.ovirt.org/c/vdsm/+/116467/2), using the build from https://jenkins.ovirt.org/job/vdsm_standard-check-patch/29478/ Out of 7 attempts, all 7 were successful and issue never appeared. Grepping the vdsm logs, there are about 200 records about waiting for multipath (added in gerrit #116467) and the average waiting time was 2.216 seconds.
Looking in reproduction attempt logs from attachment attachment 1821092 [details]: ## BZ2000720_vdsm.log 1. Test started ======== Mon Sep 6 17:40:18 IDT 2021 - start import domain ======= 2. Getting device list 2021-09-06 17:40:40,145+0300 INFO (jsonrpc/5) [vdsm.api] START getDeviceList(storageType=2, guids=(), checkStatus=False, refresh=True) from=::ffff:10.46.12.130,53554, flow_id=d5cd880f-346e-4b75-8dca-45b1bf8afdb2, task_id=913c7118-4111-4eeb-b72a-ceaae53fb222 (api:48) 2021-09-06 17:40:40,145+0300 INFO (jsonrpc/5) [storage.StorageDomainCache] Refreshing storage domain cache (resize=True) (sdc:80) 3. Scanning FC devices 2021-09-06 17:40:40,588+0300 INFO (jsonrpc/5) [storage.HBA] Scanning FC devices (hba:60) 2021-09-06 17:40:42,913+0300 INFO (jsonrpc/5) [storage.HBA] Scanning FC devices: 2.33 seconds (utils:390) 4. Resizing multipath devices 2021-09-06 17:40:44,203+0300 INFO (jsonrpc/5) [storage.Multipath] Resizing multipath devices (multipath:104) 2021-09-06 17:40:44,216+0300 ERROR (jsonrpc/5) [storage.Multipath] Could not resize device 3600a09803830447a4f244c4657616f77 (multipath:111) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/devicemapper.py", line 54, in getDmId devStat = os.stat(devlinkPath) FileNotFoundError: [Errno 2] No such file or directory: '/dev/mapper/3600a09803830447a4f244c4657616f77' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/multipath.py", line 109, in resize_devices _resize_if_needed(guid) File "/usr/lib/python3.6/site-packages/vdsm/storage/multipath.py", line 115, in _resize_if_needed name = devicemapper.getDmId(guid) File "/usr/lib/python3.6/site-packages/vdsm/storage/devicemapper.py", line 57, in getDmId deviceMultipathName) OSError: [Errno 19] Could not find dm device named `3600a09803830447a4f244c4657616f77` device 3600a09803830447a4f244c4657616f77 is a new LUN (see comment 17). Clearly multipathd did not finish handling the new devices. 5. Refresh storage completed 2021-09-06 17:40:44,227+0300 INFO (jsonrpc/5) [storage.Multipath] Resizing multipath devices: 0.02 seconds (utils:390) 2021-09-06 17:40:44,227+0300 INFO (jsonrpc/5) [storage.StorageDomainCache] Refreshing storage domain cache: 4.08 seconds (uti 6. Looking for PVs 2021-09-06 17:40:44,427+0300 WARN (jsonrpc/5) [storage.LVM] Command ['/sbin/lvm', 'pvs', '--config', 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/36001405194892563377425285018012e$|^/dev/mapper/360014053e51335e27c24270a339155d4$|^/dev/mapper/3600140567811936bde246aea5ec3c878$|^/dev/mapper/360014058a90b491aed74b3da520d977f$|^/dev/mapper/36001405992bcec685da4dd99ca7fb58a$|^/dev/mapper/36001405d56aeeed10f84ca49ab027b9b$|^/dev/mapper/36001405f309c88532b44172a5831031a$|^/dev/mapper/3600a098038304479363f4c487045512f$|^/dev/mapper/3600a098038304479363f4c4870455157$|^/dev/mapper/3600a098038304479363f4c4870455158$|^/dev/mapper/3600a098038304479363f4c4870455159$|^/dev/mapper/3600a098038304479363f4c487045515a$|^/dev/mapper/3600a098038304479363f4c4870455161$|^/dev/mapper/3600a098038304479363f4c4870455162$|^/dev/mapper/3600a098038304479363f4c4870455163$|^/dev/mapper/3600a098038304479363f4c4870455164$|^/dev/mapper/3600a09803830447a4f244c4657616f72$|^/dev/mapper/3600a09803830447a4f244c4657616f73$|^/dev/mapper/3600a09803830447a4f244c4657616f74$|^/dev/mapper/3600a09803830447a4f244c4657616f76$|^/dev/mapper/3600a09803830447a4f244c4657616f77$|^/dev/mapper/3600a09803830447a4f244c4657616f78$|^/dev/mapper/3600a09803830447a4f244c4657617469$|^/dev/mapper/3600a09803830447a4f244c465761746a$|^/dev/mapper/3600a09803830447a4f244c465761746b$|^/dev/mapper/3600a09803830447a4f244c465761746e$|^/dev/mapper/3600a09803830447a4f244c4657617672$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }', '--noheadings', '--units', 'b', '--nosuffix', '--separator', '|', '--ignoreskippedcluster', '-o', 'uuid,name,size,vg_name,vg_uuid,pe_start,pe_count,pe_alloc_count,mda_count,dev_size,mda_used_count'] succeeded with warnings: [" WARNING: Couldn't find device with uuid uEN4mu-QKj9-7C5M-Tfvj-C2pi-dJIe-pmzANv.", ' WARNING: VG 1c548258-c92a-4839-ab05-871f73662571 is missing PV uEN4mu-QKj9-7C5M-Tfvj-C2pi-dJIe-pmzANv (last written to /dev/mapper/3600a098038304479363f4c4870455163).'] (lvm:355) 2021-09-06 17:40:44,428+0300 ERROR (jsonrpc/5) [storage.LVM] Missing pv: uEN4mu-QKj9-7C5M-Tfvj-C2pi-dJIe-pmzANv in vg: 1c548258-c92a-4839-ab05-871f73662571 (lvm:614) A PV in the VG is missing, because vdsm did not wait until multipathd is ready. 7. Returning partial device list 2021-09-06 17:40:44,762+0300 INFO (jsonrpc/5) [vdsm.api] FINISH getDeviceList return=... $ grep 'FINISH getDeviceList return' BZ2000720_vdsm.log | tr ' ' '\n' | egrep '3600a09803830447a4f244c4657616f76|3600a098038304479363f4c4870455162|3600a098038304479363f4c4870455163|3600a09803830447a4f244c4657616f77' '3600a09803830447a4f244c4657616f77', '3600a098038304479363f4c4870455162', '3600a098038304479363f4c4870455163', '3600a09803830447a4f244c4657616f76', All the device are reported: { "GUID": "3600a09803830447a4f244c4657616f76", "capacity": "80530636800", "devtype": "FCP", "fwrev": "9700", "logicalblocksize": "512", "pathlist": [], "pathstatus": [ { "physdev": "sdc", "state": "active", "capacity": "80530636800", "lun": "2", "type": "FCP" } ], "physicalblocksize": "4096", "productID": "LUN C-Mode", "pvUUID": "ITTaXK-m63S-CVU5-iqhX-DErq-z0p1-i38WE1", "pvsize": "80127983616", "serial": "SNETAPP_LUN_C-Mode_80DzO_LFWaov", "vendorID": "NETAPP", "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP", "discard_max_bytes": 0, "discard_zeroes_data": 0, "status": "unknown" }, { "GUID": "3600a098038304479363f4c4870455162", "capacity": "53687091200", "devtype": "FCP", "fwrev": "9700", "logicalblocksize": "512", "pathlist": [], "pathstatus": [ { "physdev": "sdd", "state": "active", "capacity": "53687091200", "lun": "3", "type": "FCP" } ], "physicalblocksize": "4096", "productID": "LUN C-Mode", "pvUUID": "HGfG0T-OMrP-CDqw-FDBv-mzhD-tpKw-yit6zd", "pvsize": "53284438016", "serial": "SNETAPP_LUN_C-Mode_80Dy6_LHpEQb", "vendorID": "NETAPP", "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP", "discard_max_bytes": 0, "discard_zeroes_data": 0, "status": "unknown" }, { "GUID": "3600a098038304479363f4c4870455163", "capacity": "53687091200", "devtype": "FCP", "fwrev": "9700", "logicalblocksize": "512", "pathlist": [], "pathstatus": [ { "physdev": "sde", "state": "active", "capacity": "53687091200", "lun": "4", "type": "FCP" } ], "physicalblocksize": "4096", "productID": "LUN C-Mode", "pvUUID": "", "pvsize": "", "serial": "SNETAPP_LUN_C-Mode_80Dy6_LHpEQc", "vendorID": "NETAPP", "vgUUID": "", "discard_max_bytes": 0, "discard_zeroes_data": 0, "status": "unknown" }, { "GUID": "3600a09803830447a4f244c4657616f77", "capacity": "53687091200", "devtype": "FCP", "fwrev": "9700", "logicalblocksize": "512", "pathlist": [], "pathstatus": [ { "physdev": "sdf", "state": "active", "capacity": "53687091200", "lun": "5", "type": "FCP" } ], "physicalblocksize": "4096", "productID": "LUN C-Mode", "pvUUID": "at6vrb-PcSs-HeJz-o3UX-3LpH-Yu6J-EgUwF9", "pvsize": "53284438016", "serial": "SNETAPP_LUN_C-Mode_80DzO_LFWaow", "vendorID": "NETAPP", "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP", "discard_max_bytes": 0, "discard_zeroes_data": 0, "status": "unknown" }, But LUN 3600a098038304479363f4c4870455162 is not reported as part of the VG: "pvUUID": "", "pvsize": "", "vgUUID": "", 8. Getting VG info fails 2021-09-06 17:40:45,712+0300 INFO (jsonrpc/6) [vdsm.api] FINISH getVGInfo error=Multipath cannot access physical device(s): "devices=('/dev/mapper/3600a098038304479363f4c4870455163',)" from=::ffff:10.46.12.130,53554, task_id=c0b96937-e813-4b17-b9ad-894bbb8370df (api:52) 2021-09-06 17:40:45,713+0300 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='c0b96937-e813-4b17-b9ad-894bbb8370df') Unexpected error (task:877) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run return fn(*args, **kargs) File "<decorator-gen-139>", line 2, in getVGInfo File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2918, in getVGInfo return dict(info=self.__getVGsInfo([vgUUID])[0]) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2865, in __getVGsInfo self.__processVGInfos(vgInfos, pathDict, getGuid) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2884, in __processVGInfos pvInfo = lvm.getPV(pv) File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1264, in getPV raise se.InaccessiblePhysDev((pvName,)) vdsm.storage.exception.InaccessiblePhysDev: Multipath cannot access physical device(s): "devices=('/dev/mapper/3600a098038304479363f4c4870455163',)" LUN 3600a098038304479363f4c4870455163 is still not accessible. 9. Test ended: ======== Mon Sep 6 17:40:59 IDT 2021 - end import domain ======= ## BZ2000720_messages 1. Test started Sep 6 17:40:24 oncilla03 root[1135809]: ======== Mon Sep 6 17:40:24 IDT 2021 - start import domain ======= 2. SCSI scan Sep 6 17:40:40 oncilla03 systemd[1]: Started Session c6167 of user root. Sep 6 17:40:40 oncilla03 kernel: sd 20:0:0:0: alua: supports implicit TPGS Sep 6 17:40:40 oncilla03 kernel: sd 20:0:0:0: alua: device t10.NETAPP LUN 80Dy6?LHpEQW port group 3e9 rel port 6 Sep 6 17:40:40 oncilla03 kernel: sd 17:0:0:0: alua: supports implicit TPGS Sep 6 17:40:40 oncilla03 kernel: sd 18:0:0:0: alua: supports implicit TPGS Sep 6 17:40:40 oncilla03 kernel: sd 20:0:0:1: alua: supports implicit TPGS Sep 6 17:40:40 oncilla03 kernel: sd 20:0:0:1: alua: device t10.NETAPP LUN 80Dy6?LHpEQX port group 3e9 rel port 6 Sep 6 17:40:40 oncilla03 kernel: sd 20:0:0:0: alua: port group 3e9 state A non-preferred supports TolUsNA Sep 6 17:40:40 oncilla03 kernel: sd 19:0:0:0: alua: supports implicit TPGS Sep 6 17:40:40 oncilla03 kernel: sd 17:0:0:0: alua: device t10.LIO-ORG 67811936-bde2-46ae-a5ec-3c87822b429c\0 port group 3 rel port 2 ... Sep 6 17:40:42 oncilla03 kernel: sd 5:0:0:11: alua: port group 3e8 state A non-preferred supports TolUsNA Sep 6 17:40:42 oncilla03 kernel: sd 5:0:2:11: alua: port group 3e9 state N non-preferred supports TolUsNA Sep 6 17:40:42 oncilla03 kernel: sd 5:0:1:4: alua: port group 3e8 state N non-preferred supports TolUsNA Sep 6 17:40:42 oncilla03 kernel: sd 5:0:3:5: alua: port group 3e9 state A non-preferred supports TolUsNA Sep 6 17:40:44 oncilla03 kernel: sd 5:0:3:5: alua: port group 3e9 state A non-preferred supports TolUsNA 3. Multiapthd processing new devices $ grep multipathd BZ2000720_messages | egrep '3600a09803830447a4f244c4657616f76|3600a098038304479363f4c4870455162|3600a098038304479363f4c4870455163|3600a09803830447a4f244c4657616f77' Sep 6 17:40:44 oncilla03 multipathd[1980]: 3600a098038304479363f4c4870455162: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 8:48 1] Sep 6 17:40:44 oncilla03 multipathd[1980]: sdd [8:48]: path added to devmap 3600a098038304479363f4c4870455162 Sep 6 17:40:44 oncilla03 multipathd[1980]: 3600a09803830447a4f244c4657616f76: load table [0 157286400 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 8:32 1] Sep 6 17:40:44 oncilla03 multipathd[1980]: sdc [8:32]: path added to devmap 3600a09803830447a4f244c4657616f76 Sep 6 17:40:44 oncilla03 multipathd[1980]: 3600a09803830447a4f244c4657616f77: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 8:80 1] Sep 6 17:40:44 oncilla03 multipathd[1980]: sdf [8:80]: path added to devmap 3600a09803830447a4f244c4657616f77 Sep 6 17:40:44 oncilla03 multipathd[1980]: 3600a098038304479363f4c4870455163: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 8:64 1] Sep 6 17:40:44 oncilla03 multipathd[1980]: sde [8:64]: path added to devmap 3600a098038304479363f4c4870455163 Sep 6 17:40:45 oncilla03 multipathd[1980]: 3600a098038304479363f4c4870455162: performing delayed actions Sep 6 17:40:45 oncilla03 multipathd[1980]: 3600a098038304479363f4c4870455162: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 65:144 1 66:64 1 service-time 0 2 1 8:48 1 8:224 1] Sep 6 17:40:45 oncilla03 multipathd[1980]: 3600a09803830447a4f244c4657616f76: performing delayed actions Sep 6 17:40:45 oncilla03 multipathd[1980]: 3600a09803830447a4f244c4657616f76: load table [0 157286400 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 8:32 1 8:208 1 service-time 0 2 1 65:128 1 66:48 1] Sep 6 17:40:45 oncilla03 multipathd[1980]: 3600a09803830447a4f244c4657616f77: performing delayed actions Sep 6 17:40:45 oncilla03 multipathd[1980]: 3600a09803830447a4f244c4657616f77: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 65:176 1 66:96 1 service-time 0 2 1 8:80 1 65:0 1] Sep 6 17:40:45 oncilla03 multipathd[1980]: 3600a098038304479363f4c4870455163: performing delayed actions Sep 6 17:40:45 oncilla03 multipathd[1980]: 3600a098038304479363f4c4870455163: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 66:80 1 65:160 1 service-time 0 2 1 8:64 1 8:240 1] multipathd seems to be ready at this point. We can see all the devices, and all of them have 4 paths to storage, for example 3600a098038304479363f4c4870455163 has: 66:80, 65:160, 8:64, 8:240 4. Test ended Sep 6 17:41:08 oncilla03 root[1136734]: ======== Mon Sep 6 17:41:08 IDT 2021 - end import domain ======= So we did reproduce the issue: - vdsm returns device list on 17:40:44,762 - multipathd finished processing devices on 17:40:45
Looking in reproduction verification logs from attachment attachment 1821092 [details]: ## BZ2000720_verification_vdsm.log 1. Test started ======== Tue Sep 7 01:26:31 IDT 2021 - start import domain ======= 2. Getting device list 2021-09-07 01:26:52,593+0300 INFO (jsonrpc/0) [vdsm.api] START getDeviceList(storageType=2, guids=(), checkStatus=False, refresh=True) from=::ffff:10.46.12.130,34246, flow_id=bae1dfc2-ebcd-4663-bddf-5d53209b58dc, task_id=e7ffc1cd-6468-4151-a90f-7b298fd20fca (api:48) 3. Refreshing storage 2021-09-07 01:26:52,594+0300 INFO (jsonrpc/0) [storage.storagedomaincache] Refreshing storage domain cache (resize=True) (sdc:80) 4. Scanning FC devices 2021-09-07 01:26:53,696+0300 INFO (jsonrpc/0) [storage.hba] Scanning FC devices (hba:60) 2021-09-07 01:26:56,570+0300 INFO (jsonrpc/0) [storage.hba] Scanning FC devices: 2.87 seconds (utils:390) 2021-09-07 01:26:56,571+0300 INFO (jsonrpc/0) [storage.multipath] Waiting until multipathd is ready (multipath:112) 2021-09-07 01:26:58,632+0300 INFO (jsonrpc/0) [storage.multipath] Waited 2.06 seconds for multipathd (ready=2) (multipath:134) 5. Resizing multiapth devices 2021-09-07 01:26:58,632+0300 INFO (jsonrpc/0) [storage.multipath] Resizing multipath devices (multipath:187) 2021-09-07 01:26:58,650+0300 INFO (jsonrpc/0) [storage.multipath] Resizing multipath devices: 0.02 seconds (utils:390) 7. Refresh finished 2021-09-07 01:26:58,650+0300 INFO (jsonrpc/0) [storage.storagedomaincache] Refreshing storage domain cache: 6.06 seconds (utils:390) 8. Returning device list 2021-09-07 01:26:59,178+0300 INFO (jsonrpc/0) [vdsm.api] FINISH getDeviceList return=... $ grep 'FINISH getDeviceList return=' BZ2000720_verification_vdsm.log | tr ' ' '\n' | egrep '3600a09803830447a4f244c4657616f76|3600a098038304479363f4c4870455162|3600a098038304479363f4c4870455163|3600a09803830447a4f244c4657616f77' '3600a09803830447a4f244c4657616f77', '3600a098038304479363f4c4870455163', '3600a09803830447a4f244c4657616f76', '3600a098038304479363f4c4870455162', All the device included, checking device details: { "GUID": "3600a09803830447a4f244c4657616f77", "capacity": "53687091200", "devtype": "FCP", "fwrev": "9700", "logicalblocksize": "512", "pathlist": [], "pathstatus": [ { "physdev": "sdaq", "state": "active", "capacity": "53687091200", "lun": "5", "type": "FCP" }, { "physdev": "sds", "state": "active", "capacity": "53687091200", "lun": "5", "type": "FCP" }, { "physdev": "sdae", "state": "active", "capacity": "53687091200", "lun": "5", "type": "FCP" }, { "physdev": "sdg", "state": "active", "capacity": "53687091200", "lun": "5", "type": "FCP" } ], "physicalblocksize": "4096", "productID": "LUN C-Mode", "pvUUID": "at6vrb-PcSs-HeJz-o3UX-3LpH-Yu6J-EgUwF9", "pvsize": "53284438016", "serial": "SNETAPP_LUN_C-Mode_80DzO_LFWaow", "vendorID": "NETAPP", "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP", "discard_max_bytes": 0, "discard_zeroes_data": 0, "status": "unknown" }, { "GUID": "3600a098038304479363f4c4870455163", "capacity": "53687091200", "devtype": "FCP", "fwrev": "9700", "logicalblocksize": "512", "pathlist": [], "pathstatus": [ { "physdev": "sdf", "state": "active", "capacity": "53687091200", "lun": "4", "type": "FCP" }, { "physdev": "sdap", "state": "active", "capacity": "53687091200", "lun": "4", "type": "FCP" }, { "physdev": "sdr", "state": "active", "capacity": "53687091200", "lun": "4", "type": "FCP" }, { "physdev": "sdad", "state": "active", "capacity": "53687091200", "lun": "4", "type": "FCP" } ], "physicalblocksize": "4096", "productID": "LUN C-Mode", "pvUUID": "uEN4mu-QKj9-7C5M-Tfvj-C2pi-dJIe-pmzANv", "pvsize": "53284438016", "serial": "SNETAPP_LUN_C-Mode_80Dy6_LHpEQc", "vendorID": "NETAPP", "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP", "discard_max_bytes": 0, "discard_zeroes_data": 0, "status": "unknown" }, { "GUID": "3600a09803830447a4f244c4657616f76", "capacity": "80530636800", "devtype": "FCP", "fwrev": "9700", "logicalblocksize": "512", "pathlist": [], "pathstatus": [ { "physdev": "sdab", "state": "active", "capacity": "80530636800", "lun": "2", "type": "FCP" }, { "physdev": "sdd", "state": "active", "capacity": "80530636800", "lun": "2", "type": "FCP" }, { "physdev": "sdan", "state": "active", "capacity": "80530636800", "lun": "2", "type": "FCP" }, { "physdev": "sdp", "state": "active", "capacity": "80530636800", "lun": "2", "type": "FCP" } ], "physicalblocksize": "4096", "productID": "LUN C-Mode", "pvUUID": "ITTaXK-m63S-CVU5-iqhX-DErq-z0p1-i38WE1", "pvsize": "80127983616", "serial": "SNETAPP_LUN_C-Mode_80DzO_LFWaov", "vendorID": "NETAPP", "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP", "discard_max_bytes": 0, "discard_zeroes_data": 0, "status": "unknown" }, { "GUID": "3600a098038304479363f4c4870455162", "capacity": "53687091200", "devtype": "FCP", "fwrev": "9700", "logicalblocksize": "512", "pathlist": [], "pathstatus": [ { "physdev": "sdq", "state": "active", "capacity": "53687091200", "lun": "3", "type": "FCP" }, { "physdev": "sdac", "state": "active", "capacity": "53687091200", "lun": "3", "type": "FCP" }, { "physdev": "sde", "state": "active", "capacity": "53687091200", "lun": "3", "type": "FCP" }, { "physdev": "sdao", "state": "active", "capacity": "53687091200", "lun": "3", "type": "FCP" } ], "physicalblocksize": "4096", "productID": "LUN C-Mode", "pvUUID": "HGfG0T-OMrP-CDqw-FDBv-mzhD-tpKw-yit6zd", "pvsize": "53284438016", "serial": "SNETAPP_LUN_C-Mode_80Dy6_LHpEQb", "vendorID": "NETAPP", "vgUUID": "IHB192-YlSp-wF2b-4Wrx-nGwD-sc6J-McWxLP", "discard_max_bytes": 0, "discard_zeroes_data": 0, "status": "unknown" } All devices reported with all paths and LVM metadata. 9. Test ended ======== Tue Sep 7 01:27:12 IDT 2021 - end import domain ======= ## BZ2000720_verification_messages 1. SCSI scan Sep 7 01:26:53 oncilla02 kernel: scsi 5:0:0:2: Direct-Access NETAPP LUN C-Mode 9700 PQ: 0 ANSI: 5 Sep 7 01:26:53 oncilla02 kernel: scsi 5:0:0:2: alua: supports implicit TPGS Sep 7 01:26:53 oncilla02 kernel: scsi 5:0:0:2: alua: device t10.NETAPP LUN 80DzO$LFWaov port group 3e8 rel port 8 Sep 7 01:26:53 oncilla02 kernel: sd 5:0:0:2: Attached scsi generic sg4 type 0 Sep 7 01:26:53 oncilla02 kernel: sd 5:0:0:2: [sdd] 157286400 512-byte logical blocks: (80.5 GB/75.0 GiB) Sep 7 01:26:53 oncilla02 kernel: sd 5:0:0:2: alua: transition timeout set to 120 seconds Sep 7 01:26:53 oncilla02 kernel: scsi 5:0:0:3: Direct-Access NETAPP LUN C-Mode 9700 PQ: 0 ANSI: 5 Sep 7 01:26:53 oncilla02 kernel: scsi 5:0:0:3: alua: supports implicit TPGS Sep 7 01:26:53 oncilla02 kernel: scsi 5:0:0:3: alua: device t10.NETAPP LUN 80Dy6?LHpEQb port group 3e8 rel port 8 Sep 7 01:26:53 oncilla02 kernel: sd 5:0:0:3: Attached scsi generic sg5 type 0 Sep 7 01:26:53 oncilla02 kernel: scsi 5:0:0:4: Direct-Access NETAPP LUN C-Mode 9700 PQ: 0 ANSI: 5 ... Sep 7 01:26:54 oncilla02 kernel: sd 5:0:3:5: [sdaq] Attached SCSI disk Sep 7 01:26:54 oncilla02 kernel: sd 5:0:3:4: [sdap] Attached SCSI disk Sep 7 01:26:54 oncilla02 kernel: sd 5:0:1:3: [sdq] Attached SCSI disk Sep 7 01:26:54 oncilla02 kernel: sd 5:0:2:3: alua: port group 3e9 state A non-preferred supports TolUsNA Sep 7 01:26:54 oncilla02 kernel: sd 5:0:3:2: [sdan] Attached SCSI disk Sep 7 01:26:55 oncilla02 systemd[1]: Started Session c886 of user root. Sep 7 01:26:56 oncilla02 kernel: sd 5:0:2:4: alua: port group 3e9 state A non-preferred supports TolUsNA Sep 7 01:26:56 oncilla02 kernel: sd 5:0:3:3: [sdao] Attached SCSI disk Sep 7 01:26:56 oncilla02 kernel: sd 5:0:0:2: [sdd] Attached SCSI disk Sep 7 01:26:56 oncilla02 kernel: sd 5:0:2:4: [sdad] Attached SCSI disk 2. multipathd updating new devices $ grep multipathd BZ2000720_verification_messages | egrep '3600a09803830447a4f244c4657616f76|3600a098038304479363f4c4870455162|3600a098038304479363f4c4870455163|3600a09803830447a4f244c4657616f77' Sep 7 01:26:56 oncilla02 multipathd[2187]: 3600a098038304479363f4c4870455162: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 8:64 1] Sep 7 01:26:56 oncilla02 multipathd[2187]: sde [8:64]: path added to devmap 3600a098038304479363f4c4870455162 Sep 7 01:26:56 oncilla02 multipathd[2187]: 3600a098038304479363f4c4870455163: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 65:16 1] Sep 7 01:26:56 oncilla02 multipathd[2187]: sdr [65:16]: path added to devmap 3600a098038304479363f4c4870455163 Sep 7 01:26:56 oncilla02 multipathd[2187]: 3600a09803830447a4f244c4657616f76: load table [0 157286400 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 65:176 1] Sep 7 01:26:56 oncilla02 multipathd[2187]: sdab [65:176]: path added to devmap 3600a09803830447a4f244c4657616f76 Sep 7 01:26:56 oncilla02 multipathd[2187]: 3600a09803830447a4f244c4657616f77: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 1 1 service-time 0 1 1 8:96 1] Sep 7 01:26:56 oncilla02 multipathd[2187]: sdg [8:96]: path added to devmap 3600a09803830447a4f244c4657616f77 Sep 7 01:26:58 oncilla02 multipathd[2187]: 3600a098038304479363f4c4870455162: performing delayed actions Sep 7 01:26:58 oncilla02 multipathd[2187]: 3600a098038304479363f4c4870455162: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 65:192 1 65:0 1 service-time 0 2 1 8:64 1 66:128 1] Sep 7 01:26:58 oncilla02 multipathd[2187]: 3600a098038304479363f4c4870455163: performing delayed actions Sep 7 01:26:58 oncilla02 multipathd[2187]: 3600a098038304479363f4c4870455163: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 65:16 1 65:208 1 service-time 0 2 1 8:80 1 66:144 1] Sep 7 01:26:58 oncilla02 multipathd[2187]: 3600a09803830447a4f244c4657616f76: performing delayed actions Sep 7 01:26:58 oncilla02 multipathd[2187]: 3600a09803830447a4f244c4657616f76: load table [0 157286400 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 66:112 1 8:48 1 service-time 0 2 1 65:176 1 8:240 1] Sep 7 01:26:58 oncilla02 multipathd[2187]: 3600a09803830447a4f244c4657616f77: performing delayed actions Sep 7 01:26:58 oncilla02 multipathd[2187]: 3600a09803830447a4f244c4657616f77: load table [0 104857600 multipath 3 pg_init_retries 50 queue_if_no_path 1 alua 2 1 service-time 0 2 1 65:224 1 65:32 1 service-time 0 2 1 8:96 1 66:160 1] Events summary: 2021-09-07 01:26:53,696 Starting FC scan 2021-09-07 01:26:56,570 Scan ended 2021-09-07 01:26:56,571 Waiting until multipathd is ready 2021-09-07 01:26:58,632 Wait ended 2021-09-07 01:26:58,632 Resizing multipath devices 2021-09-07 01:26:58,650 Resize ended 2021-09-07 01:26:59,178 Returning device list
Fix merged for 4.5 and backport posted for 4.4.z.
Version: vdsm-4.40.90.1-1.el8ev.x86_64 ovirt-engine-4.4.9-0.6.el8ev.noarch Verification steps: I did the total flow mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2000720#c14 with a few minor adjustments: #In the setup stage: 4. Create few empty disks on this storage domain -> I used a storage domain with multiple disks (some empty and some with data). 5. Set vdsm log level to DEBUG -> following Vojtech's recommendation in https://bugzilla.redhat.com/show_bug.cgi?id=2000720#c26 I didn't do this step. #In the before test stage 3. Remove the LUNs from all hosts as explained -> in this stage I was required to remove the LUNs using "multipath -f" command. This command succeeded in 3/5 of the relevant LUNS. In 2/5 of the LUNS I got this message "< multipath-device_id>: map in use". After consultation with Vojtech I understood it shouldn't affect the verification of this bug. Verification conclusions: The total flow mentioned was done with no errors, After completing the full flow, I was able to enter the 'Import Domain' window and successfully see the storage domains displayed (it worked on the first time I entered the window) Bug verified. Thank you @vjuranek for helping me out with the bug verification flow and the consultations along the way.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.4.9]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4704