Created attachment 1696743 [details] Engine log Created attachment 1696743 [details] Engine log Description of problem: DataCenter become 'Non Responsive' (2 DC - one is Hosted-Engine DC), Storage Domains become 'inactive; (1st DC with 1 SD, 2nd DC with 11 SDs), Hosts that belong to DCs have no SPM. Each SD size - 5TB and FC connection. While running 'multipath -ll' all the LUNs are running and active It happens during mass VM creation (~2300 VM) *** And adding another host to 'Management' DC (Maybe it was the trigger) **** *** The mass VM creation running over other DCs with 'iscsi' Storage Domain ***** Version-Release number of selected component (if applicable): redhat-release-eula-8.2-1.0.el8.x86_64 rhv-release-4.4.1-1-001.noarch vdsm-4.40.17-1.el8ev.x86_64 How reproducible: Steps to Reproduce: 1. Create ~250 VMs on 'L1_Group' DC (20 in parallel) 2. Create ~1050 VMs on 'L0_Group_1' DC (50 in parallel) 3. Create ~960 VMs on 'L0_Group_2' DC (50 in parallel) - Stopped during creation 4. Add another host to 'Management' DC Actual results: DCs are 'Non responsive' status Expected results: No Errors, DCs are 'up' Additional info: (Display every 5sec) 020-06-11 05:20:00,132-04 ERROR [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-97) [1a4c6cdc] Error during ValidateFailure.: java.lang.NullPointerException at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.isSystemSuperUser(CommandBase.java:1161) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.StorageHandlingCommandBase.validate(StorageHandlingCommandBase.java:756) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.internalValidateInTransaction(CommandBase.java:819) at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:140) at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:157) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.internalValidate(CommandBase.java:798) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:412) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runAction(Backend.java:442) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:424) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:369) at jdk.internal.reflect.GeneratedMethodAccessor163.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.jboss.as.ee.0.GA-redhat-00004//org.jboss.as.ee.component.ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptor.java:52) at org.jboss.invocation.2.Final-redhat-00001//org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:422)
Created attachment 1696870 [details] vdsm.log showing exception during Master Domain reconstruction attempt
Changing severity as scaleout testing is blocked by this issue. Pasting snip of exception see attachment for more details. f3-447f-8b79-96aef9b0f174 backend from type NoneType instance 0x7f58cadd69d0 to type StoragePoolDiskBackend instance 0x7f58506ea188 (sp:156) 2020-06-11 12:12:08,700-0400 INFO (jsonrpc/0) [storage.StorageDomainCache] Looking up domain 2e7d7e21-917b-4202-b4bc-ba5e6fffa558 (sdc:171) 2020-06-11 12:12:08,703-0400 DEBUG (jsonrpc/0) [common.commands] /usr/bin/taskset --cpu-list 0-23 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/3600a098038304437415d4b6a59684968$|^/dev/mapper/3600a098038304437415d4b6a59684969$|^/dev/mapper/3600a098038304437415d4b6a5968496a$|^/dev/mapper/3600a098038304437415d4b6a5968496b$|^/dev/mapper/3600a098038304437415d4b6a5968496c$|^/dev/mapper/3600a098038304437415d4b6a5968496d$|^/dev/mapper/3600a098038304437415d4b6a5968496e$|^/dev/mapper/3600a098038304437415d4b6a5968496f$|^/dev/mapper/3600a098038304437415d4b6a59684970$|^/dev/mapper/3600a098038304437415d4b6a59684971$|^/dev/mapper/3600a098038304437415d4b6a59684972$|^/dev/mapper/3600a098038304437415d4b6a59684979$|", "r|.*|"] hints="none" } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name --select 'vg_name = 2e7d7e21-917b-4202-b4bc-ba5e6fffa558' (cwd None) (commands:153) 2020-06-11 12:12:08,813-0400 DEBUG (jsonrpc/0) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:185) 2020-06-11 12:12:08,819-0400 INFO (jsonrpc/0) [storage.StorageDomainCache] Looking up domain 2e7d7e21-917b-4202-b4bc-ba5e6fffa558: 0.12 seconds (utils:390) 2020-06-11 12:12:08,819-0400 INFO (jsonrpc/0) [vdsm.api] FINISH reconstructMaster error=Storage domain does not exist: ('2e7d7e21-917b-4202-b4bc-ba5e6fffa558',) from=::ffff:10.1.41.200,46196, flow_id=3bdb5fe4, task_id=cf16330f-89fe-4b18-9579-aec9ed40a5e1 (api:52) 2020-06-11 12:12:08,820-0400 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='cf16330f-89fe-4b18-9579-aec9ed40a5e1') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "<decorator-gen-87>", line 2, in reconstructMaster File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 1924, in reconstructMaster self.validateSdUUID(masterDom) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 309, in validateSdUUID sdDom = sdCache.produce(sdUUID=sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 115, in produce domain.getRealDomain() File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 139, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 156, in _findDomain return findMethod(sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 186, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) vdsm.storage.exception.StorageDomainDoesNotExist: Storage domain does not exist: ('2e7d7e21-917b-4202-b4bc-ba5e6fffa558',) 2020-06-11 12:12:08,820-0400 INFO (jsonrpc/0) [storage.TaskManager.Task] (Task='cf16330f-89fe-4b18-9579-aec9ed40a5e1') aborting: Task is aborted: "value=Storage domain does not exist: ('2e7d7e21-917b-4202-b4bc-ba5e6fffa558',) abortedcode=358" (task:1190)
Looking at the NPE, here is the flow: 1. AutorecoveryManager is executed periodically as a cron job, which means that there is no logged in user for that flow 2. AutorecoveryManager executes ConnectDomainToStorageCommand 3. ConnectDomainToStorageCommand is inheriting validate() method from StorageHandlingCommandBase, where we are checking if current user if super user -> but as mentioned above for scheduled jobs there is no user logged in So my suggestion is to change the condition in https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/StorageHandlingCommandBase.java#L756 to if (sd != null && sd.isHostedEngineStorage() && getCurrentUser() != null && !isSystemSuperUser()) {
> While running 'multipath -ll' all the LUNs are running and active could you check lvm sees the VGs? E.g. by running /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/3600a098038304437415d4b6a59684968$|^/dev/mapper/3600a098038304437415d4b6a59684969$|^/dev/mapper/3600a098038304437415d4b6a5968496a$|^/dev/mapper/3600a098038304437415d4b6a5968496b$|^/dev/mapper/3600a098038304437415d4b6a5968496c$|^/dev/mapper/3600a098038304437415d4b6a5968496d$|^/dev/mapper/3600a098038304437415d4b6a5968496e$|^/dev/mapper/3600a098038304437415d4b6a5968496f$|^/dev/mapper/3600a098038304437415d4b6a59684970$|^/dev/mapper/3600a098038304437415d4b6a59684971$|^/dev/mapper/3600a098038304437415d4b6a59684972$|^/dev/mapper/3600a098038304437415d4b6a59684979$|", "r|.*|"] hints="none" } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
[root@f01-h31-000-r620 ~]# /sbin/lvm vgs VG #PV #LV #SN Attr VSize VFree vg_f01-h31-000-r620 1 2 0 wz--n- <930.50g 0 [root@f01-h31-000-r620 ~]# (This host have 12 LUNs)
comment #3?
(In reply to David Vaanunu from comment #5) > [root@f01-h31-000-r620 ~]# /sbin/lvm vgs > VG #PV #LV #SN Attr VSize VFree > vg_f01-h31-000-r620 1 2 0 wz--n- <930.50g 0 > [root@f01-h31-000-r620 ~]# > > > (This host have 12 LUNs) if there any lvm fitler? Can you run whole command with filter setup (comment #4)? Can you also provide output of `multipath -ll` and `lsblk`?
Have not LVM filter. [root@f01-h31-000-r620 ~]# /sbin/lvm vgs --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/3600a098038304437415d4b6a59684968$|^/dev/mapper/3600a098038304437415d4b6a59684969$|^/dev/mapper/3600a098038304437415d4b6a5968496a$|^/dev/mapper/3600a098038304437415d4b6a5968496b$|^/dev/mapper/3600a098038304437415d4b6a5968496c$|^/dev/mapper/3600a098038304437415d4b6a5968496d$|^/dev/mapper/3600a098038304437415d4b6a5968496e$|^/dev/mapper/3600a098038304437415d4b6a5968496f$|^/dev/mapper/3600a098038304437415d4b6a59684970$|^/dev/mapper/3600a098038304437415d4b6a59684971$|^/dev/mapper/3600a098038304437415d4b6a59684972$|^/dev/mapper/3600a098038304437415d4b6a59684979$|", "r|.*|"] hints="none" } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name [root@f01-h31-000-r620 ~]# lsblk & multipath output: https://drive.google.com/open?id=1eT0sdQy5lHfph_XuPr6bEUCihN9SOG3A
##Updating the bug to provide more details There are several DCs in this enviroment. The investigation should focus on 'management' dc specifically host f01-h33 which runs the HE vm and DWH vm. Due to size, logs from this time are found here: https://drive.google.com/drive/folders/1yoLAiglGJVTgOY41MwCpNWLbZjdkABoN?usp=sharing Storage Domains which are down are found in both L0_Group_0: 12 FC DCs and Management: 1 FC DC The following focuses on Management DC host f01-h33 which failed from the time frame of 06:30 on June 10th to access its storage domain. # /var/log/sanlock.log Sanlock showing open error 2020-05-27 10:20:06 279 [1383]: s1:r2 resource a131cd1c-4973-43c7-ac8b-b6987cc0356d:fa50be97-26d5-407f-87ad-bd43a309285b:/dev/a131cd1c-4973-43c7-ac8b-b6987cc0356d/leases:108003328 2020-05-27 10:21:52 386 [1384]: s1:r3 resource a131cd1c-4973-43c7-ac8b-b6987cc0356d:SDM:/dev/a131cd1c-4973-43c7-ac8b-b6987cc0356d/leases:1048576 2020-06-10 06:30:58 1196131 [1385]: open error -1 /dev/a131cd1c-4973-43c7-ac8b-b6987cc0356d/leases 2020-06-10 06:30:58 1196131 [1385]: r3 release async open error -2 #looking at the /var/log/ovirt-hosted-engine-ha/broker.log we see that the issue started at: Thread-3::INFO::2020-06-10 06:30:55,092::mem_free::51::mem_free.MemFree::(action) memFree: 91337 Thread-2::INFO::2020-06-10 06:30:55,134::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state StatusStorageThread::ERROR::2020-06-10 06:30:57,453::storage_broker::173::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats) Failed to read metadata from /var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030 Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 155, in get_raw_stats f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC) FileNotFoundError: [Errno 2] No such file or directory: '/var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030' StatusStorageThread::ERROR::2020-06-10 06:30:57,456::status_broker::98::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state. Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 155, in get_raw_stats f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC) FileNotFoundError: [Errno 2] No such file or directory: '/var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 94, in run self._storage_broker.get_raw_stats() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 175, in get_raw_stats .format(str(e))) ovirt_hosted_engine_ha.lib.exceptions.RequestError: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030' StatusStorageThread::ERROR::2020-06-10 06:30:57,457::status_broker::70::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(trigger_restart) Trying to restart the broker ### Checking /var/log/messages log from before the error and through that time period of this failure in order to see if there are any signs of enviroment issues related to storage connection to FC, we do not see errors related to removal of disks or underlying luns. We do see the following errors are shown: ### Jun 10 06:30:56 f01-h33-000-r620 journal[2986212]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore ERROR Unable to extract HEVM OVF Jun 10 06:30:56 f01-h33-000-r620 journal[2986212]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf Jun 10 06:30:57 f01-h33-000-r620 systemd[1]: Started Session c12163 of user root. Jun 10 06:30:57 f01-h33-000-r620 journal[2985892]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to read metadata from /var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030#012Traceback (most recent call last):#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 155, in get_raw_stats#012 f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)#012FileNotFoundError: [Errno 2] No such file or directory: '/var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030' Jun 10 06:30:57 f01-h33-000-r620 journal[2985892]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Failed to read state.#012Traceback (most recent call last):#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 155, in get_raw_stats#012 f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)#012FileNotFoundError: [Errno 2] No such file or directory: '/var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030'#012#012During handling of the above exception, another exception occurred:#012#012Traceback (most recent call last):#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 94, in run#012 self._storage_broker.get_raw_stats()#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 175, in get_raw_stats#012 .format(str(e)))#012ovirt_hosted_engine_ha.lib.exceptions.RequestError: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030' Jun 10 06:30:57 f01-h33-000-r620 journal[2985892]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Trying to restart the broker Jun 10 06:30:57 f01-h33-000-r620 platform-python[2985892]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' Jun 10 06:30:57 f01-h33-000-r620 systemd[3511231]: Starting D-Bus User Message Bus Socket. Jun 10 06:30:57 f01-h33-000-r620 systemd[1]: user-runtime-dir: Unit not needed anymore. Stopping. Jun 10 06:30:57 f01-h33-000-r620 journal[2986212]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Unhandled monitoring loop exception#012Traceback (most recent call last):#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 369, in get_volume_path#012 volumeID=vol_uuid#012 File "/usr/lib/python3.6/site-packages/vdsm/client.py", line 303, in _call#012 method, kwargs, resp.error.code, str(resp.error))#012vdsm.client.ServerError: Command Image.prepare with args {'storagepoolID': '00000000-0000-0000-0000-000000000000', 'storagedomainID': 'a131cd1c-4973-43c7-ac8b-b6987cc0356d', 'imageID': '418e700d-6035-4571-bf48-732ef6caa861', 'volumeID': '36d178aa-0cb9-4655-ac84-097d753161f7'} failed:#012(code=358, message=Storage domain does not exist: ('a131cd1c-4973-43c7-ac8b-b6987cc0356d',))#012#012During handling of the above exception, another exception occurred:#012#012Traceback (most recent call last):#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 454, in start_monitoring#012 self._monitoring_loop()#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 500, in _monitoring_loop#012 self._config.refresh_vm_conf()#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config.py", line 201, in refresh_vm_conf#012 self._vm_config.download()#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config_ovf.py", line 107, in download#012 self.legacy_vm_conf.download()#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config_shared.py", line 31, in download#012 content = self._get_file_content_from_shared_storage()#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config_shared.py", line 136, in _get_file_content_from_shared_storage#012 config_volume_path = self._get_config_volume_path()#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config_shared.py", line 71, in _get_config_volume_path#012 conf_vol_uuid#012 File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 380, in get_volume_path#012 err=e)#012RuntimeError: Path to volume 36d178aa-0cb9-4655-ac84-097d753161f7 not found in /run/vdsm/storage. Caused by: Command Image.prepare with args {'storagepoolID': '00000000-0000-0000-0000-000000000000', 'storagedomainID': 'a131cd1c-4973-43c7-ac8b-b6987cc0356d', 'imageID': '418e700d-6035-4571-bf48-732ef6caa861', 'volumeID': '36d178aa-0cb9-4655-ac84-097d753161f7'} failed:#012(code=358, message=Storage domain does not exist: ('a131cd1c-4973-43c7-ac8b-b6987cc0356d',)) Jun 10 06:30:57 f01-h33-000-r620 systemd[1]: Stopping /run/user/0 mount wrapper... Jun 10 06:30:57 f01-h33-000-r620 systemd[1]: Removed slice User Slice of UID 0. Vdsm log shows 2020-06-10 06:30:55,663-0400 INFO (jsonrpc/3) [storage.StorageDomainCache] Invalidating storage domain cache (sdc:74) 2020-06-10 06:30:55,663-0400 INFO (jsonrpc/3) [storage.StorageDomainCache] Clearing storage domain cache (sdc:199) 2020-06-10 06:30:55,664-0400 INFO (jsonrpc/3) [storage.StorageDomainCache] Refreshing storage domain cache (resize=True) (sdc:80) 2020-06-10 06:30:55,664-0400 INFO (jsonrpc/3) [storage.ISCSI] Scanning iSCSI devices (iscsi:442) 2020-06-10 06:30:56,522-0400 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:312) 2020-06-10 06:30:56,540-0400 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:312) 2020-06-10 06:30:56,543-0400 INFO (jsonrpc/6) [vdsm.api] START prepareImage(sdUUID='a131cd1c-4973-43c7-ac8b-b6987cc0356d', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='418e700d-6035-4571-bf48-732ef6caa861', leafUUID='36d178aa-0cb9-4655-ac84-097d753161f7', allowIllegal=False) from=::ffff:127.0.0.1,59178, task_id=9305e400-47f5-4a37-9eb9-8725af89a77e (api:48) 2020-06-10 06:30:56,559-0400 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:312) 2020-06-10 06:30:56,562-0400 INFO (jsonrpc/7) [vdsm.api] START repoStats(domains=['a131cd1c-4973-43c7-ac8b-b6987cc0356d']) from=::ffff:127.0.0.1,59176, task_id=114efcb6-7653-45c5-8a95-4cabb43edd6c (api:48) 2020-06-10 06:30:56,562-0400 INFO (jsonrpc/7) [vdsm.api] FINISH repoStats return={'a131cd1c-4973-43c7-ac8b-b6987cc0356d': {'code': 358, 'lastCheck': '3.8', 'delay': '0.000345154', 'valid': False, 'version': -1, 'acquired': False, 'actual': True}} from=::ffff:127.0.0.1,59176, task_id=114efcb6-7653-45c5-8a95-4cabb43edd6c (api:54) 2020-06-10 06:30:56,562-0400 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getStorageRepoStats succeeded in 0.00 seconds (__init__:312) 2020-06-10 06:30:56,629-0400 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] Error checking for mail (mailbox:760) 2020-06-10 06:30:56,841-0400 INFO (jsonrpc/3) [storage.ISCSI] Scanning iSCSI devices: 1.18 seconds (utils:390) 2020-06-10 06:30:56,841-0400 INFO (jsonrpc/3) [storage.HBA] Scanning FC devices (hba:60) 2020-06-10 06:30:56,984-0400 INFO (jsonrpc/3) [storage.HBA] Scanning FC devices: 0.14 seconds (utils:390) 2020-06-10 06:30:57,038-0400 INFO (jsonrpc/3) [storage.Multipath] Resizing multipath devices (multipath:104) 2020-06-10 06:30:57,042-0400 INFO (jsonrpc/3) [storage.Multipath] Resizing multipath devices: 0.00 seconds (utils:390) 2020-06-10 06:30:57,043-0400 INFO (jsonrpc/3) [storage.StorageDomainCache] Refreshing storage domain cache: 1.38 seconds (utils:390) 2020-06-10 06:30:57,043-0400 INFO (jsonrpc/3) [storage.StorageDomainCache] Looking up domain a131cd1c-4973-43c7-ac8b-b6987cc0356d (sdc:171) 2020-06-10 06:30:57,345-0400 INFO (jsonrpc/3) [storage.StorageDomainCache] Looking up domain a131cd1c-4973-43c7-ac8b-b6987cc0356d: 0.30 seconds (utils:390) 2020-06-10 06:30:57,345-0400 INFO (jsonrpc/3) [vdsm.api] FINISH connectStoragePool error=Cannot find master domain: 'spUUID=eaa2be8d-416f-4f64-b30a-1e70f58c5a07, msdUUID=a131cd1c-4973-43c7-ac8b-b6987cc0356d' from=::ffff:10.1.41.200,48602, flow_id=6a6f50b5, task_id=f6d5b4ce-3be7-4c25-8cc1-d273b1e5dfef (api:52) 2020-06-10 06:30:57,345-0400 ERROR (jsonrpc/3) [storage.TaskManager.Task] (Task='f6d5b4ce-3be7-4c25-8cc1-d273b1e5dfef') Unexpected error (task:880) 2020-06-10 06:30:57,346-0400 INFO (jsonrpc/3) [storage.TaskManager.Task] (Task='f6d5b4ce-3be7-4c25-8cc1-d273b1e5dfef') aborting: Task is aborted: "value=Cannot find master domain: 'spUUID=eaa2be8d-416f-4f64-b30a-1e70f58c5a07, msdUUID=a131cd1c-4973-43c7-ac8b-b6987cc0356d' abortedcode=304" (task:1190) 2020-06-10 06:30:57,346-0400 ERROR (jsonrpc/3) [storage.Dispatcher] FINISH connectStoragePool error=Cannot find master domain: 'spUUID=eaa2be8d-416f-4f64-b30a-1e70f58c5a07, msdUUID=a131cd1c-4973-43c7-ac8b-b6987cc0356d' (dispatcher:83) 2020-06-10 06:30:57,346-0400 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call StoragePool.connect failed (error 304) in 1.69 seconds (__init__:312) 2020-06-10 06:30:57,349-0400 INFO (jsonrpc/6) [storage.StorageDomainCache] Looking up domain a131cd1c-4973-43c7-ac8b-b6987cc0356d (sdc:171) 2020-06-10 06:30:57,699-0400 INFO (jsonrpc/6) [storage.StorageDomainCache] Looking up domain a131cd1c-4973-43c7-ac8b-b6987cc0356d: 0.35 seconds (utils:390) 2020-06-10 06:30:57,699-0400 INFO (jsonrpc/6) [vdsm.api] FINISH prepareImage error=Storage domain does not exist: ('a131cd1c-4973-43c7-ac8b-b6987cc0356d',) from=::ffff:127.0.0.1,59178, task_id=9305e400-47f5-4a37-9eb9-8725af89a77e (api:52) 2020-06-10 06:30:57,699-0400 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='9305e400-47f5-4a37-9eb9-8725af89a77e') Unexpected error (task:880) 2020-06-10 06:30:57,700-0400 INFO (jsonrpc/6) [storage.TaskManager.Task] (Task='9305e400-47f5-4a37-9eb9-8725af89a77e') aborting: Task is aborted: "value=Storage domain does not exist: ('a131cd1c-4973-43c7-ac8b-b6987cc0356d',) abortedcode=358" (task:1190) 2020-06-10 06:30:57,700-0400 ERROR (jsonrpc/6) [storage.Dispatcher] FINISH prepareImage error=Storage domain does not exist: ('a131cd1c-4973-43c7-ac8b-b6987cc0356d',) (dispatcher:83) 2020-06-10 06:30:57,701-0400 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Image.prepare failed (error 358) in 1.16 seconds (__init__:312) 2020-06-10 06:30:57,786-0400 INFO (jsonrpc/2) [vdsm.api] START getAllTasksStatuses(spUUID=None, options=None) from=::ffff:10.1.41.200,48602, flow_id=70d12005, task_id=d3ef338c-97b6-426d-8652-896e0f75c005 (api:48)
1. Checked NetApp event logs and no storage related issues were found 2. Checked other RHV environments using FC and they were fine. No storage related issues were found Log Files: https://drive.google.com/open?id=1eT0sdQy5lHfph_XuPr6bEUCihN9SOG3A (Engine.log, vdsm.log, message log) 3. Management Cluster: Become inactive at 2020-06-10 06:30:46,951-04 2020-06-10 06:30:46,951-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM f01-h33-000-r620.rdu2.scalelab.redhat.com command SpmStatusVDS failed: Error validating master storage domain: ('MD read error',) 2020-06-10 06:30:46,952-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [] Command 'SpmStatusVDSCommand(HostName = f01-h33-000-r620.rdu2.scalelab.redhat.com, SpmStatusVDSCommandParameters:{hostId='5a5f1b5a-3c5f-4e69-81e7-dc1b03f251fb', storagePoolId='eaa2be8d-416f-4f64-b30a-1e70f58c5a07'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',) 2020-06-10 06:30:47,005-04 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [6a6f50b5] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR(987), Invalid status on Data Center Management. Setting Data Center status to Non Responsive (On host f01-h33-000-r620.rdu2.scalelab.redhat.com, Error: Cannot access master storage domain). 2020-06-10 06:30:48,846-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [6a6f50b5] Command 'ConnectStoragePoolVDSCommand(HostName = f01-h33-000-r620.rdu2.scalelab.redhat.com, ConnectStoragePoolVDSCommandParameters:{hostId='5a5f1b5a-3c5f-4e69-81e7-dc1b03f251fb', vdsId='5a5f1b5a-3c5f-4e69-81e7-dc1b03f251fb', storagePoolId='eaa2be8d-416f-4f64-b30a-1e70f58c5a07', masterVersion='1'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=eaa2be8d-416f-4f64-b30a-1e70f58c5a07, msdUUID=a131cd1c-4973-43c7-ac8b-b6987cc0356d' 2020-06-10 06:30:58,886-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Broken pipe 2020-06-10 06:30:58,908-04 WARN [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedThreadFactory-engine-Thread-4) [70d12005] Could not stop spm of pool 'eaa2be8d-416f-4f64-b30a-1e70f58c5a07' on vds '5a5f1b5a-3c5f-4e69-81e7-dc1b03f251fb': VDSGenericException: VDSNetworkException: Broken pipe 2020-06-10 06:30:58,908-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedThreadFactory-engine-Thread-4) [70d12005] FINISH, SpmStopVDSCommand, return: , log id: 58e0d6e5 2020-06-10 06:30:58,909-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-28) [7a9b37c0] Unable to GetStats: VDSNetworkException: VDSGenericException: VDSNetworkException: Broken pipe 2020-06-10 06:30:58,911-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-66) [] Command 'SpmStatusVDSCommand(HostName = f01-h33-000-r620.rdu2.scalelab.redhat.com, SpmStatusVDSCommandParameters:{hostId='5a5f1b5a-3c5f-4e69-81e7-dc1b03f251fb', storagePoolId='eaa2be8d-416f-4f64-b30a-1e70f58c5a07'})' execution failed: VDSGenericException: VDSNetworkException: Broken pipe 2020-06-10 06:31:19,559-04 INFO [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (EE-ManagedThreadFactory-engine-Thread-35) [] Could not connect host 'f01-h33-000-r620.rdu2.scalelab.redhat.com' to pool 'Managemen t', as the master domain is in inactive/unknown status - not failing the operation 4. L0_Group_0 Cluster: Become inactive at 2020-06-10 13:49:44,907-04 2020-06-10 13:49:44,907-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (EE-ManagedThreadFactory-engine-Thread-12) [6097cc9e] START, ReconstructMasterVDSCommand(HostName = f01-h1 5-000-r620.rdu2.scalelab.redhat.com, ReconstructMasterVDSCommandParameters:{hostId='4b58778a-0638-47ee-ab22-359a4b684708', vdsSpmId='2', storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storagePoolName='L0_ Group_0', masterDomainId='cc8dc275-e0af-493d-932e-6af775fe96ff', masterVersion='3', domainsList='[StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='6c2a a70d-678d-4fe9-b372-c6ab605af0e9'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='3fbd08a8-18dc-4624-9f22-f55cce2fb386'}', stat us='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='16dd0382-c10a-4f8c-9d9a-40b4b51536e7'}', status='Inactive'}, StoragePoolIsoMap:{id='St oragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='a60b8cde-99c5-4516-ad65-8221e8d35802'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a7 4-b6f3-447f-8b79-96aef9b0f174', storageId='fa9355a4-61d0-4827-9a83-6d00f045459b'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId ='2e7d7e21-917b-4202-b4bc-ba5e6fffa558'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='d9b502e0-87ab-4062-bba4-483a435a5f8b'}' , status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='cbee9a1a-8048-4507-b123-6d212d407948'}', status='Inactive'}, StoragePoolIsoMap:{ id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='feae8e8a-1a3c-4743-b993-5e7c501f3e3f'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8 c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='cc8dc275-e0af-493d-932e-6af775fe96ff'}', status='Unknown'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', stor ageId='4af7fccf-f1bf-4f0c-9b95-ea972b72afa0'}', status='Inactive'}]'}), log id: 2b4c019c 2020-06-10 13:49:45,677-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (EE-ManagedThreadFactory-engine-Thread-12) [6097cc9e] Command 'ReconstructMasterVDSCommand(HostName = f01- h15-000-r620.rdu2.scalelab.redhat.com, ReconstructMasterVDSCommandParameters:{hostId='4b58778a-0638-47ee-ab22-359a4b684708', vdsSpmId='2', storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storagePoolName='L 0_Group_0', masterDomainId='cc8dc275-e0af-493d-932e-6af775fe96ff', masterVersion='3', domainsList='[StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='6c 2aa70d-678d-4fe9-b372-c6ab605af0e9'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='3fbd08a8-18dc-4624-9f22-f55cce2fb386'}', st atus='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='16dd0382-c10a-4f8c-9d9a-40b4b51536e7'}', status='Inactive'}, StoragePoolIsoMap:{id=' StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='a60b8cde-99c5-4516-ad65-8221e8d35802'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8 a74-b6f3-447f-8b79-96aef9b0f174', storageId='fa9355a4-61d0-4827-9a83-6d00f045459b'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storage Id='2e7d7e21-917b-4202-b4bc-ba5e6fffa558'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='d9b502e0-87ab-4062-bba4-483a435a5f8b' }', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='cbee9a1a-8048-4507-b123-6d212d407948'}', status='Inactive'}, StoragePoolIsoMap :{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='feae8e8a-1a3c-4743-b993-5e7c501f3e3f'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId= '8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='cc8dc275-e0af-493d-932e-6af775fe96ff'}', status='Unknown'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', st orageId='4af7fccf-f1bf-4f0c-9b95-ea972b72afa0'}', status='Inactive'}]'})' execution failed: VDSGenericException: VDSErrorException: Failed to ReconstructMasterVDS, error = Storage domain does not exist: ('cc8dc2 75-e0af-493d-932e-6af775fe96ff',), code = 358 L0_Group_0 have 11 StorageDomains. (VDSM log) Errors start at 2020-06-10 13:15:53,061-0400 ERROR 2020-06-10 13:15:51,810-0400 INFO (monitor/d9b502e) [storage.StorageDomainCache] Removing domain d9b502e0-87ab-4062-bba4-483a435a5f8b from storage domain cache (sdc:211) 2020-06-10 13:15:51,856-0400 INFO (monitor/fa9355a) [storage.StorageDomainCache] Removing domain fa9355a4-61d0-4827-9a83-6d00f045459b from storage domain cache (sdc:211) 2020-06-10 13:15:51,869-0400 INFO (monitor/4af7fcc) [storage.StorageDomainCache] Removing domain 4af7fccf-f1bf-4f0c-9b95-ea972b72afa0 from storage domain cache (sdc:211) 2020-06-10 13:15:51,883-0400 INFO (monitor/a60b8cd) [storage.StorageDomainCache] Removing domain a60b8cde-99c5-4516-ad65-8221e8d35802 from storage domain cache (sdc:211) 2020-06-10 13:15:52,765-0400 WARN (monitor/d9b502e) [storage.LVM] Removing stale VG d9b502e0-87ab-4062-bba4-483a435a5f8b (lvm:656) 2020-06-10 13:15:53,061-0400 ERROR (monitor/d9b502e) [storage.Monitor] Error checking domain d9b502e0-87ab-4062-bba4-483a435a5f8b (monitor:423) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 404, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 48, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 139, in _realProduce domain = self._findDomain(sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 156, in _findDomain return findMethod(sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 1843, in findDomain return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID)) File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 955, in __init__ manifest = self.manifestClass(sdUUID) File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 424, in __init__ sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 346, in __init__ self._alignment = metadata.get(DMDK_ALIGNMENT, sc.ALIGNMENT_1M) File "/usr/lib/python3.6/site-packages/vdsm/storage/persistent.py", line 119, in get return dec(self._dict[key]) File "/usr/lib/python3.6/site-packages/vdsm/storage/persistent.py", line 224, in __getitem__ with self._accessWrapper(): File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__ return next(self.gen) File "/usr/lib/python3.6/site-packages/vdsm/storage/persistent.py", line 176, in _accessWrapper self._refresh() File "/usr/lib/python3.6/site-packages/vdsm/storage/persistent.py", line 245, in _refresh lines = self._metaRW.readlines() File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 364, in readlines vg = lvm.getVG(self._vgName) File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1273, in getVG raise se.VolumeGroupDoesNotExist(vgName) vdsm.storage.exception.VolumeGroupDoesNotExist: Volume Group does not exist: ('d9b502e0-87ab-4062-bba4-483a435a5f8b',) 2020-06-10 13:15:53,061-0400 INFO (monitor/d9b502e) [storage.Monitor] Domain d9b502e0-87ab-4062-bba4-483a435a5f8b became INVALID (monitor:468) Order of Lose StorageDomains 2020-06-10 13:16:18,546-0400 ('d9b502e0-87ab-4062-bba4-483a435a5f8b',) - L0_Group_0_Extra10 StorageDomain 2020-06-10 13:22:18,149-0400 ('feae8e8a-1a3c-4743-b993-5e7c501f3e3f',) - L0_Group_0_Extra9 StorageDomain 2020-06-10 13:27:18,139-0400 ('cbee9a1a-8048-4507-b123-6d212d407948',) - L0_Group_0_Extra8 StorageDomain 2020-06-10 13:31:18,211-0400 ('fa9355a4-61d0-4827-9a83-6d00f045459b',) - L0_Group_0_Extra7 StorageDomain 2020-06-10 13:32:18,231-0400 ('2e7d7e21-917b-4202-b4bc-ba5e6fffa558',) - L0_Group_0_Extra6 StorageDomain 2020-06-10 13:36:18,963-0400 ('a60b8cde-99c5-4516-ad65-8221e8d35802',) - L0_Group_0_Extra5 StorageDomain 2020-06-10 13:42:18,148-0400 ('16dd0382-c10a-4f8c-9d9a-40b4b51536e7',) - L0_Group_0_Extra4 StorageDomain 2020-06-10 13:43:17,755-0400 ('3fbd08a8-18dc-4624-9f22-f55cce2fb386',) - L0_Group_0_Extra3 StorageDomain 2020-06-10 13:47:21,672-0400 ('cc8dc275-e0af-493d-932e-6af775fe96ff',) - L0_Group_0_Extra1 StorageDomain 2020-06-10 13:47:22,534-0400 ('6c2aa70d-678d-4fe9-b372-c6ab605af0e9',) - L0_Group_0_Extra2 StorageDomain 2020-06-10 13:48:52,357-0400 ('4af7fccf-f1bf-4f0c-9b95-ea972b72afa0',) - L0_Group_0 StorageDomain Host F01-H15 /var/log/messages Jun 10 13:16:18 f01-h15-000-r620 vdsm[16262]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=c26f2e35-ec7c-481e-ae9c-b151f73bc339 at 0x7f06906b7908> timeout=30.0, duration=1.23 at 0x7f06932f4c88>#012Traceback (most recent call last):#012 File "/usr/lib/python3.6/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012 File "/usr/lib/python3.6/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012 File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 315, in __call__#012 self._execute()#012 File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 357, in _execute#012 self._vm.updateDriveVolume(drive)#012 File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 3813, in updateDriveVolume#012 vmDrive.volumeID)#012 File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5571, in _getVolumeSize#012 (domainID, volumeID))#012vdsm.virt.vm.StorageUnavailableError: Unable to get volume size for domain d9b502e0-87ab-4062-bba4-483a435a5f8b volume 54faeb78-397d-45d8-b987-c5acff85bc36
The flow was as follows: monitor check domain status and refreshes domain lvm cache every 5 minutes - it removes the domain from the cache and loads it again. On 2020-06-10 06:30:52 this reload failed (see vdsm.log.72.xz) - vgs command doesn't returned probably anything and vdsm started to consider this vg as stale and started to remove domain remnants (see supervdsm.log.4.xz): MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,535::supervdsm_server::93::SuperVdsm.ServerCallback::(wrapper) call devicemapper_removeMapping with ('a131cd1c--4973--43c7--ac8b--b6987cc0356d-xleases',) {} MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,535::commands::153::common.commands::(start) /usr/bin/taskset --cpu-list 0-23 /usr/sbin/dmsetup remove a131cd1c--4973--43c7--ac8b--b6987cc0356d-xleases (cwd None) MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,558::commands::98::common.commands::(run) SUCCESS: <err> = b''; <rc> = 0 MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) return devicemapper_removeMapping with None Removing of some devices failed as the device was busy, e.g: MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,591::supervdsm_server::93::SuperVdsm.ServerCallback::(wrapper) call devicemapper_removeMapping with ('a131cd1c--4973--43c7--ac8b--b6987cc0356d-ids',) {} MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,591::commands::153::common.commands::(start) /usr/bin/taskset --cpu-list 0-23 /usr/sbin/dmsetup remove a131cd1c--4973--43c7--ac8b--b6987cc0356d-ids (cwd None) MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,599::commands::98::common.commands::(run) FAILED: <err> = b'device-mapper: remove ioctl on a131cd1c--4973--43c7--ac8b--b6987cc0356d-ids failed: Device or resource busy\nCommand failed.\n'; <rc> = 1 MainProcess|monitor/a131cd1::ERROR::2020-06-10 06:30:52,599::supervdsm_server::97::SuperVdsm.ServerCallback::(wrapper) Error in devicemapper_removeMapping Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/devicemapper.py", line 140, in removeMapping commands.run(cmd) File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101, in run raise cmdutils.Error(args, p.returncode, out, err) vdsm.common.cmdutils.Error: Command ['/usr/sbin/dmsetup', 'remove', 'a131cd1c--4973--43c7--ac8b--b6987cc0356d-ids'] failed with rc=1 out=b'' err=b'device-mapper: remove ioctl on a131cd1c--4973--43c7--ac8b--b6987cc0356d-ids failed: Device or resource busy\nCommand failed.\n' Therefore some devices were removed and there are failures in other logs, e.g. in sanlock log: 2020-06-10 06:30:58 1196131 [1385]: open error -1 /dev/a131cd1c-4973-43c7-ac8b-b6987cc0356d/leases as this device was removed (see above), while some other devices are still mapped via device mapper: [root@f01-h33-000-r620 ~]# dmsetup table a131cd1c--4973--43c7--ac8b--b6987cc0356d-fa50be97--26d5--407f--87ad--bd43a309285b: 0 419430400 linear 253:2 10225664 a131cd1c--4973--43c7--ac8b--b6987cc0356d-ids: 0 262144 linear 253:2 526336 3600a098038304437415d4b6a59684967: 0 10737418240 multipath 3 queue_if_no_path pg_init_retries 50 1 alua 2 1 service-time 0 2 2 8:48 1 1 8:64 1 1 service-time 0 2 2 8:16 1 1 8:32 1 1 a131cd1c--4973--43c7--ac8b--b6987cc0356d-93e65dfd--b5a6--4fe4--bb00--8f31727c63bd: 0 2097152 linear 253:2 429656064 a131cd1c--4973--43c7--ac8b--b6987cc0356d-master: 0 2097152 linear 253:2 7604224 vg_f01--h33--000--r620-lv_swap: 0 16777216 linear 8:2 2048 vg_f01--h33--000--r620-lv_root: 0 1934614528 linear 8:2 16779264 a131cd1c--4973--43c7--ac8b--b6987cc0356d-60c303e3--033a--4d9e--8011--527069fdccc3: 0 33030144 linear 253:2 441976832 So the main question is why VG reload (https://github.com/oVirt/vdsm/blob/v4.40.17/lib/vdsm/storage/lvm.py#L603) probably didn't returned any value.
(In reply to Vojtech Juranek from comment #11) I think the issue is removing VG mapping when VG fails to reload: 652 for name in staleVGs: 653 if name in self._vgs: 654 log.warning("Removing stale VG %s", name) 655 removeVgMapping(name) 656 del self._vgs[name] This is completely wrong. The fact that lvm failed to return information about this VG does not mean that the VG does not exist or is unused. Removing VG mapping should be done only when we deactivate a storage domain, as part of StorageDomain.cleanup() flow. removeVgMapping() call was added in: commit a69f8a08b5c11f984254d6fd8bf50c028b0ee799 Author: Saggi Mizrahi <smizrahi> Date: Sun May 15 10:24:58 2011 +0300 BZ#701671 - Create PV should account for stale device mapper mappings Change-Id: Idc08f4d4f2042513696b8539531758ddee3c87cc Reviewed-on: http://gerrit.usersys.redhat.com/355 Reviewed-by: Dan Kenigsberg <danken> Tested-by: Dan Kenigsberg <danken> The change in included in this repo: git://gerrit.eng.lab.tlv.redhat.com/vdsm Which include vdsm source before it was open source. I'm not sure how the bug is related to the huge patch. It look like the author tried to make sure vg is removed completely, including stale lvs, so he added this call in the reload flow to make sure that non-existing vgs do not leave stale lvs. We need to remove the removeVgMapping() call. We need to understand why lvm does not return info for this storage domain (or many storage domains), but this is a separate issue, and not critical one. The system must be able to handle temporary errors when accessing storage.
This bug moved to post but it has not been targeted yet. Can you please target?
We need to backport this to 4.3, this is important fix.
Martin, are you sure the merged engine patch is related to this bug? This root cause of the issue described in comment 0 is user error. While provisioning other hosts in the same network, the LUN used by this setup was wiped. We found an empty partition table in the LUN. This caused LVM to stop reporting this LUN and the VG using it. This exposed the fact that vdsm removes mapping of missing VGs during reloading which should update only vdsm lvm cache. The issue in vdsm is fixed by https://gerrit.ovirt.org/109712 But we have another engine patch (seems unrelated to this issue): https://gerrit.ovirt.org/109673 The engine patch probably belongs to another bug, or if this is really related to the same bug it seems that 2 people worked on the same issue without any coordination :-) I guess we need to move one of the patches to another bug or create new bugs for the vdsm and engine issue and close this as NOTABUG.
(In reply to Nir Soffer from comment #15) > Martin, are you sure the merged engine patch is related to this bug? > > This root cause of the issue described in comment 0 is user error. > While provisioning other hosts in the same network, the LUN used > by this setup was wiped. We found an empty partition table in the > LUN. > > This caused LVM to stop reporting this LUN and the VG using it. This > exposed the fact that vdsm removes mapping of missing VGs during reloading > which should update only vdsm lvm cache. > > The issue in vdsm is fixed by > https://gerrit.ovirt.org/109712 > > But we have another engine patch (seems unrelated to this issue): > https://gerrit.ovirt.org/109673 > > The engine patch probably belongs to another bug, or if this is really > related to the same bug it seems that 2 people worked on the same issue > without any coordination :-) The engine patch fixes the NPE on engine side, but I already mentioned that fixing NPE will not resolve the main issue > > I guess we need to move one of the patches to another bug or create > new bugs for the vdsm and engine issue and close this as NOTABUG. Feel free to have this bug only for storage related issue
The system was re-install. Running our scale tests (~10 times, ~150Hrs of testing) scanning engine & vdsm log files. The errors are not found vdsm: "Volume Group does not exist" engine: "ConnectDomainToStorageCommand"
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.
*** Bug 1831009 has been marked as a duplicate of this bug. ***