Bug 1846331 - [scale] DataCenter become 'Non responsive' and host have no SPM
Summary: [scale] DataCenter become 'Non responsive' and host have no SPM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.40.17
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.4.1
: 4.40.21
Assignee: Vojtech Juranek
QA Contact: David Vaanunu
URL:
Whiteboard:
: 1831009 (view as bug list)
Depends On:
Blocks: 1849558
TreeView+ depends on / blocked
 
Reported: 2020-06-11 12:02 UTC by David Vaanunu
Modified: 2021-03-11 11:09 UTC (History)
11 users (show)

Fixed In Version: ovirt-engine-4.4.1.5 vdsm-4.40.21
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1849558 (view as bug list)
Environment:
Last Closed: 2020-07-08 08:25:25 UTC
oVirt Team: Storage
Embargoed:
mlehrer: needinfo-
sbonazzo: ovirt-4.4?
sbonazzo: planning_ack?
sbonazzo: devel_ack?
sbonazzo: testing_ack?


Attachments (Terms of Use)
Engine log (634.04 KB, application/x-xz)
2020-06-11 12:02 UTC, David Vaanunu
no flags Details
vdsm.log showing exception during Master Domain reconstruction attempt (9.10 MB, application/gzip)
2020-06-11 16:28 UTC, mlehrer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 109712 0 master MERGED blocksd: remove SD mappings upon SD invalidation 2020-11-02 09:06:20 UTC
oVirt gerrit 109803 0 ovirt-4.3 MERGED blocksd: remove SD mappings upon SD invalidation 2020-11-02 09:06:20 UTC

Description David Vaanunu 2020-06-11 12:02:29 UTC
Created attachment 1696743 [details]
Engine log

Created attachment 1696743 [details]
Engine log

Description of problem:

DataCenter become 'Non Responsive' (2 DC - one is Hosted-Engine DC),
Storage Domains become 'inactive; (1st DC with 1 SD, 2nd DC with 11 SDs),
Hosts that belong to DCs have no SPM.
Each SD size - 5TB and FC connection.
While running 'multipath -ll' all the LUNs are running and active 

It happens during mass VM creation (~2300 VM)
*** And adding another host to 'Management' DC (Maybe it was the trigger) ****

*** The mass VM creation running over other DCs with 'iscsi' Storage Domain *****


Version-Release number of selected component (if applicable):

redhat-release-eula-8.2-1.0.el8.x86_64
rhv-release-4.4.1-1-001.noarch
vdsm-4.40.17-1.el8ev.x86_64


How reproducible:


Steps to Reproduce:
1. Create ~250 VMs on 'L1_Group' DC (20 in parallel)
2. Create ~1050 VMs on 'L0_Group_1' DC (50 in parallel)
3. Create ~960 VMs on 'L0_Group_2' DC (50 in parallel) - Stopped during creation
4. Add another host to 'Management' DC

Actual results:

DCs are 'Non responsive' status

Expected results:

No Errors, DCs are 'up'


Additional info:

(Display every 5sec)
020-06-11 05:20:00,132-04 ERROR [org.ovirt.engine.core.bll.storage.connection.ConnectDomainToStorageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-97) [1a4c6cdc] Error during ValidateFailure.: java.lang.NullPointerException
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.isSystemSuperUser(CommandBase.java:1161)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.StorageHandlingCommandBase.validate(StorageHandlingCommandBase.java:756)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.internalValidateInTransaction(CommandBase.java:819)
	at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:140)
	at org.ovirt.engine.core.utils//org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInSuppressed(TransactionSupport.java:157)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.internalValidate(CommandBase.java:798)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:412)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runAction(Backend.java:442)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:424)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.Backend.runInternalAction(Backend.java:369)
	at jdk.internal.reflect.GeneratedMethodAccessor163.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.jboss.as.ee.0.GA-redhat-00004//org.jboss.as.ee.component.ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptor.java:52)
	at org.jboss.invocation.2.Final-redhat-00001//org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:422)

Comment 1 mlehrer 2020-06-11 16:28:51 UTC
Created attachment 1696870 [details]
vdsm.log showing exception during Master Domain reconstruction attempt

Comment 2 mlehrer 2020-06-11 16:32:13 UTC
Changing severity as scaleout testing is blocked by this issue.

Pasting snip of exception see attachment for more details.


f3-447f-8b79-96aef9b0f174 backend from type NoneType instance 0x7f58cadd69d0 to type StoragePoolDiskBackend instance 0x7f58506ea188 (sp:156)
2020-06-11 12:12:08,700-0400 INFO  (jsonrpc/0) [storage.StorageDomainCache] Looking up domain 2e7d7e21-917b-4202-b4bc-ba5e6fffa558 (sdc:171)
2020-06-11 12:12:08,703-0400 DEBUG (jsonrpc/0) [common.commands] /usr/bin/taskset --cpu-list 0-23 /usr/bin/sudo -n /sbin/lvm vgs --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/3600a098038304437415d4b6a59684968$|^/dev/mapper/3600a098038304437415d4b6a59684969$|^/dev/mapper/3600a098038304437415d4b6a5968496a$|^/dev/mapper/3600a098038304437415d4b6a5968496b$|^/dev/mapper/3600a098038304437415d4b6a5968496c$|^/dev/mapper/3600a098038304437415d4b6a5968496d$|^/dev/mapper/3600a098038304437415d4b6a5968496e$|^/dev/mapper/3600a098038304437415d4b6a5968496f$|^/dev/mapper/3600a098038304437415d4b6a59684970$|^/dev/mapper/3600a098038304437415d4b6a59684971$|^/dev/mapper/3600a098038304437415d4b6a59684972$|^/dev/mapper/3600a098038304437415d4b6a59684979$|", "r|.*|"]  hints="none" } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name --select 'vg_name = 2e7d7e21-917b-4202-b4bc-ba5e6fffa558' (cwd None) (commands:153)
2020-06-11 12:12:08,813-0400 DEBUG (jsonrpc/0) [common.commands] SUCCESS: <err> = b''; <rc> = 0 (commands:185)
2020-06-11 12:12:08,819-0400 INFO  (jsonrpc/0) [storage.StorageDomainCache] Looking up domain 2e7d7e21-917b-4202-b4bc-ba5e6fffa558: 0.12 seconds (utils:390)
2020-06-11 12:12:08,819-0400 INFO  (jsonrpc/0) [vdsm.api] FINISH reconstructMaster error=Storage domain does not exist: ('2e7d7e21-917b-4202-b4bc-ba5e6fffa558',) from=::ffff:10.1.41.200,46196, flow_id=3bdb5fe4, task_id=cf16330f-89fe-4b18-9579-aec9ed40a5e1 (api:52)
2020-06-11 12:12:08,820-0400 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='cf16330f-89fe-4b18-9579-aec9ed40a5e1') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-87>", line 2, in reconstructMaster
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 1924, in reconstructMaster
    self.validateSdUUID(masterDom)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 309, in validateSdUUID
    sdDom = sdCache.produce(sdUUID=sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 115, in produce
    domain.getRealDomain()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 139, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 156, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 186, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
vdsm.storage.exception.StorageDomainDoesNotExist: Storage domain does not exist: ('2e7d7e21-917b-4202-b4bc-ba5e6fffa558',)
2020-06-11 12:12:08,820-0400 INFO  (jsonrpc/0) [storage.TaskManager.Task] (Task='cf16330f-89fe-4b18-9579-aec9ed40a5e1') aborting: Task is aborted: "value=Storage domain does not exist: ('2e7d7e21-917b-4202-b4bc-ba5e6fffa558',) abortedcode=358" (task:1190)

Comment 3 Martin Perina 2020-06-15 12:44:27 UTC
Looking at the NPE, here is the flow:

1. AutorecoveryManager is executed periodically as a cron job, which means that there is no logged in user for that flow
2. AutorecoveryManager executes ConnectDomainToStorageCommand
3. ConnectDomainToStorageCommand is inheriting validate() method from StorageHandlingCommandBase, where we are checking if current user if super user -> but as mentioned above for scheduled jobs there is no user logged in

So my suggestion is to change the condition in https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/StorageHandlingCommandBase.java#L756 to

 if (sd != null && sd.isHostedEngineStorage() && getCurrentUser() != null && !isSystemSuperUser()) {

Comment 4 Vojtech Juranek 2020-06-15 14:57:15 UTC
> While running 'multipath -ll' all the LUNs are running and active 

could you check lvm sees the VGs? E.g. by running 

/sbin/lvm vgs --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/3600a098038304437415d4b6a59684968$|^/dev/mapper/3600a098038304437415d4b6a59684969$|^/dev/mapper/3600a098038304437415d4b6a5968496a$|^/dev/mapper/3600a098038304437415d4b6a5968496b$|^/dev/mapper/3600a098038304437415d4b6a5968496c$|^/dev/mapper/3600a098038304437415d4b6a5968496d$|^/dev/mapper/3600a098038304437415d4b6a5968496e$|^/dev/mapper/3600a098038304437415d4b6a5968496f$|^/dev/mapper/3600a098038304437415d4b6a59684970$|^/dev/mapper/3600a098038304437415d4b6a59684971$|^/dev/mapper/3600a098038304437415d4b6a59684972$|^/dev/mapper/3600a098038304437415d4b6a59684979$|", "r|.*|"]  hints="none" } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name

Comment 5 David Vaanunu 2020-06-15 16:16:28 UTC
[root@f01-h31-000-r620 ~]# /sbin/lvm vgs
  VG                  #PV #LV #SN Attr   VSize    VFree
  vg_f01-h31-000-r620   1   2   0 wz--n- <930.50g    0 
[root@f01-h31-000-r620 ~]# 


(This host have 12 LUNs)

Comment 6 Michal Skrivanek 2020-06-16 05:43:07 UTC
comment #3?

Comment 7 Vojtech Juranek 2020-06-16 08:15:27 UTC
(In reply to David Vaanunu from comment #5)
> [root@f01-h31-000-r620 ~]# /sbin/lvm vgs
>   VG                  #PV #LV #SN Attr   VSize    VFree
>   vg_f01-h31-000-r620   1   2   0 wz--n- <930.50g    0 
> [root@f01-h31-000-r620 ~]# 
> 
> 
> (This host have 12 LUNs)

if there any lvm fitler? Can you run whole command with filter setup (comment #4)? Can you also provide output of `multipath -ll` and `lsblk`?

Comment 8 David Vaanunu 2020-06-16 08:39:12 UTC
Have not LVM filter.



[root@f01-h31-000-r620 ~]# /sbin/lvm vgs --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/3600a098038304437415d4b6a59684968$|^/dev/mapper/3600a098038304437415d4b6a59684969$|^/dev/mapper/3600a098038304437415d4b6a5968496a$|^/dev/mapper/3600a098038304437415d4b6a5968496b$|^/dev/mapper/3600a098038304437415d4b6a5968496c$|^/dev/mapper/3600a098038304437415d4b6a5968496d$|^/dev/mapper/3600a098038304437415d4b6a5968496e$|^/dev/mapper/3600a098038304437415d4b6a5968496f$|^/dev/mapper/3600a098038304437415d4b6a59684970$|^/dev/mapper/3600a098038304437415d4b6a59684971$|^/dev/mapper/3600a098038304437415d4b6a59684972$|^/dev/mapper/3600a098038304437415d4b6a59684979$|", "r|.*|"]  hints="none" } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 } backup {  retain_min=50  retain_days=0 }' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name
[root@f01-h31-000-r620 ~]# 




lsblk & multipath output:
https://drive.google.com/open?id=1eT0sdQy5lHfph_XuPr6bEUCihN9SOG3A

Comment 9 mlehrer 2020-06-16 12:33:10 UTC
##Updating the bug to provide more details

There are several DCs in this enviroment.
The investigation should focus on 'management' dc specifically host f01-h33 which runs the HE vm and DWH vm.
Due to size, logs from this time are found here: https://drive.google.com/drive/folders/1yoLAiglGJVTgOY41MwCpNWLbZjdkABoN?usp=sharing 
Storage Domains which are down are found in both L0_Group_0: 12 FC DCs and Management: 1 FC DC


The following focuses on Management DC host f01-h33 which failed from the time frame of 06:30 on June 10th to access its storage domain.

# /var/log/sanlock.log Sanlock showing open error
2020-05-27 10:20:06 279 [1383]: s1:r2 resource a131cd1c-4973-43c7-ac8b-b6987cc0356d:fa50be97-26d5-407f-87ad-bd43a309285b:/dev/a131cd1c-4973-43c7-ac8b-b6987cc0356d/leases:108003328
2020-05-27 10:21:52 386 [1384]: s1:r3 resource a131cd1c-4973-43c7-ac8b-b6987cc0356d:SDM:/dev/a131cd1c-4973-43c7-ac8b-b6987cc0356d/leases:1048576
2020-06-10 06:30:58 1196131 [1385]: open error -1 /dev/a131cd1c-4973-43c7-ac8b-b6987cc0356d/leases
2020-06-10 06:30:58 1196131 [1385]: r3 release async open error -2


#looking at the /var/log/ovirt-hosted-engine-ha/broker.log we see that the issue started at:
Thread-3::INFO::2020-06-10 06:30:55,092::mem_free::51::mem_free.MemFree::(action) memFree: 91337
Thread-2::INFO::2020-06-10 06:30:55,134::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state
StatusStorageThread::ERROR::2020-06-10 06:30:57,453::storage_broker::173::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats) Failed to read metadata from /var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 155, in get_raw_stats
    f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)
FileNotFoundError: [Errno 2] No such file or directory: '/var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030'
StatusStorageThread::ERROR::2020-06-10 06:30:57,456::status_broker::98::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state.
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 155, in get_raw_stats
    f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)
FileNotFoundError: [Errno 2] No such file or directory: '/var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 94, in run
    self._storage_broker.get_raw_stats()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 175, in get_raw_stats
    .format(str(e)))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030'
StatusStorageThread::ERROR::2020-06-10 06:30:57,457::status_broker::70::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(trigger_restart) Trying to restart the broker


###
Checking /var/log/messages log from before the error and through that time period of this failure in order to see if there are any signs of enviroment issues related to storage connection to FC, we do not see errors related to removal of disks or underlying luns.
We do see the following errors are shown:
###
Jun 10 06:30:56 f01-h33-000-r620 journal[2986212]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore ERROR Unable to extract HEVM OVF
Jun 10 06:30:56 f01-h33-000-r620 journal[2986212]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf


Jun 10 06:30:57 f01-h33-000-r620 systemd[1]: Started Session c12163 of user root. 
Jun 10 06:30:57 f01-h33-000-r620 journal[2985892]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to read metadata from /var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030#012Traceback (most recent call last):#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 155, in get_raw_stats#012    f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)#012FileNotFoundError: [Errno 2] No such file or directory: '/var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030' 
Jun 10 06:30:57 f01-h33-000-r620 journal[2985892]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Failed to read state.#012Traceback (most recent call last):#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 155, in get_raw_stats#012    f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)#012FileNotFoundError: [Errno 2] No such file or directory: '/var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030'#012#012During handling of the above exception, another exception occurred:#012#012Traceback (most recent call last):#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 94, in run#012    self._storage_broker.get_raw_stats()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 175, in get_raw_stats#012    .format(str(e)))#012ovirt_hosted_engine_ha.lib.exceptions.RequestError: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm/storage/a131cd1c-4973-43c7-ac8b-b6987cc0356d/c260be65-9acd-4263-af81-43f797b93931/dea0ccf2-a1c0-44b6-ab4c-49e7729ac030' 
Jun 10 06:30:57 f01-h33-000-r620 journal[2985892]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Trying to restart the broker 
Jun 10 06:30:57 f01-h33-000-r620 platform-python[2985892]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' 
Jun 10 06:30:57 f01-h33-000-r620 systemd[3511231]: Starting D-Bus User Message Bus Socket.



Jun 10 06:30:57 f01-h33-000-r620 systemd[1]: user-runtime-dir: Unit not needed anymore. Stopping.
Jun 10 06:30:57 f01-h33-000-r620 journal[2986212]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Unhandled monitoring loop exception#012Traceback (most recent call last):#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 369, in get_volume_path#012    volumeID=vol_uuid#012  File "/usr/lib/python3.6/site-packages/vdsm/client.py", line 303, in _call#012    method, kwargs, resp.error.code, str(resp.error))#012vdsm.client.ServerError: Command Image.prepare with args {'storagepoolID': '00000000-0000-0000-0000-000000000000', 'storagedomainID': 'a131cd1c-4973-43c7-ac8b-b6987cc0356d', 'imageID': '418e700d-6035-4571-bf48-732ef6caa861', 'volumeID': '36d178aa-0cb9-4655-ac84-097d753161f7'} failed:#012(code=358, message=Storage domain does not exist: ('a131cd1c-4973-43c7-ac8b-b6987cc0356d',))#012#012During handling of the above exception, another exception occurred:#012#012Traceback (most recent call last):#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 454, in start_monitoring#012    self._monitoring_loop()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 500, in _monitoring_loop#012    self._config.refresh_vm_conf()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config.py", line 201, in refresh_vm_conf#012    self._vm_config.download()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config_ovf.py", line 107, in download#012    self.legacy_vm_conf.download()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config_shared.py", line 31, in download#012    content = self._get_file_content_from_shared_storage()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config_shared.py", line 136, in _get_file_content_from_shared_storage#012    config_volume_path = self._get_config_volume_path()#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/env/config_shared.py", line 71, in _get_config_volume_path#012    conf_vol_uuid#012  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 380, in get_volume_path#012    err=e)#012RuntimeError: Path to volume 36d178aa-0cb9-4655-ac84-097d753161f7 not found in /run/vdsm/storage. Caused by: Command Image.prepare with args {'storagepoolID': '00000000-0000-0000-0000-000000000000', 'storagedomainID': 'a131cd1c-4973-43c7-ac8b-b6987cc0356d', 'imageID': '418e700d-6035-4571-bf48-732ef6caa861', 'volumeID': '36d178aa-0cb9-4655-ac84-097d753161f7'} failed:#012(code=358, message=Storage domain does not exist: ('a131cd1c-4973-43c7-ac8b-b6987cc0356d',))
Jun 10 06:30:57 f01-h33-000-r620 systemd[1]: Stopping /run/user/0 mount wrapper...
Jun 10 06:30:57 f01-h33-000-r620 systemd[1]: Removed slice User Slice of UID 0.


Vdsm log shows

2020-06-10 06:30:55,663-0400 INFO  (jsonrpc/3) [storage.StorageDomainCache] Invalidating storage domain cache (sdc:74)
2020-06-10 06:30:55,663-0400 INFO  (jsonrpc/3) [storage.StorageDomainCache] Clearing storage domain cache (sdc:199)
2020-06-10 06:30:55,664-0400 INFO  (jsonrpc/3) [storage.StorageDomainCache] Refreshing storage domain cache (resize=True) (sdc:80)
2020-06-10 06:30:55,664-0400 INFO  (jsonrpc/3) [storage.ISCSI] Scanning iSCSI devices (iscsi:442)
2020-06-10 06:30:56,522-0400 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:312)
2020-06-10 06:30:56,540-0400 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:312)
2020-06-10 06:30:56,543-0400 INFO  (jsonrpc/6) [vdsm.api] START prepareImage(sdUUID='a131cd1c-4973-43c7-ac8b-b6987cc0356d', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='418e700d-6035-4571-bf48-732ef6caa861', leafUUID='36d178aa-0cb9-4655-ac84-097d753161f7', allowIllegal=False) from=::ffff:127.0.0.1,59178, task_id=9305e400-47f5-4a37-9eb9-8725af89a77e (api:48)
2020-06-10 06:30:56,559-0400 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.ping2 succeeded in 0.00 seconds (__init__:312)
2020-06-10 06:30:56,562-0400 INFO  (jsonrpc/7) [vdsm.api] START repoStats(domains=['a131cd1c-4973-43c7-ac8b-b6987cc0356d']) from=::ffff:127.0.0.1,59176, task_id=114efcb6-7653-45c5-8a95-4cabb43edd6c (api:48)
2020-06-10 06:30:56,562-0400 INFO  (jsonrpc/7) [vdsm.api] FINISH repoStats return={'a131cd1c-4973-43c7-ac8b-b6987cc0356d': {'code': 358, 'lastCheck': '3.8', 'delay': '0.000345154', 'valid': False, 'version': -1, 'acquired': False, 'actual': True}} from=::ffff:127.0.0.1,59176, task_id=114efcb6-7653-45c5-8a95-4cabb43edd6c (api:54)
2020-06-10 06:30:56,562-0400 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getStorageRepoStats succeeded in 0.00 seconds (__init__:312)
2020-06-10 06:30:56,629-0400 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] Error checking for mail (mailbox:760)
2020-06-10 06:30:56,841-0400 INFO  (jsonrpc/3) [storage.ISCSI] Scanning iSCSI devices: 1.18 seconds (utils:390)
2020-06-10 06:30:56,841-0400 INFO  (jsonrpc/3) [storage.HBA] Scanning FC devices (hba:60)
2020-06-10 06:30:56,984-0400 INFO  (jsonrpc/3) [storage.HBA] Scanning FC devices: 0.14 seconds (utils:390)
2020-06-10 06:30:57,038-0400 INFO  (jsonrpc/3) [storage.Multipath] Resizing multipath devices (multipath:104)
2020-06-10 06:30:57,042-0400 INFO  (jsonrpc/3) [storage.Multipath] Resizing multipath devices: 0.00 seconds (utils:390)
2020-06-10 06:30:57,043-0400 INFO  (jsonrpc/3) [storage.StorageDomainCache] Refreshing storage domain cache: 1.38 seconds (utils:390)
2020-06-10 06:30:57,043-0400 INFO  (jsonrpc/3) [storage.StorageDomainCache] Looking up domain a131cd1c-4973-43c7-ac8b-b6987cc0356d (sdc:171)
2020-06-10 06:30:57,345-0400 INFO  (jsonrpc/3) [storage.StorageDomainCache] Looking up domain a131cd1c-4973-43c7-ac8b-b6987cc0356d: 0.30 seconds (utils:390)
2020-06-10 06:30:57,345-0400 INFO  (jsonrpc/3) [vdsm.api] FINISH connectStoragePool error=Cannot find master domain: 'spUUID=eaa2be8d-416f-4f64-b30a-1e70f58c5a07, msdUUID=a131cd1c-4973-43c7-ac8b-b6987cc0356d' from=::ffff:10.1.41.200,48602, flow_id=6a6f50b5, task_id=f6d5b4ce-3be7-4c25-8cc1-d273b1e5dfef (api:52)
2020-06-10 06:30:57,345-0400 ERROR (jsonrpc/3) [storage.TaskManager.Task] (Task='f6d5b4ce-3be7-4c25-8cc1-d273b1e5dfef') Unexpected error (task:880)
2020-06-10 06:30:57,346-0400 INFO  (jsonrpc/3) [storage.TaskManager.Task] (Task='f6d5b4ce-3be7-4c25-8cc1-d273b1e5dfef') aborting: Task is aborted: "value=Cannot find master domain: 'spUUID=eaa2be8d-416f-4f64-b30a-1e70f58c5a07, msdUUID=a131cd1c-4973-43c7-ac8b-b6987cc0356d' abortedcode=304" (task:1190)
2020-06-10 06:30:57,346-0400 ERROR (jsonrpc/3) [storage.Dispatcher] FINISH connectStoragePool error=Cannot find master domain: 'spUUID=eaa2be8d-416f-4f64-b30a-1e70f58c5a07, msdUUID=a131cd1c-4973-43c7-ac8b-b6987cc0356d' (dispatcher:83)
2020-06-10 06:30:57,346-0400 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call StoragePool.connect failed (error 304) in 1.69 seconds (__init__:312)
2020-06-10 06:30:57,349-0400 INFO  (jsonrpc/6) [storage.StorageDomainCache] Looking up domain a131cd1c-4973-43c7-ac8b-b6987cc0356d (sdc:171)
2020-06-10 06:30:57,699-0400 INFO  (jsonrpc/6) [storage.StorageDomainCache] Looking up domain a131cd1c-4973-43c7-ac8b-b6987cc0356d: 0.35 seconds (utils:390)
2020-06-10 06:30:57,699-0400 INFO  (jsonrpc/6) [vdsm.api] FINISH prepareImage error=Storage domain does not exist: ('a131cd1c-4973-43c7-ac8b-b6987cc0356d',) from=::ffff:127.0.0.1,59178, task_id=9305e400-47f5-4a37-9eb9-8725af89a77e (api:52)
2020-06-10 06:30:57,699-0400 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='9305e400-47f5-4a37-9eb9-8725af89a77e') Unexpected error (task:880)
2020-06-10 06:30:57,700-0400 INFO  (jsonrpc/6) [storage.TaskManager.Task] (Task='9305e400-47f5-4a37-9eb9-8725af89a77e') aborting: Task is aborted: "value=Storage domain does not exist: ('a131cd1c-4973-43c7-ac8b-b6987cc0356d',) abortedcode=358" (task:1190)
2020-06-10 06:30:57,700-0400 ERROR (jsonrpc/6) [storage.Dispatcher] FINISH prepareImage error=Storage domain does not exist: ('a131cd1c-4973-43c7-ac8b-b6987cc0356d',) (dispatcher:83)
2020-06-10 06:30:57,701-0400 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Image.prepare failed (error 358) in 1.16 seconds (__init__:312)
2020-06-10 06:30:57,786-0400 INFO  (jsonrpc/2) [vdsm.api] START getAllTasksStatuses(spUUID=None, options=None) from=::ffff:10.1.41.200,48602, flow_id=70d12005, task_id=d3ef338c-97b6-426d-8652-896e0f75c005 (api:48)

Comment 10 David Vaanunu 2020-06-16 15:05:07 UTC
1. Checked NetApp event logs and no storage related issues were found
2. Checked other RHV environments using FC and they were fine. No storage related issues were found


Log Files: https://drive.google.com/open?id=1eT0sdQy5lHfph_XuPr6bEUCihN9SOG3A
(Engine.log, vdsm.log, message log)

3. Management Cluster: Become inactive at  2020-06-10 06:30:46,951-04

2020-06-10 06:30:46,951-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM f01-h33-000-r620.rdu2.scalelab.redhat.com command SpmStatusVDS failed: Error validating master storage domain: ('MD read error',)
2020-06-10 06:30:46,952-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [] Command 'SpmStatusVDSCommand(HostName = f01-h33-000-r620.rdu2.scalelab.redhat.com, SpmStatusVDSCommandParameters:{hostId='5a5f1b5a-3c5f-4e69-81e7-dc1b03f251fb', storagePoolId='eaa2be8d-416f-4f64-b30a-1e70f58c5a07'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)
2020-06-10 06:30:47,005-04 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [6a6f50b5] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC_WITH_ERROR(987), Invalid status on Data Center Management. Setting Data Center status to Non Responsive (On host f01-h33-000-r620.rdu2.scalelab.redhat.com, Error: Cannot access master storage domain).
2020-06-10 06:30:48,846-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-78) [6a6f50b5] Command 'ConnectStoragePoolVDSCommand(HostName = f01-h33-000-r620.rdu2.scalelab.redhat.com, ConnectStoragePoolVDSCommandParameters:{hostId='5a5f1b5a-3c5f-4e69-81e7-dc1b03f251fb', vdsId='5a5f1b5a-3c5f-4e69-81e7-dc1b03f251fb', storagePoolId='eaa2be8d-416f-4f64-b30a-1e70f58c5a07', masterVersion='1'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: 'spUUID=eaa2be8d-416f-4f64-b30a-1e70f58c5a07, msdUUID=a131cd1c-4973-43c7-ac8b-b6987cc0356d'
2020-06-10 06:30:58,886-04 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Broken pipe
2020-06-10 06:30:58,908-04 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedThreadFactory-engine-Thread-4) [70d12005] Could not stop spm of pool 'eaa2be8d-416f-4f64-b30a-1e70f58c5a07' on vds '5a5f1b5a-3c5f-4e69-81e7-dc1b03f251fb': VDSGenericException: VDSNetworkException: Broken pipe
2020-06-10 06:30:58,908-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedThreadFactory-engine-Thread-4) [70d12005] FINISH, SpmStopVDSCommand, return: , log id: 58e0d6e5
2020-06-10 06:30:58,909-04 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-28) [7a9b37c0] Unable to GetStats: VDSNetworkException: VDSGenericException: VDSNetworkException: Broken pipe
2020-06-10 06:30:58,911-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-66) [] Command 'SpmStatusVDSCommand(HostName = f01-h33-000-r620.rdu2.scalelab.redhat.com, SpmStatusVDSCommandParameters:{hostId='5a5f1b5a-3c5f-4e69-81e7-dc1b03f251fb', storagePoolId='eaa2be8d-416f-4f64-b30a-1e70f58c5a07'})' execution failed: VDSGenericException: VDSNetworkException: Broken pipe
2020-06-10 06:31:19,559-04 INFO  [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (EE-ManagedThreadFactory-engine-Thread-35) [] Could not connect host 'f01-h33-000-r620.rdu2.scalelab.redhat.com' to pool 'Managemen
t', as the master domain is in inactive/unknown status - not failing the operation



4. L0_Group_0 Cluster: Become inactive at  2020-06-10 13:49:44,907-04

2020-06-10 13:49:44,907-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (EE-ManagedThreadFactory-engine-Thread-12) [6097cc9e] START, ReconstructMasterVDSCommand(HostName = f01-h1
5-000-r620.rdu2.scalelab.redhat.com, ReconstructMasterVDSCommandParameters:{hostId='4b58778a-0638-47ee-ab22-359a4b684708', vdsSpmId='2', storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storagePoolName='L0_
Group_0', masterDomainId='cc8dc275-e0af-493d-932e-6af775fe96ff', masterVersion='3', domainsList='[StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='6c2a
a70d-678d-4fe9-b372-c6ab605af0e9'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='3fbd08a8-18dc-4624-9f22-f55cce2fb386'}', stat
us='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='16dd0382-c10a-4f8c-9d9a-40b4b51536e7'}', status='Inactive'}, StoragePoolIsoMap:{id='St
oragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='a60b8cde-99c5-4516-ad65-8221e8d35802'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a7
4-b6f3-447f-8b79-96aef9b0f174', storageId='fa9355a4-61d0-4827-9a83-6d00f045459b'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId
='2e7d7e21-917b-4202-b4bc-ba5e6fffa558'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='d9b502e0-87ab-4062-bba4-483a435a5f8b'}'
, status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='cbee9a1a-8048-4507-b123-6d212d407948'}', status='Inactive'}, StoragePoolIsoMap:{
id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='feae8e8a-1a3c-4743-b993-5e7c501f3e3f'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8
c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='cc8dc275-e0af-493d-932e-6af775fe96ff'}', status='Unknown'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', stor
ageId='4af7fccf-f1bf-4f0c-9b95-ea972b72afa0'}', status='Inactive'}]'}), log id: 2b4c019c
2020-06-10 13:49:45,677-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ReconstructMasterVDSCommand] (EE-ManagedThreadFactory-engine-Thread-12) [6097cc9e] Command 'ReconstructMasterVDSCommand(HostName = f01-
h15-000-r620.rdu2.scalelab.redhat.com, ReconstructMasterVDSCommandParameters:{hostId='4b58778a-0638-47ee-ab22-359a4b684708', vdsSpmId='2', storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storagePoolName='L
0_Group_0', masterDomainId='cc8dc275-e0af-493d-932e-6af775fe96ff', masterVersion='3', domainsList='[StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='6c
2aa70d-678d-4fe9-b372-c6ab605af0e9'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='3fbd08a8-18dc-4624-9f22-f55cce2fb386'}', st
atus='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='16dd0382-c10a-4f8c-9d9a-40b4b51536e7'}', status='Inactive'}, StoragePoolIsoMap:{id='
StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='a60b8cde-99c5-4516-ad65-8221e8d35802'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8
a74-b6f3-447f-8b79-96aef9b0f174', storageId='fa9355a4-61d0-4827-9a83-6d00f045459b'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storage
Id='2e7d7e21-917b-4202-b4bc-ba5e6fffa558'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='d9b502e0-87ab-4062-bba4-483a435a5f8b'
}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='cbee9a1a-8048-4507-b123-6d212d407948'}', status='Inactive'}, StoragePoolIsoMap
:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='feae8e8a-1a3c-4743-b993-5e7c501f3e3f'}', status='Inactive'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId=
'8c1e8a74-b6f3-447f-8b79-96aef9b0f174', storageId='cc8dc275-e0af-493d-932e-6af775fe96ff'}', status='Unknown'}, StoragePoolIsoMap:{id='StoragePoolIsoMapId:{storagePoolId='8c1e8a74-b6f3-447f-8b79-96aef9b0f174', st
orageId='4af7fccf-f1bf-4f0c-9b95-ea972b72afa0'}', status='Inactive'}]'})' execution failed: VDSGenericException: VDSErrorException: Failed to ReconstructMasterVDS, error = Storage domain does not exist: ('cc8dc2
75-e0af-493d-932e-6af775fe96ff',), code = 358


L0_Group_0 have 11 StorageDomains. (VDSM log)
Errors start at 2020-06-10 13:15:53,061-0400 ERROR

2020-06-10 13:15:51,810-0400 INFO  (monitor/d9b502e) [storage.StorageDomainCache] Removing domain d9b502e0-87ab-4062-bba4-483a435a5f8b from storage domain cache (sdc:211)
2020-06-10 13:15:51,856-0400 INFO  (monitor/fa9355a) [storage.StorageDomainCache] Removing domain fa9355a4-61d0-4827-9a83-6d00f045459b from storage domain cache (sdc:211)
2020-06-10 13:15:51,869-0400 INFO  (monitor/4af7fcc) [storage.StorageDomainCache] Removing domain 4af7fccf-f1bf-4f0c-9b95-ea972b72afa0 from storage domain cache (sdc:211)
2020-06-10 13:15:51,883-0400 INFO  (monitor/a60b8cd) [storage.StorageDomainCache] Removing domain a60b8cde-99c5-4516-ad65-8221e8d35802 from storage domain cache (sdc:211)
2020-06-10 13:15:52,765-0400 WARN  (monitor/d9b502e) [storage.LVM] Removing stale VG d9b502e0-87ab-4062-bba4-483a435a5f8b (lvm:656)
2020-06-10 13:15:53,061-0400 ERROR (monitor/d9b502e) [storage.Monitor] Error checking domain d9b502e0-87ab-4062-bba4-483a435a5f8b (monitor:423)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/monitor.py", line 404, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 48, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 139, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sdc.py", line 156, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 1843, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 955, in __init__
    manifest = self.manifestClass(sdUUID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 424, in __init__
    sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 346, in __init__
    self._alignment = metadata.get(DMDK_ALIGNMENT, sc.ALIGNMENT_1M)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/persistent.py", line 119, in get
    return dec(self._dict[key])
  File "/usr/lib/python3.6/site-packages/vdsm/storage/persistent.py", line 224, in __getitem__
    with self._accessWrapper():
  File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/persistent.py", line 176, in _accessWrapper
    self._refresh()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/persistent.py", line 245, in _refresh
    lines = self._metaRW.readlines()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 364, in readlines
    vg = lvm.getVG(self._vgName)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1273, in getVG
    raise se.VolumeGroupDoesNotExist(vgName)
vdsm.storage.exception.VolumeGroupDoesNotExist: Volume Group does not exist: ('d9b502e0-87ab-4062-bba4-483a435a5f8b',)
2020-06-10 13:15:53,061-0400 INFO  (monitor/d9b502e) [storage.Monitor] Domain d9b502e0-87ab-4062-bba4-483a435a5f8b became INVALID (monitor:468)



Order of Lose StorageDomains 
2020-06-10   13:16:18,546-0400   ('d9b502e0-87ab-4062-bba4-483a435a5f8b',) - L0_Group_0_Extra10 StorageDomain
2020-06-10   13:22:18,149-0400   ('feae8e8a-1a3c-4743-b993-5e7c501f3e3f',) - L0_Group_0_Extra9 StorageDomain
2020-06-10   13:27:18,139-0400   ('cbee9a1a-8048-4507-b123-6d212d407948',) - L0_Group_0_Extra8 StorageDomain
2020-06-10   13:31:18,211-0400   ('fa9355a4-61d0-4827-9a83-6d00f045459b',) - L0_Group_0_Extra7 StorageDomain
2020-06-10   13:32:18,231-0400   ('2e7d7e21-917b-4202-b4bc-ba5e6fffa558',) - L0_Group_0_Extra6 StorageDomain
2020-06-10   13:36:18,963-0400   ('a60b8cde-99c5-4516-ad65-8221e8d35802',) - L0_Group_0_Extra5 StorageDomain
2020-06-10   13:42:18,148-0400   ('16dd0382-c10a-4f8c-9d9a-40b4b51536e7',) - L0_Group_0_Extra4 StorageDomain
2020-06-10   13:43:17,755-0400   ('3fbd08a8-18dc-4624-9f22-f55cce2fb386',) - L0_Group_0_Extra3 StorageDomain
2020-06-10   13:47:21,672-0400   ('cc8dc275-e0af-493d-932e-6af775fe96ff',) - L0_Group_0_Extra1 StorageDomain
2020-06-10   13:47:22,534-0400   ('6c2aa70d-678d-4fe9-b372-c6ab605af0e9',) - L0_Group_0_Extra2 StorageDomain
2020-06-10   13:48:52,357-0400   ('4af7fccf-f1bf-4f0c-9b95-ea972b72afa0',) - L0_Group_0 StorageDomain


Host F01-H15 /var/log/messages
Jun 10 13:16:18 f01-h15-000-r620 vdsm[16262]: ERROR Unhandled exception in <Task discardable <UpdateVolumes vm=c26f2e35-ec7c-481e-ae9c-b151f73bc339 at 0x7f06906b7908> timeout=30.0, duration=1.23 at 0x7f06932f4c88>#012Traceback (most recent call last):#012  File "/usr/lib/python3.6/site-packages/vdsm/executor.py", line 315, in _execute_task#012    task()#012  File "/usr/lib/python3.6/site-packages/vdsm/executor.py", line 391, in __call__#012    self._callable()#012  File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 315, in __call__#012    self._execute()#012  File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 357, in _execute#012    self._vm.updateDriveVolume(drive)#012  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 3813, in updateDriveVolume#012    vmDrive.volumeID)#012  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5571, in _getVolumeSize#012    (domainID, volumeID))#012vdsm.virt.vm.StorageUnavailableError: Unable to get volume size for domain d9b502e0-87ab-4062-bba4-483a435a5f8b volume 54faeb78-397d-45d8-b987-c5acff85bc36

Comment 11 Vojtech Juranek 2020-06-16 15:26:10 UTC
The flow was as follows:
monitor check domain status and refreshes domain lvm cache every 5 minutes - it removes the domain from the cache and loads it again.
On 2020-06-10 06:30:52 this reload failed (see vdsm.log.72.xz) - vgs command doesn't returned probably anything and vdsm started to consider this vg as stale and started to remove domain remnants (see supervdsm.log.4.xz):

    MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,535::supervdsm_server::93::SuperVdsm.ServerCallback::(wrapper) call devicemapper_removeMapping with ('a131cd1c--4973--43c7--ac8b--b6987cc0356d-xleases',) {}
    MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,535::commands::153::common.commands::(start) /usr/bin/taskset --cpu-list 0-23 /usr/sbin/dmsetup remove a131cd1c--4973--43c7--ac8b--b6987cc0356d-xleases (cwd None)
    MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,558::commands::98::common.commands::(run) SUCCESS: <err> = b''; <rc> = 0
    MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,558::supervdsm_server::100::SuperVdsm.ServerCallback::(wrapper) return devicemapper_removeMapping with None

Removing of some devices failed as the device was busy, e.g:

    MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,591::supervdsm_server::93::SuperVdsm.ServerCallback::(wrapper) call devicemapper_removeMapping with ('a131cd1c--4973--43c7--ac8b--b6987cc0356d-ids',) {}
    MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,591::commands::153::common.commands::(start) /usr/bin/taskset --cpu-list 0-23 /usr/sbin/dmsetup remove a131cd1c--4973--43c7--ac8b--b6987cc0356d-ids (cwd None)
    MainProcess|monitor/a131cd1::DEBUG::2020-06-10 06:30:52,599::commands::98::common.commands::(run) FAILED: <err> = b'device-mapper: remove ioctl on a131cd1c--4973--43c7--ac8b--b6987cc0356d-ids  failed: Device or resource busy\nCommand failed.\n'; <rc> = 1
    MainProcess|monitor/a131cd1::ERROR::2020-06-10 06:30:52,599::supervdsm_server::97::SuperVdsm.ServerCallback::(wrapper) Error in devicemapper_removeMapping
    Traceback (most recent call last):
      File "/usr/lib/python3.6/site-packages/vdsm/storage/devicemapper.py", line 140, in removeMapping
        commands.run(cmd)
      File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101, in run
        raise cmdutils.Error(args, p.returncode, out, err)
    vdsm.common.cmdutils.Error: Command ['/usr/sbin/dmsetup', 'remove', 'a131cd1c--4973--43c7--ac8b--b6987cc0356d-ids'] failed with rc=1 out=b'' err=b'device-mapper: remove ioctl on a131cd1c--4973--43c7--ac8b--b6987cc0356d-ids  failed: Device or resource busy\nCommand failed.\n'


Therefore some devices were removed and there are failures in other logs, e.g. in sanlock log:

2020-06-10 06:30:58 1196131 [1385]: open error -1 /dev/a131cd1c-4973-43c7-ac8b-b6987cc0356d/leases

as this device was removed (see above), while some other devices are still mapped via device mapper:

    [root@f01-h33-000-r620 ~]# dmsetup table
    a131cd1c--4973--43c7--ac8b--b6987cc0356d-fa50be97--26d5--407f--87ad--bd43a309285b: 0 419430400 linear 253:2 10225664
    a131cd1c--4973--43c7--ac8b--b6987cc0356d-ids: 0 262144 linear 253:2 526336
    3600a098038304437415d4b6a59684967: 0 10737418240 multipath 3 queue_if_no_path pg_init_retries 50 1 alua 2 1 service-time 0 2 2 8:48 1 1 8:64 1 1 service-time 0 2 2 8:16 1 1 8:32 1 1 
    a131cd1c--4973--43c7--ac8b--b6987cc0356d-93e65dfd--b5a6--4fe4--bb00--8f31727c63bd: 0 2097152 linear 253:2 429656064
    a131cd1c--4973--43c7--ac8b--b6987cc0356d-master: 0 2097152 linear 253:2 7604224
    vg_f01--h33--000--r620-lv_swap: 0 16777216 linear 8:2 2048
    vg_f01--h33--000--r620-lv_root: 0 1934614528 linear 8:2 16779264
    a131cd1c--4973--43c7--ac8b--b6987cc0356d-60c303e3--033a--4d9e--8011--527069fdccc3: 0 33030144 linear 253:2 441976832
    
So the main question is why VG reload (https://github.com/oVirt/vdsm/blob/v4.40.17/lib/vdsm/storage/lvm.py#L603) probably didn't returned any value.

Comment 12 Nir Soffer 2020-06-16 16:28:27 UTC
(In reply to Vojtech Juranek from comment #11)
I think the issue is removing VG mapping when VG fails to reload:

 652             for name in staleVGs:
 653                 if name in self._vgs:
 654                     log.warning("Removing stale VG %s", name)
 655                     removeVgMapping(name)
 656                     del self._vgs[name]

This is completely wrong. The fact that lvm failed to return information
about this VG does not mean that the VG does not exist or is unused.

Removing VG mapping should be done only when we deactivate a storage
domain, as part of StorageDomain.cleanup() flow.

removeVgMapping() call was added in:

commit a69f8a08b5c11f984254d6fd8bf50c028b0ee799
Author: Saggi Mizrahi <smizrahi>
Date:   Sun May 15 10:24:58 2011 +0300

    BZ#701671 - Create PV should account for stale device mapper mappings
    
    Change-Id: Idc08f4d4f2042513696b8539531758ddee3c87cc
    Reviewed-on: http://gerrit.usersys.redhat.com/355
    Reviewed-by: Dan Kenigsberg <danken>
    Tested-by: Dan Kenigsberg <danken>

The change in included in this repo:
git://gerrit.eng.lab.tlv.redhat.com/vdsm

Which include vdsm source before it was open source.

I'm not sure how the bug is related to the huge patch. It look
like the author tried to make sure vg is removed completely, including
stale lvs, so he added this call in the reload flow to make sure
that non-existing vgs do not leave stale lvs.

We need to remove the removeVgMapping() call.

We need to understand why lvm does not return info for this storage domain
(or many storage domains), but this is a separate issue, and not critical
one. The system must be able to handle temporary errors when accessing
storage.

Comment 13 Sandro Bonazzola 2020-06-19 09:35:40 UTC
This bug moved to post but it has not been targeted yet. Can you please target?

Comment 14 Nir Soffer 2020-06-19 14:22:59 UTC
We need to backport this to 4.3, this is important fix.

Comment 15 Nir Soffer 2020-06-19 14:36:06 UTC
Martin, are you sure the merged engine patch is related to this bug?

This root cause of the issue described in comment 0 is user error.
While provisioning other hosts in the same network, the LUN used
by this setup was wiped. We found an empty partition table in the
LUN.

This caused LVM to stop reporting this LUN and the VG using it. This
exposed the fact that vdsm removes mapping of missing VGs during reloading
which should update only vdsm lvm cache.

The issue in vdsm is fixed by
https://gerrit.ovirt.org/109712

But we have another engine patch (seems unrelated to this issue):
https://gerrit.ovirt.org/109673

The engine patch probably belongs to another bug, or if this is really
related to the same bug it seems that 2 people worked on the same issue
without any coordination :-)

I guess we need to move one of the patches to another bug or create 
new bugs for the vdsm and engine issue and close this as NOTABUG.

Comment 16 Martin Perina 2020-06-19 15:59:25 UTC
(In reply to Nir Soffer from comment #15)
> Martin, are you sure the merged engine patch is related to this bug?
> 
> This root cause of the issue described in comment 0 is user error.
> While provisioning other hosts in the same network, the LUN used
> by this setup was wiped. We found an empty partition table in the
> LUN.
> 
> This caused LVM to stop reporting this LUN and the VG using it. This
> exposed the fact that vdsm removes mapping of missing VGs during reloading
> which should update only vdsm lvm cache.
> 
> The issue in vdsm is fixed by
> https://gerrit.ovirt.org/109712
> 
> But we have another engine patch (seems unrelated to this issue):
> https://gerrit.ovirt.org/109673
> 
> The engine patch probably belongs to another bug, or if this is really
> related to the same bug it seems that 2 people worked on the same issue
> without any coordination :-)

The engine patch fixes the NPE on engine side, but I already mentioned that fixing NPE will not resolve the main issue

> 
> I guess we need to move one of the patches to another bug or create 
> new bugs for the vdsm and engine issue and close this as NOTABUG.

Feel free to have this bug only for storage related issue

Comment 20 David Vaanunu 2020-07-07 10:00:34 UTC
The system was re-install. 
Running our scale tests (~10 times, ~150Hrs  of testing)

scanning engine & vdsm log files.
The errors are not found

vdsm: "Volume Group does not exist"
engine: "ConnectDomainToStorageCommand"

Comment 21 Sandro Bonazzola 2020-07-08 08:25:25 UTC
This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 22 Eyal Shenitzky 2021-03-11 11:05:33 UTC
*** Bug 1831009 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.