1620314 – [downstream clone - 4.2.7] SHE disaster recovery is broken in new 4.2 deployments as hosted_storage is master

Bug 1620314 - [downstream clone - 4.2.7] SHE disaster recovery is broken in new 4.2 deployments as hosted_storage is master

Summary: [downstream clone - 4.2.7] SHE disaster recovery is broken in new 4.2 deploym...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-setup
Sub Component:
Version:	4.2.5
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.2.7-1
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:	1469908 1568841
Blocks:	1420604 1638096 ovirt-hosteded-engine-setup-2.2.32
TreeView+	depends on / blocked

Reported:	2018-08-23 01:40 UTC by Germano Veit Michel
Modified:	2022-03-13 15:26 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ovirt-hosted-engine-setup-2.2.31-1.el7ev.noarch.rpm
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1638096 (view as bug list)
Environment:
Last Closed:	2018-11-20 09:32:41 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)
engine logs (79.30 KB, application/x-xz) 2018-08-23 02:06 UTC, Germano Veit Michel	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1576923	unspecified	CLOSED	RFE: Ability to move master role to another domain without putting the domain to maintenance	2024-10-01 16:07:49 UTC
Red Hat Issue Tracker	RHV-43589	None	None	None	2021-09-09 15:27:39 UTC
Red Hat Knowledge Base (Solution)	2998291	None	None	None	2018-10-25 04:33:59 UTC
oVirt gerrit	91835	'None'	MERGED	ansible: support restore procedure	2020-06-11 07:59:17 UTC
oVirt gerrit	94455	'None'	MERGED	ansible: support restore procedure	2020-06-11 07:59:17 UTC
oVirt gerrit	94477	'None'	MERGED	move RESTORE_FROM_FILE initialization	2020-06-11 07:59:15 UTC
oVirt gerrit	94479	'None'	MERGED	move RESTORE_FROM_FILE initialization	2020-06-11 07:59:15 UTC

Internal Links: 1576923

Description Germano Veit Michel 2018-08-23 01:40:51 UTC

Description of problem:

Since 4.2, the ansible SHE deployment creates a Data-Center already initialized with the hosted_storage as master. Currently, there is no way to change the master role to another SD[1].

The problem is that having hosted_storage as master breaks the Backup/Restore and move HE SD procedures, because the master sd is wiped from the DB by the --he-remove-storage-vm option passed to engine-backup.

The result is the Backup+Restore/Disaster Recovery/Move HE SD procedures are broken and running any of them results in an unusable environment, many things fail due to the missing master domain and reinitializing the DC fails with NPE. In more details:

1) All previous Storage Domains in Down State (no master)

2018-08-23 11:07:26,350+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-34) [4396d107] Command 'SpmStatusVDSCommand(HostName = host2.rhvlab, SpmStatusVDSCommandParameters:{hostId='28e4cef6-6721-41b3-b15d-9906f46a8a0a', storagePoolId='e94f9b5a-9ac4-11e8-8748-52540015c1ff'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)

2) New HE SD fails to import (no master)

2018-08-23 11:06:44,922+10 WARN  [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (EE-ManagedThreadFactory-engine-Thread-39) [7a221e26] Validation of action 'ImportHostedEngineStorageDomain' failed for user SYSTEM. Reasons: VAR__ACTION__ADD,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_MASTER_STORAGE_DOMAIN_NOT_ACTIVE

3) Hosts in Non-Operational state (cannot connect to storage pool)

2018-08-23 11:06:31,471+10 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (EE-ManagedThreadFactory-engine-Thread-32) [3a009015] Exception during process of events for pool 'e94f9b5a-9ac4-11e8-8748-52540015c1ff': java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

2018-08-23 11:06:34,708+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] Command 'ConnectStoragePoolVDSCommand(HostName = host2.rhvlab, ConnectStoragePoolVDSCommandParameters:{hostId='28e4cef6-6721-41b3-b15d-9906f46a8a0a', vdsId='28e4cef6-6721-41b3-b15d-9906f46a8a0a', storagePoolId='e94f9b5a-9ac4-11e8-8748-52540015c1ff', masterVersion='1'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: u'spUUID=e94f9b5a-9ac4-11e8-8748-52540015c1ff, msdUUID=00000000-0000-0000-0000-000000000000'

4) Cannot Reinitialize the DC. "There are no compatible Storage Domains to attach to this Data Center. Please add new Storage from the Storage tab.". Then after adding new storage, it hits a NPE:

2018-08-23 11:28:48,403+10 ERROR [org.ovirt.engine.core.bll.storage.pool.RecoveryStoragePoolCommand] (default task-8) [266ee7fc-73c9-4b4a-bd32-ca69a9b42b8f] Error during ValidateFailure.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.validator.storage.StorageDomainValidator.isInProcess(StorageDomainValidator.java:387) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.pool.RecoveryStoragePoolCommand.validate(RecoveryStoragePoolCommand.java:67) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalValidate(CommandBase.java:779) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.validateOnly(CommandBase.java:368) [bll.jar:]
        at org.ovirt.engine.core.bll.PrevalidatingMultipleActionsRunner.canRunActions(PrevalidatingMultipleActionsRunner.java:113) [bll.jar:]
        at org.ovirt.engine.core.bll.PrevalidatingMultipleActionsRunner.invokeCommands(PrevalidatingMultipleActionsRunner.java:99) [bll.jar:]
        at org.ovirt.engine.core.bll.PrevalidatingMultipleActionsRunner.execute(PrevalidatingMultipleActionsRunner.java:76) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runMultipleActionsImpl(Backend.java:596) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runMultipleActions(Backend.java:566) [bll.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.8.0_181]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [rt.jar:1.8.0_181]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_181]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_181]
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptor.java:52)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:422)
        at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:509)
        at org.jboss.as.weld.interceptors.Jsr299BindingsInterceptor.delegateInterception(Jsr299BindingsInterceptor.java:78)

5) Not sure if this one is related to this, but command sent via API from hosted-engine --deploy on the host failed with NPE on the engine:

[ ERROR ] Cannot automatically set CPU level of cluster Default:
         General command validation failure.

2018-08-23 11:06:29,961+10 ERROR [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-4) [67d94cdc-8a73-4be8-b314-8b04aaa16b4c] Error during ValidateFailure.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.UpdateClusterCommand.getEmulatedMachineOfHostInCluster(UpdateClusterCommand.java:446) [bll.jar:]
        at org.ovirt.engine.core.bll.UpdateClusterCommand.isSupportedEmulatedMachinesMatchClusterLevel(UpdateClusterCommand.java:769) [bll.jar:]
        at org.ovirt.engine.core.bll.UpdateClusterCommand.validate(UpdateClusterCommand.java:640) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalValidate(CommandBase.java:779) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:393) [bll.jar:]
        at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:468) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:450) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:403) [bll.jar:]
        at sun.reflect.GeneratedMethodAccessor135.invoke(Unknown Source) [:1.8.0_181]

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.5.3-1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy 4.2 SHE (ansible)
2. Run the Backup+Restore SHE procedure[2] or Move SHE SD procedure[3]

Additional info:
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1576923
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1420604#c71
[3] https://access.redhat.com/solutions/2998291

Comment 1 Germano Veit Michel 2018-08-23 02:06:32 UTC

Created attachment 1478022 [details]
engine logs

Comment 2 Michal Skrivanek 2018-08-23 04:30:09 UTC

hosted engine deployment - moving to Integration

Comment 3 Simone Tiraboschi 2018-08-23 07:02:27 UTC

For backup/restore operations we are currently asking the user to run hosted-engine-setup with --noansible option to force the old flow where the hosted-engine SD is not the master one and the user can still manually run engine-backup before engine-setup.

This kind of issues related to the new flow will be handled as for https://bugzilla.redhat.com/1469908

Comment 4 Germano Veit Michel 2018-08-23 07:05:12 UTC

(In reply to Simone Tiraboschi from comment #3)
> For backup/restore operations we are currently asking the user to run
> hosted-engine-setup with --noansible option to force the old flow where the
> hosted-engine SD is not the master one and the user can still manually run
> engine-backup before engine-setup.

Hi Simone,

This is exactly was done. Its the old --no-ansible option as per comment #0 --he-remove-storage-vm was used to restore the backup before engine-setup. This works fine if the HE SD is not master. But on fresh 4.2 deployments the HE SD is always master and cannot be changed. So the procedure is now broken.

Comment 8 Germano Veit Michel 2018-08-28 00:30:47 UTC

This also breaks SHE > baremetal migration.

Comment 10 RHV bug bot 2018-09-21 11:46:10 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

Comment 16 RHV bug bot 2018-10-04 12:07:17 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

Comment 24 Raz Tamir 2018-10-10 16:36:34 UTC

Setting blocker according to commet #7

Comment 27 Nikolai Sednev 2018-10-17 09:21:47 UTC

Moving to assigned forth to https://bugzilla.redhat.com/show_bug.cgi?id=1568841

Comment 28 Germano Veit Michel 2018-10-25 04:33:59 UTC

KCS will get updated once 4.2.7 is out.

Comment 33 Nikolai Sednev 2018-11-12 11:29:49 UTC

Forth to comments #15 and #32, moving to verified.
Restore/recovery is working fine when using new "hosted-engine --deploy --restore-from-file=/path/backupfilehere" option.
Tested on:
ovirt-hosted-engine-setup-2.2.32-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.18-1.el7ev.noarch

Removing previous dependent bugs, as they've been moved to upcoming version releases.

Comment 34 Daniel Gur 2019-08-28 13:12:46 UTC

sync2jira

Comment 35 Daniel Gur 2019-08-28 13:16:58 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.