Bug 1638096 - SHE disaster recovery is broken in new 4.2 deployments as hosted_storage is master
Summary: SHE disaster recovery is broken in new 4.2 deployments as hosted_storage is m...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-setup
Version: 4.2.5
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ovirt-4.3.0
: ---
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On: 1568841 1620314
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-10 16:27 UTC by RHV bug bot
Modified: 2021-09-09 15:31 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The self-hosted engine backup and restore flow has been improved, and now works correctly when the self-hosted engine storage domain is defined as the master storage domain.
Clone Of: 1620314
Environment:
Last Closed: 2019-05-08 12:32:03 UTC
oVirt Team: Integration
Target Upstream Version:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43592 0 None None None 2021-09-09 15:31:12 UTC
oVirt gerrit 91835 0 'None' MERGED ansible: support restore procedure 2020-07-31 14:51:50 UTC
oVirt gerrit 94455 0 'None' MERGED ansible: support restore procedure 2020-07-31 14:51:47 UTC
oVirt gerrit 94477 0 'None' MERGED move RESTORE_FROM_FILE initialization 2020-07-31 14:51:47 UTC
oVirt gerrit 94479 0 'None' MERGED move RESTORE_FROM_FILE initialization 2020-07-31 14:51:47 UTC

Description RHV bug bot 2018-10-10 16:27:19 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1620314 +++
======================================================================

Description of problem:

Since 4.2, the ansible SHE deployment creates a Data-Center already initialized with the hosted_storage as master. Currently, there is no way to change the master role to another SD[1].

The problem is that having hosted_storage as master breaks the Backup/Restore and move HE SD procedures, because the master sd is wiped from the DB by the --he-remove-storage-vm option passed to engine-backup.

The result is the Backup+Restore/Disaster Recovery/Move HE SD procedures are broken and running any of them results in an unusable environment, many things fail due to the missing master domain and reinitializing the DC fails with NPE. In more details:

1) All previous Storage Domains in Down State (no master)

2018-08-23 11:07:26,350+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-34) [4396d107] Command 'SpmStatusVDSCommand(HostName = host2.rhvlab, SpmStatusVDSCommandParameters:{hostId='28e4cef6-6721-41b3-b15d-9906f46a8a0a', storagePoolId='e94f9b5a-9ac4-11e8-8748-52540015c1ff'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Error validating master storage domain: ('MD read error',)

2) New HE SD fails to import (no master)

2018-08-23 11:06:44,922+10 WARN  [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (EE-ManagedThreadFactory-engine-Thread-39) [7a221e26] Validation of action 'ImportHostedEngineStorageDomain' failed for user SYSTEM. Reasons: VAR__ACTION__ADD,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_MASTER_STORAGE_DOMAIN_NOT_ACTIVE

3) Hosts in Non-Operational state (cannot connect to storage pool)

2018-08-23 11:06:31,471+10 ERROR [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (EE-ManagedThreadFactory-engine-Thread-32) [3a009015] Exception during process of events for pool 'e94f9b5a-9ac4-11e8-8748-52540015c1ff': java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

2018-08-23 11:06:34,708+10 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] Command 'ConnectStoragePoolVDSCommand(HostName = host2.rhvlab, ConnectStoragePoolVDSCommandParameters:{hostId='28e4cef6-6721-41b3-b15d-9906f46a8a0a', vdsId='28e4cef6-6721-41b3-b15d-9906f46a8a0a', storagePoolId='e94f9b5a-9ac4-11e8-8748-52540015c1ff', masterVersion='1'})' execution failed: IRSGenericException: IRSErrorException: IRSNoMasterDomainException: Cannot find master domain: u'spUUID=e94f9b5a-9ac4-11e8-8748-52540015c1ff, msdUUID=00000000-0000-0000-0000-000000000000'

4) Cannot Reinitialize the DC. "There are no compatible Storage Domains to attach to this Data Center. Please add new Storage from the Storage tab.". Then after adding new storage, it hits a NPE:

2018-08-23 11:28:48,403+10 ERROR [org.ovirt.engine.core.bll.storage.pool.RecoveryStoragePoolCommand] (default task-8) [266ee7fc-73c9-4b4a-bd32-ca69a9b42b8f] Error during ValidateFailure.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.validator.storage.StorageDomainValidator.isInProcess(StorageDomainValidator.java:387) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.pool.RecoveryStoragePoolCommand.validate(RecoveryStoragePoolCommand.java:67) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalValidate(CommandBase.java:779) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.validateOnly(CommandBase.java:368) [bll.jar:]
        at org.ovirt.engine.core.bll.PrevalidatingMultipleActionsRunner.canRunActions(PrevalidatingMultipleActionsRunner.java:113) [bll.jar:]
        at org.ovirt.engine.core.bll.PrevalidatingMultipleActionsRunner.invokeCommands(PrevalidatingMultipleActionsRunner.java:99) [bll.jar:]
        at org.ovirt.engine.core.bll.PrevalidatingMultipleActionsRunner.execute(PrevalidatingMultipleActionsRunner.java:76) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runMultipleActionsImpl(Backend.java:596) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runMultipleActions(Backend.java:566) [bll.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.8.0_181]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [rt.jar:1.8.0_181]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_181]
        at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_181]
        at org.jboss.as.ee.component.ManagedReferenceMethodInterceptor.processInvocation(ManagedReferenceMethodInterceptor.java:52)
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:422)
        at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:509)
        at org.jboss.as.weld.interceptors.Jsr299BindingsInterceptor.delegateInterception(Jsr299BindingsInterceptor.java:78)

5) Not sure if this one is related to this, but command sent via API from hosted-engine --deploy on the host failed with NPE on the engine:

[ ERROR ] Cannot automatically set CPU level of cluster Default:
         General command validation failure.

2018-08-23 11:06:29,961+10 ERROR [org.ovirt.engine.core.bll.UpdateClusterCommand] (default task-4) [67d94cdc-8a73-4be8-b314-8b04aaa16b4c] Error during ValidateFailure.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.UpdateClusterCommand.getEmulatedMachineOfHostInCluster(UpdateClusterCommand.java:446) [bll.jar:]
        at org.ovirt.engine.core.bll.UpdateClusterCommand.isSupportedEmulatedMachinesMatchClusterLevel(UpdateClusterCommand.java:769) [bll.jar:]
        at org.ovirt.engine.core.bll.UpdateClusterCommand.validate(UpdateClusterCommand.java:640) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.internalValidate(CommandBase.java:779) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:393) [bll.jar:]
        at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:468) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runActionImpl(Backend.java:450) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:403) [bll.jar:]
        at sun.reflect.GeneratedMethodAccessor135.invoke(Unknown Source) [:1.8.0_181]

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.5.3-1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy 4.2 SHE (ansible)
2. Run the Backup+Restore SHE procedure[2] or Move SHE SD procedure[3]

Additional info:
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1576923
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1420604#c71
[3] https://access.redhat.com/solutions/2998291

(Originally by Germano Veit Michel)

Comment 1 RHV bug bot 2018-10-10 16:27:27 UTC
Created attachment 1478022 [details]
engine logs

(Originally by Germano Veit Michel)

Comment 3 RHV bug bot 2018-10-10 16:27:36 UTC
Created attachment 1478022 [details]
engine logs

(Originally by Germano Veit Michel)

Comment 4 RHV bug bot 2018-10-10 16:27:40 UTC
hosted engine deployment - moving to Integration

(Originally by michal.skrivanek)

Comment 5 RHV bug bot 2018-10-10 16:27:44 UTC
For backup/restore operations we are currently asking the user to run hosted-engine-setup with --noansible option to force the old flow where the hosted-engine SD is not the master one and the user can still manually run engine-backup before engine-setup.

This kind of issues related to the new flow will be handled as for https://bugzilla.redhat.com/1469908

(Originally by Simone Tiraboschi)

Comment 6 RHV bug bot 2018-10-10 16:27:49 UTC
(In reply to Simone Tiraboschi from comment #3)
> For backup/restore operations we are currently asking the user to run
> hosted-engine-setup with --noansible option to force the old flow where the
> hosted-engine SD is not the master one and the user can still manually run
> engine-backup before engine-setup.

Hi Simone,

This is exactly was done. Its the old --no-ansible option as per comment #0 --he-remove-storage-vm was used to restore the backup before engine-setup. This works fine if the HE SD is not master. But on fresh 4.2 deployments the HE SD is always master and cannot be changed. So the procedure is now broken.

(Originally by Germano Veit Michel)

Comment 10 RHV bug bot 2018-10-10 16:28:07 UTC
This also breaks SHE > baremetal migration.

(Originally by Germano Veit Michel)

Comment 12 RHV bug bot 2018-10-10 16:28:16 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops@redhat.com

(Originally by rhv-bugzilla-bot)

Comment 18 RHV bug bot 2018-10-10 16:28:43 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops@redhat.com

(Originally by rhv-bugzilla-bot)

Comment 25 Raz Tamir 2018-10-10 16:35:29 UTC
Setting blocker according to commet #7

Comment 26 Nikolai Sednev 2018-11-12 11:36:56 UTC
Works for me on these components:
ovirt-hosted-engine-setup-2.2.32-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.18-1.el7ev.noarch

Moving to verified.

Comment 28 errata-xmlrpc 2019-05-08 12:32:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1050

Comment 29 Daniel Gur 2019-08-28 13:12:34 UTC
sync2jira

Comment 30 Daniel Gur 2019-08-28 13:16:47 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.