Bug 1469908

Summary: [RFE] - Support managed/automated restore
Product: [oVirt] ovirt-hosted-engine-setup Reporter: Yedidyah Bar David <didi>
Component: GeneralAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: Polina <pagranat>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 2.0.1.4CC: bugs, irosenzw, pagranat, pdwyer, stirabos
Target Milestone: ovirt-4.2.7-1Keywords: FutureFeature
Target Release: ---Flags: rule-engine: ovirt-4.2+
ylavi: planning_ack+
sbonazzo: devel_ack+
rule-engine: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-hosted-engine-setup-2.2.32-1.el7ev.noarch.rpm Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-13 16:12:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1620314, 1644784    
Attachments:
Description Flags
ovirt_hosted_engine_setup
none
the whole ovirt-hosted-engine-setup directory attached none

Description Yedidyah Bar David 2017-07-12 05:21:58 UTC
Description of problem:

Current backup, and especially restore, processes, are a bit delicate and require too many manual steps, to be taken in a very specific order. See also bug 1420604.

The hosted-engine-setup project already has a tool to migrate an engine vm from el6/3.6 to el7/4.0, using backup/restore. I think it should require not-too-much work to reuse the code from this tool to automate the restore process, at least partially, and perhaps also backup (even though that's less of an issue).

Comment 1 Yedidyah Bar David 2017-07-12 05:22:54 UTC
Yaniv, what do you say?

Comment 2 Yaniv Lavi 2017-07-12 11:57:24 UTC
I like the idea, but I don't think it will for to 4.2.

Comment 5 Ido Rosenzwig 2018-06-05 12:05:29 UTC
*** Bug 1584154 has been marked as a duplicate of this bug. ***

Comment 6 Polina 2018-10-16 15:23:56 UTC
Verification steps on ovirt-engine-4.2.7.3-0.0.master.20181015151121.gitd6e9af9.el7.noarch:

scenario from doc https://docs.google.com/document/d/1Hyg7epVNfwSmPx9N8qaITH5vo2mGQm6Ie1JKYbFYBus/edit?ts=5bbcbe3e: 
Node 0 -> node 0
nfs->nfs 
redeploy on an env where power management is configured and all the hosts could be reached

The 4.2 upstream HE environment has two hosts - host1 - not HE, host2 - HE host. The VM1 is running on not HE host1, VM2 is running on HE host2. Power management is configured on both hosts. 

The steps are :
1. The backup file is created on engine by running <engine-backup --mode=backup --file=backup_compute-he-4 --log=log_compute-he-4_backup4.2>. Copy the backup file aside (on laptop) .
2. Insert environment into global maintenance. 
3. Cleanup HE Storage NFS Domain.
4. Reprovisioning HE host . Copy repos to /etc/yum.repos.d/ and run <yum install ovirt-hosted-engine-setup>. 
5. Run on HE host restore command
<hosted-engine --deploy --restore-from-file=backup_compute-he-4>.

Result (and I reproduced it twice): The Deploy starts ok with all the questions and then hangs for long time (I waited for two hours). then the host disconnects.
the last output lines are :
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Set FQDN]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Force the local VM FQDN to temporary resolve on the natted network address]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Restore sshd reverse DNS lookups]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Generate an answer file for engine-setup]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Include before engine-setup custom tasks files for the engine VM]
[ INFO  ] TASK [include_tasks]
[ INFO  ] ok: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Copy the backup file to the engine VM for restore]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Run engine-backup]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Remove backup file]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [include_tasks]
[ INFO  ] ok: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Find configuration file for SCL PostgreSQL]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Check SCL PostgreSQL value]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Set SCL prefix for PostgreSQL]
[ INFO  ] ok: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Remove previous hosted-engine VM]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Remove dynamic data for VMs on the host used to redeploy]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Remove host used to redeploy]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Remove previuos HE storage domain to avoid name conflicts]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Execute engine-setup]
[ INFO  ] changed: [compute-ge-he-4.scl.lab.tlv.redhat.com]
[ INFO  ] TASK [Include after engine-setup custom tasks files for the engine VM]
[ INFO  ] TASK [Wait for the engine to reach a stable condition]

The most strange thing is that I can ssh to the host (and no ping). And in Power Management I can see that the "Host is currently off".  The host could be started by power control. Though the restore operation didn't succeed - we have no engine.

Comment 7 Simone Tiraboschi 2018-10-17 01:54:21 UTC
(In reply to Polina from comment #6)
> The most strange thing is that I can ssh to the host (and no ping). And in
> Power Management I can see that the "Host is currently off".  The host could
> be started by power control. Though the restore operation didn't succeed -
> we have no engine.

I'm pretty sure that the issue is due to power management configuration on the source environment.
Can I ask to retest without configuring the power management on the source environment and eventually open a specific bug like "Restoring an hosted-engine environment from backup fails if the power management was configured for HE hosts"?

In the meanwhile I'll try to fix ASAP trying to disable host fencing during the recovery period.

Comment 8 Polina 2018-10-17 11:09:03 UTC
Created attachment 1494784 [details]
ovirt_hosted_engine_setup

Hi Simone,
indeed the scenario without configuring the power management on hosts behaves differently. It is resulted by setup failure, but host remains On.
I will insert a separate bug for power management behavior with the suggested subject.
For this repeated scenario with no power management configured please see the attached setup logs

Comment 9 Polina 2018-10-17 11:20:46 UTC
inserted Bug 1640097 - Restoring an HE env from backup fails if the power management was configured for HE hosts

Comment 10 Simone Tiraboschi 2018-10-18 00:44:56 UTC
This time the issue seams here:

2018-10-17 13:21:53,131+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Add host]
2018-10-17 13:21:57,843+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 changed: [localhost]
2018-10-17 13:22:00,752+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Wait for the host to be up]
2018-10-17 13:35:23,737+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 {u'_ansible_parsed': True, u'deprecations': [{u'msg': u"The 'ovirt_hosts_facts' module is being renamed 'ovirt_host_facts'", u'version': 2.8}], u'_ansible_no_log': False, u'changed': False, u'attempts': 120, u'invocation': {u'module_args': {u'all_content': False, u'pattern': u'name=cougar03.scl.lab.tlv.redhat.com', u'fetch_nested': False, u'nested_attributes': []}}, u'ansible_facts': {u'ovirt_hosts': []}}
2018-10-17 13:35:23,838+0300 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"ansible_facts": {"ovirt_hosts": []}, "attempts": 120, "changed": false, "deprecations": [{"msg": "The 'ovirt_hosts_facts' module is being renamed 'ovirt_host_facts'", "version": 2.8}]}

Can you please attach also engine.log and host-deploy logs?

Comment 11 Polina 2018-10-18 06:27:11 UTC
Hi Simone, 
the engine.log is not attached since the test failed before I have HE machine.
about host-deploy -
unfortunately, after performing the test I reprovisioned the whole environment to verify another bug, so, I don't have the host-deploy.log. Please let me know if you would like me to repeat the test to have the deploy log.

Comment 12 Simone Tiraboschi 2018-10-18 06:45:12 UTC
(In reply to Polina from comment #11)
> Hi Simone, 
> the engine.log is not attached since the test failed before I have HE
> machine.
> about host-deploy -
> unfortunately, after performing the test I reprovisioned the whole
> environment to verify another bug, so, I don't have the host-deploy.log.
> Please let me know if you would like me to repeat the test to have the
> deploy log.

engine and host-deploy logs are going to be collected under
/var/log/ovirt-hosted-engine-setup/engine-logs-<timestamp>
also on failures.
Can you please trying reproducing and uploading the whole content of /var/log/ovirt-hosted-engine-setup

I tried reproducing it on my side.
I uploaded a video here: https://asciinema.org/a/207132

Comment 13 Polina 2018-10-18 14:29:03 UTC
Created attachment 1495314 [details]
the whole ovirt-hosted-engine-setup directory attached

Hi, I reproduced the failure and attached the logs of the directory /var/log/ovirt-hosted-engine-setup.

Comment 14 Simone Tiraboschi 2018-10-22 07:52:58 UTC
2018-10-18 16:33:16,142+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-85) [] Command 'GetCapabilitiesAsyncVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{hostId='40fbf3d1-4fc5-4e8f-8242-144d0056c3b9', vds='Host[host_mixed_1,40fbf3d1-4fc5-4e8f-8242-144d0056c3b9]'})' execution failed: java.net.NoRouteToHostException: No route to host
2018-10-18 16:33:17,284+03 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to cougar03.scl.lab.tlv.redhat.com/10.35.160.95

Comment 15 Polina 2018-10-28 16:00:14 UTC
Verification is done on ovirt-hosted-engine-setup-2.2.30-1.el7ev.noarch.

The reason for re-assign:
As a result of deploying with restore from the backup file I've got a problematic environment:
  - all VMs that were running on the host where deployment performed are removed from the environment.
  - the HE VM is not migratable.

The detailed results are described in the Comment12/13 in https://bugzilla.redhat.com/show_bug.cgi?id=1406067

Steps:
1.Pre-condition: 
   all three hosts in the environment are configured with PM. 
   4 VMs are up: 
      vm1_0 and vm1_1 (high available with lease) on host 1; SD name nfs_0.
      vm2_0 and vm2_1 (high available with lease) on host 2; SD name nfs_0.
      HE VM is running on host1 (SD name hosted_storage)
2. 
[root@compute-ge-he-4 ~]# engine-backup --mode=backup --file=backup_compute-he-4 --log=log_compute-he-4_backup4.2
Backing up:
Notifying engine
- Files
- Engine database 'engine'
- DWH database 'ovirt_engine_history'
Packing into file 'backup_compute-he-4'
Notifying engine
Done.

3. Copy aside
4. Insert environment into global maintenance. hosted-engine --set-maintenance --mode=global

5. Cleanup HE Storage NFS Domain.
rm -Rf /Compute_NFS/pagranat/compute-ge-he-4 on yellow-vdsb.qa.lab.tlv.redhat.com

6. Reprovisioning HE host. Copy repos to /etc/yum.repos.d/, yum update -y , 
yum install http://download.eng.bos.redhat.com/brewroot/packages/ovirt-hosted-engine-setup/2.2.30/1.el7ev/noarch/ovirt-hosted-engine-setup-2.2.30-1.el7ev.noarch.rpm
7. copy backup file to the host and run 
   hosted-engine --deploy --restore-from-file=backup_compute-he-4 (give the NEW storage path yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/pagranat/compute-ge-he-4_restore_pm )

Comment 16 Simone Tiraboschi 2018-10-29 08:04:51 UTC
(In reply to Polina from comment #15)
> Steps:
> 1.Pre-condition: 
>    all three hosts in the environment are configured with PM. 
>    4 VMs are up: 
>       vm1_0 and vm1_1 (high available with lease) on host 1; SD name nfs_0.
>       vm2_0 and vm2_1 (high available with lease) on host 2; SD name nfs_0.
>       HE VM is running on host1 (SD name hosted_storage)

Here I think we are missing a relevant point: as for https://bugzilla.redhat.com/show_bug.cgi?id=1406067 hosted-engine-setup can only deploy hosts in Default datacenter and Default cluster but the user can still move hosts after setup time to different cluster and it's what probably happened in the tested scenario.

In the restore process, the host used for the redeployment will be added again to the Default cluster in the Default Datacenter ( still as for https://bugzilla.redhat.com/show_bug.cgi?id=1406067 ) and this will also affects other VMs running on that host that now appears as attached to a different cluster.

Can you please retest the more common scenario where keeping hosted-engine hosts in the default cluster is acceptable?

Comment 17 Sandro Bonazzola 2018-11-05 09:51:19 UTC
Is this ready for QE?

Comment 18 Simone Tiraboschi 2018-11-05 13:09:20 UTC
(In reply to Sandro Bonazzola from comment #17)
> Is this ready for QE?

Yes, it is.
Maybe we can evaluate including also a fix for https://bugzilla.redhat.com/1645757 in that async.

Comment 19 Polina 2018-11-12 12:55:34 UTC
The following iterations were successfully performed to verify this bug in version: ovirt-hosted-engine-setup-2.2.32-1.el7ev.noarch

1)Node 0 -> node 0, nfs->nfs, power management configured.
2)Node 0 -> node 0, FC->iscsi, power management is not configured
3)Vintage(4.2.5) -> node 0 (4.2.7), nfs-> iscsi, non SPM host
4)BM -> HE, Local BM storage> iscsi, redeploy SPM host differs from backup
5)BM -> HE, Local BM storage >nfs, host with spm_id!=1	

For all iterations:
 - redeploy with running VMs on other non HE hosts
 - redeploy with running VMs on HE hosts


The full matrix will be verified in the BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1648889

Comment 20 Sandro Bonazzola 2018-11-13 16:12:48 UTC
This bugzilla is included in oVirt 4.2.7 Async 1 release, published on November 13th 2018.

Since the problem described in this bug report should be resolved in oVirt 4.2.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.