Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1811734

Summary: Restore from backup file gets stuck
Product: [oVirt] ovirt-hosted-engine-setup Reporter: Nikolai Sednev <nsednev>
Component: GeneralAssignee: Yedidyah Bar David <didi>
Status: CLOSED WORKSFORME QA Contact: Nikolai Sednev <nsednev>
Severity: urgent Docs Contact:
Priority: high    
Version: 2.4.2CC: aoconnor, bugs, lsurette, lsvaty, michal.skrivanek, mtessun, stirabos
Target Milestone: ovirt-4.4.1Keywords: Regression, Triaged
Target Release: ---Flags: sbonazzo: ovirt-4.4?
aoconnor: blocker+
sbonazzo: planning_ack?
sbonazzo: devel_ack?
sbonazzo: testing_ack?
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-16 19:54:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1686575    
Attachments:
Description Flags
sosreport from alma03 host A none

Description Nikolai Sednev 2020-03-09 16:17:58 UTC
Created attachment 1668701 [details]
sosreport from alma03 host A

Description of problem:
Restore from backup file gets stuck

Version-Release number of selected component (if applicable):

ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.2-2.el8ev.noarch
Linux 4.18.0-187.el8.x86_64 #1 SMP Sat Mar 7 03:42:33 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux release 8.2 Beta (Ootpa)


How reproducible:
100%

Steps to Reproduce:
1.Deploy vintage HE over NFS on two ha-hosts.
2.Add iSCSI data storage domain for guest VMs.
3.Create and run 4 guest-VMs, 2 on each ha-host.
4.Make sure that first ha-host "A" is SPM and HE-VM is running on it.
5.Create non-management logical network which marked as required and assign it to both ha-hosts.
6.Set global maintenance.
7.Backup the engine and copy file to safe place.
8.Set second ha-host "B" as SPM.
9.Reprovision "A" and restore HE over NFS clean storage domain, using "hosted-engine --deploy --restore-from-file=file", during restore answer "YES" to "Pause the execution after adding this host to the engine?
          You will be able to iteratively connect to the restored engine in order to manually review and remediate its configuration before proceeding with the deployment:
          please ensure that all the datacenter hosts and storage domain are listed as up or in maintenance mode before proceeding.
          This is normally not required when restoring an up to date and coherent backup. (Yes, No)[No]: yes"

Actual results:
Restore gets stuck on 
"[ INFO  ] TASK [ovirt.engine-setup : Run engine-setup with answerfile]"

Expected results:
Restore should continue and let the user to log in to the engine and assign network to both ha-hosts and then continue with the backup after deleting the "/tmp/ansible.WGeSW8_he_setup_lock" from host.

Additional info:
Sosreport from host A (alma03) is attached.

Comment 1 Michal Skrivanek 2020-03-16 07:42:27 UTC
why is this a RHV bug, what's RHV-specific about it?

Comment 2 Nikolai Sednev 2020-03-16 07:51:36 UTC
(In reply to Michal Skrivanek from comment #1)
> why is this a RHV bug, what's RHV-specific about it?

Because restore was made on RHV engine and its RHV engine related issue, due to the fact that its also related to the upgrade of RHV 4.3 to 4.4.

Comment 3 Michal Skrivanek 2020-03-18 08:48:33 UTC
(In reply to Nikolai Sednev from comment #2)
> (In reply to Michal Skrivanek from comment #1)
> > why is this a RHV bug, what's RHV-specific about it?
> 
> Because restore was made on RHV engine and its RHV engine related issue, due
> to the fact that its also related to the upgrade of RHV 4.3 to 4.4.

like most of the bugs since QE is testing d/s builds only. Please sync with lsvaty on guidelines about filing d/s vs u/s bugs
Thanks

Also, AFAICT logs do not include logs from the actual engine setup you've seen getting stuck.
Please attach.

Comment 4 Nikolai Sednev 2020-03-18 09:26:30 UTC
(In reply to Michal Skrivanek from comment #3)
> (In reply to Nikolai Sednev from comment #2)
> > (In reply to Michal Skrivanek from comment #1)
> > > why is this a RHV bug, what's RHV-specific about it?
> > 
> > Because restore was made on RHV engine and its RHV engine related issue, due
> > to the fact that its also related to the upgrade of RHV 4.3 to 4.4.
> 
> like most of the bugs since QE is testing d/s builds only. Please sync with
> lsvaty on guidelines about filing d/s vs u/s bugs
> Thanks
> 
> Also, AFAICT logs do not include logs from the actual engine setup you've
> seen getting stuck.
> Please attach.

Not possible to attach logs from the engine, it was not accessible.

Comment 5 Yedidyah Bar David 2020-03-23 10:23:34 UTC
(In reply to Nikolai Sednev from comment #4)
> Not possible to attach logs from the engine, it was not accessible.

It should be accessible, at the point that engine-setup is stuck. It will have a local (to libvirt's default network) IP address, search the log for 'local_vm_ip', e.g.:

2020-03-09 16:05:41,055+0200 DEBUG var changed: host "localhost" var "local_vm_ip" type "<class 'dict'>" value: "{
    "attempts": 3,
    "changed": true,
    "cmd": "virsh -r net-dhcp-leases default | grep -i 00:16:3e:7b:b8:53 | awk '{ print $5 }' | cut -f1 -d'/'",
    "delta": "0:00:00.034518",
    "end": "2020-03-09 16:05:40.583720",
    "failed": false,
    "rc": 0,
    "start": "2020-03-09 16:05:40.549202",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "192.168.222.167",
    "stdout_lines": [
        "192.168.222.167"
    ]
}"

The fact that var/log/ovirt-hosted-engine-setup/engine-logs-2020-03-09T16:01:32Z is empty, is a bug in itself. Please open one. I think that hosted-engine deploy could have fetched the logs, but failed due to a bug, and not because it was impossible:

2020-03-09 18:01:39,607+0200 DEBUG var changed: host "localhost" var "app_img" type "<class 'dict'>" value: "{
    "changed": false,
    "examined": 0,
    "failed": false,
    "files": [],
    "matched": 0,
    "msg": "/images was skipped as it does not seem to be a valid directory or it cannot be accessed\n"
}"

Comment 6 Nikolai Sednev 2020-03-23 12:50:27 UTC
At this moment environment is not available. I left the environment online for 3 days and let R&D to access it for any required information. 
In reproduction steps there is no need for vintage HE, it's fine to deploy latest HE on 4.4 and then proceed with the reproduction steps as described.

Comment 7 Yedidyah Bar David 2020-03-29 10:39:51 UTC
A short update:

It seems like the issue is that under certain circumstances engine-setup has to ask the user stuff during restore, based on the content of the db, and we provide no means to do that. So hosted-engine waits "forever".

How to handle? Perhaps:

1. Make ansible run engine-setup with an empty stdin, so that if engine-setup tries to read, it will fail immediately.

2. Check the specific issue (perhaps about required networks, guess based on the flow) once we reproduce.

3. More generally, perhaps allow running engine-setup interactively somehow. engine-setup was not designed to run unintended, and --accept-defaults is only good for when we have defaults. If we didn't supply any, we decided user interaction is mandatory. So we must allow that also on HE restore.

Comment 10 Nikolai Sednev 2020-04-16 19:54:12 UTC
Works for me, didn't found the initial error, moving to closed.
Tested on:
rhvm-4.4.0-0.31.master.el8ev.noarch
ovirt-hosted-engine-ha-2.4.2-1.el8ev.noarch
ovirt-hosted-engine-setup-2.4.4-1.el8ev.noarch
rhvm-appliance.x86_64 2:4.4-20200403.0.el8ev
Red Hat Enterprise Linux release 8.2 (Ootpa)
Linux 4.18.0-193.el8.x86_64 #1 SMP Fri Mar 27 14:35:58 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Comment 11 Yedidyah Bar David 2020-04-19 09:02:30 UTC
Thanks for the report.

If this happens again, please attach also engine-setup logs. These should be found on the engine vm, in /var/log/ovirt-engine/setup .

The deploy process should copy them to the host running the deploy, to /var/log/ovirt-hosted-engine-setup/engine-logs-$TIMESTAMP. If this one is empty, that's probably a bug in the deploy process.

If it's full, and sosreport does not collect it, that's a bug in sos.

In any case, please keep the engine vm image for investigation, even if re-purposing the host. Thanks.