1548508 – ansible based setup fails if 'hosted-engine --status --json' produces an incomplete response during ovirt-ha-agent start

Bug 1548508 - ansible based setup fails if 'hosted-engine --status --json' produces an incomplete response during ovirt-ha-agent start

Summary: ansible based setup fails if 'hosted-engine --status --json' produces an inco...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-hosted-engine-setup
Classification:	oVirt
Component:	General
Sub Component:
Version:	2.2.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.2
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1548806 (view as bug list)
Depends On:
Blocks:	1458709
TreeView+	depends on / blocked

Reported:	2018-02-23 17:18 UTC by Simone Tiraboschi
Modified:	2018-03-29 11:05 UTC (History)
CC List:	6 users (show)
Fixed In Version:	ovirt-hosted-engine-setup-2.2.12
Clone Of:
Environment:
Last Closed:	2018-03-29 11:05:50 UTC
oVirt Team:	Integration
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+ rule-engine: blocker+

Attachments	(Terms of Use)
oVirt self-hosted engine deploy failure (3.75 MB, image/jpeg) 2018-03-22 19:29 UTC, ekultails	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	88133	0	master	MERGED	ansible: safely check vm status	2020-09-05 23:38:36 UTC
oVirt gerrit	88204	0	ovirt-hosted-engine-setup-2.2	MERGED	ansible: safely check vm status	2020-09-05 23:38:37 UTC

Description Simone Tiraboschi 2018-02-23 17:18:11 UTC

Description of problem:
The ansible playbook check engine status using in the json output of 'hosted-engine --status --json'.
Under certain circumstances 'hosted-engine --status --json' could produce an incomplete response although exiting with exit code 0.

In that case the deployment fails with:
03:28:49 [ INFO  ] TASK [Wait for the engine to come up on the target VM]
03:29:11 [ ERROR ] fatal: [localhost]: FAILED! => {"msg": "The conditional check 'health_result.rc == 0 and health_result.stdout|from_json|json_query('*.\"engine-status\".\"health\"')|first==\"good\"' failed. The error was: error while evaluating conditional (health_result.rc == 0 and health_result.stdout|from_json|json_query('*.\"engine-status\".\"health\"')|first==\"good\"): No first item, sequence was empty."}
03:29:11 [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook


Version-Release number of selected component (if applicable):


How reproducible:
quite low, it's a race on ovirt-ha-agent start.
Seen just once in CI.

Steps to Reproduce:
1. deploy hosted-engine with ansible
2.
3.

Actual results:
We could hit:
03:28:49 [ INFO  ] TASK [Wait for the engine to come up on the target VM]
03:29:11 [ ERROR ] fatal: [localhost]: FAILED! => {"msg": "The conditional check 'health_result.rc == 0 and health_result.stdout|from_json|json_query('*.\"engine-status\".\"health\"')|first==\"good\"' failed. The error was: error while evaluating conditional (health_result.rc == 0 and health_result.stdout|from_json|json_query('*.\"engine-status\".\"health\"')|first==\"good\"): No first item, sequence was empty."}
03:29:11 [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook

Expected results:
it ignores incomplete json responses from 'hosted-engine --status --json'


Additional info:

Comment 1 Yaniv Kaul 2018-02-25 08:43:15 UTC

I'm pretty sure it's because 'hosted-engine --status --json' doesn't product JSON output on failures...

Comment 2 Simone Tiraboschi 2018-02-26 14:14:43 UTC

(In reply to Yaniv Kaul from comment #1)
> I'm pretty sure it's because 'hosted-engine --status --json' doesn't product
> JSON output on failures...

We were already checking the exit code and so it wasn't a failure.
I think that is due to the fact that ovirt-ha-broker initializes the monitoring threads asynchronously so there is probably a time frame where we have a valid json response that still doesn't contain engine health.

Comment 3 Simone Tiraboschi 2018-02-26 14:17:03 UTC

*** Bug 1548806 has been marked as a duplicate of this bug. ***

Comment 4 Michael 2018-03-05 13:41:21 UTC

Hi guys,

For what it's worth I'm seeing the same error. Info:
Intel I7-7567U CPU, CentOS-7-x86_64-DVD-1708 installed, fully updated and fresh (was formatted prior to the HE deployment).

I do however have an oVirt engine web management running, I see the deployed engine VM running on the one and only host (the machine I ran the deployment scripts on)... so I guess it's ... working? It failed, but it's working?

Can anyone please help confirm how far it went and which, if any, steps are missing from a setup at this point?

Thanks,
Mike

Comment 5 Simone Tiraboschi 2018-03-05 14:33:08 UTC

(In reply to Michael from comment #4)
> Can anyone please help confirm how far it went and which, if any, steps are
> missing from a setup at this point?

It was one of latest steps.
You could eventually find the bootstrap local VM in the engine and you can simply manually remove it if there.
Everything else should be OK.

Comment 6 Yihui Zhao 2018-03-06 08:47:48 UTC

Hit this issue with ansible deployment on the latest rhvh(rhvh-4.2.1.4-0.20180305.0+1)

Test version:
ovirt-hosted-engine-setup-2.2.12-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.6-1.el7ev.noarch
rhvm-appliance-4.2-20180202.0.el7.noarch


From the CLI:
[ INFO  ] changed: [localhost]
[ INFO  ] TASK [Wait for the engine to come up on the target VM]
[ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": true, "cmd": ["hosted-engine", "--vm-status", "--json"], "delta": "0:00:00.340304", "end": "2018-03-06 16:37:12.968483", "rc": 0, "start": "2018-03-06 16:37:12.628179", "stderr": "", "stderr_lines": [], "stdout": "{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=18685 (Tue Mar  6 16:37:08 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=18686 (Tue Mar  6 16:37:09 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"ibm-x3650m5-05.lab.eng.pek2.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"4e99870f\", \"local_conf_timestamp\": 18686, \"host-ts\": 18685}, \"global_maintenance\": false}", "stdout_lines": ["{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=18685 (Tue Mar  6 16:37:08 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=18686 (Tue Mar  6 16:37:09 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"ibm-x3650m5-05.lab.eng.pek2.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"4e99870f\", \"local_conf_timestamp\": 18686, \"host-ts\": 18685}, \"global_maintenance\": false}"]}
[ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook
[ INFO  ] Stage: Clean up

Comment 7 Nikolai Sednev 2018-03-06 11:48:16 UTC

Moving back to assigned forth to comment #6.

Comment 8 Simone Tiraboschi 2018-03-06 12:47:20 UTC

(In reply to Yihui Zhao from comment #6)
> \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\":
> \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\":


In this case the json in output was complete, the point is that the engine engine for some different reason failed to start.

Comment 9 Nikolai Sednev 2018-03-18 14:18:47 UTC

Works for me on these components:
ovirt-hosted-engine-ha-2.2.7-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.13-1.el7ev.noarch
rhvm-appliance-4.2-20180202.0.el7.noarch
Linux 3.10.0-861.el7.x86_64 #1 SMP Wed Mar 14 10:21:01 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 (Maipo)
Deployed over iSCSI.

Comment 10 ekultails 2018-03-22 17:57:58 UTC

I am running into the exact same problem.

I first ran into the problem when using an all-in-one self-hosted oVirt Engine deployment I made via Vagrant and Ansible:

https://github.com/ekultails/vagrant-ovirt-self-hosted-engine

I then tried the deploy on bare-metal with both an AMD and Intel server to see if it was a nested virtualization performance issue. It wasn't. It stalled out at the same part when it's verifying the functionality with "hosted-engine --status --json". I am not sure what the exact problem is.

Tested on CentOS 7.4 with oVirt 4.2.1, oVirt 4.2 latest development snapshot, and oVirt master (4.3 from the latest snapshot available today). I am using a local NFS share for storage and temporarily disabled SELinux for testing.

Comment 11 Simone Tiraboschi 2018-03-22 18:15:26 UTC

(In reply to ekultails from comment #10)
> I then tried the deploy on bare-metal with both an AMD and Intel server to
> see if it was a nested virtualization performance issue. It wasn't. It
> stalled out at the same part when it's verifying the functionality with

Did it failed on "No first item, sequence was empty." on something like that parsing json output or did it simply went to time out with  
 "vm": "up"
 "health": "bad"
in this case I'd suggest to check double check name resolution on your env.

Comment 12 ekultails 2018-03-22 19:25:59 UTC

I have a screenshot of the full error and will try to upload it as an attachment. I did not see anything similar to "No first item, sequence was empty". In the VM I had DNSMasq setup for DNS resolution and on the bare-metal machines I only used /etc/hosts for simplicity.

The stderr message included this:

"reason": "failed liveliness check",
"health": "bad",
"vm": "up",
"detail": "Up"

Comment 13 ekultails 2018-03-22 19:29:22 UTC

Created attachment 1411817 [details]
oVirt self-hosted engine deploy failure

Comment 14 Simone Tiraboschi 2018-03-22 21:01:39 UTC

 (In reply to ekultails from comment #12)
> "reason": "failed liveliness check",
> "health": "bad",
> "vm": "up",
> "detail": "Up"

So it's not this bug since that json is well formed.

Here it's saying that the VM was up (at libvirt level) but the engine couldn't be reached (this check instead happens through the network).

You can try checking what failed on your VM with
 hosted-engine --console

I suspect your VM got a wrong address: it's your entry still in /etc/hosts after the deployment? did the VM got an address via DHCP, in that case do you have a valid DHCP reservation for that and did you passed the right mac address to hosted-engine-setup?

Comment 15 ekultails 2018-03-23 16:08:08 UTC

I am sorry Simone, you are completely right and you have helped to solve my problem. Thank you so much! This was an issue with my configuration that lead to DNS resolution problems. The entry in /etc/hosts was removed by the installer using the default settings. As soon as I added it back, the health check worked.

The end solution for my problem (which is confirmed to be unrelated to this bug) was, during a manual deploy, to specify static IP addressing (to align with the DNS A record), having a local /etc/hosts file filled out on the VM, as well as use my DNSMasq server for DNS resolution.

$ sudo hosted-engine --deploy
...
          How should the engine VM network be configured (DHCP, Static)[DHCP]? Static
          Please enter the IP address to be used for the engine VM [192.168.121.2]: 192.168.121.201
[ INFO  ] The engine VM will be configured to use 192.168.121.201/24
          Please provide a comma-separated list (max 3) of IP addresses of domain name servers for the engine VM
          Engine VM DNS (leave it empty to skip) [127.0.0.1]: <VAGRANT_VM_IP>
          Add lines for the appliance itself and for this host to /etc/hosts on the engine VM?
          Note: ensuring that this host could resolve the engine VM hostname is still up to you
          (Yes, No)[No] Yes
...

Comment 16 Sandro Bonazzola 2018-03-29 11:05:50 UTC

This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.