Bug 1548508

Summary: ansible based setup fails if 'hosted-engine --status --json' produces an incomplete response during ovirt-ha-agent start
Product: [oVirt] ovirt-hosted-engine-setup Reporter: Simone Tiraboschi <stirabos>
Component: GeneralAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: Nikolai Sednev <nsednev>
Severity: high Docs Contact:
Priority: high    
Version: 2.2.6CC: bugs, didi, ekultails, mcmr, pagranat, yzhao
Target Milestone: ovirt-4.2.2Keywords: Triaged
Target Release: ---Flags: rule-engine: ovirt-4.2+
rule-engine: blocker+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-hosted-engine-setup-2.2.12 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-29 11:05:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1458709    
Attachments:
Description Flags
oVirt self-hosted engine deploy failure none

Description Simone Tiraboschi 2018-02-23 17:18:11 UTC
Description of problem:
The ansible playbook check engine status using in the json output of 'hosted-engine --status --json'.
Under certain circumstances 'hosted-engine --status --json' could produce an incomplete response although exiting with exit code 0.

In that case the deployment fails with:
03:28:49 [ INFO  ] TASK [Wait for the engine to come up on the target VM]
03:29:11 [ ERROR ] fatal: [localhost]: FAILED! => {"msg": "The conditional check 'health_result.rc == 0 and health_result.stdout|from_json|json_query('*.\"engine-status\".\"health\"')|first==\"good\"' failed. The error was: error while evaluating conditional (health_result.rc == 0 and health_result.stdout|from_json|json_query('*.\"engine-status\".\"health\"')|first==\"good\"): No first item, sequence was empty."}
03:29:11 [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook


Version-Release number of selected component (if applicable):


How reproducible:
quite low, it's a race on ovirt-ha-agent start.
Seen just once in CI.

Steps to Reproduce:
1. deploy hosted-engine with ansible
2.
3.

Actual results:
We could hit:
03:28:49 [ INFO  ] TASK [Wait for the engine to come up on the target VM]
03:29:11 [ ERROR ] fatal: [localhost]: FAILED! => {"msg": "The conditional check 'health_result.rc == 0 and health_result.stdout|from_json|json_query('*.\"engine-status\".\"health\"')|first==\"good\"' failed. The error was: error while evaluating conditional (health_result.rc == 0 and health_result.stdout|from_json|json_query('*.\"engine-status\".\"health\"')|first==\"good\"): No first item, sequence was empty."}
03:29:11 [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook

Expected results:
it ignores incomplete json responses from 'hosted-engine --status --json'


Additional info:

Comment 1 Yaniv Kaul 2018-02-25 08:43:15 UTC
I'm pretty sure it's because 'hosted-engine --status --json' doesn't product JSON output on failures...

Comment 2 Simone Tiraboschi 2018-02-26 14:14:43 UTC
(In reply to Yaniv Kaul from comment #1)
> I'm pretty sure it's because 'hosted-engine --status --json' doesn't product
> JSON output on failures...

We were already checking the exit code and so it wasn't a failure.
I think that is due to the fact that ovirt-ha-broker initializes the monitoring threads asynchronously so there is probably a time frame where we have a valid json response that still doesn't contain engine health.

Comment 3 Simone Tiraboschi 2018-02-26 14:17:03 UTC
*** Bug 1548806 has been marked as a duplicate of this bug. ***

Comment 4 Michael 2018-03-05 13:41:21 UTC
Hi guys,

For what it's worth I'm seeing the same error. Info:
Intel I7-7567U CPU, CentOS-7-x86_64-DVD-1708 installed, fully updated and fresh (was formatted prior to the HE deployment).

I do however have an oVirt engine web management running, I see the deployed engine VM running on the one and only host (the machine I ran the deployment scripts on)... so I guess it's ... working? It failed, but it's working?

Can anyone please help confirm how far it went and which, if any, steps are missing from a setup at this point?

Thanks,
Mike

Comment 5 Simone Tiraboschi 2018-03-05 14:33:08 UTC
(In reply to Michael from comment #4)
> Can anyone please help confirm how far it went and which, if any, steps are
> missing from a setup at this point?

It was one of latest steps.
You could eventually find the bootstrap local VM in the engine and you can simply manually remove it if there.
Everything else should be OK.

Comment 6 Yihui Zhao 2018-03-06 08:47:48 UTC
Hit this issue with ansible deployment on the latest rhvh(rhvh-4.2.1.4-0.20180305.0+1)

Test version:
ovirt-hosted-engine-setup-2.2.12-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.6-1.el7ev.noarch
rhvm-appliance-4.2-20180202.0.el7.noarch


From the CLI:
[ INFO  ] changed: [localhost]
[ INFO  ] TASK [Wait for the engine to come up on the target VM]
[ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": true, "cmd": ["hosted-engine", "--vm-status", "--json"], "delta": "0:00:00.340304", "end": "2018-03-06 16:37:12.968483", "rc": 0, "start": "2018-03-06 16:37:12.628179", "stderr": "", "stderr_lines": [], "stdout": "{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=18685 (Tue Mar  6 16:37:08 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=18686 (Tue Mar  6 16:37:09 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"ibm-x3650m5-05.lab.eng.pek2.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"4e99870f\", \"local_conf_timestamp\": 18686, \"host-ts\": 18685}, \"global_maintenance\": false}", "stdout_lines": ["{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=18685 (Tue Mar  6 16:37:08 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=18686 (Tue Mar  6 16:37:09 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"ibm-x3650m5-05.lab.eng.pek2.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"4e99870f\", \"local_conf_timestamp\": 18686, \"host-ts\": 18685}, \"global_maintenance\": false}"]}
[ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook
[ INFO  ] Stage: Clean up

Comment 7 Nikolai Sednev 2018-03-06 11:48:16 UTC
Moving back to assigned forth to comment #6.

Comment 8 Simone Tiraboschi 2018-03-06 12:47:20 UTC
(In reply to Yihui Zhao from comment #6)
> \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\":
> \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\":


In this case the json in output was complete, the point is that the engine engine for some different reason failed to start.

Comment 9 Nikolai Sednev 2018-03-18 14:18:47 UTC
Works for me on these components:
ovirt-hosted-engine-ha-2.2.7-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.13-1.el7ev.noarch
rhvm-appliance-4.2-20180202.0.el7.noarch
Linux 3.10.0-861.el7.x86_64 #1 SMP Wed Mar 14 10:21:01 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 (Maipo)
Deployed over iSCSI.

Comment 10 ekultails 2018-03-22 17:57:58 UTC
I am running into the exact same problem.

I first ran into the problem when using an all-in-one self-hosted oVirt Engine deployment I made via Vagrant and Ansible:

https://github.com/ekultails/vagrant-ovirt-self-hosted-engine

I then tried the deploy on bare-metal with both an AMD and Intel server to see if it was a nested virtualization performance issue. It wasn't. It stalled out at the same part when it's verifying the functionality with "hosted-engine --status --json". I am not sure what the exact problem is.

Tested on CentOS 7.4 with oVirt 4.2.1, oVirt 4.2 latest development snapshot, and oVirt master (4.3 from the latest snapshot available today). I am using a local NFS share for storage and temporarily disabled SELinux for testing.

Comment 11 Simone Tiraboschi 2018-03-22 18:15:26 UTC
(In reply to ekultails from comment #10)
> I then tried the deploy on bare-metal with both an AMD and Intel server to
> see if it was a nested virtualization performance issue. It wasn't. It
> stalled out at the same part when it's verifying the functionality with

Did it failed on "No first item, sequence was empty." on something like that parsing json output or did it simply went to time out with  
 "vm": "up"
 "health": "bad"
in this case I'd suggest to check double check name resolution on your env.

Comment 12 ekultails 2018-03-22 19:25:59 UTC
I have a screenshot of the full error and will try to upload it as an attachment. I did not see anything similar to "No first item, sequence was empty". In the VM I had DNSMasq setup for DNS resolution and on the bare-metal machines I only used /etc/hosts for simplicity.

The stderr message included this:

"reason": "failed liveliness check",
"health": "bad",
"vm": "up",
"detail": "Up"

Comment 13 ekultails 2018-03-22 19:29:22 UTC
Created attachment 1411817 [details]
oVirt self-hosted engine deploy failure

Comment 14 Simone Tiraboschi 2018-03-22 21:01:39 UTC
 (In reply to ekultails from comment #12)
> "reason": "failed liveliness check",
> "health": "bad",
> "vm": "up",
> "detail": "Up"

So it's not this bug since that json is well formed.

Here it's saying that the VM was up (at libvirt level) but the engine couldn't be reached (this check instead happens through the network).

You can try checking what failed on your VM with
 hosted-engine --console

I suspect your VM got a wrong address: it's your entry still in /etc/hosts after the deployment? did the VM got an address via DHCP, in that case do you have a valid DHCP reservation for that and did you passed the right mac address to hosted-engine-setup?

Comment 15 ekultails 2018-03-23 16:08:08 UTC
I am sorry Simone, you are completely right and you have helped to solve my problem. Thank you so much! This was an issue with my configuration that lead to DNS resolution problems. The entry in /etc/hosts was removed by the installer using the default settings. As soon as I added it back, the health check worked.

The end solution for my problem (which is confirmed to be unrelated to this bug) was, during a manual deploy, to specify static IP addressing (to align with the DNS A record), having a local /etc/hosts file filled out on the VM, as well as use my DNSMasq server for DNS resolution.

$ sudo hosted-engine --deploy
...
          How should the engine VM network be configured (DHCP, Static)[DHCP]? Static
          Please enter the IP address to be used for the engine VM [192.168.121.2]: 192.168.121.201
[ INFO  ] The engine VM will be configured to use 192.168.121.201/24
          Please provide a comma-separated list (max 3) of IP addresses of domain name servers for the engine VM
          Engine VM DNS (leave it empty to skip) [127.0.0.1]: <VAGRANT_VM_IP>
          Add lines for the appliance itself and for this host to /etc/hosts on the engine VM?
          Note: ensuring that this host could resolve the engine VM hostname is still up to you
          (Yes, No)[No] Yes
...

Comment 16 Sandro Bonazzola 2018-03-29 11:05:50 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.