Bug 1548508
Summary: | ansible based setup fails if 'hosted-engine --status --json' produces an incomplete response during ovirt-ha-agent start | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-hosted-engine-setup | Reporter: | Simone Tiraboschi <stirabos> | ||||
Component: | General | Assignee: | Simone Tiraboschi <stirabos> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Nikolai Sednev <nsednev> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 2.2.6 | CC: | bugs, didi, ekultails, mcmr, pagranat, yzhao | ||||
Target Milestone: | ovirt-4.2.2 | Keywords: | Triaged | ||||
Target Release: | --- | Flags: | rule-engine:
ovirt-4.2+
rule-engine: blocker+ |
||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | ovirt-hosted-engine-setup-2.2.12 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-03-29 11:05:50 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1458709 | ||||||
Attachments: |
|
Description
Simone Tiraboschi
2018-02-23 17:18:11 UTC
I'm pretty sure it's because 'hosted-engine --status --json' doesn't product JSON output on failures... (In reply to Yaniv Kaul from comment #1) > I'm pretty sure it's because 'hosted-engine --status --json' doesn't product > JSON output on failures... We were already checking the exit code and so it wasn't a failure. I think that is due to the fact that ovirt-ha-broker initializes the monitoring threads asynchronously so there is probably a time frame where we have a valid json response that still doesn't contain engine health. *** Bug 1548806 has been marked as a duplicate of this bug. *** Hi guys, For what it's worth I'm seeing the same error. Info: Intel I7-7567U CPU, CentOS-7-x86_64-DVD-1708 installed, fully updated and fresh (was formatted prior to the HE deployment). I do however have an oVirt engine web management running, I see the deployed engine VM running on the one and only host (the machine I ran the deployment scripts on)... so I guess it's ... working? It failed, but it's working? Can anyone please help confirm how far it went and which, if any, steps are missing from a setup at this point? Thanks, Mike (In reply to Michael from comment #4) > Can anyone please help confirm how far it went and which, if any, steps are > missing from a setup at this point? It was one of latest steps. You could eventually find the bootstrap local VM in the engine and you can simply manually remove it if there. Everything else should be OK. Hit this issue with ansible deployment on the latest rhvh(rhvh-4.2.1.4-0.20180305.0+1) Test version: ovirt-hosted-engine-setup-2.2.12-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.6-1.el7ev.noarch rhvm-appliance-4.2-20180202.0.el7.noarch From the CLI: [ INFO ] changed: [localhost] [ INFO ] TASK [Wait for the engine to come up on the target VM] [ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": true, "cmd": ["hosted-engine", "--vm-status", "--json"], "delta": "0:00:00.340304", "end": "2018-03-06 16:37:12.968483", "rc": 0, "start": "2018-03-06 16:37:12.628179", "stderr": "", "stderr_lines": [], "stdout": "{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=18685 (Tue Mar 6 16:37:08 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=18686 (Tue Mar 6 16:37:09 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"ibm-x3650m5-05.lab.eng.pek2.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"4e99870f\", \"local_conf_timestamp\": 18686, \"host-ts\": 18685}, \"global_maintenance\": false}", "stdout_lines": ["{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=18685 (Tue Mar 6 16:37:08 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=18686 (Tue Mar 6 16:37:09 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"ibm-x3650m5-05.lab.eng.pek2.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"4e99870f\", \"local_conf_timestamp\": 18686, \"host-ts\": 18685}, \"global_maintenance\": false}"]} [ ERROR ] Failed to execute stage 'Closing up': Failed executing ansible-playbook [ INFO ] Stage: Clean up Moving back to assigned forth to comment #6. (In reply to Yihui Zhao from comment #6) > \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": > \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": In this case the json in output was complete, the point is that the engine engine for some different reason failed to start. Works for me on these components: ovirt-hosted-engine-ha-2.2.7-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.13-1.el7ev.noarch rhvm-appliance-4.2-20180202.0.el7.noarch Linux 3.10.0-861.el7.x86_64 #1 SMP Wed Mar 14 10:21:01 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.5 (Maipo) Deployed over iSCSI. I am running into the exact same problem. I first ran into the problem when using an all-in-one self-hosted oVirt Engine deployment I made via Vagrant and Ansible: https://github.com/ekultails/vagrant-ovirt-self-hosted-engine I then tried the deploy on bare-metal with both an AMD and Intel server to see if it was a nested virtualization performance issue. It wasn't. It stalled out at the same part when it's verifying the functionality with "hosted-engine --status --json". I am not sure what the exact problem is. Tested on CentOS 7.4 with oVirt 4.2.1, oVirt 4.2 latest development snapshot, and oVirt master (4.3 from the latest snapshot available today). I am using a local NFS share for storage and temporarily disabled SELinux for testing. (In reply to ekultails from comment #10) > I then tried the deploy on bare-metal with both an AMD and Intel server to > see if it was a nested virtualization performance issue. It wasn't. It > stalled out at the same part when it's verifying the functionality with Did it failed on "No first item, sequence was empty." on something like that parsing json output or did it simply went to time out with "vm": "up" "health": "bad" in this case I'd suggest to check double check name resolution on your env. I have a screenshot of the full error and will try to upload it as an attachment. I did not see anything similar to "No first item, sequence was empty". In the VM I had DNSMasq setup for DNS resolution and on the bare-metal machines I only used /etc/hosts for simplicity. The stderr message included this: "reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up" Created attachment 1411817 [details]
oVirt self-hosted engine deploy failure
(In reply to ekultails from comment #12) > "reason": "failed liveliness check", > "health": "bad", > "vm": "up", > "detail": "Up" So it's not this bug since that json is well formed. Here it's saying that the VM was up (at libvirt level) but the engine couldn't be reached (this check instead happens through the network). You can try checking what failed on your VM with hosted-engine --console I suspect your VM got a wrong address: it's your entry still in /etc/hosts after the deployment? did the VM got an address via DHCP, in that case do you have a valid DHCP reservation for that and did you passed the right mac address to hosted-engine-setup? I am sorry Simone, you are completely right and you have helped to solve my problem. Thank you so much! This was an issue with my configuration that lead to DNS resolution problems. The entry in /etc/hosts was removed by the installer using the default settings. As soon as I added it back, the health check worked. The end solution for my problem (which is confirmed to be unrelated to this bug) was, during a manual deploy, to specify static IP addressing (to align with the DNS A record), having a local /etc/hosts file filled out on the VM, as well as use my DNSMasq server for DNS resolution. $ sudo hosted-engine --deploy ... How should the engine VM network be configured (DHCP, Static)[DHCP]? Static Please enter the IP address to be used for the engine VM [192.168.121.2]: 192.168.121.201 [ INFO ] The engine VM will be configured to use 192.168.121.201/24 Please provide a comma-separated list (max 3) of IP addresses of domain name servers for the engine VM Engine VM DNS (leave it empty to skip) [127.0.0.1]: <VAGRANT_VM_IP> Add lines for the appliance itself and for this host to /etc/hosts on the engine VM? Note: ensuring that this host could resolve the engine VM hostname is still up to you (Yes, No)[No] Yes ... This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |