Bug 1615943

Summary: [RFE] Engine health check should report an error in readable and informative way.
Product: [oVirt] ovirt-hosted-engine-setup Reporter: Nikolai Sednev <nsednev>
Component: RFEsAssignee: Asaf Rachmani <arachman>
Status: CLOSED WONTFIX QA Contact: Nikolai Sednev <nsednev>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.2.24CC: arachman, bugs, sanja, stirabos
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-04 15:18:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs from alma03 none

Description Nikolai Sednev 2018-08-14 14:46:09 UTC
Created attachment 1475889 [details]
logs from alma03

Description of problem:
[RFE] Errors should be clear and informative when deployment fails after the engine is not reachable by its FQDN.

When deploying Hosted-engine with DHCP configuration, if the engine VM receives an IP address which doesn't resolve to its FQDN (due to a user wrong configuration) the deployment fails with unreadable and not informative error as such: "[ INFO  ] TASK [Start ovirt-ha-agent service on the host]
[ INFO  ] changed: [localhost]
[ INFO  ] TASK [Check engine VM health]
[ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": true, "cmd": ["hosted-engine", "--vm-status", "--json"], "delta": "0:00:00.388280", "end": "2018-08-13 15:34:51.947498", "rc": 0, "start": "2018-08-13 15:34:51.559218", "stderr": "", "stderr_lines": [], "stdout": "{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=3281 (Mon Aug 13 15:34:40 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=3281 (Mon Aug 13 15:34:40 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"alma03.qa.lab.tlv.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"5eb81e6e\", \"local_conf_timestamp\": 3281, \"host-ts\": 3281}, \"global_maintenance\": false}", "stdout_lines": ["{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=3281 (Mon Aug 13 15:34:40 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=3281 (Mon Aug 13 15:34:40 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"alma03.qa.lab.tlv.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"5eb81e6e\", \"local_conf_timestamp\": 3281, \"host-ts\": 3281}, \"global_maintenance\": false}"]}
[ INFO  ] TASK [Check VM status at virt level]
"

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-2.2.25-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.16-1.el7ev.noarch
rhvm-appliance-4.2-20180813.0.el7.noarch
Linux 3.10.0-862.11.6.el7.x86_64 #1 SMP Fri Aug 10 16:55:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 (Maipo)

How reproducible:
100%

Steps to Reproduce:
1.Deploy SHE with wrong MAC address, so the engine won't be able to resolve to it's FQDN.

Actual results:
Unreadable and meaningless error being printed.

Expected results:
Errors should appear in informative and readable form.

Additional info:
Logs from the deployment attached.

Comment 1 Nikolai Sednev 2018-08-14 14:55:37 UTC
The true issue here is the reported error by engine health check.
Please correct the error, so customer will be able to understand its meaning.

Comment 2 Sandro Bonazzola 2019-11-20 12:52:59 UTC
missed 4.4 feature freeze, moving to 4.5

Comment 3 Asaf Rachmani 2021-10-31 07:38:51 UTC
Actually, we already added an informative error message for this issue, from "ovirt-hosted-engine-setup" log file:

2018-08-13 15:35:25,849+0300 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"changed": false, "msg": "Engine VM IP address is 10.35.92.55 while the engine's FQDN nsednev-he-1.qa.lab.tlv.redhat.com resolves to 10.35.92.51. If you are using DHCP, check your DHCP reservation configuration"}


"Engine health check" helps us to indicate if the engine is up, in case of failure during HE deployment we check the reason and show an informative error message right after:

2018-08-13 15:23:15,661+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Check engine VM health]
2018-08-13 15:34:52,135+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 {u'_ansible_parsed': True, u'stderr_lines': [], u'cmd': [u'hosted-engine', u'--vm-status', u'--json'], u'end': u'2018-08-13 15:34:51.947498', u'_ansible_no_log': False, u'stdout': u'{"1": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=3281 (Mon Aug 13 15:34:40 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=3281 (Mon Aug 13 15:34:40 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n", "hostname": "alma03.qa.lab.tlv.redhat.com", "host-id": 1, "engine-status": {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "5eb81e6e", "local_conf_timestamp": 3281, "host-ts": 3281}, "global_maintenance": false}', u'changed': True, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': False, u'_raw_params': u'hosted-engine --vm-status --json', u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2018-08-13 15:34:51.559218', u'attempts': 120, u'stderr': u'', u'rc': 0, u'delta': u'0:00:00.388280', u'stdout_lines': [u'{"1": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=3281 (Mon Aug 13 15:34:40 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=3281 (Mon Aug 13 15:34:40 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n", "hostname": "alma03.qa.lab.tlv.redhat.com", "host-id": 1, "engine-status": {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "5eb81e6e", "local_conf_timestamp": 3281, "host-ts": 3281}, "global_maintenance": false}']}
2018-08-13 15:34:52,236+0300 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"attempts": 120, "changed": true, "cmd": ["hosted-engine", "--vm-status", "--json"], "delta": "0:00:00.388280", "end": "2018-08-13 15:34:51.947498", "rc": 0, "start": "2018-08-13 15:34:51.559218", "stderr": "", "stderr_lines": [], "stdout": "{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=3281 (Mon Aug 13 15:34:40 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=3281 (Mon Aug 13 15:34:40 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"alma03.qa.lab.tlv.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"5eb81e6e\", \"local_conf_timestamp\": 3281, \"host-ts\": 3281}, \"global_maintenance\": false}", "stdout_lines": ["{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=3281 (Mon Aug 13 15:34:40 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=3281 (Mon Aug 13 15:34:40 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"alma03.qa.lab.tlv.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"5eb81e6e\", \"local_conf_timestamp\": 3281, \"host-ts\": 3281}, \"global_maintenance\": false}"]}
2018-08-13 15:34:53,942+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Check VM status at virt level]
2018-08-13 15:34:56,146+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 changed: [localhost]
2018-08-13 15:34:58,050+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 TASK [debug]
2018-08-13 15:34:59,753+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 vm_status_virsh: {'stderr_lines': [], u'changed': True, u'end': u'2018-08-13 15:34:55.988876', u'stdout': u' 2     HostedEngine                   running', u'cmd': u'virsh -r list | grep HostedEngine | grep running', 'failed': False, u'delta': u'0:00:00.126384', u'stderr': u'', u'rc': 0, 'stdout_lines': [u' 2     HostedEngine                   running'], u'start': u'2018-08-13 15:34:55.862492'}
2018-08-13 15:35:01,456+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Fail if engine VM is not running]
2018-08-13 15:35:03,289+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 skipping: [localhost]
2018-08-13 15:35:05,093+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Get target engine VM IPv4 address]
2018-08-13 15:35:07,200+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 changed: [localhost]
2018-08-13 15:35:09,013+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Get VDSM's target engine VM stats]
2018-08-13 15:35:11,619+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 changed: [localhost]
2018-08-13 15:35:13,523+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Convert stats to JSON format]
2018-08-13 15:35:15,227+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 ok: [localhost]
2018-08-13 15:35:17,031+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Get target engine VM IPv4 address from VDSM stats]
2018-08-13 15:35:18,735+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 ok: [localhost]
2018-08-13 15:35:20,538+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 TASK [debug]
2018-08-13 15:35:22,241+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 engine_vm_ip_vdsm: 10.35.92.55
2018-08-13 15:35:24,044+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Fail if Engine IP is different from engine's FQDN resolved IP]
2018-08-13 15:35:25,748+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 {u'msg': u"Engine VM IP address is 10.35.92.55 while the engine's FQDN nsednev-he-1.qa.lab.tlv.redhat.com resolves to 10.35.92.51. If you are using DHCP, check your DHCP reservation configuration", u'changed': False, u'_ansible_no_log': False}
2018-08-13 15:35:25,849+0300 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"changed": false, "msg": "Engine VM IP address is 10.35.92.55 while the engine's FQDN nsednev-he-1.qa.lab.tlv.redhat.com resolves to 10.35.92.51. If you are using DHCP, check your DHCP reservation configuration"}
2018-08-13 15:35:27,452+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 PLAY RECAP [localhost] : ok: 82 changed: 31 unreachable: 0 skipped: 5 failed: 2