Bug 1615943 - [RFE] Engine health check should report an error in readable and informative way.
Summary: [RFE] Engine health check should report an error in readable and informative...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: ovirt-hosted-engine-setup
Classification: oVirt
Component: RFEs
Version: 2.2.24
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Asaf Rachmani
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-14 14:46 UTC by Nikolai Sednev
Modified: 2022-02-23 14:04 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-04 15:18:51 UTC
oVirt Team: Integration
Embargoed:


Attachments (Terms of Use)
logs from alma03 (165.85 KB, application/x-gzip)
2018-08-14 14:46 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 117298 0 master NEW Add stdout of liveliness script to failure reason 2021-10-25 16:18:49 UTC

Description Nikolai Sednev 2018-08-14 14:46:09 UTC
Created attachment 1475889 [details]
logs from alma03

Description of problem:
[RFE] Errors should be clear and informative when deployment fails after the engine is not reachable by its FQDN.

When deploying Hosted-engine with DHCP configuration, if the engine VM receives an IP address which doesn't resolve to its FQDN (due to a user wrong configuration) the deployment fails with unreadable and not informative error as such: "[ INFO  ] TASK [Start ovirt-ha-agent service on the host]
[ INFO  ] changed: [localhost]
[ INFO  ] TASK [Check engine VM health]
[ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": true, "cmd": ["hosted-engine", "--vm-status", "--json"], "delta": "0:00:00.388280", "end": "2018-08-13 15:34:51.947498", "rc": 0, "start": "2018-08-13 15:34:51.559218", "stderr": "", "stderr_lines": [], "stdout": "{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=3281 (Mon Aug 13 15:34:40 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=3281 (Mon Aug 13 15:34:40 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"alma03.qa.lab.tlv.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"5eb81e6e\", \"local_conf_timestamp\": 3281, \"host-ts\": 3281}, \"global_maintenance\": false}", "stdout_lines": ["{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=3281 (Mon Aug 13 15:34:40 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=3281 (Mon Aug 13 15:34:40 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"alma03.qa.lab.tlv.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"5eb81e6e\", \"local_conf_timestamp\": 3281, \"host-ts\": 3281}, \"global_maintenance\": false}"]}
[ INFO  ] TASK [Check VM status at virt level]
"

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-2.2.25-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.16-1.el7ev.noarch
rhvm-appliance-4.2-20180813.0.el7.noarch
Linux 3.10.0-862.11.6.el7.x86_64 #1 SMP Fri Aug 10 16:55:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 (Maipo)

How reproducible:
100%

Steps to Reproduce:
1.Deploy SHE with wrong MAC address, so the engine won't be able to resolve to it's FQDN.

Actual results:
Unreadable and meaningless error being printed.

Expected results:
Errors should appear in informative and readable form.

Additional info:
Logs from the deployment attached.

Comment 1 Nikolai Sednev 2018-08-14 14:55:37 UTC
The true issue here is the reported error by engine health check.
Please correct the error, so customer will be able to understand its meaning.

Comment 2 Sandro Bonazzola 2019-11-20 12:52:59 UTC
missed 4.4 feature freeze, moving to 4.5

Comment 3 Asaf Rachmani 2021-10-31 07:38:51 UTC
Actually, we already added an informative error message for this issue, from "ovirt-hosted-engine-setup" log file:

2018-08-13 15:35:25,849+0300 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"changed": false, "msg": "Engine VM IP address is 10.35.92.55 while the engine's FQDN nsednev-he-1.qa.lab.tlv.redhat.com resolves to 10.35.92.51. If you are using DHCP, check your DHCP reservation configuration"}


"Engine health check" helps us to indicate if the engine is up, in case of failure during HE deployment we check the reason and show an informative error message right after:

2018-08-13 15:23:15,661+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Check engine VM health]
2018-08-13 15:34:52,135+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 {u'_ansible_parsed': True, u'stderr_lines': [], u'cmd': [u'hosted-engine', u'--vm-status', u'--json'], u'end': u'2018-08-13 15:34:51.947498', u'_ansible_no_log': False, u'stdout': u'{"1": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=3281 (Mon Aug 13 15:34:40 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=3281 (Mon Aug 13 15:34:40 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n", "hostname": "alma03.qa.lab.tlv.redhat.com", "host-id": 1, "engine-status": {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "5eb81e6e", "local_conf_timestamp": 3281, "host-ts": 3281}, "global_maintenance": false}', u'changed': True, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': False, u'_raw_params': u'hosted-engine --vm-status --json', u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2018-08-13 15:34:51.559218', u'attempts': 120, u'stderr': u'', u'rc': 0, u'delta': u'0:00:00.388280', u'stdout_lines': [u'{"1": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=3281 (Mon Aug 13 15:34:40 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=3281 (Mon Aug 13 15:34:40 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n", "hostname": "alma03.qa.lab.tlv.redhat.com", "host-id": 1, "engine-status": {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "5eb81e6e", "local_conf_timestamp": 3281, "host-ts": 3281}, "global_maintenance": false}']}
2018-08-13 15:34:52,236+0300 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"attempts": 120, "changed": true, "cmd": ["hosted-engine", "--vm-status", "--json"], "delta": "0:00:00.388280", "end": "2018-08-13 15:34:51.947498", "rc": 0, "start": "2018-08-13 15:34:51.559218", "stderr": "", "stderr_lines": [], "stdout": "{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=3281 (Mon Aug 13 15:34:40 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=3281 (Mon Aug 13 15:34:40 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"alma03.qa.lab.tlv.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"5eb81e6e\", \"local_conf_timestamp\": 3281, \"host-ts\": 3281}, \"global_maintenance\": false}", "stdout_lines": ["{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=3281 (Mon Aug 13 15:34:40 2018)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=3281 (Mon Aug 13 15:34:40 2018)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"alma03.qa.lab.tlv.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"failed liveliness check\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Up\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"5eb81e6e\", \"local_conf_timestamp\": 3281, \"host-ts\": 3281}, \"global_maintenance\": false}"]}
2018-08-13 15:34:53,942+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Check VM status at virt level]
2018-08-13 15:34:56,146+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 changed: [localhost]
2018-08-13 15:34:58,050+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 TASK [debug]
2018-08-13 15:34:59,753+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 vm_status_virsh: {'stderr_lines': [], u'changed': True, u'end': u'2018-08-13 15:34:55.988876', u'stdout': u' 2     HostedEngine                   running', u'cmd': u'virsh -r list | grep HostedEngine | grep running', 'failed': False, u'delta': u'0:00:00.126384', u'stderr': u'', u'rc': 0, 'stdout_lines': [u' 2     HostedEngine                   running'], u'start': u'2018-08-13 15:34:55.862492'}
2018-08-13 15:35:01,456+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Fail if engine VM is not running]
2018-08-13 15:35:03,289+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 skipping: [localhost]
2018-08-13 15:35:05,093+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Get target engine VM IPv4 address]
2018-08-13 15:35:07,200+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 changed: [localhost]
2018-08-13 15:35:09,013+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Get VDSM's target engine VM stats]
2018-08-13 15:35:11,619+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 changed: [localhost]
2018-08-13 15:35:13,523+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Convert stats to JSON format]
2018-08-13 15:35:15,227+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 ok: [localhost]
2018-08-13 15:35:17,031+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Get target engine VM IPv4 address from VDSM stats]
2018-08-13 15:35:18,735+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 ok: [localhost]
2018-08-13 15:35:20,538+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 TASK [debug]
2018-08-13 15:35:22,241+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 engine_vm_ip_vdsm: 10.35.92.55
2018-08-13 15:35:24,044+0300 INFO otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:100 TASK [Fail if Engine IP is different from engine's FQDN resolved IP]
2018-08-13 15:35:25,748+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 {u'msg': u"Engine VM IP address is 10.35.92.55 while the engine's FQDN nsednev-he-1.qa.lab.tlv.redhat.com resolves to 10.35.92.51. If you are using DHCP, check your DHCP reservation configuration", u'changed': False, u'_ansible_no_log': False}
2018-08-13 15:35:25,849+0300 ERROR otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:98 fatal: [localhost]: FAILED! => {"changed": false, "msg": "Engine VM IP address is 10.35.92.55 while the engine's FQDN nsednev-he-1.qa.lab.tlv.redhat.com resolves to 10.35.92.51. If you are using DHCP, check your DHCP reservation configuration"}
2018-08-13 15:35:27,452+0300 DEBUG otopi.ovirt_hosted_engine_setup.ansible_utils ansible_utils._process_output:94 PLAY RECAP [localhost] : ok: 82 changed: 31 unreachable: 0 skipped: 5 failed: 2


Note You need to log in before you can comment on or make changes to this bug.