Bug 1460982 - [downstream clone - 4.1.5] [TEXT] Error message is confusing when hosted-engine Storage Domain can't be mounted
Summary: [downstream clone - 4.1.5] [TEXT] Error message is confusing when hosted-engi...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-setup
Version: 4.0.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.1.5
: ---
Assignee: Jenny Tokar
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On: 1434209
Blocks: 1455341
TreeView+ depends on / blocked
 
Reported: 2017-06-13 10:08 UTC by rhev-integ
Modified: 2020-05-14 16:02 UTC (History)
9 users (show)

Fixed In Version: ovirt-hosted-engine-setup-2.1.3.1-1.el7ev
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of: 1434209
Environment:
Last Closed: 2017-08-20 10:50:39 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Screenshot from 2017-07-05 11-43-24.png (160.11 KB, image/png)
2017-07-05 08:44 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2973011 0 None None None 2017-06-13 10:09:16 UTC
Red Hat Product Errata RHBA-2017:2525 0 normal SHIPPED_LIVE ovirt-hosted-engine-setup bug fix update for RHV 4.1.5 2017-08-22 21:41:00 UTC
oVirt gerrit 77959 0 master MERGED Provide clearer error when failing to retrieve vm.conf from storage. 2020-11-23 13:32:39 UTC
oVirt gerrit 78131 0 ovirt-hosted-engine-setup-2.1 MERGED Provide clearer error when failing to retrieve vm.conf from storage. 2020-11-23 13:32:16 UTC
oVirt gerrit 80380 0 ovirt-hosted-engine-setup-2.1 MERGED Provide clearer error when failing to connect to storage domain. 2020-11-23 13:32:15 UTC

Description rhev-integ 2017-06-13 10:08:54 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1434209 +++
======================================================================

Description of problem:
Recently my customer opened a support case regarding an error message in their environment. I have also seen this same error in my lab and I, too, found it confusing at first since it doesn't actually tell you the real reason for the error.

# hosted-engine --vm-status
Unable to read vm.conf, please check ovirt-ha-agent logs

The customer reported that they were not able to start the hosted-engine because "the config file is missing" which, while accurate, is not the real reason. The real reason was that the Hosted Engine Storage Domain could not be mounted. After checking the storage connection and restarting the ovirt-ha-* services, the customer was able to launch the hosted-engine VM.


Version-Release number of selected component (if applicable):
4.0.6

How reproducible:
Very

Steps to Reproduce:
1. Create a hosted-engine environment hosted on NFS storage and cause an outage of some kind where the NFS mount doesn't get properly unmounted or can't be mounted at all.
2. Restart the ovirt-ha-agent and ovirt-ha-broker services.
3. No errors on ovirt-ha-* service restart. Only later in 'systemctl status ovirt-ha-agent ovirt-ha-broker' is the real error seen.
4. Run 'hosted-engine --vm-status'.

Actual results:
Error message is confusing and doesn't indicate a storage domain connection issue.

Expected results:
Error message should be more descriptive and provide users the proper commands to run (or perhaps a link to the attached kbase for more detailed resolution steps).

Additional info:
https://access.redhat.com/solutions/2973011

(Originally by Bryan Yount)

Comment 4 Nikolai Sednev 2017-07-02 13:32:46 UTC
I'm getting this error:
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '2e359244-de14-454c-b4aa-5f289da01e92'}: Connection timed out

1)Cast on alma04 "iptables -A OUTPUT -p udp -d <ip address of storage> --dport 2049 -j DROP" and "iptables -A OUTPUT -p tcp -d <ip address of storage> --dport 2049 -j DROP".
2)Cast on alma04 "systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent" and this caused for several minutes of host not being responsive (looked like it was stuck for several minutes). 
3)alma04 ~]# hosted-engine --vm-status
alma04 ~]# hosted-engine --vm-status
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 173, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 103, in print_status
    all_host_stats = self._get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 73, in _get_all_host_stats
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats
    self._configure_broker_conn(broker)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn
    dom_type=dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain
    .format(sd_type, options, e))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '2e359244-de14-454c-b4aa-5f289da01e92'}: Connection timed out

For ports and NFS detailed explanation please refer to 
https://www.centos.org/docs/5/html/Deployment_Guide-en-US/ch-nfs.html.

Components on hosts:
qemu-kvm-rhev-2.9.0-14.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-setup-lib-1.1.3-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
vdsm-4.19.20-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.1.4-1.el7ev.noarch
libvirt-client-3.2.0-14.el7.x86_64
ovirt-hosted-engine-setup-2.1.3.2-1.el7ev.noarch
sanlock-3.5.0-1.el7.x86_64
ovirt-host-deploy-1.6.6-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
Linux version 3.10.0-663.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-14) (GCC) ) #1 SMP Tue May 2 16:00:29 EDT 2017
Linux 3.10.0-663.el7.x86_64 #1 SMP Tue May 2 16:00:29 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.4 (Maipo)

I did not seen something like: - "The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable."
May you provide more information on this please?

Comment 5 Nikolai Sednev 2017-07-02 13:35:30 UTC
Moving back to assigned as an error being received is different from the expected.

Comment 7 Jenny Tokar 2017-07-05 08:09:13 UTC
The tested scenario is a bit different, here /var/run/ovirt-hosted-engine-ha/vm.conf exists in the system so the check that shows the error is passed successfully and the error is thrown from another place in the code.

Comment 8 Nikolai Sednev 2017-07-05 08:17:08 UTC
(In reply to Jenny Tokar from comment #7)
> The tested scenario is a bit different, here
> /var/run/ovirt-hosted-engine-ha/vm.conf exists in the system so the check
> that shows the error is passed successfully and the error is thrown from
> another place in the code.

Reproduction was made correctly, I've caused an outage of storage using iptables, which is totally possible imitation of storage disconnection, which could happen either somewhere within the network or by disconnecting the storage physically and error was thrown not as was expected.

Comment 9 Nikolai Sednev 2017-07-05 08:43:44 UTC
The same errors being dropped out if you'll disable NIC and then will run "systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent" and then "hosted-engine --vm-status" on host with HE-VM.

Comment 10 Nikolai Sednev 2017-07-05 08:44:46 UTC
Created attachment 1294501 [details]
Screenshot from 2017-07-05 11-43-24.png

Comment 11 Jenny Tokar 2017-07-05 09:36:57 UTC
I didn't say it was incorrect, just that it causes a different reaction. 

This will be fixed by a different error message.

Comment 13 Nikolai Sednev 2017-08-16 15:30:44 UTC
I'm still getting the same error:
# iptables -A OUTPUT -p udp -d 10.35.80.5 --dport 2049 -j DROP
# iptables -A OUTPUT -p tcp -d 10.35.80.5 --dport 2049 -j DROP
# hosted-engine --vm-statussystemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent && 
> 
> 
> 
> 
> ^C
[root@puma18 ~]# hosted-engine --vm-status && systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent && hosted-engine --vm-status 
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 180, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 104, in print_status
    all_host_stats = self._get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 73, in _get_all_host_stats
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats
    self._configure_broker_conn(broker)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn
    dom_type=dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain
    .format(sd_type, options, e))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '22488d8d-6f11-4c9d-b129-9a6e492d7a16'}: Connection timed out

Comment 14 Nikolai Sednev 2017-08-16 15:31:45 UTC
Version-Release number of selected component:
ovirt-hosted-engine-setup-2.1.3.6-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.5-1.el7ev.noarch
vdsm-4.19.27-1.el7ev.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64
ovirt-host-deploy-1.6.6-1.el7ev.noarch
libvirt-client-3.2.0-14.el7_4.2.x86_64
libvirt-lock-sanlock-3.2.0-14.el7_4.2.x86_64

Comment 15 Nikolai Sednev 2017-08-17 08:00:14 UTC
I've just tested this on ovirt-hosted-engine-setup-2.2.0-0.0.master.20170814052558.git066c94c.el7.centos.noarch and there it appears to be working fine, although with huge delay:

[root@alma03 ~]# iptables -A OUTPUT -p udp -d <ip of SHE's storage server here> --dport 2049 -j DROP && iptables -A OUTPUT -p tcp -d 10.35.80.5 --dport 2049 -j DROP && systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent
.
Some delay of several minutes happens here...
.
[root@alma03 ~]# hosted-engine --vm-status
The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.

I've also noticed for error message in /var/log/messages:
Aug 17 10:42:01 alma03 kernel: watchdog watchdog0: watchdog did not stop!
Aug 17 10:42:01 alma03 wdmd[698]: /dev/watchdog0 closed unclean

Please see my reproduction on upstream latest build.

Comment 17 Doron Fediuck 2017-08-20 10:50:39 UTC
I do not see the point in investing more time for an error message, which mostly works.

Comment 18 Nikolai Sednev 2017-08-20 13:23:47 UTC
Doron, is this working as expected if I'm getting this on latest 4.1.5?
Wasn't this bug on fixing for the error message received after running "hosted-engine --vm-status" command on host without connectivity to its hosted-egnine storage domain?
The error message which I'm getting is still the same as when this bug was opened:

[root@puma18 ~]# hosted-engine --vm-status
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 180, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 104, in print_status
    all_host_stats = self._get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 73, in _get_all_host_stats
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats
    self._configure_broker_conn(broker)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn
    dom_type=dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain
    .format(sd_type, options, e))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '22488d8d-6f11-4c9d-b129-9a6e492d7a16'}: Connection timed out

Comment 19 Doron Fediuck 2017-09-03 10:17:57 UTC
(In reply to Nikolai Sednev from comment #18)
> Doron, is this working as expected if I'm getting this on latest 4.1.5?
> Wasn't this bug on fixing for the error message received after running
> "hosted-engine --vm-status" command on host without connectivity to its
> hosted-egnine storage domain?
> The error message which I'm getting is still the same as when this bug was
> opened:

I didn't say as expected, I said we invested too much time in it and we can live with the below message, so it works for me-

> ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage
> domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid':
> '22488d8d-6f11-4c9d-b129-9a6e492d7a16'}: Connection timed out


Note You need to log in before you can comment on or make changes to this bug.