Bug 1460982

Summary: [downstream clone - 4.1.5] [TEXT] Error message is confusing when hosted-engine Storage Domain can't be mounted
Product: Red Hat Enterprise Virtualization Manager Reporter: rhev-integ
Component: ovirt-hosted-engine-setupAssignee: Jenny Tokar <jtokar>
Status: CLOSED WORKSFORME QA Contact: Nikolai Sednev <nsednev>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0.6CC: dfediuck, gveitmic, jtokar, lsurette, mavital, mgoldboi, rmcswain, ykaul, ylavi
Target Milestone: ovirt-4.1.5Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-hosted-engine-setup-2.1.3.1-1.el7ev Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: 1434209 Environment:
Last Closed: 2017-08-20 10:50:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1434209    
Bug Blocks: 1455341    
Attachments:
Description Flags
Screenshot from 2017-07-05 11-43-24.png none

Description rhev-integ 2017-06-13 10:08:54 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1434209 +++
======================================================================

Description of problem:
Recently my customer opened a support case regarding an error message in their environment. I have also seen this same error in my lab and I, too, found it confusing at first since it doesn't actually tell you the real reason for the error.

# hosted-engine --vm-status
Unable to read vm.conf, please check ovirt-ha-agent logs

The customer reported that they were not able to start the hosted-engine because "the config file is missing" which, while accurate, is not the real reason. The real reason was that the Hosted Engine Storage Domain could not be mounted. After checking the storage connection and restarting the ovirt-ha-* services, the customer was able to launch the hosted-engine VM.


Version-Release number of selected component (if applicable):
4.0.6

How reproducible:
Very

Steps to Reproduce:
1. Create a hosted-engine environment hosted on NFS storage and cause an outage of some kind where the NFS mount doesn't get properly unmounted or can't be mounted at all.
2. Restart the ovirt-ha-agent and ovirt-ha-broker services.
3. No errors on ovirt-ha-* service restart. Only later in 'systemctl status ovirt-ha-agent ovirt-ha-broker' is the real error seen.
4. Run 'hosted-engine --vm-status'.

Actual results:
Error message is confusing and doesn't indicate a storage domain connection issue.

Expected results:
Error message should be more descriptive and provide users the proper commands to run (or perhaps a link to the attached kbase for more detailed resolution steps).

Additional info:
https://access.redhat.com/solutions/2973011

(Originally by Bryan Yount)

Comment 4 Nikolai Sednev 2017-07-02 13:32:46 UTC
I'm getting this error:
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '2e359244-de14-454c-b4aa-5f289da01e92'}: Connection timed out

1)Cast on alma04 "iptables -A OUTPUT -p udp -d <ip address of storage> --dport 2049 -j DROP" and "iptables -A OUTPUT -p tcp -d <ip address of storage> --dport 2049 -j DROP".
2)Cast on alma04 "systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent" and this caused for several minutes of host not being responsive (looked like it was stuck for several minutes). 
3)alma04 ~]# hosted-engine --vm-status
alma04 ~]# hosted-engine --vm-status
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 173, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 103, in print_status
    all_host_stats = self._get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 73, in _get_all_host_stats
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats
    self._configure_broker_conn(broker)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn
    dom_type=dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain
    .format(sd_type, options, e))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '2e359244-de14-454c-b4aa-5f289da01e92'}: Connection timed out

For ports and NFS detailed explanation please refer to 
https://www.centos.org/docs/5/html/Deployment_Guide-en-US/ch-nfs.html.

Components on hosts:
qemu-kvm-rhev-2.9.0-14.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
mom-0.5.9-1.el7ev.noarch
ovirt-imageio-daemon-1.0.0-0.el7ev.noarch
ovirt-setup-lib-1.1.3-1.el7ev.noarch
ovirt-imageio-common-1.0.0-0.el7ev.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
vdsm-4.19.20-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.1.4-1.el7ev.noarch
libvirt-client-3.2.0-14.el7.x86_64
ovirt-hosted-engine-setup-2.1.3.2-1.el7ev.noarch
sanlock-3.5.0-1.el7.x86_64
ovirt-host-deploy-1.6.6-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
Linux version 3.10.0-663.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-14) (GCC) ) #1 SMP Tue May 2 16:00:29 EDT 2017
Linux 3.10.0-663.el7.x86_64 #1 SMP Tue May 2 16:00:29 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.4 (Maipo)

I did not seen something like: - "The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable."
May you provide more information on this please?

Comment 5 Nikolai Sednev 2017-07-02 13:35:30 UTC
Moving back to assigned as an error being received is different from the expected.

Comment 7 Jenny Tokar 2017-07-05 08:09:13 UTC
The tested scenario is a bit different, here /var/run/ovirt-hosted-engine-ha/vm.conf exists in the system so the check that shows the error is passed successfully and the error is thrown from another place in the code.

Comment 8 Nikolai Sednev 2017-07-05 08:17:08 UTC
(In reply to Jenny Tokar from comment #7)
> The tested scenario is a bit different, here
> /var/run/ovirt-hosted-engine-ha/vm.conf exists in the system so the check
> that shows the error is passed successfully and the error is thrown from
> another place in the code.

Reproduction was made correctly, I've caused an outage of storage using iptables, which is totally possible imitation of storage disconnection, which could happen either somewhere within the network or by disconnecting the storage physically and error was thrown not as was expected.

Comment 9 Nikolai Sednev 2017-07-05 08:43:44 UTC
The same errors being dropped out if you'll disable NIC and then will run "systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent" and then "hosted-engine --vm-status" on host with HE-VM.

Comment 10 Nikolai Sednev 2017-07-05 08:44:46 UTC
Created attachment 1294501 [details]
Screenshot from 2017-07-05 11-43-24.png

Comment 11 Jenny Tokar 2017-07-05 09:36:57 UTC
I didn't say it was incorrect, just that it causes a different reaction. 

This will be fixed by a different error message.

Comment 13 Nikolai Sednev 2017-08-16 15:30:44 UTC
I'm still getting the same error:
# iptables -A OUTPUT -p udp -d 10.35.80.5 --dport 2049 -j DROP
# iptables -A OUTPUT -p tcp -d 10.35.80.5 --dport 2049 -j DROP
# hosted-engine --vm-statussystemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent && 
> 
> 
> 
> 
> ^C
[root@puma18 ~]# hosted-engine --vm-status && systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent && hosted-engine --vm-status 
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 180, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 104, in print_status
    all_host_stats = self._get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 73, in _get_all_host_stats
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats
    self._configure_broker_conn(broker)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn
    dom_type=dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain
    .format(sd_type, options, e))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '22488d8d-6f11-4c9d-b129-9a6e492d7a16'}: Connection timed out

Comment 14 Nikolai Sednev 2017-08-16 15:31:45 UTC
Version-Release number of selected component:
ovirt-hosted-engine-setup-2.1.3.6-1.el7ev.noarch
ovirt-hosted-engine-ha-2.1.5-1.el7ev.noarch
vdsm-4.19.27-1.el7ev.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64
ovirt-host-deploy-1.6.6-1.el7ev.noarch
libvirt-client-3.2.0-14.el7_4.2.x86_64
libvirt-lock-sanlock-3.2.0-14.el7_4.2.x86_64

Comment 15 Nikolai Sednev 2017-08-17 08:00:14 UTC
I've just tested this on ovirt-hosted-engine-setup-2.2.0-0.0.master.20170814052558.git066c94c.el7.centos.noarch and there it appears to be working fine, although with huge delay:

[root@alma03 ~]# iptables -A OUTPUT -p udp -d <ip of SHE's storage server here> --dport 2049 -j DROP && iptables -A OUTPUT -p tcp -d 10.35.80.5 --dport 2049 -j DROP && systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent
.
Some delay of several minutes happens here...
.
[root@alma03 ~]# hosted-engine --vm-status
The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.

I've also noticed for error message in /var/log/messages:
Aug 17 10:42:01 alma03 kernel: watchdog watchdog0: watchdog did not stop!
Aug 17 10:42:01 alma03 wdmd[698]: /dev/watchdog0 closed unclean

Please see my reproduction on upstream latest build.

Comment 17 Doron Fediuck 2017-08-20 10:50:39 UTC
I do not see the point in investing more time for an error message, which mostly works.

Comment 18 Nikolai Sednev 2017-08-20 13:23:47 UTC
Doron, is this working as expected if I'm getting this on latest 4.1.5?
Wasn't this bug on fixing for the error message received after running "hosted-engine --vm-status" command on host without connectivity to its hosted-egnine storage domain?
The error message which I'm getting is still the same as when this bug was opened:

[root@puma18 ~]# hosted-engine --vm-status
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 180, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 104, in print_status
    all_host_stats = self._get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 73, in _get_all_host_stats
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats
    self._configure_broker_conn(broker)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn
    dom_type=dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain
    .format(sd_type, options, e))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '22488d8d-6f11-4c9d-b129-9a6e492d7a16'}: Connection timed out

Comment 19 Doron Fediuck 2017-09-03 10:17:57 UTC
(In reply to Nikolai Sednev from comment #18)
> Doron, is this working as expected if I'm getting this on latest 4.1.5?
> Wasn't this bug on fixing for the error message received after running
> "hosted-engine --vm-status" command on host without connectivity to its
> hosted-egnine storage domain?
> The error message which I'm getting is still the same as when this bug was
> opened:

I didn't say as expected, I said we invested too much time in it and we can live with the below message, so it works for me-

> ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage
> domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid':
> '22488d8d-6f11-4c9d-b129-9a6e492d7a16'}: Connection timed out