+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1434209 +++ ====================================================================== Description of problem: Recently my customer opened a support case regarding an error message in their environment. I have also seen this same error in my lab and I, too, found it confusing at first since it doesn't actually tell you the real reason for the error. # hosted-engine --vm-status Unable to read vm.conf, please check ovirt-ha-agent logs The customer reported that they were not able to start the hosted-engine because "the config file is missing" which, while accurate, is not the real reason. The real reason was that the Hosted Engine Storage Domain could not be mounted. After checking the storage connection and restarting the ovirt-ha-* services, the customer was able to launch the hosted-engine VM. Version-Release number of selected component (if applicable): 4.0.6 How reproducible: Very Steps to Reproduce: 1. Create a hosted-engine environment hosted on NFS storage and cause an outage of some kind where the NFS mount doesn't get properly unmounted or can't be mounted at all. 2. Restart the ovirt-ha-agent and ovirt-ha-broker services. 3. No errors on ovirt-ha-* service restart. Only later in 'systemctl status ovirt-ha-agent ovirt-ha-broker' is the real error seen. 4. Run 'hosted-engine --vm-status'. Actual results: Error message is confusing and doesn't indicate a storage domain connection issue. Expected results: Error message should be more descriptive and provide users the proper commands to run (or perhaps a link to the attached kbase for more detailed resolution steps). Additional info: https://access.redhat.com/solutions/2973011 (Originally by Bryan Yount)
I'm getting this error: ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '2e359244-de14-454c-b4aa-5f289da01e92'}: Connection timed out 1)Cast on alma04 "iptables -A OUTPUT -p udp -d <ip address of storage> --dport 2049 -j DROP" and "iptables -A OUTPUT -p tcp -d <ip address of storage> --dport 2049 -j DROP". 2)Cast on alma04 "systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent" and this caused for several minutes of host not being responsive (looked like it was stuck for several minutes). 3)alma04 ~]# hosted-engine --vm-status alma04 ~]# hosted-engine --vm-status Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 173, in <module> if not status_checker.print_status(): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 103, in print_status all_host_stats = self._get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 73, in _get_all_host_stats all_host_stats = ha_cli.get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats return self.get_all_stats(self.StatModes.HOST) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain .format(sd_type, options, e)) ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '2e359244-de14-454c-b4aa-5f289da01e92'}: Connection timed out For ports and NFS detailed explanation please refer to https://www.centos.org/docs/5/html/Deployment_Guide-en-US/ch-nfs.html. Components on hosts: qemu-kvm-rhev-2.9.0-14.el7.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch mom-0.5.9-1.el7ev.noarch ovirt-imageio-daemon-1.0.0-0.el7ev.noarch ovirt-setup-lib-1.1.3-1.el7ev.noarch ovirt-imageio-common-1.0.0-0.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch vdsm-4.19.20-1.el7ev.x86_64 ovirt-hosted-engine-ha-2.1.4-1.el7ev.noarch libvirt-client-3.2.0-14.el7.x86_64 ovirt-hosted-engine-setup-2.1.3.2-1.el7ev.noarch sanlock-3.5.0-1.el7.x86_64 ovirt-host-deploy-1.6.6-1.el7ev.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch Linux version 3.10.0-663.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-14) (GCC) ) #1 SMP Tue May 2 16:00:29 EDT 2017 Linux 3.10.0-663.el7.x86_64 #1 SMP Tue May 2 16:00:29 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.4 (Maipo) I did not seen something like: - "The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable." May you provide more information on this please?
Moving back to assigned as an error being received is different from the expected.
The tested scenario is a bit different, here /var/run/ovirt-hosted-engine-ha/vm.conf exists in the system so the check that shows the error is passed successfully and the error is thrown from another place in the code.
(In reply to Jenny Tokar from comment #7) > The tested scenario is a bit different, here > /var/run/ovirt-hosted-engine-ha/vm.conf exists in the system so the check > that shows the error is passed successfully and the error is thrown from > another place in the code. Reproduction was made correctly, I've caused an outage of storage using iptables, which is totally possible imitation of storage disconnection, which could happen either somewhere within the network or by disconnecting the storage physically and error was thrown not as was expected.
The same errors being dropped out if you'll disable NIC and then will run "systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent" and then "hosted-engine --vm-status" on host with HE-VM.
Created attachment 1294501 [details] Screenshot from 2017-07-05 11-43-24.png
I didn't say it was incorrect, just that it causes a different reaction. This will be fixed by a different error message.
I'm still getting the same error: # iptables -A OUTPUT -p udp -d 10.35.80.5 --dport 2049 -j DROP # iptables -A OUTPUT -p tcp -d 10.35.80.5 --dport 2049 -j DROP # hosted-engine --vm-statussystemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent && > > > > > ^C [root@puma18 ~]# hosted-engine --vm-status && systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent && hosted-engine --vm-status Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 180, in <module> if not status_checker.print_status(): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 104, in print_status all_host_stats = self._get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 73, in _get_all_host_stats all_host_stats = ha_cli.get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats return self.get_all_stats(self.StatModes.HOST) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain .format(sd_type, options, e)) ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '22488d8d-6f11-4c9d-b129-9a6e492d7a16'}: Connection timed out
Version-Release number of selected component: ovirt-hosted-engine-setup-2.1.3.6-1.el7ev.noarch ovirt-hosted-engine-ha-2.1.5-1.el7ev.noarch vdsm-4.19.27-1.el7ev.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64 ovirt-host-deploy-1.6.6-1.el7ev.noarch libvirt-client-3.2.0-14.el7_4.2.x86_64 libvirt-lock-sanlock-3.2.0-14.el7_4.2.x86_64
I've just tested this on ovirt-hosted-engine-setup-2.2.0-0.0.master.20170814052558.git066c94c.el7.centos.noarch and there it appears to be working fine, although with huge delay: [root@alma03 ~]# iptables -A OUTPUT -p udp -d <ip of SHE's storage server here> --dport 2049 -j DROP && iptables -A OUTPUT -p tcp -d 10.35.80.5 --dport 2049 -j DROP && systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent . Some delay of several minutes happens here... . [root@alma03 ~]# hosted-engine --vm-status The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable. I've also noticed for error message in /var/log/messages: Aug 17 10:42:01 alma03 kernel: watchdog watchdog0: watchdog did not stop! Aug 17 10:42:01 alma03 wdmd[698]: /dev/watchdog0 closed unclean Please see my reproduction on upstream latest build.
https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88aWRkcXBKa0RuOHM/view?usp=sharing
I do not see the point in investing more time for an error message, which mostly works.
Doron, is this working as expected if I'm getting this on latest 4.1.5? Wasn't this bug on fixing for the error message received after running "hosted-engine --vm-status" command on host without connectivity to its hosted-egnine storage domain? The error message which I'm getting is still the same as when this bug was opened: [root@puma18 ~]# hosted-engine --vm-status Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 180, in <module> if not status_checker.print_status(): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 104, in print_status all_host_stats = self._get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 73, in _get_all_host_stats all_host_stats = ha_cli.get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats return self.get_all_stats(self.StatModes.HOST) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 177, in set_storage_domain .format(sd_type, options, e)) ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': '22488d8d-6f11-4c9d-b129-9a6e492d7a16'}: Connection timed out
(In reply to Nikolai Sednev from comment #18) > Doron, is this working as expected if I'm getting this on latest 4.1.5? > Wasn't this bug on fixing for the error message received after running > "hosted-engine --vm-status" command on host without connectivity to its > hosted-egnine storage domain? > The error message which I'm getting is still the same as when this bug was > opened: I didn't say as expected, I said we invested too much time in it and we can live with the below message, so it works for me- > ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage > domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': > '22488d8d-6f11-4c9d-b129-9a6e492d7a16'}: Connection timed out