Description of problem: I noticed this in my test environment: When I set up the RHEV manager I allowed it to create an ISO domain on the VM for my environment. The ISO domain was attached and active in the datacenter. I put the environment in global maintenance, then took the VM offline. In an attempt to check on things, I ran 'hosted-engine --vm-status' which froze. After some inspection it appeared the that ovirt-ha-agent service was stuck in D state, and restarting it, the broker, or vdsm did not help. Noticed in /var/log/messages during this time (FQDNs removed): Sep 11 14:19:00 rhevh-11 vdsm scanDomains WARNING Metadata collection for domain path FQDN:_var_lib_exports_iso timedout#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/fileSD.py", line 662, in collectMetaFiles#012 sd.DOMAIN_META_DATA))#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 297, in callCrabRPCFunction#012 *args, **kwargs)#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 184, in callCrabRPCFunction#012 rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 150, in _recvAll#012 raise Timeout()#012Timeout Sep 11 14:19:12 rhevh-11 vdsm scanDomains WARNING Metadata collection for domain path FQDN:_var_lib_exports_iso timedout#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/fileSD.py", line 662, in collectMetaFiles#012 sd.DOMAIN_META_DATA))#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 297, in callCrabRPCFunction#012 *args, **kwargs)#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 184, in callCrabRPCFunction#012 rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 150, in _recvAll#012 raise Timeout()#012Timeout The 'hosted-engine' command would successfully boot the VM, at which point there was no longer an issue with '--vm-status'. If I put the ISO domain in maintenance mode before taking the VM offline this problem did not occur. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-1.1.2-5.el6ev.noarch vdsm-4.14.11-5.el6ev.x86_64 How reproducible: Very Steps to Reproduce: 1. Set up hosted-engine VM containing an ISO domain NFS share 2. Take VM offline while the ISO domain is attached and active in the Data Center 3. Try running 'hosted-engine --vm-status' and watch it freeze 4. Run 'ps aux | grep hosted' to see that the ovirt-ha-agent service is in D state Actual results: ovirt-ha-agent service locks up and is unresponsive until the VM (or ISO domain really) comes back online Expected results: There should be some mechanism in place to check if the ISO domain's IP address matches the IP of the HE VM, and if so, mark it as offline when the VM is not online Additional info: I'm not sure if the 'Expected Result' is the proper way to handle this, but the fact that 'hosted-engine --vm-status' is completely unresponsive should be taken into consideration. During maintenance windows if an admin needed to check on the VM's status it would be impossible to do so (with the hosted-engine script that is).
That's interesting the inaccessible ISO domain should have no impact on the hosted-engine agent or the hosted-engine --vm-status. Can you please check if hosted-engine.* files are accessible? They're in /rhev/mnt/<mount point>/<uuid>/ha_agent/
I believe we can consider this a dupe of #1085523. I really can't imagine a better solution than just timeout and showing some error message like: "Storage is not accessible, please check the connection to storage"
Patch has been merged only on upstream master, a backport to the 1.2 branch is missing. Moving back to post
Verified on ovirt-hosted-engine-ha-1.2.1-1.el6ev.noarch Have engine-vm with iso domain, after vm killed, vm-status works fine and also df function.
Also ps aux | grep hosted root 16015 0.0 0.0 103252 848 pts/0 S+ 14:56 0:00 grep hosted vdsm 19322 0.1 0.0 244080 15216 ? S 13:39 0:07 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent vdsm 19362 0.4 0.1 915456 16512 ? Sl 13:39 0:18 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker Don't have D status
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0194.html