Bug 1140824 - ovirt-ha-agent goes into D state when the RHEV-M VM is hosting the ISO domain and goes offline
Summary: ovirt-ha-agent goes into D state when the RHEV-M VM is hosting the ISO domain...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 3.4.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.5.0
Assignee: Doron Fediuck
QA Contact: Artyom
URL:
Whiteboard: sla
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-09-11 18:42 UTC by wdaniel
Modified: 2016-02-10 20:14 UTC (History)
9 users (show)

Fixed In Version: ovirt-hosted-engine-ha-1.2.2-1.el6ev
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-02-11 21:09:10 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0194 0 normal SHIPPED_LIVE ovirt-hosted-engine-ha bug fix and enhancement update 2015-02-12 01:35:33 UTC
oVirt gerrit 33725 0 master MERGED use vdsm to get the storage domain path 2020-11-12 04:57:21 UTC
oVirt gerrit 33755 0 ovirt-hosted-engine-ha-1.2 MERGED use vdsm to get the storage domain path 2020-11-12 04:57:01 UTC

Description wdaniel 2014-09-11 18:42:55 UTC
Description of problem:

I noticed this in my test environment: When I set up the RHEV manager I allowed it to create an ISO domain on the VM for my environment. The ISO domain was attached and active in the datacenter. I put the environment in global maintenance, then took the VM offline. 

In an attempt to check on things, I ran 'hosted-engine --vm-status' which froze. After some inspection it appeared the that ovirt-ha-agent service was stuck in D state, and restarting it, the broker, or vdsm did not help.

Noticed in /var/log/messages during this time (FQDNs removed):

Sep 11 14:19:00 rhevh-11 vdsm scanDomains WARNING Metadata collection for domain path FQDN:_var_lib_exports_iso timedout#012Traceback (most recent call last):#012  File "/usr/share/vdsm/storage/fileSD.py", line 662, in collectMetaFiles#012    sd.DOMAIN_META_DATA))#012  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 297, in callCrabRPCFunction#012    *args, **kwargs)#012  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 184, in callCrabRPCFunction#012    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)#012  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 150, in _recvAll#012    raise Timeout()#012Timeout
Sep 11 14:19:12 rhevh-11 vdsm scanDomains WARNING Metadata collection for domain path FQDN:_var_lib_exports_iso timedout#012Traceback (most recent call last):#012  File "/usr/share/vdsm/storage/fileSD.py", line 662, in collectMetaFiles#012    sd.DOMAIN_META_DATA))#012  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 297, in callCrabRPCFunction#012    *args, **kwargs)#012  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 184, in callCrabRPCFunction#012    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)#012  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 150, in _recvAll#012    raise Timeout()#012Timeout

The 'hosted-engine' command would successfully boot the VM, at which point there was no longer an issue with '--vm-status'. If I put the ISO domain in maintenance mode before taking the VM offline this problem did not occur.

Version-Release number of selected component (if applicable):

ovirt-hosted-engine-ha-1.1.2-5.el6ev.noarch
vdsm-4.14.11-5.el6ev.x86_64

How reproducible:

Very

Steps to Reproduce:
1. Set up hosted-engine VM containing an ISO domain NFS share
2. Take VM offline while the ISO domain is attached and active in the Data Center
3. Try running 'hosted-engine --vm-status' and watch it freeze
4. Run 'ps aux | grep hosted' to see  that the ovirt-ha-agent service is in D state

Actual results:

ovirt-ha-agent service locks up and is unresponsive until the VM (or ISO domain really) comes back online

Expected results:

There should be some mechanism in place to check if the ISO domain's IP address matches the IP of the HE VM, and if so, mark it as offline when the VM is not online

Additional info:

I'm not sure if the 'Expected Result' is the proper way to handle this, but the fact that 'hosted-engine --vm-status' is completely unresponsive should be taken into consideration. During maintenance windows if an admin needed to check on the VM's status it would be impossible to do so (with the hosted-engine script that is).

Comment 1 Jiri Moskovcak 2014-09-24 14:06:19 UTC
That's interesting the inaccessible ISO domain should have no impact on the hosted-engine agent or the hosted-engine --vm-status. Can you please check if hosted-engine.* files are accessible? They're in /rhev/mnt/<mount point>/<uuid>/ha_agent/

Comment 4 Jiri Moskovcak 2014-10-01 08:51:09 UTC
I believe we can consider this a dupe of #1085523. I really can't imagine a better solution than just timeout and showing some error message like: "Storage is not accessible, please check the connection to storage"

Comment 5 Sandro Bonazzola 2014-10-03 07:15:38 UTC
Patch has been merged only on upstream master, a backport to the 1.2 branch is missing. Moving back to post

Comment 7 Artyom 2014-10-06 12:02:13 UTC
Verified on ovirt-hosted-engine-ha-1.2.1-1.el6ev.noarch
Have engine-vm with iso domain, after vm killed, vm-status works fine and also df function.

Comment 8 Artyom 2014-10-06 12:02:50 UTC
Also ps aux | grep hosted
root     16015  0.0  0.0 103252   848 pts/0    S+   14:56   0:00 grep hosted
vdsm     19322  0.1  0.0 244080 15216 ?        S    13:39   0:07 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
vdsm     19362  0.4  0.1 915456 16512 ?        Sl   13:39   0:18 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker
Don't have D status

Comment 11 errata-xmlrpc 2015-02-11 21:09:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0194.html


Note You need to log in before you can comment on or make changes to this bug.