Bug 1286750

Summary: ovirt-hosted-engine-ha-agent stopped due to timeout during domain aquisition
Product: [oVirt] ovirt-hosted-engine-ha Reporter: SATHEESARAN <sasundar>
Component: GeneralAssignee: Martin Sivák <msivak>
Status: CLOSED WORKSFORME QA Contact: Ilanit Stein <istein>
Severity: urgent Docs Contact:
Priority: medium    
Version: 1.3.0CC: bugs, dfediuck, gklein, lsurette, nsoffer, rgolan, sasundar, ykaul
Target Milestone: ovirt-4.0.0-alphaFlags: dfediuck: ovirt-4.0.0?
rule-engine: planning_ack?
sasundar: devel_ack?
rule-engine: testing_ack?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-10 17:39:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1258386    
Attachments:
Description Flags
agent.log from host-3 none

Description SATHEESARAN 2015-11-30 16:18:34 UTC
Description of problem:
-----------------------
hosted-engine deployment was made with RHEL 7.2 server. 2 more additional hosts are also added to hosted-engine setup. ovirt-ha-agent was not running in one of the node in the cluster, and there were errors due to timeout during domain acquisition. This error was also seen in other nodes, but ovirt-ha-agent was running on those nodes.

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHEV 3.6 Beta1

How reproducible:
------------------
Never tried to reproduce

Steps to Reproduce:
-------------------
1. Deploy hosted-engine on a RHEL 7.2 server ( with gluster as backend )
2. Add 2 more hosts to the setup using hosted-engine deploy script
3. After all the three nodes are deployed, check for ovirt-ha-agent
(i.e) service ovirt-ha-agent status

Actual results:
---------------
ovirt-ha-agent was not running

Expected results:
-----------------
ovirt-ha-agent should be running

Additional info:
-----------------
Below is the snippet of errors seen in agent.log

<snip>
MainThread::ERROR::2015-11-30 14:51:46,855::hosted_engine::790::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=e71eefc3-a720-435c-a79a-efb11987a198, host_id=2): timeout during domain acquisition
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=e71eefc3-a720-435c-a79a-efb11987a198, host_id=2): timeout during domain acquisition
MainThread::WARNING::2015-11-30 14:51:46,855::hosted_engine::470::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Failed to start monitoring domain (sd_uuid=e71eefc3-a720-435c-a79a-efb11987a198, host_id=2): timeout during domain acquisition
MainThread::WARNING::2015-11-30 14:51:46,855::hosted_engine::473::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=e71eefc3-a720-435c-a79a-efb11987a198, host_id=2): timeout during domain acquisition
MainThread::ERROR::2015-11-30 14:51:46,855::hosted_engine::486::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Shutting down the agent because of 3 failures in a row!
MainThread::INFO::2015-11-30 14:51:49,869::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down
</snip>

Comment 1 SATHEESARAN 2015-12-03 08:40:25 UTC
In my recent testing, I am seeing errors in agent.log but agent was still running.

Below is the snip of those errors

<snip>
MainThread::ERROR::2015-12-03 13:52:53,623::hosted_engine::790::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=0ffa2caf-93ba-4
48a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=0ffa2caf-93ba-448a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition
MainThread::WARNING::2015-12-03 13:52:53,623::hosted_engine::470::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Failed to start monitoring domain (
sd_uuid=0ffa2caf-93ba-448a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition
MainThread::WARNING::2015-12-03 13:52:53,623::hosted_engine::473::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=0ffa2caf-93ba-448a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition

</snip>

Comment 2 SATHEESARAN 2015-12-03 08:42:00 UTC
Created attachment 1101707 [details]
agent.log from host-3

This logfile has errors of domain acquisition timeout

Comment 3 Martin Sivák 2016-02-10 09:58:37 UTC
Nir, how much time can VDSM take to acquire the domain lock? Our current timeout is 4 minutes and we try that 5 times. Plus systemd will restart the agent anyway so we will try until it succeeds.

SATHEESARAN: Do you still by any chance have the vdsm.log from the machine? Or can you describe the setup? How many domains and what type (nfs, iscsi, ..) did you have?

Comment 4 Red Hat Bugzilla Rules Engine 2016-02-11 08:09:49 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 5 Nir Soffer 2016-02-20 20:15:53 UTC
(In reply to Martin Sivák from comment #3)
> Nir, how much time can VDSM take to acquire the domain lock? 

Do you mean acquire the host id after starting a domain monitor?

There is no limit in vdsm, if storage is not accessible, this can take
forever. The monitor starts async acquire, and check the status with
sanlock every 10 seconds. When sanlock reports that the host id is acquired,
we report back acquired: True in repoStat.

> Our current
> timeout is 4 minutes and we try that 5 times. Plus systemd will restart the
> agent anyway so we will try until it succeeds.

We need vdsm log and sanlock log to understand what happened.

Comment 6 SATHEESARAN 2016-03-10 17:39:00 UTC
I am not seeing this issue from beta3.
Closing this bug now, and will re-open the issue if encountered again