Bug 1286750 - ovirt-hosted-engine-ha-agent stopped due to timeout during domain aquisition
ovirt-hosted-engine-ha-agent stopped due to timeout during domain aquisition
Status: CLOSED WORKSFORME
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: General (Show other bugs)
1.3.0
x86_64 Linux
medium Severity urgent (vote)
: ovirt-4.0.0-alpha
: ---
Assigned To: Martin Sivák
Ilanit Stein
:
Depends On:
Blocks: Gluster-HC-1
  Show dependency treegraph
 
Reported: 2015-11-30 11:18 EST by SATHEESARAN
Modified: 2016-03-10 12:39 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-03-10 12:39:00 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: SLA
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
dfediuck: ovirt‑4.0.0?
rule-engine: planning_ack?
sasundar: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
agent.log from host-3 (7.74 MB, text/plain)
2015-12-03 03:42 EST, SATHEESARAN
no flags Details

  None (edit)
Description SATHEESARAN 2015-11-30 11:18:34 EST
Description of problem:
-----------------------
hosted-engine deployment was made with RHEL 7.2 server. 2 more additional hosts are also added to hosted-engine setup. ovirt-ha-agent was not running in one of the node in the cluster, and there were errors due to timeout during domain acquisition. This error was also seen in other nodes, but ovirt-ha-agent was running on those nodes.

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHEV 3.6 Beta1

How reproducible:
------------------
Never tried to reproduce

Steps to Reproduce:
-------------------
1. Deploy hosted-engine on a RHEL 7.2 server ( with gluster as backend )
2. Add 2 more hosts to the setup using hosted-engine deploy script
3. After all the three nodes are deployed, check for ovirt-ha-agent
(i.e) service ovirt-ha-agent status

Actual results:
---------------
ovirt-ha-agent was not running

Expected results:
-----------------
ovirt-ha-agent should be running

Additional info:
-----------------
Below is the snippet of errors seen in agent.log

<snip>
MainThread::ERROR::2015-11-30 14:51:46,855::hosted_engine::790::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=e71eefc3-a720-435c-a79a-efb11987a198, host_id=2): timeout during domain acquisition
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=e71eefc3-a720-435c-a79a-efb11987a198, host_id=2): timeout during domain acquisition
MainThread::WARNING::2015-11-30 14:51:46,855::hosted_engine::470::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Failed to start monitoring domain (sd_uuid=e71eefc3-a720-435c-a79a-efb11987a198, host_id=2): timeout during domain acquisition
MainThread::WARNING::2015-11-30 14:51:46,855::hosted_engine::473::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=e71eefc3-a720-435c-a79a-efb11987a198, host_id=2): timeout during domain acquisition
MainThread::ERROR::2015-11-30 14:51:46,855::hosted_engine::486::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Shutting down the agent because of 3 failures in a row!
MainThread::INFO::2015-11-30 14:51:49,869::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down
</snip>
Comment 1 SATHEESARAN 2015-12-03 03:40:25 EST
In my recent testing, I am seeing errors in agent.log but agent was still running.

Below is the snip of those errors

<snip>
MainThread::ERROR::2015-12-03 13:52:53,623::hosted_engine::790::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=0ffa2caf-93ba-4
48a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=0ffa2caf-93ba-448a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition
MainThread::WARNING::2015-12-03 13:52:53,623::hosted_engine::470::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Failed to start monitoring domain (
sd_uuid=0ffa2caf-93ba-448a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition
MainThread::WARNING::2015-12-03 13:52:53,623::hosted_engine::473::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=0ffa2caf-93ba-448a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition

</snip>
Comment 2 SATHEESARAN 2015-12-03 03:42 EST
Created attachment 1101707 [details]
agent.log from host-3

This logfile has errors of domain acquisition timeout
Comment 3 Martin Sivák 2016-02-10 04:58:37 EST
Nir, how much time can VDSM take to acquire the domain lock? Our current timeout is 4 minutes and we try that 5 times. Plus systemd will restart the agent anyway so we will try until it succeeds.

SATHEESARAN: Do you still by any chance have the vdsm.log from the machine? Or can you describe the setup? How many domains and what type (nfs, iscsi, ..) did you have?
Comment 4 Red Hat Bugzilla Rules Engine 2016-02-11 03:09:49 EST
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
Comment 5 Nir Soffer 2016-02-20 15:15:53 EST
(In reply to Martin Sivák from comment #3)
> Nir, how much time can VDSM take to acquire the domain lock? 

Do you mean acquire the host id after starting a domain monitor?

There is no limit in vdsm, if storage is not accessible, this can take
forever. The monitor starts async acquire, and check the status with
sanlock every 10 seconds. When sanlock reports that the host id is acquired,
we report back acquired: True in repoStat.

> Our current
> timeout is 4 minutes and we try that 5 times. Plus systemd will restart the
> agent anyway so we will try until it succeeds.

We need vdsm log and sanlock log to understand what happened.
Comment 6 SATHEESARAN 2016-03-10 12:39:00 EST
I am not seeing this issue from beta3.
Closing this bug now, and will re-open the issue if encountered again

Note You need to log in before you can comment on or make changes to this bug.