Bug 1286750
| Summary: | ovirt-hosted-engine-ha-agent stopped due to timeout during domain aquisition | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | SATHEESARAN <sasundar> | ||||
| Component: | General | Assignee: | Martin Sivák <msivak> | ||||
| Status: | CLOSED WORKSFORME | QA Contact: | Ilanit Stein <istein> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 1.3.0 | CC: | bugs, dfediuck, gklein, lsurette, nsoffer, rgolan, sasundar, ykaul | ||||
| Target Milestone: | ovirt-4.0.0-alpha | Flags: | dfediuck:
ovirt-4.0.0?
rule-engine: planning_ack? sasundar: devel_ack? rule-engine: testing_ack? |
||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-03-10 17:39:00 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1258386 | ||||||
| Attachments: |
|
||||||
In my recent testing, I am seeing errors in agent.log but agent was still running.
Below is the snip of those errors
<snip>
MainThread::ERROR::2015-12-03 13:52:53,623::hosted_engine::790::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=0ffa2caf-93ba-4
48a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring
self._initialize_domain_monitor()
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor
raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=0ffa2caf-93ba-448a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition
MainThread::WARNING::2015-12-03 13:52:53,623::hosted_engine::470::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Failed to start monitoring domain (
sd_uuid=0ffa2caf-93ba-448a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition
MainThread::WARNING::2015-12-03 13:52:53,623::hosted_engine::473::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring
self._initialize_domain_monitor()
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor
raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=0ffa2caf-93ba-448a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition
</snip>
Created attachment 1101707 [details]
agent.log from host-3
This logfile has errors of domain acquisition timeout
Nir, how much time can VDSM take to acquire the domain lock? Our current timeout is 4 minutes and we try that 5 times. Plus systemd will restart the agent anyway so we will try until it succeeds. SATHEESARAN: Do you still by any chance have the vdsm.log from the machine? Or can you describe the setup? How many domains and what type (nfs, iscsi, ..) did you have? Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone. (In reply to Martin Sivák from comment #3) > Nir, how much time can VDSM take to acquire the domain lock? Do you mean acquire the host id after starting a domain monitor? There is no limit in vdsm, if storage is not accessible, this can take forever. The monitor starts async acquire, and check the status with sanlock every 10 seconds. When sanlock reports that the host id is acquired, we report back acquired: True in repoStat. > Our current > timeout is 4 minutes and we try that 5 times. Plus systemd will restart the > agent anyway so we will try until it succeeds. We need vdsm log and sanlock log to understand what happened. I am not seeing this issue from beta3. Closing this bug now, and will re-open the issue if encountered again |
Description of problem: ----------------------- hosted-engine deployment was made with RHEL 7.2 server. 2 more additional hosts are also added to hosted-engine setup. ovirt-ha-agent was not running in one of the node in the cluster, and there were errors due to timeout during domain acquisition. This error was also seen in other nodes, but ovirt-ha-agent was running on those nodes. Version-Release number of selected component (if applicable): -------------------------------------------------------------- RHEV 3.6 Beta1 How reproducible: ------------------ Never tried to reproduce Steps to Reproduce: ------------------- 1. Deploy hosted-engine on a RHEL 7.2 server ( with gluster as backend ) 2. Add 2 more hosts to the setup using hosted-engine deploy script 3. After all the three nodes are deployed, check for ovirt-ha-agent (i.e) service ovirt-ha-agent status Actual results: --------------- ovirt-ha-agent was not running Expected results: ----------------- ovirt-ha-agent should be running Additional info: ----------------- Below is the snippet of errors seen in agent.log <snip> MainThread::ERROR::2015-11-30 14:51:46,855::hosted_engine::790::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=e71eefc3-a720-435c-a79a-efb11987a198, host_id=2): timeout during domain acquisition Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring self._initialize_domain_monitor() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor raise Exception(msg) Exception: Failed to start monitoring domain (sd_uuid=e71eefc3-a720-435c-a79a-efb11987a198, host_id=2): timeout during domain acquisition MainThread::WARNING::2015-11-30 14:51:46,855::hosted_engine::470::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Failed to start monitoring domain (sd_uuid=e71eefc3-a720-435c-a79a-efb11987a198, host_id=2): timeout during domain acquisition MainThread::WARNING::2015-11-30 14:51:46,855::hosted_engine::473::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring self._initialize_domain_monitor() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor raise Exception(msg) Exception: Failed to start monitoring domain (sd_uuid=e71eefc3-a720-435c-a79a-efb11987a198, host_id=2): timeout during domain acquisition MainThread::ERROR::2015-11-30 14:51:46,855::hosted_engine::486::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Shutting down the agent because of 3 failures in a row! MainThread::INFO::2015-11-30 14:51:49,869::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down </snip>