Bug 1286750
Summary: | ovirt-hosted-engine-ha-agent stopped due to timeout during domain aquisition | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | SATHEESARAN <sasundar> | ||||
Component: | General | Assignee: | Martin Sivák <msivak> | ||||
Status: | CLOSED WORKSFORME | QA Contact: | Ilanit Stein <istein> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 1.3.0 | CC: | bugs, dfediuck, gklein, lsurette, nsoffer, rgolan, sasundar, ykaul | ||||
Target Milestone: | ovirt-4.0.0-alpha | Flags: | dfediuck:
ovirt-4.0.0?
rule-engine: planning_ack? sasundar: devel_ack? rule-engine: testing_ack? |
||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-03-10 17:39:00 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1258386 | ||||||
Attachments: |
|
Description
SATHEESARAN
2015-11-30 16:18:34 UTC
In my recent testing, I am seeing errors in agent.log but agent was still running. Below is the snip of those errors <snip> MainThread::ERROR::2015-12-03 13:52:53,623::hosted_engine::790::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=0ffa2caf-93ba-4 48a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring self._initialize_domain_monitor() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor raise Exception(msg) Exception: Failed to start monitoring domain (sd_uuid=0ffa2caf-93ba-448a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition MainThread::WARNING::2015-12-03 13:52:53,623::hosted_engine::470::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Failed to start monitoring domain ( sd_uuid=0ffa2caf-93ba-448a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition MainThread::WARNING::2015-12-03 13:52:53,623::hosted_engine::473::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 447, in start_monitoring self._initialize_domain_monitor() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor raise Exception(msg) Exception: Failed to start monitoring domain (sd_uuid=0ffa2caf-93ba-448a-8e8e-a1b3f38f3b99, host_id=2): timeout during domain acquisition </snip> Created attachment 1101707 [details]
agent.log from host-3
This logfile has errors of domain acquisition timeout
Nir, how much time can VDSM take to acquire the domain lock? Our current timeout is 4 minutes and we try that 5 times. Plus systemd will restart the agent anyway so we will try until it succeeds. SATHEESARAN: Do you still by any chance have the vdsm.log from the machine? Or can you describe the setup? How many domains and what type (nfs, iscsi, ..) did you have? Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone. (In reply to Martin Sivák from comment #3) > Nir, how much time can VDSM take to acquire the domain lock? Do you mean acquire the host id after starting a domain monitor? There is no limit in vdsm, if storage is not accessible, this can take forever. The monitor starts async acquire, and check the status with sanlock every 10 seconds. When sanlock reports that the host id is acquired, we report back acquired: True in repoStat. > Our current > timeout is 4 minutes and we try that 5 times. Plus systemd will restart the > agent anyway so we will try until it succeeds. We need vdsm log and sanlock log to understand what happened. I am not seeing this issue from beta3. Closing this bug now, and will re-open the issue if encountered again |