Created attachment 1089809 [details] vdsm and agent log Description of problem: Hosted engine host failed to acquire host id for domain after upgrade from 3.5 to 3.6 Under vdsm log I see: VDSM says that it successfully acquired Thread-30::DEBUG::2015-11-04 19:09:43,551::clusterlock::237::Storage.SANLock::(acquireHostId) Host id for domain 66454524-8626-4511-9c19-9b6036854fc5 successfully acquired (id: 2) but then it still respond with 'acquired': False Thread-553::DEBUG::2015-11-04 19:09:44,374::task::1191::Storage.TaskManager.Task::(prepare) Task=`7aeee01f-44ea-4530-aad2-516599f3d635`::finished: {'66454524-8626-4511-9c19-9b6036854fc5': {'code': 0, 'actual': True, 'version': 3, 'acquired': False, 'delay': '0.000299854', 'lastCheck': '0.8', 'valid': True}} Version-Release number of selected component (if applicable): from vdsm-4.16.28-1.el7ev.x86_64 to vdsm-4.17.10.1-0.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Deploy hosted-engine 3.5 on two hosts 2. set global maintenance 3. upgrade engine from 3.5 to 3.6 4. put host without engine vm to maintenance 5. upgrade hosts from 3.5 to 3.6 Actual results: ovirt-ha-agent stop with traceback Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 190, in _run_agent return action(he) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 63, in action_proper return he.start_monitoring() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 411, in start_monitoring self._initialize_domain_monitor() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor raise Exception(msg) Exception: Failed to start monitoring domain (sd_uuid=66454524-8626-4511-9c19-9b6036854fc5, host_id=2): timeout during domain acquisition Expected results: ovirt-ha-agent run without any errors Additional info:
The issue seams here: VDSM successfully acquires domain lock but then it still reports 'acquired': False. ha-agent is calling startMonitoringDomain and then is polling for that. ha-agent fails due to a timeout cause VDSM indefinitely responds 'acquired': False also if it was correctly acquired: Thread-30::INFO::2015-11-04 19:12:44,532::clusterlock::219::Storage.SANLock::(acquireHostId) Acquiring host id for domain 66454524-8626-4511-9c19-9b6036854fc5 (id: 2) Thread-30::DEBUG::2015-11-04 19:12:44,532::clusterlock::237::Storage.SANLock::(acquireHostId) Host id for domain 66454524-8626-4511-9c19-9b6036854fc5 successfully acquired (id: 2) ... Thread-611::INFO::2015-11-04 19:12:47,307::logUtils::51::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'66454524-8626-4511-9c19-9b6036854fc5': {'code': 0, 'actual': True, 'version': 3, 'acquired': False, 'delay': '0.00030614', 'lastCheck': '2.8', 'valid': True}} ... Thread-613::INFO::2015-11-04 19:12:52,399::logUtils::51::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'66454524-8626-4511-9c19-9b6036854fc5': {'code': 0, 'actual': True, 'version': 3, 'acquired': False, 'delay': '0.00030614', 'lastCheck': '7.9', 'valid': True}}
Created attachment 1089997 [details] /var/log/messages and /var/log/sanlock.log
It seams that sanlock it's failing to acquire the log. Nov 5 12:19:46 alma05 sanlock[20723]: 2015-11-05 12:19:46+0200 68574 [324]: s6304 wdmd_connect failed -111 Nov 5 12:19:46 alma05 sanlock[20723]: 2015-11-05 12:19:46+0200 68574 [324]: s6304 connect_watchdog failed -1 Nov 5 12:19:47 alma05 sanlock[20723]: 2015-11-05 12:19:47+0200 68575 [20727]: s6304 add_lockspace fail result -203 so this 'Host id for domain 66454524-8626-4511-9c19-9b6036854fc5 successfully acquired (id: 2)' is not coherent at all but the issue is not on repoStats as we initially thought but probably on sanlock interaction.
We have 2 issues: 1. In Vdsm log, we have a bogus log: Thread-30::DEBUG::2015-11-04 19:09:43,551::clusterlock::237::Storage.SANLock::(acquireHostId) Host id for domain 66454524-8626-4511-9c19-9b6036854fc5 successfully acquired (id: 2) This log was correct before we added the async acquire option. Now we are using async acquire, and when sanlock operation is finished successfuly, we do *not* have a host id, we only know that we started a asynchronous acquire operation. Vdsm check every 10 seconds if the host id was acquired, and since in this case it was not acquired, it try to acquire again, as we can see in vdsm log. Every 10 second, we have a new "successful" attempt to acquire. The log must be fixed to reflect reality. 2. Sanlock cannot acquire the host id, since it cannot connect to wdmd: 2015-11-05 12:52:22+0200 70530 [20729]: s6498 add_lockspace fail result -203 2015-11-05 12:52:31+0200 70539 [20729]: s6499 lockspace 66454524-8626-4511-9c19-9b6036854fc5:2:/rhev/data-center/mnt/10.35.160.108:_Virt_alukiano__hosted__engine__upgrade/66454524-8626-4511-9c19-9b6036854fc5/dom_md/ids:0 2015-11-05 12:52:31+0200 70539 [2153]: s6499 wdmd_connect failed -111 2015-11-05 12:52:31+0200 70539 [2153]: s6499 connect_watchdog failed -1 2015-11-05 12:52:32+0200 70540 [20729]: s6499 add_lockspace fail result -203 When we check wdmd status, we see: [root@alma05 dom_md]# systemctl status wdmd ● wdmd.service - Watchdog Multiplexing Daemon Loaded: loaded (/usr/lib/systemd/system/wdmd.service; disabled; vendor preset: disabled) Active: inactive (dead) since Wed 2015-11-04 18:37:30 IST; 18h ago Main PID: 635 (code=killed, signal=KILL) Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com systemd[1]: Starting Watchdog Multiplexing Daemon... Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com systemd-wdmd[20473]: Starting wdmd: [ OK ] Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com wdmd[20480]: wdmd started S0 H1 G179 Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com wdmd[20480]: other wdmd not cleanly stopped, shm_open error 17 Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com systemd[1]: Started Watchdog Multiplexing Daemon. We can see many wdmd related errors in /var/log/messages (see attachment 1089997 [details]). I'm cloning this bug to sanlock, so David can look at this.
Lowering urgency, since this is a log issue.
Service ovirt-ha-agent failed to start after ten attempts
(In reply to meital avital from comment #6) > Service ovirt-ha-agent failed to start after ten attempts It was just to say that on vdsm we have a wrong logging issue (sanlock lock hasn't benn acquired as the log states) but the real issue is there: https://bugzilla.redhat.com/show_bug.cgi?id=1278369
This bug is marked for z-stream, yet the milestone is for a major version, therefore the milestone has been reset. Please set the correct milestone or drop the z stream flag.
Yaniv, how this affects upgrade to 3.6 flow? It's targeted to 4.0.
(In reply to Sandro Bonazzola from comment #9) > Yaniv, how this affects upgrade to 3.6 flow? It's targeted to 4.0. It's a logging issue, not functional impact.
(In reply to Yaniv Dary from comment #10) > (In reply to Sandro Bonazzola from comment #9) > > Yaniv, how this affects upgrade to 3.6 flow? It's targeted to 4.0. > > It's a logging issue, not functional impact. Removing the blocker flag based on this statement.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Not blocking bug #1285700 (upgrade to 3.6) since this has been targeted to 4.0.
Closing old bugs. If you feel this is critical, explain why and reopen.