Bug 1278130

Summary: false log error acquire host id for domain after upgrade from 3.5 to 3.6
Product: [oVirt] vdsm Reporter: Artyom <alukiano>
Component: GeneralAssignee: Amit Aviram <aaviram>
Status: CLOSED WONTFIX QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.17.10CC: acanan, amureini, bugs, cshao, huiwa, huzhao, mavital, msivak, nsoffer, sbonazzo, stirabos, tnisan, ycui, ylavi
Target Milestone: ---Keywords: Regression, TestOnly
Target Release: ---Flags: ylavi: blocker-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1278369 (view as bug list) Environment:
Last Closed: 2016-03-28 15:09:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1276650, 1276740    
Attachments:
Description Flags
vdsm and agent log
none
/var/log/messages and /var/log/sanlock.log none

Description Artyom 2015-11-04 17:37:29 UTC
Created attachment 1089809 [details]
vdsm and agent log

Description of problem:
Hosted engine host failed to acquire host id for domain after upgrade from 3.5 to 3.6
Under vdsm log I see:
VDSM says that it successfully acquired
Thread-30::DEBUG::2015-11-04 19:09:43,551::clusterlock::237::Storage.SANLock::(acquireHostId) Host id for domain 66454524-8626-4511-9c19-9b6036854fc5 successfully acquired (id: 2)
but then it still respond with 'acquired': False
Thread-553::DEBUG::2015-11-04 19:09:44,374::task::1191::Storage.TaskManager.Task::(prepare) Task=`7aeee01f-44ea-4530-aad2-516599f3d635`::finished: {'66454524-8626-4511-9c19-9b6036854fc5': {'code': 0, 'actual': True, 'version': 3, 'acquired': False, 'delay': '0.000299854', 'lastCheck': '0.8', 'valid': True}}

Version-Release number of selected component (if applicable):
from 
vdsm-4.16.28-1.el7ev.x86_64
to
vdsm-4.17.10.1-0.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy hosted-engine 3.5 on two hosts
2. set global maintenance
3. upgrade engine from 3.5 to 3.6
4. put host without engine vm to maintenance
5. upgrade hosts from 3.5 to 3.6

Actual results:
ovirt-ha-agent stop with traceback
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 190, in _run_agent
    return action(he)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 63, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 411, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 791, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=66454524-8626-4511-9c19-9b6036854fc5, host_id=2): timeout during domain acquisition

Expected results:
ovirt-ha-agent run without any errors

Additional info:

Comment 1 Simone Tiraboschi 2015-11-04 17:45:56 UTC
The issue seams here:
VDSM successfully acquires domain lock but then it still reports 'acquired': False.
ha-agent is calling startMonitoringDomain and then is polling for that.
ha-agent fails due to a timeout cause VDSM indefinitely responds 'acquired': False also if it was correctly acquired:


Thread-30::INFO::2015-11-04 19:12:44,532::clusterlock::219::Storage.SANLock::(acquireHostId) Acquiring host id for domain 66454524-8626-4511-9c19-9b6036854fc5 (id: 2)
Thread-30::DEBUG::2015-11-04 19:12:44,532::clusterlock::237::Storage.SANLock::(acquireHostId) Host id for domain 66454524-8626-4511-9c19-9b6036854fc5 successfully acquired (id: 2)
...
Thread-611::INFO::2015-11-04 19:12:47,307::logUtils::51::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'66454524-8626-4511-9c19-9b6036854fc5': {'code': 0, 'actual': True, 'version': 3, 'acquired': False, 'delay': '0.00030614', 'lastCheck': '2.8', 'valid': True}}
...
Thread-613::INFO::2015-11-04 19:12:52,399::logUtils::51::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'66454524-8626-4511-9c19-9b6036854fc5': {'code': 0, 'actual': True, 'version': 3, 'acquired': False, 'delay': '0.00030614', 'lastCheck': '7.9', 'valid': True}}

Comment 2 Simone Tiraboschi 2015-11-05 10:11:12 UTC
Created attachment 1089997 [details]
/var/log/messages and /var/log/sanlock.log

Comment 3 Simone Tiraboschi 2015-11-05 10:21:19 UTC
It seams that sanlock it's failing to acquire the log.
Nov  5 12:19:46 alma05 sanlock[20723]: 2015-11-05 12:19:46+0200 68574 [324]: s6304 wdmd_connect failed -111
Nov  5 12:19:46 alma05 sanlock[20723]: 2015-11-05 12:19:46+0200 68574 [324]: s6304 connect_watchdog failed -1
Nov  5 12:19:47 alma05 sanlock[20723]: 2015-11-05 12:19:47+0200 68575 [20727]: s6304 add_lockspace fail result -203

so this 'Host id for domain 66454524-8626-4511-9c19-9b6036854fc5 successfully acquired (id: 2)' is not coherent at all but the issue is not on repoStats as we initially thought but probably on sanlock interaction.

Comment 4 Nir Soffer 2015-11-05 10:56:12 UTC
We have 2 issues:

1. In Vdsm log, we have a bogus log:

Thread-30::DEBUG::2015-11-04 19:09:43,551::clusterlock::237::Storage.SANLock::(acquireHostId) Host id for domain 66454524-8626-4511-9c19-9b6036854fc5 successfully acquired (id: 2)

This log was correct before we added the async acquire option. Now we are
using async acquire, and when sanlock operation is finished successfuly, 
we do *not* have a host id, we only know that we started a asynchronous 
acquire operation.

Vdsm check every 10 seconds if the host id was acquired, and since in this
case it was not acquired, it try to acquire again, as we can see in vdsm log.
Every 10 second, we have a new "successful" attempt to acquire.

The log must be fixed to reflect reality.

2. Sanlock cannot acquire the host id, since it cannot connect to wdmd:

2015-11-05 12:52:22+0200 70530 [20729]: s6498 add_lockspace fail result -203
2015-11-05 12:52:31+0200 70539 [20729]: s6499 lockspace 66454524-8626-4511-9c19-9b6036854fc5:2:/rhev/data-center/mnt/10.35.160.108:_Virt_alukiano__hosted__engine__upgrade/66454524-8626-4511-9c19-9b6036854fc5/dom_md/ids:0
2015-11-05 12:52:31+0200 70539 [2153]: s6499 wdmd_connect failed -111
2015-11-05 12:52:31+0200 70539 [2153]: s6499 connect_watchdog failed -1
2015-11-05 12:52:32+0200 70540 [20729]: s6499 add_lockspace fail result -203

When we check wdmd status, we see:

[root@alma05 dom_md]# systemctl status wdmd
● wdmd.service - Watchdog Multiplexing Daemon
   Loaded: loaded (/usr/lib/systemd/system/wdmd.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2015-11-04 18:37:30 IST; 18h ago
 Main PID: 635 (code=killed, signal=KILL)

Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com systemd[1]: Starting Watchdog Multiplexing Daemon...
Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com systemd-wdmd[20473]: Starting wdmd: [  OK  ]
Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com wdmd[20480]: wdmd started S0 H1 G179
Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com wdmd[20480]: other wdmd not cleanly stopped, shm_open error 17
Nov 04 18:37:30 alma05.qa.lab.tlv.redhat.com systemd[1]: Started Watchdog Multiplexing Daemon.

We can see many wdmd related errors in /var/log/messages (see attachment 1089997 [details]).

I'm cloning this bug to sanlock, so David can look at this.

Comment 5 Nir Soffer 2015-11-05 10:57:13 UTC
Lowering urgency, since this is a log issue.

Comment 6 meital avital 2015-11-05 11:43:46 UTC
Service ovirt-ha-agent failed to start after ten attempts

Comment 7 Simone Tiraboschi 2015-11-05 12:31:09 UTC
(In reply to meital avital from comment #6)
> Service ovirt-ha-agent failed to start after ten attempts

It was just to say that on vdsm we have a wrong logging issue (sanlock lock hasn't benn acquired as the log states) but the real issue is there: https://bugzilla.redhat.com/show_bug.cgi?id=1278369

Comment 8 Red Hat Bugzilla Rules Engine 2015-11-05 13:20:11 UTC
This bug is marked for z-stream, yet the milestone is for a major version, therefore the milestone has been reset.
Please set the correct milestone or drop the z stream flag.

Comment 9 Sandro Bonazzola 2016-01-14 07:28:51 UTC
Yaniv, how this affects upgrade to 3.6 flow? It's targeted to 4.0.

Comment 10 Yaniv Lavi 2016-01-14 08:54:14 UTC
(In reply to Sandro Bonazzola from comment #9)
> Yaniv, how this affects upgrade to 3.6 flow? It's targeted to 4.0.

It's a logging issue, not functional impact.

Comment 11 Allon Mureinik 2016-01-14 13:45:31 UTC
(In reply to Yaniv Dary from comment #10)
> (In reply to Sandro Bonazzola from comment #9)
> > Yaniv, how this affects upgrade to 3.6 flow? It's targeted to 4.0.
> 
> It's a logging issue, not functional impact.
Removing the blocker flag based on this statement.

Comment 12 Red Hat Bugzilla Rules Engine 2016-01-14 13:45:36 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 13 Sandro Bonazzola 2016-02-23 09:11:54 UTC
Not blocking bug #1285700 (upgrade to 3.6) since this has been targeted to 4.0.

Comment 14 Allon Mureinik 2016-03-28 15:09:23 UTC
Closing old bugs.
If you feel this is critical, explain why and reopen.