Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1100566

Summary:	[hosted engine] - vdsm needs HA agent configuration before deployment
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Jiri Moskovcak <jmoskovc>
Component:	vdsm	Assignee:	Federico Simoncelli <fsimonce>
Status:	CLOSED WORKSFORME	QA Contact:	Nikolai Sednev <nsednev>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	3.3.0	CC:	acanan, agk, alukiano, amureini, bazulay, cluster-maint, dfediuck, eedri, fsimonce, gchaplik, gpadgett, iheim, jmoskovc, lpeer, mavital, pablo.iranzo, pstehlik, sbonazzo, sherold, teigland, yeylon
Target Milestone:	---	Keywords:	Triaged
Target Release:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	storage
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1045053	Environment:
Last Closed:	2014-07-30 13:39:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1045053, 1142923, 1156165

Comment 2 David Teigland 2014-05-23 14:35:27 UTC

I don't see any reference to sanlock changes in the comments above; was this misassigned?

Comment 3 Jiri Moskovcak 2014-05-26 06:46:15 UTC

My assumption about the problem with sanlock was based on this:

Thread-49::DEBUG::2013-12-19 13:25:44,912::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) Unable to issue the acquire host id 1 request for domain 4eea45f1-0be1-4c5c-9ec3-1460a16de055
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 259, in _monitorDomain
    self.domain.acquireHostId(self.hostId, async=True)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('4eea45f1-0be1-4c5c-9ec3-1460a16de055', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))

If you don't think it's a problem with sanlock, then please reassign it to whatever component you think is causing the problem.

Comment 4 David Teigland 2014-05-27 14:18:34 UTC

The most likely cause for -EINVAL from add_lockspace is that a lockspace with the same name has already been added.  In the next version I have included a log message when this happens.

sanlock cannot do anything about this. vdsm will need to handle this situation.

Comment 6 Federico Simoncelli 2014-07-14 11:36:15 UTC

I see that the host id was released successfully 100ms earlier:

Thread-31::INFO::2013-12-19 10:34:34,882::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain 6c57fd8e-d77b-4833-adff-0050415ac789 (id: 1)
Thread-31::DEBUG::2013-12-19 10:34:34,882::clusterlock::207::SANLock::(releaseHostId) Host id for domain 6c57fd8e-d77b-4833-adff-0050415ac789 released successfully (id: 1)
...(no other sanlock operation)...
Thread-54::INFO::2013-12-19 10:34:34,975::clusterlock::174::SANLock::(acquireHostId) Acquiring host id for domain 6c57fd8e-d77b-4833-adff-0050415ac789 (id: 1)
Thread-54::DEBUG::2013-12-19 10:34:34,976::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) Unable to issue the acquire host id 1 request for domain 6c57fd8e-d77b-4833-adff-0050415ac789
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 259, in _monitorDomain
    self.domain.acquireHostId(self.hostId, async=True)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('6c57fd8e-d77b-4833-adff-0050415ac789', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))

I know we can't do much without sanlock logs but do you think there could be another reason for EINVAL or maybe a race between release/acquire?
Thanks.

Comment 7 David Teigland 2014-07-14 15:10:19 UTC

Is it doing sanlock_rem_lockspace(SANLK_REM_ASYNC)?  If so, then the following sanlock_add_lockspace() would likely see the previous instance which is not yet gone.  In that case, add_lockspace returns EVINAL:
https://git.fedorahosted.org/cgit/sanlock.git/tree/src/lockspace.c?id=8163bc7b56be4cbe747dc1f3ad9a6f3bca368eb5#n642

Comment 8 Federico Simoncelli 2014-07-14 15:49:40 UTC

(In reply to David Teigland from comment #7)
> Is it doing sanlock_rem_lockspace(SANLK_REM_ASYNC)?  If so, then the
> following sanlock_add_lockspace() would likely see the previous instance
> which is not yet gone.  In that case, add_lockspace returns EVINAL:
> https://git.fedorahosted.org/cgit/sanlock.git/tree/src/lockspace.
> c?id=8163bc7b56be4cbe747dc1f3ad9a6f3bca368eb5#n642

No, we never use async for release (only for acquire). I double checked the code as well.

Jiri are you still hitting this issue?

Comment 9 Jiri Moskovcak 2014-07-15 06:32:55 UTC

moving needinfo to pstehlik, because he reported this bug

Comment 12 Artyom 2014-07-17 11:11:38 UTC

I see that bug opened for 3.3, for what version you prefer reproduction, for 3.3 or 3.4?

Comment 13 Artyom 2014-07-17 14:56:33 UTC

I checked it on ovirt-hosted-engine-setup-1.1.3-2.el6ev.noarch,
I have HE environment without changing any modes:
1) yum erase ovirt-host* -y
2) rm -rf /etc/ovirt-hosted*
3) yum install ovirt-hosted-engine-setup-1.1.3-2.el6ev.noarch -y
4) First hosted-engine --deploy up error because we have running vm:
   vdsClient -s 0 destroy vm_id
5) hosted-engine --deploy on clean storage, run ok, without any problems and also no any problems on:
...
[ INFO  ] Initializing sanlock metadata
[ INFO  ] Creating VM Image
[ INFO  ] Disconnecting Storage Pool
[ INFO  ] Start monitoring domain
[ INFO  ] Configuring VM
[ INFO  ] Updating hosted-engine configuration
...
Please inform me if you need to check it for 3.3

Comment 14 Allon Mureinik 2014-07-23 08:44:10 UTC

(In reply to Artyom from comment #13)
> 5) hosted-engine --deploy on clean storage, run ok, without any problems and
> also no any problems on:
Fede, based on this statement, can we close this BZ?

Comment 15 Allon Mureinik 2014-07-30 13:39:40 UTC

(In reply to Allon Mureinik from comment #14)
> (In reply to Artyom from comment #13)
> > 5) hosted-engine --deploy on clean storage, run ok, without any problems and
> > also no any problems on:
> Fede, based on this statement, can we close this BZ?
Closing.
If this was incorrect, please open with the relevant details.