1100566 – [hosted engine] - vdsm needs HA agent configuration before deployment

Bug 1100566 - [hosted engine] - vdsm needs HA agent configuration before deployment

Summary: [hosted engine] - vdsm needs HA agent configuration before deployment

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Federico Simoncelli
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	1045053 rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-05-23 06:13 UTC by Jiri Moskovcak
Modified:	2019-04-28 14:23 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1045053
Environment:
Last Closed:	2014-07-30 13:39:40 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 2 David Teigland 2014-05-23 14:35:27 UTC

I don't see any reference to sanlock changes in the comments above; was this misassigned?

Comment 3 Jiri Moskovcak 2014-05-26 06:46:15 UTC

My assumption about the problem with sanlock was based on this:

Thread-49::DEBUG::2013-12-19 13:25:44,912::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) Unable to issue the acquire host id 1 request for domain 4eea45f1-0be1-4c5c-9ec3-1460a16de055
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 259, in _monitorDomain
    self.domain.acquireHostId(self.hostId, async=True)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('4eea45f1-0be1-4c5c-9ec3-1460a16de055', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))

If you don't think it's a problem with sanlock, then please reassign it to whatever component you think is causing the problem.

Comment 4 David Teigland 2014-05-27 14:18:34 UTC

The most likely cause for -EINVAL from add_lockspace is that a lockspace with the same name has already been added.  In the next version I have included a log message when this happens.

sanlock cannot do anything about this. vdsm will need to handle this situation.

Comment 6 Federico Simoncelli 2014-07-14 11:36:15 UTC

I see that the host id was released successfully 100ms earlier:

Thread-31::INFO::2013-12-19 10:34:34,882::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain 6c57fd8e-d77b-4833-adff-0050415ac789 (id: 1)
Thread-31::DEBUG::2013-12-19 10:34:34,882::clusterlock::207::SANLock::(releaseHostId) Host id for domain 6c57fd8e-d77b-4833-adff-0050415ac789 released successfully (id: 1)
...(no other sanlock operation)...
Thread-54::INFO::2013-12-19 10:34:34,975::clusterlock::174::SANLock::(acquireHostId) Acquiring host id for domain 6c57fd8e-d77b-4833-adff-0050415ac789 (id: 1)
Thread-54::DEBUG::2013-12-19 10:34:34,976::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) Unable to issue the acquire host id 1 request for domain 6c57fd8e-d77b-4833-adff-0050415ac789
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 259, in _monitorDomain
    self.domain.acquireHostId(self.hostId, async=True)
  File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('6c57fd8e-d77b-4833-adff-0050415ac789', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))

I know we can't do much without sanlock logs but do you think there could be another reason for EINVAL or maybe a race between release/acquire?
Thanks.

Comment 7 David Teigland 2014-07-14 15:10:19 UTC

Is it doing sanlock_rem_lockspace(SANLK_REM_ASYNC)?  If so, then the following sanlock_add_lockspace() would likely see the previous instance which is not yet gone.  In that case, add_lockspace returns EVINAL:
https://git.fedorahosted.org/cgit/sanlock.git/tree/src/lockspace.c?id=8163bc7b56be4cbe747dc1f3ad9a6f3bca368eb5#n642

Comment 8 Federico Simoncelli 2014-07-14 15:49:40 UTC

(In reply to David Teigland from comment #7)
> Is it doing sanlock_rem_lockspace(SANLK_REM_ASYNC)?  If so, then the
> following sanlock_add_lockspace() would likely see the previous instance
> which is not yet gone.  In that case, add_lockspace returns EVINAL:
> https://git.fedorahosted.org/cgit/sanlock.git/tree/src/lockspace.
> c?id=8163bc7b56be4cbe747dc1f3ad9a6f3bca368eb5#n642

No, we never use async for release (only for acquire). I double checked the code as well.

Jiri are you still hitting this issue?

Comment 9 Jiri Moskovcak 2014-07-15 06:32:55 UTC

moving needinfo to pstehlik, because he reported this bug

Comment 12 Artyom 2014-07-17 11:11:38 UTC

I see that bug opened for 3.3, for what version you prefer reproduction, for 3.3 or 3.4?

Comment 13 Artyom 2014-07-17 14:56:33 UTC

I checked it on ovirt-hosted-engine-setup-1.1.3-2.el6ev.noarch,
I have HE environment without changing any modes:
1) yum erase ovirt-host* -y
2) rm -rf /etc/ovirt-hosted*
3) yum install ovirt-hosted-engine-setup-1.1.3-2.el6ev.noarch -y
4) First hosted-engine --deploy up error because we have running vm:
   vdsClient -s 0 destroy vm_id
5) hosted-engine --deploy on clean storage, run ok, without any problems and also no any problems on:
...
[ INFO  ] Initializing sanlock metadata
[ INFO  ] Creating VM Image
[ INFO  ] Disconnecting Storage Pool
[ INFO  ] Start monitoring domain
[ INFO  ] Configuring VM
[ INFO  ] Updating hosted-engine configuration
...
Please inform me if you need to check it for 3.3

Comment 14 Allon Mureinik 2014-07-23 08:44:10 UTC

(In reply to Artyom from comment #13)
> 5) hosted-engine --deploy on clean storage, run ok, without any problems and
> also no any problems on:
Fede, based on this statement, can we close this BZ?

Comment 15 Allon Mureinik 2014-07-30 13:39:40 UTC

(In reply to Allon Mureinik from comment #14)
> (In reply to Artyom from comment #13)
> > 5) hosted-engine --deploy on clean storage, run ok, without any problems and
> > also no any problems on:
> Fede, based on this statement, can we close this BZ?
Closing.
If this was incorrect, please open with the relevant details.

Note You need to log in before you can comment on or make changes to this bug.

acanan
agk
alukiano
amureini
bazulay
cluster-maint
dfediuck
eedri
fsimonce
gchaplik
gpadgett
iheim
jmoskovc
lpeer
mavital
pablo.iranzo
pstehlik
sbonazzo
sherold
teigland
yeylon