I don't see any reference to sanlock changes in the comments above; was this misassigned?
My assumption about the problem with sanlock was based on this: Thread-49::DEBUG::2013-12-19 13:25:44,912::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) Unable to issue the acquire host id 1 request for domain 4eea45f1-0be1-4c5c-9ec3-1460a16de055 Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 259, in _monitorDomain self.domain.acquireHostId(self.hostId, async=True) File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('4eea45f1-0be1-4c5c-9ec3-1460a16de055', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) If you don't think it's a problem with sanlock, then please reassign it to whatever component you think is causing the problem.
The most likely cause for -EINVAL from add_lockspace is that a lockspace with the same name has already been added. In the next version I have included a log message when this happens. sanlock cannot do anything about this. vdsm will need to handle this situation.
I see that the host id was released successfully 100ms earlier: Thread-31::INFO::2013-12-19 10:34:34,882::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain 6c57fd8e-d77b-4833-adff-0050415ac789 (id: 1) Thread-31::DEBUG::2013-12-19 10:34:34,882::clusterlock::207::SANLock::(releaseHostId) Host id for domain 6c57fd8e-d77b-4833-adff-0050415ac789 released successfully (id: 1) ...(no other sanlock operation)... Thread-54::INFO::2013-12-19 10:34:34,975::clusterlock::174::SANLock::(acquireHostId) Acquiring host id for domain 6c57fd8e-d77b-4833-adff-0050415ac789 (id: 1) Thread-54::DEBUG::2013-12-19 10:34:34,976::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) Unable to issue the acquire host id 1 request for domain 6c57fd8e-d77b-4833-adff-0050415ac789 Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 259, in _monitorDomain self.domain.acquireHostId(self.hostId, async=True) File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('6c57fd8e-d77b-4833-adff-0050415ac789', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) I know we can't do much without sanlock logs but do you think there could be another reason for EINVAL or maybe a race between release/acquire? Thanks.
Is it doing sanlock_rem_lockspace(SANLK_REM_ASYNC)? If so, then the following sanlock_add_lockspace() would likely see the previous instance which is not yet gone. In that case, add_lockspace returns EVINAL: https://git.fedorahosted.org/cgit/sanlock.git/tree/src/lockspace.c?id=8163bc7b56be4cbe747dc1f3ad9a6f3bca368eb5#n642
(In reply to David Teigland from comment #7) > Is it doing sanlock_rem_lockspace(SANLK_REM_ASYNC)? If so, then the > following sanlock_add_lockspace() would likely see the previous instance > which is not yet gone. In that case, add_lockspace returns EVINAL: > https://git.fedorahosted.org/cgit/sanlock.git/tree/src/lockspace. > c?id=8163bc7b56be4cbe747dc1f3ad9a6f3bca368eb5#n642 No, we never use async for release (only for acquire). I double checked the code as well. Jiri are you still hitting this issue?
moving needinfo to pstehlik, because he reported this bug
I see that bug opened for 3.3, for what version you prefer reproduction, for 3.3 or 3.4?
I checked it on ovirt-hosted-engine-setup-1.1.3-2.el6ev.noarch, I have HE environment without changing any modes: 1) yum erase ovirt-host* -y 2) rm -rf /etc/ovirt-hosted* 3) yum install ovirt-hosted-engine-setup-1.1.3-2.el6ev.noarch -y 4) First hosted-engine --deploy up error because we have running vm: vdsClient -s 0 destroy vm_id 5) hosted-engine --deploy on clean storage, run ok, without any problems and also no any problems on: ... [ INFO ] Initializing sanlock metadata [ INFO ] Creating VM Image [ INFO ] Disconnecting Storage Pool [ INFO ] Start monitoring domain [ INFO ] Configuring VM [ INFO ] Updating hosted-engine configuration ... Please inform me if you need to check it for 3.3
(In reply to Artyom from comment #13) > 5) hosted-engine --deploy on clean storage, run ok, without any problems and > also no any problems on: Fede, based on this statement, can we close this BZ?
(In reply to Allon Mureinik from comment #14) > (In reply to Artyom from comment #13) > > 5) hosted-engine --deploy on clean storage, run ok, without any problems and > > also no any problems on: > Fede, based on this statement, can we close this BZ? Closing. If this was incorrect, please open with the relevant details.