Bug 1100566
| Summary: | [hosted engine] - vdsm needs HA agent configuration before deployment | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Jiri Moskovcak <jmoskovc> |
| Component: | vdsm | Assignee: | Federico Simoncelli <fsimonce> |
| Status: | CLOSED WORKSFORME | QA Contact: | Nikolai Sednev <nsednev> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.3.0 | CC: | acanan, agk, alukiano, amureini, bazulay, cluster-maint, dfediuck, eedri, fsimonce, gchaplik, gpadgett, iheim, jmoskovc, lpeer, mavital, pablo.iranzo, pstehlik, sbonazzo, sherold, teigland, yeylon |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | 3.5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | storage | ||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1045053 | Environment: | |
| Last Closed: | 2014-07-30 13:39:40 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1045053, 1142923, 1156165 | ||
|
Comment 2
David Teigland
2014-05-23 14:35:27 UTC
My assumption about the problem with sanlock was based on this:
Thread-49::DEBUG::2013-12-19 13:25:44,912::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) Unable to issue the acquire host id 1 request for domain 4eea45f1-0be1-4c5c-9ec3-1460a16de055
Traceback (most recent call last):
File "/usr/share/vdsm/storage/domainMonitor.py", line 259, in _monitorDomain
self.domain.acquireHostId(self.hostId, async=True)
File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
self._clusterLock.acquireHostId(hostId, async)
File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('4eea45f1-0be1-4c5c-9ec3-1460a16de055', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
If you don't think it's a problem with sanlock, then please reassign it to whatever component you think is causing the problem.
The most likely cause for -EINVAL from add_lockspace is that a lockspace with the same name has already been added. In the next version I have included a log message when this happens. sanlock cannot do anything about this. vdsm will need to handle this situation. I see that the host id was released successfully 100ms earlier:
Thread-31::INFO::2013-12-19 10:34:34,882::clusterlock::197::SANLock::(releaseHostId) Releasing host id for domain 6c57fd8e-d77b-4833-adff-0050415ac789 (id: 1)
Thread-31::DEBUG::2013-12-19 10:34:34,882::clusterlock::207::SANLock::(releaseHostId) Host id for domain 6c57fd8e-d77b-4833-adff-0050415ac789 released successfully (id: 1)
...(no other sanlock operation)...
Thread-54::INFO::2013-12-19 10:34:34,975::clusterlock::174::SANLock::(acquireHostId) Acquiring host id for domain 6c57fd8e-d77b-4833-adff-0050415ac789 (id: 1)
Thread-54::DEBUG::2013-12-19 10:34:34,976::domainMonitor::263::Storage.DomainMonitorThread::(_monitorDomain) Unable to issue the acquire host id 1 request for domain 6c57fd8e-d77b-4833-adff-0050415ac789
Traceback (most recent call last):
File "/usr/share/vdsm/storage/domainMonitor.py", line 259, in _monitorDomain
self.domain.acquireHostId(self.hostId, async=True)
File "/usr/share/vdsm/storage/sd.py", line 458, in acquireHostId
self._clusterLock.acquireHostId(hostId, async)
File "/usr/share/vdsm/storage/clusterlock.py", line 189, in acquireHostId
raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('6c57fd8e-d77b-4833-adff-0050415ac789', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
I know we can't do much without sanlock logs but do you think there could be another reason for EINVAL or maybe a race between release/acquire?
Thanks.
Is it doing sanlock_rem_lockspace(SANLK_REM_ASYNC)? If so, then the following sanlock_add_lockspace() would likely see the previous instance which is not yet gone. In that case, add_lockspace returns EVINAL: https://git.fedorahosted.org/cgit/sanlock.git/tree/src/lockspace.c?id=8163bc7b56be4cbe747dc1f3ad9a6f3bca368eb5#n642 (In reply to David Teigland from comment #7) > Is it doing sanlock_rem_lockspace(SANLK_REM_ASYNC)? If so, then the > following sanlock_add_lockspace() would likely see the previous instance > which is not yet gone. In that case, add_lockspace returns EVINAL: > https://git.fedorahosted.org/cgit/sanlock.git/tree/src/lockspace. > c?id=8163bc7b56be4cbe747dc1f3ad9a6f3bca368eb5#n642 No, we never use async for release (only for acquire). I double checked the code as well. Jiri are you still hitting this issue? moving needinfo to pstehlik, because he reported this bug I see that bug opened for 3.3, for what version you prefer reproduction, for 3.3 or 3.4? I checked it on ovirt-hosted-engine-setup-1.1.3-2.el6ev.noarch, I have HE environment without changing any modes: 1) yum erase ovirt-host* -y 2) rm -rf /etc/ovirt-hosted* 3) yum install ovirt-hosted-engine-setup-1.1.3-2.el6ev.noarch -y 4) First hosted-engine --deploy up error because we have running vm: vdsClient -s 0 destroy vm_id 5) hosted-engine --deploy on clean storage, run ok, without any problems and also no any problems on: ... [ INFO ] Initializing sanlock metadata [ INFO ] Creating VM Image [ INFO ] Disconnecting Storage Pool [ INFO ] Start monitoring domain [ INFO ] Configuring VM [ INFO ] Updating hosted-engine configuration ... Please inform me if you need to check it for 3.3 (In reply to Artyom from comment #13) > 5) hosted-engine --deploy on clean storage, run ok, without any problems and > also no any problems on: Fede, based on this statement, can we close this BZ? (In reply to Allon Mureinik from comment #14) > (In reply to Artyom from comment #13) > > 5) hosted-engine --deploy on clean storage, run ok, without any problems and > > also no any problems on: > Fede, based on this statement, can we close this BZ? Closing. If this was incorrect, please open with the relevant details. |