Bug 1408602
Summary: | Failed to get auto-import for hosted_storage after 3.5.8-0.1 to 3.6.10.1-0.1 upgrade. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Nikolai Sednev <nsednev> | ||||||||
Component: | ovirt-engine | Assignee: | Simone Tiraboschi <stirabos> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | meital avital <mavital> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 3.6.9 | CC: | dfediuck, gklein, lsurette, msivak, nsednev, nsoffer, rbalakri, Rhev-m-bugs, srevivo, ykaul, yquinn | ||||||||
Target Milestone: | --- | Keywords: | Regression | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2017-01-02 11:09:38 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Nikolai Sednev
2016-12-25 17:32:56 UTC
Created attachment 1235281 [details]
engine's sosreport
Created attachment 1235282 [details]
alma03's sosreport
Created attachment 1235283 [details]
alma04's sosreport
In case that someone being interested on why there is a time shift between logs from host (RHEVH) and the engine, the answer is here https://access.redhat.com/solutions/440343. "All RHEV Hypervisors use UTC as the system time and this can not be changed. Any guests that have the incorrect time need to have their time zones changed to reflect the proper time." I had around 2 hours time shift between engine and RHEVH, as engine was deployed first time on RHEL6.7 with correct timezone, while engine migrated to RHEVH, the time zone had changed and was not synchronized with what it should have to be. Nikolai, can you point me to the vdsm log containing this error? Dec 25, 2016 6:38:49 PM VDSM command failed: Cannot acquire host id: (u'81af2dad-ac74-4e6b-9f56-d69c5811686a', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))" Also, /var/log/sanlock.log from same host can be useful. I have found a possible cause for that VDSM failure on vdsm.log.1 of host alma04 : Thread-16::ERROR::2016-12-25 15:56:27,693::sdc::145::Storage.StorageDomainCache::(_findDomain) domain 81af2dad-ac74-4e6b-9f56-d69c5811686a not found Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 173, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('81af2dad-ac74-4e6b-9f56-d69c5811686a',) Thread-16::ERROR::2016-12-25 15:56:27,697::task::866::Storage.TaskManager.Task::(_setError) Task=`c22ca9a6-9869-40aa-9ede-6ae723f293d1`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2835, in getStorageDomainInfo dom = self.validateSdUUID(sdUUID) File "/usr/share/vdsm/storage/hsm.py", line 277, in validateSdUUID ... Im not sure if the logs correlate, this needs to be verified. this issue was raised in the past : https://bugzilla.redhat.com/show_bug.cgi?id=1269768#c31 (In reply to Nir Soffer from comment #6) > Nikolai, can you point me to the vdsm log containing this error? > > Dec 25, 2016 6:38:49 PM > VDSM command failed: Cannot acquire host id: > (u'81af2dad-ac74-4e6b-9f56-d69c5811686a', SanlockException(22, 'Sanlock > lockspace add failure', 'Invalid argument'))" Time gap should be +2 hours between UTC and IST, so on RHEVH host there is always lag of 2 hours if compared to the engine's IST. You should correlate with engine log, by subtracting 2 hours from it's events and then looking on vdsm log from RHEVH. This is exactly a duplicate of rhbz#1322849: The engine was trying to import the hosted-engine storage domain using the host alma03; on alma03, under /etc/ovirt-hosted-engine/hosted-engine.conf, we have host_id=3 so ovirt-ha-agent is locking with host_id=3 on the other side, the engine thinks that alma03 has host_id=1 and so it's trying to acquire again the lock with host_id=1 but this will fail, indeed in /var/log/sanlock.log we see: 2016-12-25 16:38:49+0000 4219 [16357]: add_lockspace 81af2dad-ac74-4e6b-9f56-d69c5811686a:1:/rhev/data-center/mnt/10.35.110.11:_Compute__NFS_nsednev__he__1/81af2dad-ac74-4e6b-9f56-d69c5811686a/dom_md/ids:0 conflicts with name of list1 s4 81af2dad-ac74-4e6b-9f56-d69c5811686a:3:/rhev/data-center/mnt/10.35.110.11:_Compute__NFS_nsednev__he__1/81af2dad-ac74-4e6b-9f56-d69c5811686a/dom_md/ids:0 where the relevant information is the conflict between 81af2dad-ac74-4e6b-9f56-d69c5811686a:1 and 81af2dad-ac74-4e6b-9f56-d69c5811686a:3 Under vdsm logs we see: jsonrpc.Executor/1::DEBUG::2016-12-25 16:38:49,810::__init__::532::jsonrpc.JsonRpcServer::(_handle_request) Calling 'StorageDomain.detach' in bridge with {u'storagepoolID': u'00000000-0000-0000-0000-000000000000', u'force': True, u'storagedomainID': u'81af2dad-ac74-4e6b-9f56-d69c5811686a'} jsonrpc.Executor/1::DEBUG::2016-12-25 16:38:49,811::task::595::Storage.TaskManager.Task::(_updateState) Task=`b95270fd-34e8-4b7f-adfd-6f587bab67cf`::moving from state init -> state preparing jsonrpc.Executor/1::INFO::2016-12-25 16:38:49,811::logUtils::48::dispatcher::(wrapper) Run and protect: forcedDetachStorageDomain(sdUUID=u'81af2dad-ac74-4e6b-9f56-d69c5811686a', spUUID=u'00000000-0000-0000-0000-000000000000', options=None) jsonrpc.Executor/1::INFO::2016-12-25 16:38:49,811::clusterlock::219::Storage.SANLock::(acquireHostId) Acquiring host id for domain 81af2dad-ac74-4e6b-9f56-d69c5811686a (id: 1) jsonrpc.Executor/1::ERROR::2016-12-25 16:38:49,813::task::866::Storage.TaskManager.Task::(_setError) Task=`b95270fd-34e8-4b7f-adfd-6f587bab67cf`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 805, in forcedDetachStorageDomain self._deatchStorageDomainFromOldPools(sdUUID) File "/usr/share/vdsm/storage/hsm.py", line 781, in _deatchStorageDomainFromOldPools dom.acquireHostId(pool.id) File "/usr/share/vdsm/storage/sd.py", line 533, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 234, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: (u'81af2dad-ac74-4e6b-9f56-d69c5811686a', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) and also here the relevant information is the conflict between (id: 1) from the engine and host_id=3 under /etc/ovirt-hosted-engine/hosted-engine.conf As for https://bugzilla.redhat.com/show_bug.cgi?id=1322849#c18 the workaround is manually syncing the host_id value for ovirt-ha-agent with the value used by the engine. To do that: 1. run on the engine VM: sudo -u postgres psql engine -c "select vds_name, vds_spm_id from vds ORDER BY vds_spm_id" 2. check the value of host_id under /etc/ovirt-hosted-engine/hosted-engine.conf for all the involved hosts. If they doesn't match with the value at point 1: 3. put the host into maintenance mode from the engine, fix the value under /etc/ovirt-hosted-engine/hosted-engine.conf and reboot the host 4. When the host is up again, reactivate it from the engine *** This bug has been marked as a duplicate of bug 1322849 *** Opened https://bugzilla.redhat.com/show_bug.cgi?id=1409771 to get this issue covered there. |