Description of problem: Failed to get auto-import for hosted_storage after 3.5.8-0.1 to 3.6.10.1-0.1 upgrade. On my 3.5 environment, which consisted of two hosts: alma03 - 3.5's RHEL6.7: ovirt-host-deploy-1.3.2-1.el6ev.noarch qemu-kvm-rhev-0.12.1.2-2.479.el6_7.5.x86_64 vdsm-4.16.38-1.el6ev.x86_64 sanlock-2.8-2.el6_5.x86_64 mom-0.4.1-4.el6ev.noarch ovirt-hosted-engine-ha-1.2.10-1.el6ev.noarch libvirt-client-0.10.2-54.el6_7.6.x86_64 ovirt-hosted-engine-setup-1.2.6.1-1.el6ev.noarch Linux version 2.6.32-431.76.1.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Tue Nov 1 14:24:00 EDT 2016 Linux 2.6.32-431.76.1.el6.x86_64 #1 SMP Tue Nov 1 14:24:00 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 6.7 (Santiago) alma04 - 3.5's RHEVH7.2 hypervisor for RHEV 3.5 (RHEV Hypervisor - 7.2 - 20160219.0.el7ev): qemu-kvm-rhev-2.3.0-31.el7_2.7.x86_64 vdsm-4.16.35-2.el7ev.x86_64 ovirt-node-plugin-hosted-engine-0.2.0-18.0.el7ev.noarch ovirt-node-branding-rhev-3.2.3-31.el7.noarch rhevm-sdk-python-3.5.6.0-1.el7ev.noarch mom-0.4.1-4.el7ev.noarch ovirt-node-selinux-3.2.3-31.el7.noarch ovirt-host-deploy-offline-1.3.0-3.el7ev.x86_64 ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch ovirt-node-plugin-vdsm-0.2.0-26.el7ev.noarch ovirt-node-plugin-cim-3.2.3-31.el7.noarch ovirt-node-plugin-snmp-3.2.3-31.el7.noarch sanlock-3.2.4-1.el7.x86_64 libvirt-client-1.2.17-13.el7_2.3.x86_64 ovirt-node-plugin-rhn-3.2.3-31.el7.noarch ovirt-node-3.2.3-31.el7.noarch ovirt-hosted-engine-ha-1.2.10-1.el7ev.noarch ovirt-host-deploy-1.3.2-1.el7ev.noarch Linux version 3.10.0-327.10.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Sat Jan 23 04:54:55 EST 2016 Linux 3.10.0-327.10.1.el7.x86_64 #1 SMP Sat Jan 23 04:54:55 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Virtualization Hypervisor release 7.2 (20160219.0.el7ev) I've followed the https://access.redhat.com/solutions/2351141 and performed an upgrade from Red Hat Enterprise Virtualization Manager Version 3.5.8-0.1.el6ev to 3.6.10.1-0.1.el6.noarch engine (initially engine was running on alma03, which was RHEL6.7 on 3.5. 3.6.10.1-0.1 engine's components: rhevm-sdk-python-3.6.9.1-1.el6ev.noarch ovirt-setup-lib-1.0.1-1.el6ev.noarch rhevm-restapi-3.5.8-0.1.el6ev.noarch rhevm-setup-plugin-vmconsole-proxy-helper-3.6.10.1-0.1.el6.noarch rhevm-websocket-proxy-3.6.10.1-0.1.el6.noarch rhevm-guest-agent-common-1.0.11-6.el6ev.noarch rhevm-image-uploader-3.6.1-2.el6ev.noarch rhevm-spice-client-x64-msi-3.6-7.el6.noarch rhevm-setup-base-3.6.10.1-0.1.el6.noarch rhevm-extensions-api-impl-3.6.10.1-0.1.el6.noarch rhevm-dbscripts-3.5.8-0.1.el6ev.noarch rhevm-backend-3.5.8-0.1.el6ev.noarch rhevm-setup-plugin-ovirt-engine-common-3.6.10.1-0.1.el6.noarch rhevm-reports-setup-3.6.5.1-1.el6ev.noarch rhevm-log-collector-3.6.1-1.el6ev.noarch ovirt-host-deploy-java-1.4.1-1.el6ev.noarch rhevm-spice-client-x86-msi-3.6-7.el6.noarch rhevm-branding-rhev-3.6.0-10.el6ev.noarch rhevm-lib-3.6.10.1-0.1.el6.noarch rhevm-3.5.8-0.1.el6ev.noarch rhevm-dwh-3.5.5-1.el6ev.noarch rhev-release-3.6.10-2-001.noarch rhevm-setup-plugins-3.6.5-1.el6ev.noarch rhevm-dwh-setup-3.6.8-1.el6ev.noarch rhevm-spice-client-x64-cab-3.6-7.el6.noarch rhevm-tools-3.5.8-0.1.el6ev.noarch rhevm-setup-3.6.10.1-0.1.el6.noarch rhevm-iso-uploader-3.6.0-1.el6ev.noarch rhevm-spice-client-x86-cab-3.6-7.el6.noarch ovirt-host-deploy-1.4.1-1.el6ev.noarch ovirt-engine-extension-aaa-jdbc-1.0.7-2.el6ev.noarch rhevm-userportal-3.5.8-0.1.el6ev.noarch rhevm-webadmin-portal-3.5.8-0.1.el6ev.noarch rhevm-setup-plugin-ovirt-engine-3.6.10.1-0.1.el6.noarch rhevm-doc-3.6.10-1.el6ev.noarch rhevm-dependencies-3.6.1-1.el6ev.noarch rhevm-reports-3.5.8-1.el6ev.noarch rhevm-setup-plugin-websocket-proxy-3.6.10.1-0.1.el6.noarch rhevm-cli-3.6.9.0-1.el6ev.noarch rhev-guest-tools-iso-3.6-6.el6ev.noarch Linux version 2.6.32-642.11.1.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC) ) #1 SMP Wed Oct 26 10:25:23 EDT 2016 Linux 2.6.32-642.11.1.el6.x86_64 #1 SMP Wed Oct 26 10:25:23 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 6.8 (Santiago) Reprovisioned RHEL to RHEV Hypervisor - 7.2 - 20160219.0.el7ev (3.5) and then set it to maintenance and upgraded to Hypervisor - 7.3 - 20161028.1.el7ev (3.6) and migrated HE-VM back to it, to default host cluster, from temporary host cluster. Upgraded the remaining RHEV Hypervisor - 7.2 - 20160219.0.el7ev (3.5) to Hypervisor - 7.3 - 20161028.1.el7ev (3.6) in temporary host cluster and then moved it to default host cluster. Upgraded the engine from 3.5.8-0.1 to 3.6.10.1-0.1.el6. Bumped up default host cluster with two RHEVH7.2 20161028.1.el7ev (3.6) from 3.5 to 3.6 compatibility mode, while received message: "Change Cluster Compatibility Version There are running VMs in the cluster. All of them have to be rebooted if the Cluster Compatibility Version is changed. You are about to change the Cluster Compatibility Version. Are you sure you want to continue?" Bumped up Data Center from 3.5 to 3.6. While still in data center was in 3.5, I've already saw that hosted_storage started auto-import, as seen it in storage tab with locked symbol. After data center was bumped up to 3.6, hosted_storage was still in locked status in WEBUI and hosted_storage did not finished auto-import successfully, in WEBUI I saw : "Dec 25, 2016 6:40:42 PM Data Center Default was updated by admin@internal Dec 25, 2016 6:38:57 PM The Hosted Engine Storage Domain isn't Active. Dec 25, 2016 6:38:49 PM Failed to import the Hosted Engine Storage Domain Dec 25, 2016 6:38:49 PM Failed to attach Storage Domain hosted_storage to Data Center Default. (User: SYSTEM) Dec 25, 2016 6:38:49 PM VDSM command failed: Cannot acquire host id: (u'81af2dad-ac74-4e6b-9f56-d69c5811686a', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))" I've tried to set alma04 to global maintenance and then tried this on engine: "engine-config --list | grep -i ovf engine-config -s OvfUpdateIntervalInMinutes=1 service ovirt-engine restart". I've set back "hosted-engine --set-maintenance --mode=none" on alma04. Version-Release number of selected component (if applicable): 3.6.10 RHEVH's components on both hosts after upgrade: ovirt-node-lib-3.6.1-34.0.el7ev.noarch ovirt-node-plugin-cim-logic-3.6.1-34.0.el7ev.noarch qemu-kvm-rhev-2.6.0-27.el7.x86_64 ovirt-node-plugin-hosted-engine-0.3.0-8.el7ev.noarch ovirt-node-plugin-vdsm-0.6.1-11.el7ev.noarch ovirt-node-plugin-cim-3.6.1-34.0.el7ev.noarch ovirt-node-branding-rhev-3.6.1-34.0.el7ev.noarch rhevm-sdk-python-3.6.9.1-1.el7ev.noarch ovirt-host-deploy-1.4.1-1.el7ev.noarch ovirt-vmconsole-1.0.4-1.el7ev.noarch libvirt-client-2.0.0-10.el7.x86_64 ovirt-node-lib-legacy-3.6.1-34.0.el7ev.noarch ovirt-node-selinux-3.6.1-34.0.el7ev.noarch ovirt-node-plugin-snmp-logic-3.6.1-34.0.el7ev.noarch mom-0.5.6-1.el7ev.noarch vdsm-4.17.35-1.el7ev.noarch ovirt-hosted-engine-setup-1.3.7.3-1.el7ev.noarch ovirt-node-plugin-rhn-3.6.1-34.0.el7ev.noarch ovirt-node-plugin-snmp-3.6.1-34.0.el7ev.noarch sanlock-3.2.4-1.el7.x86_64 ovirt-vmconsole-host-1.0.4-1.el7ev.noarch ovirt-node-lib-config-3.6.1-34.0.el7ev.noarch ovirt-node-3.6.1-34.0.el7ev.noarch ovirt-hosted-engine-ha-1.3.5.8-1.el7ev.noarch ovirt-host-deploy-offline-1.4.0-1.el7ev.x86_64 ovirt-setup-lib-1.0.1-1.el7ev.noarch Linux version 3.10.0-514.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Wed Oct 19 11:24:13 EDT 2016 Linux 3.10.0-514.el7.x86_64 #1 SMP Wed Oct 19 11:24:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Virtualization Hypervisor release 7.3 (20161028.1.el7ev) Components on engine after upgrade: ovirt-vmconsole-proxy-1.0.4-1.el6ev.noarch rhevm-webadmin-portal-3.6.10.1-0.1.el6.noarch rhevm-sdk-python-3.6.9.1-1.el6ev.noarch ovirt-setup-lib-1.0.1-1.el6ev.noarch rhevm-setup-plugin-vmconsole-proxy-helper-3.6.10.1-0.1.el6.noarch rhevm-tools-3.6.10.1-0.1.el6.noarch rhevm-dwh-3.6.8-1.el6ev.noarch rhevm-websocket-proxy-3.6.10.1-0.1.el6.noarch rhevm-guest-agent-common-1.0.11-6.el6ev.noarch rhevm-image-uploader-3.6.1-2.el6ev.noarch rhevm-spice-client-x64-msi-3.6-7.el6.noarch rhevm-dbscripts-3.6.10.1-0.1.el6.noarch rhevm-setup-base-3.6.10.1-0.1.el6.noarch rhevm-extensions-api-impl-3.6.10.1-0.1.el6.noarch rhevm-setup-plugin-ovirt-engine-common-3.6.10.1-0.1.el6.noarch rhevm-reports-setup-3.6.5.1-1.el6ev.noarch rhevm-log-collector-3.6.1-1.el6ev.noarch ovirt-host-deploy-java-1.4.1-1.el6ev.noarch rhevm-spice-client-x86-msi-3.6-7.el6.noarch ovirt-vmconsole-1.0.4-1.el6ev.noarch rhevm-tools-backup-3.6.10.1-0.1.el6.noarch rhevm-userportal-3.6.10.1-0.1.el6.noarch rhevm-branding-rhev-3.6.0-10.el6ev.noarch rhevm-lib-3.6.10.1-0.1.el6.noarch rhev-release-3.6.10-2-001.noarch rhevm-setup-plugins-3.6.5-1.el6ev.noarch rhevm-dwh-setup-3.6.8-1.el6ev.noarch rhevm-spice-client-x64-cab-3.6-7.el6.noarch rhev-hypervisor7-7.3-20161028.1.el6ev.noarch rhevm-restapi-3.6.10.1-0.1.el6.noarch rhevm-reports-3.6.5.1-1.el6ev.noarch rhevm-setup-3.6.10.1-0.1.el6.noarch rhevm-iso-uploader-3.6.0-1.el6ev.noarch rhevm-spice-client-x86-cab-3.6-7.el6.noarch rhevm-vmconsole-proxy-helper-3.6.10.1-0.1.el6.noarch rhevm-backend-3.6.10.1-0.1.el6.noarch ovirt-host-deploy-1.4.1-1.el6ev.noarch ovirt-engine-extension-aaa-jdbc-1.0.7-2.el6ev.noarch rhevm-setup-plugin-ovirt-engine-3.6.10.1-0.1.el6.noarch rhevm-doc-3.6.10-1.el6ev.noarch rhevm-dependencies-3.6.1-1.el6ev.noarch rhevm-3.6.10.1-0.1.el6.noarch rhevm-setup-plugin-websocket-proxy-3.6.10.1-0.1.el6.noarch rhevm-cli-3.6.9.0-1.el6ev.noarch rhev-guest-tools-iso-3.6-6.el6ev.noarch Linux version 2.6.32-642.11.1.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC) ) #1 SMP Wed Oct 26 10:25:23 EDT 2016 Linux 2.6.32-642.11.1.el6.x86_64 #1 SMP Wed Oct 26 10:25:23 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 6.8 (Santiago) How reproducible: 100% Steps to Reproduce: 1.Deploy clean 3.5 HE environment on one RHEL6.7 host over NFS and add pair data storage domains to it. 2.Create temporary host cluster and name it "test". 3.Add additional 3.5 RHEVH7.2 host as second hosted-engine host to host cluster "test". 4.Migrate HE-VM via WEBUI to 3.5 RHEVH7.2 in host cluster "test". 5.Remove RHEL6.7 host from default host cluster and reprovision it to 3.5 RHEVH7.2, then add it back to default host cluster as hosted-engine host. 6.Install "yum install rhev-hypervisor7" on engine and upgrade 3.5 RHEVH7.2 in default host cluster to 3.6 RHEVH7.2. 7.Migrate HE-VM to 3.6 RHEVH7.2 in default host cluster. 8.Upgrade remaining in "test" host cluster host from 3.5 RHEVH7.2 to 3.6 RHEVH7.2 and then add it to default host cluster. 9.Remove "test" host cluster. 10.Upgrade the engine from 3.5 to 3.6.10, while hosts in global maintenance. 11.Cancel global maintenance on hosts, once engine upgrade is finished. 12.Bump up to 3.6 compatibility mode on default host cluster. 13.Bump up to 3.6 compatibility mode on data center. Actual results: Hosted_storage stuck in locked. Expected results: Hosted storage should succeed, once host cluster compatibility mode changes from 3.5 to 3.6. Additional info: See sosreports from both hosts and the engine.
Created attachment 1235281 [details] engine's sosreport
Created attachment 1235282 [details] alma03's sosreport
Created attachment 1235283 [details] alma04's sosreport
In case that someone being interested on why there is a time shift between logs from host (RHEVH) and the engine, the answer is here https://access.redhat.com/solutions/440343. "All RHEV Hypervisors use UTC as the system time and this can not be changed. Any guests that have the incorrect time need to have their time zones changed to reflect the proper time." I had around 2 hours time shift between engine and RHEVH, as engine was deployed first time on RHEL6.7 with correct timezone, while engine migrated to RHEVH, the time zone had changed and was not synchronized with what it should have to be.
Nikolai, can you point me to the vdsm log containing this error? Dec 25, 2016 6:38:49 PM VDSM command failed: Cannot acquire host id: (u'81af2dad-ac74-4e6b-9f56-d69c5811686a', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))"
Also, /var/log/sanlock.log from same host can be useful.
I have found a possible cause for that VDSM failure on vdsm.log.1 of host alma04 : Thread-16::ERROR::2016-12-25 15:56:27,693::sdc::145::Storage.StorageDomainCache::(_findDomain) domain 81af2dad-ac74-4e6b-9f56-d69c5811686a not found Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 173, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('81af2dad-ac74-4e6b-9f56-d69c5811686a',) Thread-16::ERROR::2016-12-25 15:56:27,697::task::866::Storage.TaskManager.Task::(_setError) Task=`c22ca9a6-9869-40aa-9ede-6ae723f293d1`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2835, in getStorageDomainInfo dom = self.validateSdUUID(sdUUID) File "/usr/share/vdsm/storage/hsm.py", line 277, in validateSdUUID ... Im not sure if the logs correlate, this needs to be verified. this issue was raised in the past : https://bugzilla.redhat.com/show_bug.cgi?id=1269768#c31
(In reply to Nir Soffer from comment #6) > Nikolai, can you point me to the vdsm log containing this error? > > Dec 25, 2016 6:38:49 PM > VDSM command failed: Cannot acquire host id: > (u'81af2dad-ac74-4e6b-9f56-d69c5811686a', SanlockException(22, 'Sanlock > lockspace add failure', 'Invalid argument'))" Time gap should be +2 hours between UTC and IST, so on RHEVH host there is always lag of 2 hours if compared to the engine's IST. You should correlate with engine log, by subtracting 2 hours from it's events and then looking on vdsm log from RHEVH.
This is exactly a duplicate of rhbz#1322849: The engine was trying to import the hosted-engine storage domain using the host alma03; on alma03, under /etc/ovirt-hosted-engine/hosted-engine.conf, we have host_id=3 so ovirt-ha-agent is locking with host_id=3 on the other side, the engine thinks that alma03 has host_id=1 and so it's trying to acquire again the lock with host_id=1 but this will fail, indeed in /var/log/sanlock.log we see: 2016-12-25 16:38:49+0000 4219 [16357]: add_lockspace 81af2dad-ac74-4e6b-9f56-d69c5811686a:1:/rhev/data-center/mnt/10.35.110.11:_Compute__NFS_nsednev__he__1/81af2dad-ac74-4e6b-9f56-d69c5811686a/dom_md/ids:0 conflicts with name of list1 s4 81af2dad-ac74-4e6b-9f56-d69c5811686a:3:/rhev/data-center/mnt/10.35.110.11:_Compute__NFS_nsednev__he__1/81af2dad-ac74-4e6b-9f56-d69c5811686a/dom_md/ids:0 where the relevant information is the conflict between 81af2dad-ac74-4e6b-9f56-d69c5811686a:1 and 81af2dad-ac74-4e6b-9f56-d69c5811686a:3 Under vdsm logs we see: jsonrpc.Executor/1::DEBUG::2016-12-25 16:38:49,810::__init__::532::jsonrpc.JsonRpcServer::(_handle_request) Calling 'StorageDomain.detach' in bridge with {u'storagepoolID': u'00000000-0000-0000-0000-000000000000', u'force': True, u'storagedomainID': u'81af2dad-ac74-4e6b-9f56-d69c5811686a'} jsonrpc.Executor/1::DEBUG::2016-12-25 16:38:49,811::task::595::Storage.TaskManager.Task::(_updateState) Task=`b95270fd-34e8-4b7f-adfd-6f587bab67cf`::moving from state init -> state preparing jsonrpc.Executor/1::INFO::2016-12-25 16:38:49,811::logUtils::48::dispatcher::(wrapper) Run and protect: forcedDetachStorageDomain(sdUUID=u'81af2dad-ac74-4e6b-9f56-d69c5811686a', spUUID=u'00000000-0000-0000-0000-000000000000', options=None) jsonrpc.Executor/1::INFO::2016-12-25 16:38:49,811::clusterlock::219::Storage.SANLock::(acquireHostId) Acquiring host id for domain 81af2dad-ac74-4e6b-9f56-d69c5811686a (id: 1) jsonrpc.Executor/1::ERROR::2016-12-25 16:38:49,813::task::866::Storage.TaskManager.Task::(_setError) Task=`b95270fd-34e8-4b7f-adfd-6f587bab67cf`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 805, in forcedDetachStorageDomain self._deatchStorageDomainFromOldPools(sdUUID) File "/usr/share/vdsm/storage/hsm.py", line 781, in _deatchStorageDomainFromOldPools dom.acquireHostId(pool.id) File "/usr/share/vdsm/storage/sd.py", line 533, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 234, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: (u'81af2dad-ac74-4e6b-9f56-d69c5811686a', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) and also here the relevant information is the conflict between (id: 1) from the engine and host_id=3 under /etc/ovirt-hosted-engine/hosted-engine.conf As for https://bugzilla.redhat.com/show_bug.cgi?id=1322849#c18 the workaround is manually syncing the host_id value for ovirt-ha-agent with the value used by the engine. To do that: 1. run on the engine VM: sudo -u postgres psql engine -c "select vds_name, vds_spm_id from vds ORDER BY vds_spm_id" 2. check the value of host_id under /etc/ovirt-hosted-engine/hosted-engine.conf for all the involved hosts. If they doesn't match with the value at point 1: 3. put the host into maintenance mode from the engine, fix the value under /etc/ovirt-hosted-engine/hosted-engine.conf and reboot the host 4. When the host is up again, reactivate it from the engine *** This bug has been marked as a duplicate of bug 1322849 ***
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1409771 to get this issue covered there.