Description of problem: HE-SD mounted twice on host and ovirt-ha-agent.service dead. # systemctl status ovirt-ha-agent.service -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: inactive (dead) since Thu 2016-01-21 16:18:08 IST; 1h 20min ago Process: 120555 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon (code=exited, status=0/SUCCESS) Main PID: 120555 (code=exited, status=0/SUCCESS) Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: Traceback (most recent call last): Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 454, in start_monitoring Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: self._initialize_domain_monitor() Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 833, in _initialize_domain_monitor Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: raise Exception(msg) Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: Exception: Failed to start monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee, host_id=2): timeout during domain acquisition Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Shutting down the agent because of 3 failures in a row! Jan 21 16:18:08 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:VDSM domain monitor status: PENDING Jan 21 16:18:08 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Failed to stop monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee): Error 900 from stopMonitoringDomain: Storage domain is member of pool: 'domain=04cae086-8504-443f-a8bd-9a747d19bcee' Jan 21 16:18:08 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: INFO:ovirt_hosted_engine_ha.agent.agent.Agent:Agent shutting down # ls -l /rhev/data-center/mnt/ total 16 drwxrwxrwx. 4 root root 4096 Jan 20 18:56 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2 drwxrwxrwx. 4 root root 4096 Jan 20 18:56 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_ drwxrwxrwx. 4 root root 4096 Jan 20 20:07 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2__SD1__for__guests__nsednev__he__2 drwxr-xr-x. 2 vdsm kvm 4096 Jan 20 20:10 _var_lib_ovirt-hosted-engine-setup_tmpC04sVa Version-Release number of selected component (if applicable): ovirt-vmconsole-1.0.0-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.3.7-1.el7ev.noarch mom-0.5.1-1.el7ev.noarch qemu-kvm-rhev-2.3.0-31.el7_2.6.x86_64 ovirt-host-deploy-1.4.1-1.el7ev.noarch libvirt-client-1.2.17-13.el7_2.2.x86_64 ovirt-setup-lib-1.0.1-1.el7ev.noarch vdsm-4.17.18-0.el7ev.noarch ovirt-vmconsole-host-1.0.0-1.el7ev.noarch ovirt-hosted-engine-setup-1.3.2.3-1.el7ev.noarch sanlock-3.2.4-2.el7_2.x86_64 Linux version 3.10.0-327.8.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Jan 11 05:03:18 EST 2016 How reproducible: 100% Steps to Reproduce: 1.Deploy HE on 2 clean servers over NFS SD, add additional NFS SD for data, so auto-import will import the HE-SD inside the WEBUI. 2.Set the host with running HE-VM in to the maintenance mode via WEBUI and wait until HE-VM migrated successfully to second host. 3.Activate back via WEBUI the first host. 4.Check that ovirt-ha-agent is not running on first host. Actual results: ovirt-ha-agent is not running on first host and host can't provide HA, HE-SD seen twice mounted from first host, which is not running HE-VM. Expected results: Once host activated back, HA should be running on it and ha-agent should be running without errors, HE-SD should be seen only once from each host. Additional info: sosreports attached from both hosts and the engine.
Created attachment 1116970 [details] sosreport from HE-host alma02
Created attachment 1116971 [details] sosreport from the engine
Adding the sosreport from HE-host alma04 (agent died on this host) as external link. https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88aXVIYlM4VHJtWms/view?usp=sharing
SD of HE-VM seen twice from both hosts: ls -l /rhev/data-center/mnt/ total 20 drwxrwxrwx. 4 root root 4096 Jan 20 18:56 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2 drwxrwxrwx. 4 root root 4096 Jan 20 18:56 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_ drwxrwxrwx. 4 root root 4096 Jan 20 20:07
MainThread::ERROR::2016-01-21 13:41:26,668::hosted_engine::832::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monit or) Failed to start monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee, host_id=2): timeout during domain acquisition Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 454, in start_monitoring self._initialize_domain_monitor() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 833, in _initialize_domain_monitor raise Exception(msg) Exception: Failed to start monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee, host_id=2): timeout during domain acquisition MainThread::WARNING::2016-01-21 13:41:26,669::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Erro r while monitoring engine: Failed to start monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee, host_id=2): timeout during domain acqu isition MainThread::WARNING::2016-01-21 13:41:26,669::hosted_engine::483::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unex pected error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 454, in start_monitoring self._initialize_domain_monitor() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 833, in _initialize_domain_monitor raise Exception(msg) Exception: Failed to start monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee, host_id=2): timeout during domain acquisition MainThread::ERROR::2016-01-21 13:41:26,669::hosted_engine::496::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Shutting down the agent because of 3 failures in a row! MainThread::INFO::2016-01-21 13:41:29,701::hosted_engine::862::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) VDSM domain monitor status: PENDING MainThread::INFO::2016-01-21 13:41:29,719::hosted_engine::788::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor) Failed to stop monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee): Error 900 from stopMonitoringDomain: Storage domain is member of pool: 'domain=04cae086-8504-443f-a8bd-9a747d19bcee' MainThread::INFO::2016-01-21 13:41:29,720::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down
Vdsm is reporting it twice: [root@alma04 ~]# vdsClient -s 0 getStorageDomainsList 04cae086-8504-443f-a8bd-9a747d19bcee 04cae086-8504-443f-a8bd-9a747d19bcee f57d8bc4-cbb5-4d85-a88c-bf163e79e88b Thread-37117::INFO::2016-01-21 16:51:02,936::logUtils::51::dispatcher::(wrapper) Run and protect: getStorageDomainsList, Return response: {'domlist': [u'04cae086-8504-443f-a8bd-9a747d19bcee', u'04cae086-8504-443f-a8bd-9a747d19bcee', u'f57d8bc4-cbb5-4d85-a88c-bf163e79e88b']} Thread-37117::DEBUG::2016-01-21 16:51:02,936::task::1191::Storage.TaskManager.Task::(prepare) Task=`5f9f7f50-9f12-4e0e-8eb3-b2b9f9464251`::finished: {'domlist': [u'04cae086-8504-443f-a8bd-9a747d19bcee', u'04cae086-8504-443f-a8bd-9a747d19bcee', u'f57d8bc4-cbb5-4d85-a88c-bf163e79e88b']} Thread-37117::DEBUG::2016-01-21 16:51:02,937::task::595::Storage.TaskManager.Task::(_updateState) Task=`5f9f7f50-9f12-4e0e-8eb3-b2b9f9464251`::moving from state preparing -> state finished The issue reported here happens when ovirt-ha-agents call start monitoring domain on the hosted-engine-storage domain and fails cause it's failing in VDSM: Thread-33573::INFO::2016-01-21 14:01:36,060::clusterlock::219::Storage.SANLock::(acquireHostId) Acquiring host id for domain 04cae086-8504-443f-a8bd-9a747d19bcee (id: 2) Thread-33573::ERROR::2016-01-21 14:01:36,061::monitor::387::Storage.Monitor::(_acquireHostId) Error acquiring host id 2 for domain 04cae086-8504-443f-a8bd-9a747d19bcee Traceback (most recent call last): File "/usr/share/vdsm/storage/monitor.py", line 384, in _acquireHostId self.domain.acquireHostId(self.hostId, async=True) File "/usr/share/vdsm/storage/sd.py", line 533, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 234, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: (u'04cae086-8504-443f-a8bd-9a747d19bcee', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) but it's failing in VDSM since the two mount points are pointing to the same storage domain and sanlock it's already locking on the other mount point: Jan 21 16:37:34 alma04 sanlock[887]: 2016-01-21 16:37:34+0200 24005 [894]: add_lockspace 04cae086-8504-443f-a8bd-9a747d19bcee:2:/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2/04cae086-8504-443f-a8bd-9a747d19bcee/dom_md/ids:0 conflicts with name of list1 s4 04cae086-8504-443f-a8bd-9a747d19bcee:2:/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_/04cae086-8504-443f-a8bd-9a747d19bcee/dom_md/ids:0 [root@alma04 logs]# sanlock client status daemon 96533e8b-143b-49c8-a9fd-1fdf717d605c.alma04.qa. p -1 helper p -1 listener p -1 status p 2851 s f57d8bc4-cbb5-4d85-a88c-bf163e79e88b:2:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2__SD1__for__guests__nsednev__he__2/f57d8bc4-cbb5-4d85-a88c-bf163e79e88b/dom_md/ids:0 s 04cae086-8504-443f-a8bd-9a747d19bcee:2:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_/04cae086-8504-443f-a8bd-9a747d19bcee/dom_md/ids:0 So the issue on ovirt-ha-agent is related to the fact that vdsm for a strange reason duplicated the hosted-engine storage domain.
Looks more vdsm storage then.
Might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1269768 ?
Probably the duplication happened here: jsonrpc.Executor/5::INFO::2016-01-21 13:26:41,387::logUtils::48::dispatcher::(wrapper) Run and protect: disconnectStorageServer(domType=1, spUUID=u'00000001-0001-0001-0001-0000000003d7', conList=[{u'id': u'5333df6d-11c1-4cd3-96ea-b76e04c8f569', u'connection': u'10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2_SD1_for_guests_nsednev_he_2', u'iqn': u'', u'user': u'', u'protocol_version': u'3', u'tpgt': u'1', u'password': '********', u'port': u''}, {u'id': u'8a0053ae-7645-427d-aa05-6f4b4fea856c', u'connection': u'10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u''}], options=None) jsonrpc.Executor/5::DEBUG::2016-01-21 13:26:41,388::mount::229::Storage.Misc.excCmd::(_runcmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /usr/bin/umount -f -l /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2__SD1__for__guests__nsednev__he__2 (cwd None) jsonrpc.Executor/5::DEBUG::2016-01-21 13:26:41,418::mount::229::Storage.Misc.excCmd::(_runcmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /usr/bin/umount -f -l /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2 (cwd None) jsonrpc.Executor/5::ERROR::2016-01-21 13:26:41,438::hsm::2557::Storage.HSM::(disconnectStorageServer) Could not disconnect from storageServer Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2553, in disconnectStorageServer conObj.disconnect() File "/usr/share/vdsm/storage/storageServer.py", line 447, in disconnect return self._mountCon.disconnect() File "/usr/share/vdsm/storage/storageServer.py", line 256, in disconnect self._mount.umount(True, True) File "/usr/share/vdsm/storage/mount.py", line 256, in umount return self._runcmd(cmd, timeout) File "/usr/share/vdsm/storage/mount.py", line 241, in _runcmd raise MountError(rc, ";".join((out, err))) MountError: (32, ';umount: /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2: not mounted\n') Thread-33541::DEBUG::2016-01-21 13:26:50,544::task::595::Storage.TaskManager.Task::(_updateState) Task=`e7eb6901-0989-40c2-95b8-cd39a97c5ede`::moving from state init -> state preparing Thread-33541::INFO::2016-01-21 13:26:50,544::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'protocol_version': 3, 'connection': '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/', 'user': 'kvm', 'id': '64caaf06-2e83-4acc-9f10-e948fd4ef565'}], options=None) Thread-33541::DEBUG::2016-01-21 13:26:50,547::hsm::2413::Storage.HSM::(__prefetchDomains) nfs local path: /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_ Thread-33541::DEBUG::2016-01-21 13:26:50,548::hsm::2437::Storage.HSM::(__prefetchDomains) Found SD uuids: (u'04cae086-8504-443f-a8bd-9a747d19bcee',) Thread-33541::DEBUG::2016-01-21 13:26:50,548::hsm::2497::Storage.HSM::(connectStorageServer) knownSDs: {f57d8bc4-cbb5-4d85-a88c-bf163e79e88b: storage.nfsSD.findDomain, 04cae086-8504-443f-a8bd-9a747d19bcee: storage.nfsSD.findDomain} Thread-33541::INFO::2016-01-21 13:26:50,549::logUtils::51::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '64caaf06-2e83-4acc-9f10-e948fd4ef565'}]} Thread-33541::DEBUG::2016-01-21 13:26:50,549::task::1191::Storage.TaskManager.Task::(prepare) Task=`e7eb6901-0989-40c2-95b8-cd39a97c5ede`::finished: {'statuslist': [{'status': 0, 'id': '64caaf06-2e83-4acc-9f10-e948fd4ef565'}]} Thread-33541::DEBUG::2016-01-21 13:26:50,549::task::595::Storage.TaskManager.Task::(_updateState) Task=`e7eb6901-0989-40c2-95b8-cd39a97c5ede`::moving from state preparing -> state finished
Seems like the problem is in the auto import flow, Daniel please look into this with the owner of the flow
(In reply to Tal Nisan from comment #10) > Seems like the problem is in the auto import flow, Daniel please look into > this with the owner of the flow Daniel, Tal - The auto import will take an unattached, but monitored, connected domain, and add it into the DC + Activate it. The scenario brought here is after all is done and a new host is trying to take over. I don't know how it ended up being twice in vdms internal structures but do I notice the delicate difference between the links: _vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_ drwxrwxrwx. 4 root root 4096 Jan 20 18:56 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2 drwxrwxrwx. 4 root root 4096 Jan 20 18:56 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_ drwxrwxrwx. 4 root root 4096 Jan 20 20:07 ^^^ extra "_" after 2 I think Nir is the right person to tell what could go wrong here and what is this behaviour with the extra "_"?
Seen also with GlusterFS storage domain. [root@kvm2 ~]# vdsClient -s 0 getStorageDomainsList 2d0b57a7-8fe7-445d-af41-0c66c99ba962 0b4c94b1-aa8e-4415-b6f2-79c12ee9c4ed 6d50e33a-8f7c-4d75-a64b-07db36507b37 7149b31d-430b-4d91-a8a3-ae1024dfd0a9 7149b31d-430b-4d91-a8a3-ae1024dfd0a9 [root@kvm2 ~]# ls -l /rhev/data-center/mnt/glusterSD/ total 6 drwxr-xr-x 2 vdsm kvm 6 25 gen 02.06 kvm1.brightsquid.com:_data drwxr-xr-x 6 vdsm kvm 120 22 gen 21.38 kvm1.brightsquid.com:_engine drwxr-xr-x 6 vdsm kvm 120 22 gen 21.38 kvm1:_engine drwxr-xr-x 6 vdsm kvm 4096 25 gen 22.57 kvm1:_local-kvm1-vol1raid1 drwxr-xr-x 6 vdsm kvm 120 25 gen 22.13 kvm1:_vmos1 drwxr-xr-x 2 vdsm kvm 6 22 gen 22.35 kvm2.brightsquid.com:_engine drwxr-xr-x 2 vdsm kvm 6 19 gen 19.38 kvm2:_engine drwxr-xr-x 6 vdsm kvm 120 25 gen 22.14 kvm2:_vmos2 Here it's probably a bit more tricky since the user was trying to use hosted-engine on GlusterSD in a hyper-converged fashion and in order to avoid a SPOF he deployed hosted-engine identifying the mount point as kvm1:/engine on the first host and kvm2:/engine on the second and VDSM didn't recognized that they are the same SD when the auto-import procedure mounted it on the other way.
(In reply to Simone Tiraboschi from comment #12) > Seen also with GlusterFS storage domain. > [root@kvm2 ~]# ls -l /rhev/data-center/mnt/glusterSD/ > total 6 > drwxr-xr-x 2 vdsm kvm 6 25 gen 02.06 kvm1.brightsquid.com:_data > drwxr-xr-x 6 vdsm kvm 120 22 gen 21.38 kvm1.brightsquid.com:_engine > drwxr-xr-x 6 vdsm kvm 120 22 gen 21.38 kvm1:_engine > drwxr-xr-x 6 vdsm kvm 4096 25 gen 22.57 kvm1:_local-kvm1-vol1raid1 > drwxr-xr-x 6 vdsm kvm 120 25 gen 22.13 kvm1:_vmos1 > drwxr-xr-x 2 vdsm kvm 6 22 gen 22.35 kvm2.brightsquid.com:_engine > drwxr-xr-x 2 vdsm kvm 6 19 gen 19.38 kvm2:_engine > drwxr-xr-x 6 vdsm kvm 120 25 gen 22.14 kvm2:_vmos2 > > Here it's probably a bit more tricky since the user was trying to use > hosted-engine on GlusterSD in a hyper-converged fashion and in order to > avoid a SPOF he deployed hosted-engine identifying the mount point as > kvm1:/engine on the first host and kvm2:/engine on the second and VDSM > didn't recognized that they are the same SD when the auto-import procedure > mounted it on the other way. Of course vdsm did not recognize they are the same, from vdsm point of view, the mounts are different if the remote path is different, which means that this is a hosted engine or user issue, vdsm cannot help you in any way here.
(In reply to Nir Soffer from comment #13) > Of course vdsm did not recognize they are the same, from vdsm point of view, > the mounts are different if the remote path is different, which means that > this > is a hosted engine or user issue, vdsm cannot help you in any way here. Neither if we see the same SD UUID? If I just call getStorageDomainInfo(7149b31d-430b-4d91-a8a3-ae1024dfd0a9) about which of the two I'll get info about?
(In reply to Simone Tiraboschi from comment #14) > (In reply to Nir Soffer from comment #13) > > Of course vdsm did not recognize they are the same, from vdsm point of view, > > the mounts are different if the remote path is different, which means that > > this > > is a hosted engine or user issue, vdsm cannot help you in any way here. > > Neither if we see the same SD UUID? > > If I just call getStorageDomainInfo(7149b31d-430b-4d91-a8a3-ae1024dfd0a9) > about which of the two I'll get info about? You will probably get the first one found when domain is searched. We don't support multiple server/remote_path for same domain on same host. The mount details are set in the engine, and used to mount the the nfs/gluster on the hosts. To modify domain server/remote_path, you have put that domain into maintenance, modify and activate it again, so there is no way to get same domain mounted with different server/remote_path when using engine. If hosted-engine setup allow this, it should be fixed to disallow this. As usual, the setup should use vdsm apis as they are used by engine.
(In reply to Nikolai Sednev from comment #4) > SD of HE-VM seen twice from both hosts: > ls -l /rhev/data-center/mnt/ > total 20 > drwxrwxrwx. 4 root root 4096 Jan 20 18:56 > 10.35.64.11: > _vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2 > drwxrwxrwx. 4 root root 4096 Jan 20 18:56 > 10.35.64.11: > _vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_ > drwxrwxrwx. 4 root root 4096 Jan 20 20:07 Looking at the vdsm logs, I find only connections for: vdsm.log.3.xz:Thread-10856::INFO::2016-01-21 13:00:17,992::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'protocol_version': 3, 'connection': '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/', 'user': 'kvm', 'id': '64caaf06-2e83-4acc-9f10-e948fd4ef565'}], options=None) This connection will be mounted at: /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_ It looks like hosted engine setup or engine is using 'connection': '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/' Instead of: 'connection': '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2' Note the trailing slash, which is escaped to "_" by the mount code, creating the local mountpoint directory. Vdsm treats "path/to/directory" and "path/to/directory/" as different paths, and so should be the client calling connnectStorageServer. Simone, please check where '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/' is coming from - if this is user input, you must normalize it like engine does, because vdsm will not normalize it for you.
(In reply to Nir Soffer from comment #16) > Simone, please check where > '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/' > is coming from - if this is user input, you must normalize it like engine > does, > because vdsm will not normalize it for you. You are right, from /etc/ovirt-hosted-engine/answer.conf I see: OVEHOSTED_STORAGE/storageDomainConnection=str:10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/ and hosted-engine-setup accepted it, indeed under /etc/ovirt-hosted-engine/hosted-engine.conf we have storage=10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/ and ovirt-ha-agent simply uses it without removing the trailing slash. The issue is that now the engine imports the hosted-engine storage domain, (probably) normalizes the string and so VDSM mounts it twice under different paths. Normalizing on new deployment is obviously easy; it's not that easy to fix for upgrades since in the past we were accepting it and the upgrade procedure doesn't disconnect the connected storage servers.
(In reply to Simone Tiraboschi from comment #17) > (In reply to Nir Soffer from comment #16) > > Simone, please check where > > '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/' > > is coming from - if this is user input, you must normalize it like engine > > does, > > because vdsm will not normalize it for you. > > You are right, from /etc/ovirt-hosted-engine/answer.conf I see: > OVEHOSTED_STORAGE/storageDomainConnection=str:10.35.64.11:/vol/RHEV/Virt/ > nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/ > and hosted-engine-setup accepted it, indeed under > /etc/ovirt-hosted-engine/hosted-engine.conf we have > storage=10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/ > and ovirt-ha-agent simply uses it without removing the trailing slash. > The issue is that now the engine imports the hosted-engine storage domain, > (probably) normalizes the string and so VDSM mounts it twice under different > paths. > > Normalizing on new deployment is obviously easy; it's not that easy to fix > for upgrades since in the past we were accepting it and the upgrade > procedure doesn't disconnect the connected storage servers. So before the upgrade, the setup was using "/path/", (with trailing slash) and after the domain was imported, engine try to connect to "/path" (without trailing slash). We should check why engine is using the path without the slash. When importing a domain, engine should use the exactly the same path reported by vdsm.
@Nir Soffer, Regarding the comment #18, there was no upgrade at all, there was a clean HE deployment over NFS SD.
(In reply to Nikolai Sednev from comment #19) > @Nir Soffer, > Regarding the comment #18, there was no upgrade at all, there was a clean HE > deployment over NFS SD. I assumed that since you called your NFS export: /vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/ The issue is that hosted-engine-setup accepts it with the trailing slash while the engine probably removes it on auto-import and VDSM treats them differently. This happens on new install and on upgrades. The issue in the upgrade is that we were accepting the path with the trailing / also in the past and ovirt-ha-agent is not disconnecting that storageServer in the upgrade.
Ok, let's try to fix it in hosted-engine-setup and oVirt HA agent: 1. hosted-engine-setup for new deployment: remove the trailing slash and alert the user 2. ovirt-ha-agent: fix the piece of code that calls connectStorageserver to ignore the trailing slash; this is probably not enough by itself since a simply restart of ovirt-ha-agent currently doesn't call disconnectStorageServer so the storage server is still connected also on the old path and we are going to hit the bug here. Simply calling disconnectStorageServer is not enough since sanlock can have a lock here and VDSM could be already monitoring it. The simpler, but really ugly option, is that is we discover that the NFS path has a trailing slash and the storage domain is already mounted on a path with a trailing '_' we refuse to start and we ask the user to reboot. Another option is to stopMonitoringDomain if monitoring, remove sanlock lock, disconnect the storage domain, reconnect the storage domain. Nir, what do you suggest?
Created attachment 1118808 [details] Engine blocking invalid path with trailing slash Simone, engine blocks paths with trailing slash, so you can use the same behavior, or normalize the path automatically. Whatever solution you choose, the mount path must never contain a trailing slash.
Reproduced. I deployed over '192.168.1.115:/Virtual/exthe7/' and the mount path is indeed '/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/' [root@c72het20160127h1 ~]# ls /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/ aec5376e-18dd-4628-bb3d-dcfd52f1b7b0 __DIRECT_IO_TEST__ but VDSM reports back just '192.168.1.115:/Virtual/exthe7' [root@c72het20160127h1 ~]# vdsClient -s 0 getStorageDomainInfo aec5376e-18dd-4628-bb3d-dcfd52f1b7b0 uuid = aec5376e-18dd-4628-bb3d-dcfd52f1b7b0 version = 3 role = Regular remotePath = 192.168.1.115:/Virtual/exthe7 type = NFS class = Data pool = ['a8aecb62-6f9a-43a6-9080-0adbdaea9b77'] name = hosted_storage Probably this has to be fixed to.
Thread-25::DEBUG::2016-01-27 19:17:40,374::task::595::Storage.TaskManager.Task::(_updateState) Task=`92756df2-d3fe-40c8-8a73-436ad7a5c32c`::moving from state init -> state preparing Thread-25::INFO::2016-01-27 19:17:40,374::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'protocol_version': 4, 'connection': '192.168.1.115:/Virtual/exthe7/', 'user': 'kvm', 'id': '157bc82b-eeb6-4dc8-987e-3cdc20e6181e'}], options=None) Thread-25::DEBUG::2016-01-27 19:17:40,376::fileUtils::143::Storage.fileUtils::(createdir) Creating directory: /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_ mode: None Thread-25::DEBUG::2016-01-27 19:17:40,376::mount::229::Storage.Misc.excCmd::(_runcmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /usr/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=4 192.168.1.115:/Virtual/exthe7 /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_ (cwd None) Thread-25::DEBUG::2016-01-27 19:17:40,405::__init__::320::IOProcessClient::(_run) Starting IOProcess... Thread-25::DEBUG::2016-01-27 19:17:40,418::hsm::2413::Storage.HSM::(__prefetchDomains) nfs local path: /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_ Thread-25::DEBUG::2016-01-27 19:17:40,419::hsm::2437::Storage.HSM::(__prefetchDomains) Found SD uuids: () Thread-25::DEBUG::2016-01-27 19:17:40,419::hsm::2497::Storage.HSM::(connectStorageServer) knownSDs: {} Thread-25::INFO::2016-01-27 19:17:40,419::logUtils::51::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '157bc82b-eeb6-4dc8-987e-3cdc20e6181e'}]} Thread-25::DEBUG::2016-01-27 19:17:40,420::task::1191::Storage.TaskManager.Task::(prepare) Task=`92756df2-d3fe-40c8-8a73-436ad7a5c32c`::finished: {'statuslist': [{'status': 0, 'id': '157bc82b-eeb6-4dc8-987e-3cdc20e6181e'}]}
(In reply to Simone Tiraboschi from comment #23) > Reproduced. > > I deployed over '192.168.1.115:/Virtual/exthe7/' and the mount path is > indeed > '/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/' > > [root@c72het20160127h1 ~]# ls > /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/ > aec5376e-18dd-4628-bb3d-dcfd52f1b7b0 __DIRECT_IO_TEST__ > > but VDSM reports back just '192.168.1.115:/Virtual/exthe7' > > [root@c72het20160127h1 ~]# vdsClient -s 0 getStorageDomainInfo > aec5376e-18dd-4628-bb3d-dcfd52f1b7b0 > uuid = aec5376e-18dd-4628-bb3d-dcfd52f1b7b0 > version = 3 > role = Regular > remotePath = 192.168.1.115:/Virtual/exthe7 > type = NFS > class = Data > pool = ['a8aecb62-6f9a-43a6-9080-0adbdaea9b77'] > name = hosted_storage > > Probably this has to be fixed to. I don't want to try to fix VDSM. That is too risky as that may have impact on something else. Moreover, since the engine is relying on the output of the above command, it seems that a trailing '/' is something the system forbids (or drops). The engine won't let you add a storage domain from the UI with a trailing '/' as well. So that is consistent. Like agreed yesterday, lets see if we are covered by: 1. Fresh install - sanitize input param of path - remove trailing '/' 2. Upgrade - while in maintenance, update the config file to remove the slash. 2.1 on ovirt-ha-agent service stop, perform disconnect from sd . The next start will connect anyway 2.2 alternative to 2.1 - reboot host, mount points will be regenerated without '_"
2.1 can be risky since we need to be sure that VDSM is not monitoring the domain and we have no active sanlock locks there.
Guys, we can fix it on hosted engine side.. But just because you always get the path with no slash from the engine does not mean VDSM should ignore input data validation. That is a bug by itself. VDSM should be fixed too to either refuse value like that or normalize the path by itself. Any other option is just workaround and a new issue waiting to happen.
(In reply to Martin Sivák from comment #27) Vdsm returning: remotePath = 192.168.1.115:/Virtual/exthe7 When the mount path was: 192.168.1.115:/Virtual/exthe7/ Is not a bug. Vdsm normalizes the mount path, but does not keep this info. When checking if a remote path is mounted, we compare the normalized path with the mount info from /proc/mounts and /etc/mtab, so we cannot restore the original unnormalized path. We can fix vdsm to normalize the local mount path before escaping slashes, but this will break old hosted engine installation if vdsm is upgraded without disconnecting the hosted engine domain, since the domain will be considered as not mounted. When checking if a storage domain is mounted, we compare the remote and local paths with the information from /proc/mounts and /etc/mtab. If the local path is different (since we normalize before escaping), the compare will fail, leading to the same issue in this bug, having 2 mounts for the same remote patch. The only way to avoid this issue, is to disconnect the domain and connect it again with normalized path. So we cannot avoid fixing hosted engine. It should use os.path.normpath to normalize user input. I sent fix for vdsm, it would nice if you test it with hosted engine.
Yep we will deal with the hosted engine side. Simone already has a patch for that. Thanks for the vdsm update.
It seams that we have another issue with VDSM here: Now, with https://gerrit.ovirt.org/#/c/52836/ , ovirt-ha-agent correctly fixes the path, disconnects the previously connected storage server and reconnect it under the right path via connectStorageServer So we have a single mount point and '/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_' becomes '/rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe7' (without trailing underscore!) as we need. [root@c72het20160127h1 ~]# mount | grep rhev 192.168.1.115:/Virtual/exthe7 on /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7 type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,port=0,timeo=600,retrans=6,sec=sys,clientaddr=192.168.1.118,local_lock=none,addr=192.168.1.115) [root@c72het20160127h1 ~]# The issue is that now the links re-created by prepareImage under /var/run/vdsm/storage/ still points to '/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_' (with the trailing underscore!) that is not mounted anymore. [root@c72het20160127h1 ~]# ls -l /rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/ha_agent/hosted-engine.metadata lrwxrwxrwx. 1 vdsm kvm 132 1 feb 16.03 /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/ha_agent/hosted-engine.metadata -> /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee [root@c72het20160127h1 ~]# ls -l /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5 lrwxrwxrwx. 1 vdsm kvm 133 1 feb 15.47 /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5 -> /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5 [root@c72het20160127h1 ~]# ls -l /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5 ls: cannot access /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5: No such file or directory Here prepareImage logs: Thread-2006::DEBUG::2016-02-01 16:06:04,076::task::595::Storage.TaskManager.Task::(_updateState) Task=`9b1bd78b-36cc-475c-9ca4-619c21015fe0`::moving from state init -> state preparing Thread-2006::INFO::2016-02-01 16:06:04,076::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='aec5376e-18dd-4628-bb3d-dcfd52f1b7b0', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5', leafUUID='5924bae3-ee7d-4493-9081-c91f42b6c3ee') Thread-2006::DEBUG::2016-02-01 16:06:04,076::resourceManager::198::Storage.ResourceManager.Request::(__init__) ResName=`Storage.aec5376e-18dd-4628-bb3d-dcfd52f1b7b0`ReqID=`ae5a6689-6e5d-4199-81bc-651c9f84f87c`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '3213' at 'prepareImage' Thread-2006::DEBUG::2016-02-01 16:06:04,076::resourceManager::542::Storage.ResourceManager::(registerResource) Trying to register resource 'Storage.aec5376e-18dd-4628-bb3d-dcfd52f1b7b0' for lock type 'shared' Thread-2006::DEBUG::2016-02-01 16:06:04,076::resourceManager::601::Storage.ResourceManager::(registerResource) Resource 'Storage.aec5376e-18dd-4628-bb3d-dcfd52f1b7b0' is free. Now locking as 'shared' (1 active user) Thread-2006::DEBUG::2016-02-01 16:06:04,076::resourceManager::238::Storage.ResourceManager.Request::(grant) ResName=`Storage.aec5376e-18dd-4628-bb3d-dcfd52f1b7b0`ReqID=`ae5a6689-6e5d-4199-81bc-651c9f84f87c`::Granted request Thread-2006::DEBUG::2016-02-01 16:06:04,076::task::827::Storage.TaskManager.Task::(resourceAcquired) Task=`9b1bd78b-36cc-475c-9ca4-619c21015fe0`::_resourcesAcquired: Storage.aec5376e-18dd-4628-bb3d-dcfd52f1b7b0 (shared) Thread-2006::DEBUG::2016-02-01 16:06:04,076::task::993::Storage.TaskManager.Task::(_decref) Task=`9b1bd78b-36cc-475c-9ca4-619c21015fe0`::ref 1 aborting False Thread-2006::DEBUG::2016-02-01 16:06:04,078::fileSD::538::Storage.StorageDomain::(activateVolumes) Fixing permissions on /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee Thread-2006::DEBUG::2016-02-01 16:06:04,078::fileUtils::143::Storage.fileUtils::(createdir) Creating directory: /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0 mode: None Thread-2006::WARNING::2016-02-01 16:06:04,078::fileUtils::152::Storage.fileUtils::(createdir) Dir /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0 already exists Thread-2006::DEBUG::2016-02-01 16:06:04,078::fileSD::513::Storage.StorageDomain::(createImageLinks) Creating symlink from /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5 to /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5 Thread-2006::DEBUG::2016-02-01 16:06:04,078::fileSD::518::Storage.StorageDomain::(createImageLinks) img run dir already exists: /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5 Thread-2006::DEBUG::2016-02-01 16:06:04,079::fileVolume::535::Storage.Volume::(validateVolumePath) validate path for 5924bae3-ee7d-4493-9081-c91f42b6c3ee Thread-2006::INFO::2016-02-01 16:06:04,080::logUtils::51::dispatcher::(wrapper) Run and protect: prepareImage, Return response: {'info': {'domainID': 'aec5376e-18dd-4628-bb3d-dcfd52f1b7b0', 'volType': 'path', 'leaseOffset': 0, 'path': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'volumeID': u'5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'leasePath': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee.lease', 'imageID': '8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5'}, 'path': u'/var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'imgVolumesInfo': [{'domainID': 'aec5376e-18dd-4628-bb3d-dcfd52f1b7b0', 'volType': 'path', 'leaseOffset': 0, 'path': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'volumeID': u'5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'leasePath': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee.lease', 'imageID': '8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5'}]} Thread-2006::DEBUG::2016-02-01 16:06:04,080::task::1191::Storage.TaskManager.Task::(prepare) Task=`9b1bd78b-36cc-475c-9ca4-619c21015fe0`::finished: {'info': {'domainID': 'aec5376e-18dd-4628-bb3d-dcfd52f1b7b0', 'volType': 'path', 'leaseOffset': 0, 'path': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'volumeID': u'5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'leasePath': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee.lease', 'imageID': '8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5'}, 'path': u'/var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'imgVolumesInfo': [{'domainID': 'aec5376e-18dd-4628-bb3d-dcfd52f1b7b0', 'volType': 'path', 'leaseOffset': 0, 'path': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'volumeID': u'5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'leasePath': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee.lease', 'imageID': '8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5'}]} Thread-2006::DEBUG::2016-02-01 16:06:04,080::task::595::Storage.TaskManager.Task::(_updateState) Task=`9b1bd78b-36cc-475c-9ca4-619c21015fe0`::moving from state preparing -> state finished The issue could be with: Thread-2006::DEBUG::2016-02-01 16:06:04,078::fileSD::518::Storage.StorageDomain::(createImageLinks) img run dir already exists: /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5 Nir, is createImageLinks going to recreate the symlinks if they are still there from the previous attempt? Do we need to do something special to cleanup the links before calling prepareImage? An host reboot cleanups /var/run and the link get recreated pointing to the right path.
OK, the issue seams really in createImageLinks that, if the link already exists, doesn't recreate and doesn't check if it points to the required path.
Simone, is a reboot still required after all the patches will be merged?
(In reply to Sandro Bonazzola from comment #32) > Simone, is a reboot still required after all the patches will be merged? The issue is with VDSM createImageLinks that doesn't refresh the symlinks under /var/run/vdsm/storage if they already exist. Patch http://gerrit.ovirt.org/52937 tries to address that, it we are not going to merge it we need to ask the user to reboot the host (/var/run is on tmpfs and so on reboot...)
Nir, please provide directions on how to solve this.
Nir can you please perform the cleanup for /var/run/... (which anyway makes sense) in deactivate verb?
(In reply to Roy Golan from comment #35) > Nir can you please perform the cleanup for /var/run/... (which anyway makes > sense) in deactivate verb? I'm testing this: https://gerrit.ovirt.org/#/c/53298/
(In reply to Simone Tiraboschi from comment #36) > (In reply to Roy Golan from comment #35) > > Nir can you please perform the cleanup for /var/run/... (which anyway makes > > sense) in deactivate verb? > > I'm testing this: https://gerrit.ovirt.org/#/c/53298/ Thanks Nir, this means we have a simple vdsm-supported way of cleaning up using deactivate.
Moving from blocker to exception since a host reboot will solve and the issue happens only on NFS if path has trailing / on a 3.5 -> 3.6 upgrade scenario.
Removing stale needinfo.
If this is not in tomorrow please add release notes on known issue and workaround asap.
So
52937 is just the latest workaround on VDSM side. We could consider the bug as fixed also without that.
oVirt 3.6.3 has been released, closing UPSTREAM.