Bug 1300749 - HE-SD mounted twice on host and ovirt-ha-agent.service dead
Summary: HE-SD mounted twice on host and ovirt-ha-agent.service dead
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: ovirt-hosted-engine-setup
Classification: oVirt
Component: Plugins.NFS
Version: 1.3.2.3
Hardware: x86_64
OS: Linux
unspecified
high vote
Target Milestone: ovirt-3.6.4
: 1.3.3.4
Assignee: Nir Soffer
QA Contact: Nikolai Sednev
URL: https://drive.google.com/a/redhat.com...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-01-21 15:45 UTC by Nikolai Sednev
Modified: 2016-08-09 08:20 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-09 21:41:42 UTC
oVirt Team: Storage
rule-engine: ovirt-3.6.z+
rule-engine: exception+
ylavi: planning_ack+
sbonazzo: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
sosreport from HE-host alma02 (12.94 MB, application/x-xz)
2016-01-21 15:48 UTC, Nikolai Sednev
no flags Details
sosreport from the engine (7.46 MB, application/x-xz)
2016-01-21 15:52 UTC, Nikolai Sednev
no flags Details
Engine blocking invalid path with trailing slash (28.68 KB, image/png)
2016-01-27 14:30 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 52820 0 master MERGED packaging: setup: validating export paths as VDSM does 2020-07-23 06:30:17 UTC
oVirt gerrit 52836 0 master MERGED storage: validating NFS/gluster paths 2020-07-23 06:30:17 UTC
oVirt gerrit 52873 0 master MERGED storageServer: Normalize local path before escaping 2020-07-23 06:30:17 UTC
oVirt gerrit 52937 0 master ABANDONED createImageLinks: recreate if the link exists but it's broken 2020-07-23 06:30:17 UTC
oVirt gerrit 53269 0 ovirt-hosted-engine-ha-1.3 MERGED storage: validating NFS/gluster paths 2020-07-23 06:30:17 UTC
oVirt gerrit 53270 0 ovirt-hosted-engine-setup-1.3 MERGED packaging: setup: validating export paths as VDSM does 2020-07-23 06:30:16 UTC
oVirt gerrit 53276 0 ovirt-3.6 MERGED storageServer: Normalize local path before escaping 2020-07-23 06:30:16 UTC
oVirt gerrit 53277 0 ovirt-3.6.3 MERGED storageServer: Normalize local path before escaping 2020-07-23 06:30:16 UTC
oVirt gerrit 53298 0 master MERGED hsm: Remove image run links in teardownImage 2020-07-23 06:30:16 UTC
oVirt gerrit 53357 0 master MERGED storage: validating NFS/gluster paths 2020-07-23 06:30:16 UTC
oVirt gerrit 53559 0 ovirt-3.6 MERGED hsm: Remove image run links in teardownImage 2020-07-23 06:30:15 UTC
oVirt gerrit 53560 0 ovirt-3.6.3 MERGED hsm: Remove image run links in teardownImage 2020-07-23 06:30:16 UTC
oVirt gerrit 53842 0 ovirt-hosted-engine-ha-1.3 MERGED storage: validating NFS/gluster paths 2020-07-23 06:30:15 UTC

Description Nikolai Sednev 2016-01-21 15:45:42 UTC
Description of problem:
HE-SD mounted twice on host and ovirt-ha-agent.service dead.

# systemctl status ovirt-ha-agent.service -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Thu 2016-01-21 16:18:08 IST; 1h 20min ago
  Process: 120555 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon (code=exited, status=0/SUCCESS)
 Main PID: 120555 (code=exited, status=0/SUCCESS)

Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: Traceback (most recent call last):
Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 454, in start_monitoring
Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: self._initialize_domain_monitor()
Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 833, in _initialize_domain_monitor
Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: raise Exception(msg)
Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: Exception: Failed to start monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee, host_id=2): timeout during domain acquisition
Jan 21 16:18:06 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Shutting down the agent because of 3 failures in a row!
Jan 21 16:18:08 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:VDSM domain monitor status: PENDING
Jan 21 16:18:08 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Failed to stop monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee): Error 900 from stopMonitoringDomain: Storage domain is member of pool: 'domain=04cae086-8504-443f-a8bd-9a747d19bcee'
Jan 21 16:18:08 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[120555]: INFO:ovirt_hosted_engine_ha.agent.agent.Agent:Agent shutting down

# ls -l /rhev/data-center/mnt/
total 16
drwxrwxrwx. 4 root root 4096 Jan 20 18:56 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2
drwxrwxrwx. 4 root root 4096 Jan 20 18:56 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_
drwxrwxrwx. 4 root root 4096 Jan 20 20:07 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2__SD1__for__guests__nsednev__he__2
drwxr-xr-x. 2 vdsm kvm  4096 Jan 20 20:10 _var_lib_ovirt-hosted-engine-setup_tmpC04sVa


Version-Release number of selected component (if applicable):
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.3.7-1.el7ev.noarch
mom-0.5.1-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.6.x86_64
ovirt-host-deploy-1.4.1-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.2.x86_64
ovirt-setup-lib-1.0.1-1.el7ev.noarch
vdsm-4.17.18-0.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.2.3-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
Linux version 3.10.0-327.8.1.el7.x86_64 (mockbuild@x86-034.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Jan 11 05:03:18 EST 2016


How reproducible:
100%

Steps to Reproduce:
1.Deploy HE on 2 clean servers over NFS SD, add additional NFS SD for data, so auto-import will import the HE-SD inside the WEBUI. 
2.Set the host with running HE-VM in to the maintenance mode via WEBUI and wait until HE-VM migrated successfully to second host.
3.Activate back via WEBUI the first host.
4.Check that ovirt-ha-agent is not running on first host.

Actual results:
ovirt-ha-agent is not running on first host and host can't provide HA, HE-SD seen twice mounted from first host, which is not running HE-VM.

Expected results:
Once host activated back, HA should be running on it and ha-agent should be running without errors, HE-SD should be seen only once from each host.

Additional info:
sosreports attached from both hosts and the engine.

Comment 1 Nikolai Sednev 2016-01-21 15:48:20 UTC
Created attachment 1116970 [details]
sosreport from HE-host alma02

Comment 2 Nikolai Sednev 2016-01-21 15:52:02 UTC
Created attachment 1116971 [details]
sosreport from the engine

Comment 3 Nikolai Sednev 2016-01-21 15:55:27 UTC
Adding the sosreport from HE-host alma04 (agent died on this host) as external link. https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88aXVIYlM4VHJtWms/view?usp=sharing

Comment 4 Nikolai Sednev 2016-01-21 15:56:45 UTC
SD of HE-VM seen twice from both hosts:
ls -l /rhev/data-center/mnt/
total 20
drwxrwxrwx. 4 root root 4096 Jan 20 18:56 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2
drwxrwxrwx. 4 root root 4096 Jan 20 18:56 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_
drwxrwxrwx. 4 root root 4096 Jan 20 20:07

Comment 5 Nikolai Sednev 2016-01-21 15:58:57 UTC
MainThread::ERROR::2016-01-21 13:41:26,668::hosted_engine::832::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monit
or) Failed to start monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee, host_id=2): timeout during domain acquisition
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 454, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 833, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee, host_id=2): timeout during domain acquisition
MainThread::WARNING::2016-01-21 13:41:26,669::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Erro
r while monitoring engine: Failed to start monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee, host_id=2): timeout during domain acqu
isition
MainThread::WARNING::2016-01-21 13:41:26,669::hosted_engine::483::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unex
pected error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 454, in start_monitoring
    self._initialize_domain_monitor()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 833, in _initialize_domain_monitor
    raise Exception(msg)
Exception: Failed to start monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee, host_id=2): timeout during domain acquisition
MainThread::ERROR::2016-01-21 13:41:26,669::hosted_engine::496::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Shutting down the agent because of 3 failures in a row!
MainThread::INFO::2016-01-21 13:41:29,701::hosted_engine::862::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) VDSM domain monitor status: PENDING
MainThread::INFO::2016-01-21 13:41:29,719::hosted_engine::788::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_stop_domain_monitor) Failed to stop monitoring domain (sd_uuid=04cae086-8504-443f-a8bd-9a747d19bcee): Error 900 from stopMonitoringDomain: Storage domain is member of pool: 'domain=04cae086-8504-443f-a8bd-9a747d19bcee'
MainThread::INFO::2016-01-21 13:41:29,720::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down

Comment 6 Simone Tiraboschi 2016-01-21 16:09:48 UTC
Vdsm is reporting it twice:

[root@alma04 ~]# vdsClient -s 0 getStorageDomainsList
04cae086-8504-443f-a8bd-9a747d19bcee
04cae086-8504-443f-a8bd-9a747d19bcee
f57d8bc4-cbb5-4d85-a88c-bf163e79e88b

Thread-37117::INFO::2016-01-21 16:51:02,936::logUtils::51::dispatcher::(wrapper) Run and protect: getStorageDomainsList, Return response: {'domlist': [u'04cae086-8504-443f-a8bd-9a747d19bcee', u'04cae086-8504-443f-a8bd-9a747d19bcee', u'f57d8bc4-cbb5-4d85-a88c-bf163e79e88b']}
Thread-37117::DEBUG::2016-01-21 16:51:02,936::task::1191::Storage.TaskManager.Task::(prepare) Task=`5f9f7f50-9f12-4e0e-8eb3-b2b9f9464251`::finished: {'domlist': [u'04cae086-8504-443f-a8bd-9a747d19bcee', u'04cae086-8504-443f-a8bd-9a747d19bcee', u'f57d8bc4-cbb5-4d85-a88c-bf163e79e88b']}
Thread-37117::DEBUG::2016-01-21 16:51:02,937::task::595::Storage.TaskManager.Task::(_updateState) Task=`5f9f7f50-9f12-4e0e-8eb3-b2b9f9464251`::moving from state preparing -> state finished



The issue reported here happens when ovirt-ha-agents call start monitoring domain on the hosted-engine-storage domain and fails cause it's failing in VDSM:
Thread-33573::INFO::2016-01-21 14:01:36,060::clusterlock::219::Storage.SANLock::(acquireHostId) Acquiring host id for domain 04cae086-8504-443f-a8bd-9a747d19bcee (id: 2)
Thread-33573::ERROR::2016-01-21 14:01:36,061::monitor::387::Storage.Monitor::(_acquireHostId) Error acquiring host id 2 for domain 04cae086-8504-443f-a8bd-9a747d19bcee
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 384, in _acquireHostId
    self.domain.acquireHostId(self.hostId, async=True)
  File "/usr/share/vdsm/storage/sd.py", line 533, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 234, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: (u'04cae086-8504-443f-a8bd-9a747d19bcee', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))


but it's failing in VDSM since the two mount points are pointing to the same storage domain and sanlock it's already locking on the other mount point:
Jan 21 16:37:34 alma04 sanlock[887]: 2016-01-21 16:37:34+0200 24005 [894]: add_lockspace 04cae086-8504-443f-a8bd-9a747d19bcee:2:/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2/04cae086-8504-443f-a8bd-9a747d19bcee/dom_md/ids:0 conflicts with name of list1 s4 04cae086-8504-443f-a8bd-9a747d19bcee:2:/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_/04cae086-8504-443f-a8bd-9a747d19bcee/dom_md/ids:0


[root@alma04 logs]# sanlock client status
daemon 96533e8b-143b-49c8-a9fd-1fdf717d605c.alma04.qa.
p -1 helper
p -1 listener
p -1 status
p 2851 
s f57d8bc4-cbb5-4d85-a88c-bf163e79e88b:2:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2__SD1__for__guests__nsednev__he__2/f57d8bc4-cbb5-4d85-a88c-bf163e79e88b/dom_md/ids:0
s 04cae086-8504-443f-a8bd-9a747d19bcee:2:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_/04cae086-8504-443f-a8bd-9a747d19bcee/dom_md/ids:0


So the issue on ovirt-ha-agent is related to the fact that vdsm for a strange reason duplicated the hosted-engine storage domain.

Comment 7 Sandro Bonazzola 2016-01-21 16:12:25 UTC
Looks more vdsm storage then.

Comment 8 Nikolai Sednev 2016-01-21 16:19:51 UTC
Might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1269768 ?

Comment 9 Simone Tiraboschi 2016-01-21 16:25:43 UTC
Probably the duplication happened here:

jsonrpc.Executor/5::INFO::2016-01-21 13:26:41,387::logUtils::48::dispatcher::(wrapper) Run and protect: disconnectStorageServer(domType=1, spUUID=u'00000001-0001-0001-0001-0000000003d7', conList=[{u'id': u'5333df6d-11c1-4cd3-96ea-b76e04c8f569', u'connection': u'10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2_SD1_for_guests_nsednev_he_2', u'iqn': u'', u'user': u'', u'protocol_version': u'3', u'tpgt': u'1', u'password': '********', u'port': u''}, {u'id': u'8a0053ae-7645-427d-aa05-6f4b4fea856c', u'connection': u'10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'password': '********', u'port': u''}], options=None)
jsonrpc.Executor/5::DEBUG::2016-01-21 13:26:41,388::mount::229::Storage.Misc.excCmd::(_runcmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /usr/bin/umount -f -l /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2__SD1__for__guests__nsednev__he__2 (cwd None)
jsonrpc.Executor/5::DEBUG::2016-01-21 13:26:41,418::mount::229::Storage.Misc.excCmd::(_runcmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /usr/bin/umount -f -l /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2 (cwd None)
jsonrpc.Executor/5::ERROR::2016-01-21 13:26:41,438::hsm::2557::Storage.HSM::(disconnectStorageServer) Could not disconnect from storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2553, in disconnectStorageServer
    conObj.disconnect()
  File "/usr/share/vdsm/storage/storageServer.py", line 447, in disconnect
    return self._mountCon.disconnect()
  File "/usr/share/vdsm/storage/storageServer.py", line 256, in disconnect
    self._mount.umount(True, True)
  File "/usr/share/vdsm/storage/mount.py", line 256, in umount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 241, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (32, ';umount: /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2: not mounted\n')



Thread-33541::DEBUG::2016-01-21 13:26:50,544::task::595::Storage.TaskManager.Task::(_updateState) Task=`e7eb6901-0989-40c2-95b8-cd39a97c5ede`::moving from state init -> state preparing
Thread-33541::INFO::2016-01-21 13:26:50,544::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'protocol_version': 3, 'connection': '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/', 'user': 'kvm', 'id': '64caaf06-2e83-4acc-9f10-e948fd4ef565'}], options=None)
Thread-33541::DEBUG::2016-01-21 13:26:50,547::hsm::2413::Storage.HSM::(__prefetchDomains) nfs local path: /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_
Thread-33541::DEBUG::2016-01-21 13:26:50,548::hsm::2437::Storage.HSM::(__prefetchDomains) Found SD uuids: (u'04cae086-8504-443f-a8bd-9a747d19bcee',)
Thread-33541::DEBUG::2016-01-21 13:26:50,548::hsm::2497::Storage.HSM::(connectStorageServer) knownSDs: {f57d8bc4-cbb5-4d85-a88c-bf163e79e88b: storage.nfsSD.findDomain, 04cae086-8504-443f-a8bd-9a747d19bcee: storage.nfsSD.findDomain}
Thread-33541::INFO::2016-01-21 13:26:50,549::logUtils::51::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '64caaf06-2e83-4acc-9f10-e948fd4ef565'}]}
Thread-33541::DEBUG::2016-01-21 13:26:50,549::task::1191::Storage.TaskManager.Task::(prepare) Task=`e7eb6901-0989-40c2-95b8-cd39a97c5ede`::finished: {'statuslist': [{'status': 0, 'id': '64caaf06-2e83-4acc-9f10-e948fd4ef565'}]}
Thread-33541::DEBUG::2016-01-21 13:26:50,549::task::595::Storage.TaskManager.Task::(_updateState) Task=`e7eb6901-0989-40c2-95b8-cd39a97c5ede`::moving from state preparing -> state finished

Comment 10 Tal Nisan 2016-01-25 11:56:54 UTC
Seems like the problem is in the auto import flow, Daniel please look into this with the owner of the flow

Comment 11 Roy Golan 2016-01-26 12:53:44 UTC
(In reply to Tal Nisan from comment #10)
> Seems like the problem is in the auto import flow, Daniel please look into
> this with the owner of the flow

Daniel, Tal - 
The auto import will take an unattached, but monitored, connected domain, and add it into the DC + Activate it.

The scenario brought here is after all is done and a new host is trying to take over.

I don't know how it ended up being twice in vdms internal structures but do I notice the delicate difference between the links:

_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_

drwxrwxrwx. 4 root root 4096 Jan 20 18:56 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2
drwxrwxrwx. 4 root root 4096 Jan 20 18:56 10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_
drwxrwxrwx. 4 root root 4096 Jan 20 20:07
                                                                             ^^^
                                                               extra "_" after 2

I think Nir is the right person to tell what could go wrong here and what is this behaviour with the extra "_"?

Comment 12 Simone Tiraboschi 2016-01-26 14:55:17 UTC
Seen also with GlusterFS storage domain.

[root@kvm2 ~]# vdsClient -s 0 getStorageDomainsList
2d0b57a7-8fe7-445d-af41-0c66c99ba962
0b4c94b1-aa8e-4415-b6f2-79c12ee9c4ed
6d50e33a-8f7c-4d75-a64b-07db36507b37
7149b31d-430b-4d91-a8a3-ae1024dfd0a9
7149b31d-430b-4d91-a8a3-ae1024dfd0a9

[root@kvm2 ~]# ls -l /rhev/data-center/mnt/glusterSD/
total 6
drwxr-xr-x 2 vdsm kvm    6 25 gen 02.06 kvm1.brightsquid.com:_data
drwxr-xr-x 6 vdsm kvm  120 22 gen 21.38 kvm1.brightsquid.com:_engine
drwxr-xr-x 6 vdsm kvm  120 22 gen 21.38 kvm1:_engine
drwxr-xr-x 6 vdsm kvm 4096 25 gen 22.57 kvm1:_local-kvm1-vol1raid1
drwxr-xr-x 6 vdsm kvm  120 25 gen 22.13 kvm1:_vmos1
drwxr-xr-x 2 vdsm kvm    6 22 gen 22.35 kvm2.brightsquid.com:_engine
drwxr-xr-x 2 vdsm kvm    6 19 gen 19.38 kvm2:_engine
drwxr-xr-x 6 vdsm kvm  120 25 gen 22.14 kvm2:_vmos2

Here it's probably a bit more tricky since the user was trying to use hosted-engine on GlusterSD in a hyper-converged fashion and in order to avoid a SPOF he deployed hosted-engine identifying the mount point as kvm1:/engine on the first host and kvm2:/engine on the second and VDSM didn't recognized that they are the same SD when the auto-import procedure mounted it on the other way.

Comment 13 Nir Soffer 2016-01-26 15:01:01 UTC
(In reply to Simone Tiraboschi from comment #12)
> Seen also with GlusterFS storage domain.
> [root@kvm2 ~]# ls -l /rhev/data-center/mnt/glusterSD/
> total 6
> drwxr-xr-x 2 vdsm kvm    6 25 gen 02.06 kvm1.brightsquid.com:_data
> drwxr-xr-x 6 vdsm kvm  120 22 gen 21.38 kvm1.brightsquid.com:_engine
> drwxr-xr-x 6 vdsm kvm  120 22 gen 21.38 kvm1:_engine
> drwxr-xr-x 6 vdsm kvm 4096 25 gen 22.57 kvm1:_local-kvm1-vol1raid1
> drwxr-xr-x 6 vdsm kvm  120 25 gen 22.13 kvm1:_vmos1
> drwxr-xr-x 2 vdsm kvm    6 22 gen 22.35 kvm2.brightsquid.com:_engine
> drwxr-xr-x 2 vdsm kvm    6 19 gen 19.38 kvm2:_engine
> drwxr-xr-x 6 vdsm kvm  120 25 gen 22.14 kvm2:_vmos2
> 
> Here it's probably a bit more tricky since the user was trying to use
> hosted-engine on GlusterSD in a hyper-converged fashion and in order to
> avoid a SPOF he deployed hosted-engine identifying the mount point as
> kvm1:/engine on the first host and kvm2:/engine on the second and VDSM
> didn't recognized that they are the same SD when the auto-import procedure
> mounted it on the other way.

Of course vdsm did not recognize they are the same, from vdsm point of view, 
the mounts are different if the remote path is different, which means that this
is a hosted engine or user issue, vdsm cannot help you in any way here.

Comment 14 Simone Tiraboschi 2016-01-26 15:15:02 UTC
(In reply to Nir Soffer from comment #13)
> Of course vdsm did not recognize they are the same, from vdsm point of view, 
> the mounts are different if the remote path is different, which means that
> this
> is a hosted engine or user issue, vdsm cannot help you in any way here.

Neither if we see the same SD UUID?

If I just call getStorageDomainInfo(7149b31d-430b-4d91-a8a3-ae1024dfd0a9) about which of the two I'll get info about?

Comment 15 Nir Soffer 2016-01-26 17:11:15 UTC
(In reply to Simone Tiraboschi from comment #14)
> (In reply to Nir Soffer from comment #13)
> > Of course vdsm did not recognize they are the same, from vdsm point of view, 
> > the mounts are different if the remote path is different, which means that
> > this
> > is a hosted engine or user issue, vdsm cannot help you in any way here.
> 
> Neither if we see the same SD UUID?
> 
> If I just call getStorageDomainInfo(7149b31d-430b-4d91-a8a3-ae1024dfd0a9)
> about which of the two I'll get info about?

You will probably get the first one found when domain is searched. We don't 
support multiple server/remote_path for same domain on same host.

The mount details are set in the engine, and used to mount the the nfs/gluster
on the hosts. To modify domain server/remote_path, you have put that domain
into maintenance, modify and activate it again, so there is no way to get 
same domain mounted with different server/remote_path when using engine.

If hosted-engine setup allow this, it should be fixed to disallow this. As usual,
the setup should use vdsm apis as they are used by engine.

Comment 16 Nir Soffer 2016-01-26 17:49:48 UTC
(In reply to Nikolai Sednev from comment #4)
> SD of HE-VM seen twice from both hosts:
> ls -l /rhev/data-center/mnt/
> total 20
> drwxrwxrwx. 4 root root 4096 Jan 20 18:56
> 10.35.64.11:
> _vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2
> drwxrwxrwx. 4 root root 4096 Jan 20 18:56
> 10.35.64.11:
> _vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_
> drwxrwxrwx. 4 root root 4096 Jan 20 20:07

Looking at the vdsm logs, I find only connections for:

vdsm.log.3.xz:Thread-10856::INFO::2016-01-21 13:00:17,992::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'protocol_version': 3, 'connection': '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/', 'user': 'kvm', 'id': '64caaf06-2e83-4acc-9f10-e948fd4ef565'}], options=None)

This connection will be mounted at:

    /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__upgrade__he__3__5__6__to__3__5__7__el__7__2_

It looks like hosted engine setup or engine is using

    'connection': '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/'

Instead of:

    'connection': '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2'

Note the trailing slash, which is escaped to "_" by the mount code, creating the local
mountpoint directory.

Vdsm treats "path/to/directory" and "path/to/directory/" as different paths, and so 
should be the client calling connnectStorageServer.

Simone, please check where '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/'
is coming from - if this is user input, you must normalize it like engine does,
because vdsm will not normalize it for you.

Comment 17 Simone Tiraboschi 2016-01-27 00:18:34 UTC
(In reply to Nir Soffer from comment #16)
> Simone, please check where
> '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/'
> is coming from - if this is user input, you must normalize it like engine
> does,
> because vdsm will not normalize it for you.

You are right, from /etc/ovirt-hosted-engine/answer.conf I see:
OVEHOSTED_STORAGE/storageDomainConnection=str:10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/
and hosted-engine-setup accepted it, indeed under /etc/ovirt-hosted-engine/hosted-engine.conf we have storage=10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/
and ovirt-ha-agent simply uses it without removing the trailing slash.
The issue is that now the engine imports the hosted-engine storage domain, (probably) normalizes the string and so VDSM mounts it twice under different paths.

Normalizing on new deployment is obviously easy; it's not that easy to fix for upgrades since in the past we were accepting it and the upgrade procedure doesn't disconnect the connected storage servers.

Comment 18 Nir Soffer 2016-01-27 00:35:18 UTC
(In reply to Simone Tiraboschi from comment #17)
> (In reply to Nir Soffer from comment #16)
> > Simone, please check where
> > '10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/'
> > is coming from - if this is user input, you must normalize it like engine
> > does,
> > because vdsm will not normalize it for you.
> 
> You are right, from /etc/ovirt-hosted-engine/answer.conf I see:
> OVEHOSTED_STORAGE/storageDomainConnection=str:10.35.64.11:/vol/RHEV/Virt/
> nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/
> and hosted-engine-setup accepted it, indeed under
> /etc/ovirt-hosted-engine/hosted-engine.conf we have
> storage=10.35.64.11:/vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/
> and ovirt-ha-agent simply uses it without removing the trailing slash.
> The issue is that now the engine imports the hosted-engine storage domain,
> (probably) normalizes the string and so VDSM mounts it twice under different
> paths.
> 
> Normalizing on new deployment is obviously easy; it's not that easy to fix
> for upgrades since in the past we were accepting it and the upgrade
> procedure doesn't disconnect the connected storage servers.

So before the upgrade, the setup was using "/path/", (with trailing slash) and
after the domain was imported, engine try to connect to "/path" (without trailing
slash).

We should check why engine is using the path without the slash. When importing
a domain, engine should use the exactly the same path reported by vdsm.

Comment 19 Nikolai Sednev 2016-01-27 09:25:18 UTC
@Nir Soffer,
Regarding the comment #18, there was no upgrade at all, there was a clean HE deployment over NFS SD.

Comment 20 Simone Tiraboschi 2016-01-27 09:31:21 UTC
(In reply to Nikolai Sednev from comment #19)
> @Nir Soffer,
> Regarding the comment #18, there was no upgrade at all, there was a clean HE
> deployment over NFS SD.

I assumed that since you called your NFS export: /vol/RHEV/Virt/nsednev_upgrade_he_3_5_6_to_3_5_7_el_7_2/

The issue is that hosted-engine-setup accepts it with the trailing slash while the engine probably removes it on auto-import and VDSM treats them differently.
This happens on new install and on upgrades.

The issue in the upgrade is that we were accepting the path with the trailing / also in the past and ovirt-ha-agent is not disconnecting that storageServer in the upgrade.

Comment 21 Simone Tiraboschi 2016-01-27 13:52:01 UTC
Ok, let's try to fix it in hosted-engine-setup and oVirt HA agent:

1. hosted-engine-setup for new deployment: remove the trailing slash and alert the user
2. ovirt-ha-agent: fix the piece of code that calls connectStorageserver to ignore the trailing slash; this is probably not enough by itself since a simply restart of ovirt-ha-agent currently doesn't call disconnectStorageServer so the storage server is still connected also on the old path and we are going to hit the bug here.
Simply calling disconnectStorageServer is not enough since sanlock can have a lock here and VDSM could be already monitoring it.
The simpler, but really ugly option, is that is we discover that the NFS path has a trailing slash and the storage domain is already mounted on a path with a trailing '_' we refuse to start and we ask the user to reboot.
Another option is to stopMonitoringDomain if monitoring, remove sanlock lock, disconnect the storage domain, reconnect the storage domain.

Nir, what do you suggest?

Comment 22 Nir Soffer 2016-01-27 14:30:16 UTC
Created attachment 1118808 [details]
Engine blocking invalid path with trailing slash

Simone, engine blocks paths with trailing slash, so you can use the same behavior, or normalize the path automatically. Whatever solution you choose, the mount path must never contain a trailing slash.

Comment 23 Simone Tiraboschi 2016-01-27 18:23:48 UTC
Reproduced.

I deployed over '192.168.1.115:/Virtual/exthe7/' and the mount path is indeed 
'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/'

[root@c72het20160127h1 ~]# ls /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/
aec5376e-18dd-4628-bb3d-dcfd52f1b7b0  __DIRECT_IO_TEST__

but VDSM reports back just '192.168.1.115:/Virtual/exthe7'

[root@c72het20160127h1 ~]# vdsClient -s 0 getStorageDomainInfo aec5376e-18dd-4628-bb3d-dcfd52f1b7b0
	uuid = aec5376e-18dd-4628-bb3d-dcfd52f1b7b0
	version = 3
	role = Regular
	remotePath = 192.168.1.115:/Virtual/exthe7
	type = NFS
	class = Data
	pool = ['a8aecb62-6f9a-43a6-9080-0adbdaea9b77']
	name = hosted_storage

Probably this has to be fixed to.

Comment 24 Simone Tiraboschi 2016-01-27 18:42:00 UTC
Thread-25::DEBUG::2016-01-27 19:17:40,374::task::595::Storage.TaskManager.Task::(_updateState) Task=`92756df2-d3fe-40c8-8a73-436ad7a5c32c`::moving from state init -> state preparing
Thread-25::INFO::2016-01-27 19:17:40,374::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'protocol_version': 4, 'connection': '192.168.1.115:/Virtual/exthe7/', 'user': 'kvm', 'id': '157bc82b-eeb6-4dc8-987e-3cdc20e6181e'}], options=None)
Thread-25::DEBUG::2016-01-27 19:17:40,376::fileUtils::143::Storage.fileUtils::(createdir) Creating directory: /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_ mode: None
Thread-25::DEBUG::2016-01-27 19:17:40,376::mount::229::Storage.Misc.excCmd::(_runcmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /usr/bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=4 192.168.1.115:/Virtual/exthe7 /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_ (cwd None)
Thread-25::DEBUG::2016-01-27 19:17:40,405::__init__::320::IOProcessClient::(_run) Starting IOProcess...
Thread-25::DEBUG::2016-01-27 19:17:40,418::hsm::2413::Storage.HSM::(__prefetchDomains) nfs local path: /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_
Thread-25::DEBUG::2016-01-27 19:17:40,419::hsm::2437::Storage.HSM::(__prefetchDomains) Found SD uuids: ()
Thread-25::DEBUG::2016-01-27 19:17:40,419::hsm::2497::Storage.HSM::(connectStorageServer) knownSDs: {}
Thread-25::INFO::2016-01-27 19:17:40,419::logUtils::51::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '157bc82b-eeb6-4dc8-987e-3cdc20e6181e'}]}
Thread-25::DEBUG::2016-01-27 19:17:40,420::task::1191::Storage.TaskManager.Task::(prepare) Task=`92756df2-d3fe-40c8-8a73-436ad7a5c32c`::finished: {'statuslist': [{'status': 0, 'id': '157bc82b-eeb6-4dc8-987e-3cdc20e6181e'}]}

Comment 25 Roy Golan 2016-01-28 09:03:30 UTC
(In reply to Simone Tiraboschi from comment #23)
> Reproduced.
> 
> I deployed over '192.168.1.115:/Virtual/exthe7/' and the mount path is
> indeed 
> '/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/'
> 
> [root@c72het20160127h1 ~]# ls
> /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/
> aec5376e-18dd-4628-bb3d-dcfd52f1b7b0  __DIRECT_IO_TEST__
> 
> but VDSM reports back just '192.168.1.115:/Virtual/exthe7'
> 
> [root@c72het20160127h1 ~]# vdsClient -s 0 getStorageDomainInfo
> aec5376e-18dd-4628-bb3d-dcfd52f1b7b0
> 	uuid = aec5376e-18dd-4628-bb3d-dcfd52f1b7b0
> 	version = 3
> 	role = Regular
> 	remotePath = 192.168.1.115:/Virtual/exthe7
> 	type = NFS
> 	class = Data
> 	pool = ['a8aecb62-6f9a-43a6-9080-0adbdaea9b77']
> 	name = hosted_storage
> 
> Probably this has to be fixed to.

I don't want to try to fix VDSM. That is too risky as that may have impact on something else.

Moreover, since the engine is relying on the output of the above command, it seems that a trailing '/' is something the system forbids (or drops). The engine won't let you add a storage domain from the UI with a trailing '/' as well. So that is consistent.

Like agreed yesterday, lets see if we are covered by:

1. Fresh install - sanitize input param of path - remove trailing '/'
2. Upgrade - while in maintenance, update the config file to remove the slash.
2.1 on ovirt-ha-agent service stop, perform disconnect from sd . The next start will connect anyway
2.2 alternative to 2.1 - reboot host, mount points will be regenerated without '_"

Comment 26 Simone Tiraboschi 2016-01-28 09:27:50 UTC
2.1 can be risky since we need to be sure that VDSM is not monitoring the domain and we have no active sanlock locks there.

Comment 27 Martin Sivák 2016-01-28 15:43:49 UTC
Guys, we can fix it on hosted engine side..

But just because you always get the path with no slash from the engine does not mean VDSM should ignore input data validation. That is a bug by itself. VDSM should be fixed too to either refuse value like that or normalize the path by itself. Any other option is just workaround and a new issue waiting to happen.

Comment 28 Nir Soffer 2016-01-28 22:20:19 UTC
(In reply to Martin Sivák from comment #27)

Vdsm returning:

    remotePath = 192.168.1.115:/Virtual/exthe7

When the mount path was:

    192.168.1.115:/Virtual/exthe7/

Is not a bug. Vdsm normalizes the mount path, but does not keep this info. When
checking if a remote path is mounted, we compare the normalized path with the
mount info from /proc/mounts and /etc/mtab, so we cannot restore the original
unnormalized path.

We can fix vdsm to normalize the local mount path before escaping slashes, but 
this will break old hosted engine installation if vdsm is upgraded without 
disconnecting the hosted engine domain, since the domain will be considered as
not mounted.

When checking if a storage domain is mounted, we compare the remote and local
paths with the information from /proc/mounts and /etc/mtab. If the local path
is different (since we normalize before escaping), the compare will fail, leading
to the same issue in this bug, having 2 mounts for the same remote patch.

The only way to avoid this issue, is to disconnect the domain and connect it
again with normalized path.

So we cannot avoid fixing hosted engine. It should use os.path.normpath to 
normalize user input.

I sent fix for vdsm, it would nice if you test it with hosted engine.

Comment 29 Martin Sivák 2016-01-29 22:24:18 UTC
Yep we will deal with the hosted engine side. Simone already has a patch for that. Thanks for the vdsm update.

Comment 30 Simone Tiraboschi 2016-02-01 15:24:27 UTC
It seams that we have another issue with VDSM here:
Now, with https://gerrit.ovirt.org/#/c/52836/ , ovirt-ha-agent correctly fixes the path, disconnects the previously connected storage server and reconnect it under the right path via connectStorageServer
So we have a single mount point and '/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_' becomes '/rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe7' (without trailing underscore!) as we need.

[root@c72het20160127h1 ~]# mount | grep rhev
192.168.1.115:/Virtual/exthe7 on /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7 type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,port=0,timeo=600,retrans=6,sec=sys,clientaddr=192.168.1.118,local_lock=none,addr=192.168.1.115)
[root@c72het20160127h1 ~]# 


The issue is that now the links re-created by prepareImage under /var/run/vdsm/storage/ still points to '/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_' (with the trailing underscore!) that is not mounted anymore.


[root@c72het20160127h1 ~]# ls -l /rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/ha_agent/hosted-engine.metadata 
lrwxrwxrwx. 1 vdsm kvm 132  1 feb 16.03 /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/ha_agent/hosted-engine.metadata -> /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee
[root@c72het20160127h1 ~]# ls -l /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5
lrwxrwxrwx. 1 vdsm kvm 133  1 feb 15.47 /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5 -> /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5
[root@c72het20160127h1 ~]# ls -l /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5
ls: cannot access /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7_/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5: No such file or directory

Here prepareImage logs:
Thread-2006::DEBUG::2016-02-01 16:06:04,076::task::595::Storage.TaskManager.Task::(_updateState) Task=`9b1bd78b-36cc-475c-9ca4-619c21015fe0`::moving from state init -> state preparing
Thread-2006::INFO::2016-02-01 16:06:04,076::logUtils::48::dispatcher::(wrapper) Run and protect: prepareImage(sdUUID='aec5376e-18dd-4628-bb3d-dcfd52f1b7b0', spUUID='00000000-0000-0000-0000-000000000000', imgUUID='8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5', leafUUID='5924bae3-ee7d-4493-9081-c91f42b6c3ee')
Thread-2006::DEBUG::2016-02-01 16:06:04,076::resourceManager::198::Storage.ResourceManager.Request::(__init__) ResName=`Storage.aec5376e-18dd-4628-bb3d-dcfd52f1b7b0`ReqID=`ae5a6689-6e5d-4199-81bc-651c9f84f87c`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '3213' at 'prepareImage'
Thread-2006::DEBUG::2016-02-01 16:06:04,076::resourceManager::542::Storage.ResourceManager::(registerResource) Trying to register resource 'Storage.aec5376e-18dd-4628-bb3d-dcfd52f1b7b0' for lock type 'shared'
Thread-2006::DEBUG::2016-02-01 16:06:04,076::resourceManager::601::Storage.ResourceManager::(registerResource) Resource 'Storage.aec5376e-18dd-4628-bb3d-dcfd52f1b7b0' is free. Now locking as 'shared' (1 active user)
Thread-2006::DEBUG::2016-02-01 16:06:04,076::resourceManager::238::Storage.ResourceManager.Request::(grant) ResName=`Storage.aec5376e-18dd-4628-bb3d-dcfd52f1b7b0`ReqID=`ae5a6689-6e5d-4199-81bc-651c9f84f87c`::Granted request
Thread-2006::DEBUG::2016-02-01 16:06:04,076::task::827::Storage.TaskManager.Task::(resourceAcquired) Task=`9b1bd78b-36cc-475c-9ca4-619c21015fe0`::_resourcesAcquired: Storage.aec5376e-18dd-4628-bb3d-dcfd52f1b7b0 (shared)
Thread-2006::DEBUG::2016-02-01 16:06:04,076::task::993::Storage.TaskManager.Task::(_decref) Task=`9b1bd78b-36cc-475c-9ca4-619c21015fe0`::ref 1 aborting False
Thread-2006::DEBUG::2016-02-01 16:06:04,078::fileSD::538::Storage.StorageDomain::(activateVolumes) Fixing permissions on /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee
Thread-2006::DEBUG::2016-02-01 16:06:04,078::fileUtils::143::Storage.fileUtils::(createdir) Creating directory: /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0 mode: None
Thread-2006::WARNING::2016-02-01 16:06:04,078::fileUtils::152::Storage.fileUtils::(createdir) Dir /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0 already exists
Thread-2006::DEBUG::2016-02-01 16:06:04,078::fileSD::513::Storage.StorageDomain::(createImageLinks) Creating symlink from /rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5 to /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5
Thread-2006::DEBUG::2016-02-01 16:06:04,078::fileSD::518::Storage.StorageDomain::(createImageLinks) img run dir already exists: /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5
Thread-2006::DEBUG::2016-02-01 16:06:04,079::fileVolume::535::Storage.Volume::(validateVolumePath) validate path for 5924bae3-ee7d-4493-9081-c91f42b6c3ee
Thread-2006::INFO::2016-02-01 16:06:04,080::logUtils::51::dispatcher::(wrapper) Run and protect: prepareImage, Return response: {'info': {'domainID': 'aec5376e-18dd-4628-bb3d-dcfd52f1b7b0', 'volType': 'path', 'leaseOffset': 0, 'path': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'volumeID': u'5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'leasePath': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee.lease', 'imageID': '8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5'}, 'path': u'/var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'imgVolumesInfo': [{'domainID': 'aec5376e-18dd-4628-bb3d-dcfd52f1b7b0', 'volType': 'path', 'leaseOffset': 0, 'path': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'volumeID': u'5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'leasePath': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee.lease', 'imageID': '8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5'}]}
Thread-2006::DEBUG::2016-02-01 16:06:04,080::task::1191::Storage.TaskManager.Task::(prepare) Task=`9b1bd78b-36cc-475c-9ca4-619c21015fe0`::finished: {'info': {'domainID': 'aec5376e-18dd-4628-bb3d-dcfd52f1b7b0', 'volType': 'path', 'leaseOffset': 0, 'path': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'volumeID': u'5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'leasePath': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee.lease', 'imageID': '8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5'}, 'path': u'/var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'imgVolumesInfo': [{'domainID': 'aec5376e-18dd-4628-bb3d-dcfd52f1b7b0', 'volType': 'path', 'leaseOffset': 0, 'path': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'volumeID': u'5924bae3-ee7d-4493-9081-c91f42b6c3ee', 'leasePath': u'/rhev/data-center/mnt/192.168.1.115:_Virtual_exthe7/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/images/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5/5924bae3-ee7d-4493-9081-c91f42b6c3ee.lease', 'imageID': '8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5'}]}
Thread-2006::DEBUG::2016-02-01 16:06:04,080::task::595::Storage.TaskManager.Task::(_updateState) Task=`9b1bd78b-36cc-475c-9ca4-619c21015fe0`::moving from state preparing -> state finished


The issue could be with:
Thread-2006::DEBUG::2016-02-01 16:06:04,078::fileSD::518::Storage.StorageDomain::(createImageLinks) img run dir already exists: /var/run/vdsm/storage/aec5376e-18dd-4628-bb3d-dcfd52f1b7b0/8b01ffc3-126e-4a7c-8c8c-3d309dc65ac5

Nir, is createImageLinks going to recreate the symlinks if they are still there from the previous attempt? Do we need to do something special to cleanup the links before calling prepareImage?

An host reboot cleanups /var/run and the link get recreated pointing to the right path.

Comment 31 Simone Tiraboschi 2016-02-01 16:12:11 UTC
OK, the issue seams really in createImageLinks that, if the link already exists, doesn't recreate and doesn't check if it points to the required path.

Comment 32 Sandro Bonazzola 2016-02-03 08:44:11 UTC
Simone, is a reboot still required after all the patches will be merged?

Comment 33 Simone Tiraboschi 2016-02-03 08:50:20 UTC
(In reply to Sandro Bonazzola from comment #32)
> Simone, is a reboot still required after all the patches will be merged?

The issue is with VDSM createImageLinks that doesn't refresh the symlinks under /var/run/vdsm/storage if they already exist.
Patch http://gerrit.ovirt.org/52937 tries to address that, it we are not going to merge it we need to ask the user to reboot the host (/var/run is on tmpfs and so on reboot...)

Comment 34 Sandro Bonazzola 2016-02-09 12:20:13 UTC
Nir, please provide directions on how to solve this.

Comment 35 Roy Golan 2016-02-12 10:07:06 UTC
Nir can you please perform the cleanup for /var/run/... (which anyway makes sense) in deactivate verb?

Comment 36 Simone Tiraboschi 2016-02-12 10:11:46 UTC
(In reply to Roy Golan from comment #35)
> Nir can you please perform the cleanup for /var/run/... (which anyway makes
> sense) in deactivate verb?

I'm testing this: https://gerrit.ovirt.org/#/c/53298/

Comment 37 Roy Golan 2016-02-15 13:24:47 UTC
(In reply to Simone Tiraboschi from comment #36)
> (In reply to Roy Golan from comment #35)
> > Nir can you please perform the cleanup for /var/run/... (which anyway makes
> > sense) in deactivate verb?
> 
> I'm testing this: https://gerrit.ovirt.org/#/c/53298/

Thanks Nir, this means we have a simple vdsm-supported way of cleaning up using deactivate.

Comment 38 Sandro Bonazzola 2016-02-16 10:50:00 UTC
Moving from blocker to exception since a host reboot will solve and the issue happens only on NFS if path has trailing / on a 3.5 -> 3.6 upgrade scenario.

Comment 39 Nir Soffer 2016-02-20 20:03:50 UTC
Removing stale needinfo.

Comment 40 Yaniv Lavi 2016-02-22 13:35:01 UTC
If this is not in tomorrow please add release notes on known issue and workaround asap.

Comment 41 Allon Mureinik 2016-02-28 14:04:43 UTC
So

Comment 42 Simone Tiraboschi 2016-02-29 14:30:24 UTC
52937 is just the latest workaround on VDSM side.
We could consider the bug as fixed also without that.

Comment 43 Allon Mureinik 2016-03-09 21:41:42 UTC
oVirt 3.6.3 has been released, closing UPSTREAM.


Note You need to log in before you can comment on or make changes to this bug.