Bug 1258465

Summary: Different behavior of connectStorageServer and prepareImage between iSCSI and NFS
Product: Red Hat Enterprise Virtualization Manager Reporter: Simone Tiraboschi <stirabos>
Component: ovirt-hosted-engine-haAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.6.0CC: acanan, ahino, amarchuk, amureini, bazulay, dfediuck, fdeutsch, gklein, lsurette, mgoldboi, nsoffer, sbonazzo, sherold, stirabos, tnisan, ycui, yeylon, ykaul, ylavi
Target Milestone: ovirt-3.6.1Keywords: Regression
Target Release: 3.6.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-hosted-engine-ha-1.3.3.5-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-11 07:32:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1247942, 1251752    
Attachments:
Description Flags
connectStorageServer and prepareImage on NFS and on iSCSI
none
VDSM logs from 3.5 on iSCSI after a reboot
none
HE 1.2.6.1 + 34881 with VDSM 4.17.6 none

Description Simone Tiraboschi 2015-08-31 13:00:33 UTC
Description of problem:
connectStorageServer behaves differently between NFS and iSCSI.
on NFS connectStorageServer makes /rhev/data-center/mnt/ ready while it doesn't on iSCSI.

On NFS:
# NFSv4
storageType=1
protocol_version=4
spUUID=00000000-0000-0000-0000-000000000000
storage=192.168.1.115:/Virtual/ext35u36
connectionUUID=a3ee261b-5aea-4754-b80f-b1f3644a7462
vdsClient -s 0 connectStorageServer \
"${storageType}" \
"${spUUID}" \
"connection=${storage},iqn=None,portal=0,user=kvm,password=kvm,id=${connectionUUID},port=,protocol_version=${protocol_version}"

and then /rhev/data-center/mnt/ is ready:
[root@c71heis201508272 ~]# tree /rhev/data-center/mnt/
/rhev/data-center/mnt/
└── 192.168.1.115:_Virtual_ext35u36
    ├── 4ef702d9-71dc-49c1-b2d7-7a995afc26fe
    │   ├── dom_md
    │   │   ├── ids
...


So prepareImage successfully completes:
[root@c71heis201508272 ~]# vdsClient -s 0  prepareImage "${spUUID}" "${sdUUID}" "${vm_disk_id}"  "${vm_disk_vol_id}"
{'domainID': '4ef702d9-71dc-49c1-b2d7-7a995afc26fe',
 'imageID': '16a05e61-566e-4eaf-8447-871bd4bf3374',
 'leaseOffset': 0,
 'leasePath': '/rhev/data-center/mnt/192.168.1.115:_Virtual_ext35u36/4ef702d9-71dc-49c1-b2d7-7a995afc26fe/images/16a05e61-566e-4eaf-8447-871bd4bf3374/10b1cadf-d49b-4d6d-85b7-c0ec50702c21.lease',
 'path': '/rhev/data-center/mnt/192.168.1.115:_Virtual_ext35u36/4ef702d9-71dc-49c1-b2d7-7a995afc26fe/images/16a05e61-566e-4eaf-8447-871bd4bf3374/10b1cadf-d49b-4d6d-85b7-c0ec50702c21',
 'volType': 'path',
 'volumeID': '10b1cadf-d49b-4d6d-85b7-c0ec50702c21'}

[root@c71heis201508272 ~]# tree /var/run/vdsm/storage
/var/run/vdsm/storage
└── 4ef702d9-71dc-49c1-b2d7-7a995afc26fe
    └── 16a05e61-566e-4eaf-8447-871bd4bf3374 -> /rhev/data-center/mnt/192.168.1.115:_Virtual_ext35u36/4ef702d9-71dc-49c1-b2d7-7a995afc26fe/images/16a05e61-566e-4eaf-8447-871bd4bf3374

2 directories, 0 files


The same doesn't work anymore on iSCSI:
# iSCSI
storageType=3
spUUID=00000000-0000-0000-0000-000000000000
connectionUUID=6961c798-c143-4d51-afcb-49a20a2b1ac9
portal=1
user=user
password=password
port=3260
iqn=iqn.2015-03.com.redhat:simone1
storage=192.168.1.125
vdsClient -s 0 connectStorageServer \
"${storageType}" \
"${spUUID}" \
"connection=${storage},iqn=${iqn},portal=${portal},user=${user},password=${password},id=${connectionUUID},port=${port}"

connectStorageServer reports no error:
Thread-40::DEBUG::2015-08-27 17:05:23,777::task::595::Storage.TaskManager.Task::(_updateState) Task=`6de846d8-881a-4c6c-b005-3ba03559ee98`::moving from state init -> state preparing
Thread-40::INFO::2015-08-27 17:05:23,777::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=3, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'id': '6961c798-c143-4d51-afcb-49a20a2b1ac9', 'connection': '192.168.1.125', 'iqn': 'iqn.2015-03.com.redhat:simone1', 'portal': '1', 'user': 'user', 'password': '********', 'port': '3260'}], options=None)
...
Thread-40::INFO::2015-08-27 17:05:25,621::logUtils::51::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': '6961c798-c143-4d51-afcb-49a20a2b1ac9'}]}
Thread-40::DEBUG::2015-08-27 17:05:25,621::task::1191::Storage.TaskManager.Task::(prepare) Task=`6de846d8-881a-4c6c-b005-3ba03559ee98`::finished: {'statuslist': [{'status': 0, 'id': '6961c798-c143-4d51-afcb-49a20a2b1ac9'}]}
Thread-40::DEBUG::2015-08-27 17:05:25,621::task::595::Storage.TaskManager.Task::(_updateState) Task=`6de846d8-881a-4c6c-b005-3ba03559ee98`::moving from state preparing -> state finished

but /rhev/data-center/mnt/ doesn't get populated and so than
sdUUID=60bded7c-0a8a-4baa-a97b-6b4b7fb6132f
vm_disk_id=8a549ab0-6b31-4735-a0a3-969177c7422e
vm_disk_vol_id=b114161e-dfad-4ea0-865d-3fb8beeeb94f
vdsClient -s 0  prepareImage "${spUUID}" "${sdUUID}" "${vm_disk_id}"  "${vm_disk_vol_id}

fails with
[root@c71heis201508272 ~]# vdsClient -s 0  prepareImage "${spUUID}" "${sdUUID}" "${vm_disk_id}"  "${vm_disk_vol_id}"
Image path does not exist or cannot be accessed/created: ('/rhev/data-center/mnt/blockSD/60bded7c-0a8a-4baa-a97b-6b4b7fb6132f/images/8a549ab0-6b31-4735-a0a3-969177c7422e',)

getStorageDomainInfo says that the iSCSI storage domain is OK:
[root@c71heis201508272 ~]# vdsClient -s 0 getStorageDomainInfo 60bded7c-0a8a-4baa-a97b-6b4b7fb6132f
	uuid = 60bded7c-0a8a-4baa-a97b-6b4b7fb6132f
	vguuid = 2T1p7H-pL3P-erKp-oVr1-GZpw-amjC-KDsX2L
	state = OK
	version = 3
	role = Regular
	type = ISCSI
	class = Data
	pool = []
	name = hosted_storage


Version-Release number of selected component (if applicable):
VDSM 4.17 from third beta


How reproducible:
100%

Steps to Reproduce:
1. connectStorageServer and prepareImage for an iSCSI domain
2.
3.

Actual results:
It fails with:
Image path does not exist or cannot be accessed/created: ('/rhev/data-center/mnt/blockSD/60bded7c-0a8a-4baa-a97b-6b4b7fb6132f/images/8a549ab0-6b31-4735-a0a3-969177c7422e',)

Expected results:
It works as for NFS


Additional info:
It happens on hosted-engine where we are not using the SPM.

Comment 1 Simone Tiraboschi 2015-08-31 13:02:07 UTC
Created attachment 1068640 [details]
connectStorageServer and prepareImage on NFS and on iSCSI

connectStorageServer and prepareImage on NFS and on iSCSI

Comment 2 Tal Nisan 2015-09-01 12:48:07 UTC
By design prepareImage should not work when the pool is not up, the only bug here that prepareImage does work when the pool is not up and this behavior should be blocked, reducing severity

Comment 3 Simone Tiraboschi 2015-09-01 12:58:03 UTC
hosted-engine is not using the storage pool not the SPM for its own storage domain witch is directly monitoring.
But is still need to be able to call prepareImage

Comment 4 Sandro Bonazzola 2015-09-04 11:25:15 UTC
(In reply to Tal Nisan from comment #2)
> By design prepareImage should not work when the pool is not up

The we need another verb that can be called just with a monitored domain.
It worked so far in 3.5 and in 3.6 until a couple of weeks ago.
Raising again severity, since this is a blocker for oVirt 3.6.0 GA

Comment 5 Allon Mureinik 2015-09-06 10:57:15 UTC
(In reply to Sandro Bonazzola from comment #4)
> (In reply to Tal Nisan from comment #2)
> > By design prepareImage should not work when the pool is not up
> 
> The we need another verb that can be called just with a monitored domain.
> It worked so far in 3.5 and in 3.6 until a couple of weeks ago.
> Raising again severity, since this is a blocker for oVirt 3.6.0 GA
If a pooled verb worked on an inactive pool that's the bug, not anything else.
In what version did it seem to work? Can you add a log of a successful run?

Comment 6 Simone Tiraboschi 2015-09-07 15:11:25 UTC
Here the situation on 3.5 with iSCSI when ovirt-ha-agent bring up the system after a reboot:

[root@c7120150907he35is ~]# vdsClient -s 0 getStorageDomainsList
c80e2ec1-a9c2-4952-949d-9a101c200539

[root@c7120150907he35is ~]# vdsClient -s 0 getStorageDomainInfo c80e2ec1-a9c2-4952-949d-9a101c200539
	uuid = c80e2ec1-a9c2-4952-949d-9a101c200539
	vguuid = KVuNlI-Fs35-g7bi-Bdnh-C6bZ-FIsU-1F3f0C
	state = OK
	version = 3
	role = Master
	type = ISCSI
	class = Data
	pool = ['b9208baa-7c5d-4eea-962b-a6c9f188238c']
	name = hosted_storage

On 3.5 the HE storage domain was still connected to a storage pool but we don't connect it:

[root@c7120150907he35is ~]# vdsClient -s 0 getStoragePoolInfo b9208baa-7c5d-4eea-962b-a6c9f188238c
Unknown pool id, pool not connected: ('b9208baa-7c5d-4eea-962b-a6c9f188238c',)

But /rhev/data-center/ git correctly populated:

[root@c7120150907he35is ~]# tree /rhev/data-center/mnt/blockSD/c80e2ec1-a9c2-4952-949d-9a101c200539/
/rhev/data-center/mnt/blockSD/c80e2ec1-a9c2-4952-949d-9a101c200539/
├── dom_md
│   ├── ids -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/ids
│   ├── inbox -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/inbox
│   ├── leases -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/leases
│   ├── master -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/master
│   ├── metadata -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/metadata
│   └── outbox -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/outbox
├── ha_agent
│   ├── hosted-engine.lockspace -> /rhev/data-center/mnt/blockSD/c80e2ec1-a9c2-4952-949d-9a101c200539/images/b343b1fc-8a9f-40d2-9035-1bf8a3c8cce2/c23d7f7f-b068-4272-ac6f-8d703ad5506f
│   └── hosted-engine.metadata -> /rhev/data-center/mnt/blockSD/c80e2ec1-a9c2-4952-949d-9a101c200539/images/002aa7c0-ab4d-4a09-9e3b-549961e45a30/c8125c0d-ea55-4fac-b1f4-88b085c18bc8
├── images
│   ├── 002aa7c0-ab4d-4a09-9e3b-549961e45a30
│   │   └── c8125c0d-ea55-4fac-b1f4-88b085c18bc8 -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/c8125c0d-ea55-4fac-b1f4-88b085c18bc8
│   ├── 73c0134f-4fd4-4c4d-8b36-3a7e85c01fea
│   │   └── 20a13701-077a-444c-b09a-400aa319e5d6 -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/20a13701-077a-444c-b09a-400aa319e5d6
│   └── b343b1fc-8a9f-40d2-9035-1bf8a3c8cce2
│       └── c23d7f7f-b068-4272-ac6f-8d703ad5506f -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/c23d7f7f-b068-4272-ac6f-8d703ad5506f
└── master

7 directories, 11 files


I'm attaching VDSM log of what happens after a reboot.

Comment 7 Simone Tiraboschi 2015-09-07 15:15:15 UTC
Created attachment 1071055 [details]
VDSM logs from 3.5 on iSCSI after a reboot

VDSM logs from 3.5 on iSCSI after a reboot

Comment 8 Allon Mureinik 2015-09-09 16:50:19 UTC
Ala/Nir - frankly, this (the HE) flow doesn't make any sense to me whatsoever, but according to Simone's logs, it didn't seem to change since 3.5.

Please take a look and see if we have something that's easily revertable and that we can live without, at least until we can properly fix the root cause.

Comment 9 Nir Soffer 2015-09-09 17:02:52 UTC
On 3.6 you must use connectStoragePool before calling prepareImage on block
storage. 

This works on nfs since nfs is mounted in /rhev/datacenter, but in block
storage there is no mount; the symbolic links to block storage domains
that makes block storage looks like file storage are created when connecting
the pool.

The correct way to use vdsm APIs is to use exactly the same api calls used
by engine itself.

Comment 10 Nir Soffer 2015-09-09 17:04:12 UTC
There is no bug here, so there can be no regression.

Comment 11 Simone Tiraboschi 2015-09-09 17:13:43 UTC
On HE on 3.6 we don't have anymore a dedicated storagePool for the hosted engine storageDomain.
The aim is to be able to import the hosted-engine storage domain into the engine in order to be able to manage the engine VM from the engine (if the storageDomain is already attached to another storagePool the engine refuses to import it).

Having no storagePool we cannot call connectStoragePool. By the way we neither weren't calling it on 3.5 were it was correctly working without that so, at least on that aspect, it's a regression. It was also working on 3.6 till about one month ago.

Said that I'm really open to any other solution (a different sequence, a new verb, a command on the host to directly mount it on /rhev/datacenter/ as for NFS / ...) to have it working again on iSCSI.

Any ideas?

Comment 12 Nir Soffer 2015-09-09 17:47:08 UTC
(In reply to Simone Tiraboschi from comment #11)
> On HE on 3.6 we don't have anymore a dedicated storagePool for the hosted
> engine storageDomain.
> The aim is to be able to import the hosted-engine storage domain into the
> engine in order to be able to manage the engine VM from the engine (if the
> storageDomain is already attached to another storagePool the engine refuses
> to import it).

You should be able to import the storage domain into engine using
import domain feature that was introduced in 3.5. If it does not work
we may need to tweak it so it becomes possible, or find a way to 
make it work (see bellow).

> Having no storagePool we cannot call connectStoragePool. 

This is will work only with SDM, when we don't have a pool. This
will not be available in 3.6.0, so you cannot depend on this.

> By the way we
> neither weren't calling it on 3.5 were it was correctly working without that
> so, at least on that aspect, it's a regression. It was also working on 3.6
> till about one month ago.

Please test ovirt engine 3.5 with current vdsm version first. If it does
not work now, we will treat it as vdsm regression.

> Said that I'm really open to any other solution (a different sequence, a new
> verb, a command on the host to directly mount it on /rhev/datacenter/ as for
> NFS / ...) to have it working again on iSCSI.
> 
> Any ideas?

I think the way is to remove the domain - this is tricky since it is hard
to move the last domain (master). But once you removed it you should be
able to import it into the hosted engine.

On the engine, you must create a new master domain on some other storage
first, before you can import another domain. You can create a bootstrap 
storage domain for that on shared storage (e.g. nfs) or on the first host
the engine is running on (nfs, loop device, etc.)

Once you imported the hosted engine domain, you can remove the boostrap
storage domain.

Comment 13 Simone Tiraboschi 2015-09-10 08:52:56 UTC
(In reply to Nir Soffer from comment #12)
> You should be able to import the storage domain into engine using
> import domain feature that was introduced in 3.5. If it does not work
> we may need to tweak it so it becomes possible, or find a way to 
> make it work (see bellow).

We are doing it or at least we are trying to.

> Please test ovirt engine 3.5 with current vdsm version first. If it does
> not work now, we will treat it as vdsm regression.

OK, I'm trying to reproduce there.

> I think the way is to remove the domain - this is tricky since it is hard
> to move the last domain (master). But once you removed it you should be
> able to import it into the hosted engine.
> 
> On the engine, you must create a new master domain on some other storage
> first, before you can import another domain. You can create a bootstrap 
> storage domain for that on shared storage (e.g. nfs) or on the first host
> the engine is running on (nfs, loop device, etc.)
> 
> Once you imported the hosted engine domain, you can remove the boostrap
> storage domain.

It's exactly what we are doing:
we create a bootstrap storage pool with a bootstrap PosixFS storage domain on a loopback device. Then we ensure that the bootstrap storage domain is the master one and we detach the hosted engine storage domain to be imported into the engine.
The only difference is that we are doing it from the HA agent cause we want to ensure that it works also on upgrades from 3.5 without the need of having users running manual commands for that. Then, as a soon as a 3.6 engine will recognize an hosted-engine host witha score that indicates that it's correctly at 3.6, the engine will try import the hosted-engine storage domain.

The issue we are facing here is on the reboot:
the engine VM configuration is now on the shared storage (an additional volume on the hosted-engine storage domain) so the agent should be able to read it to eventually start the engine VM but to do that it has to prepareImage but prepareImage is now failing on iSCSI after the reboot cause /rhev/data-center/mnt/blockSD hasn't been populated.

Comment 14 Simone Tiraboschi 2015-09-11 14:07:29 UTC
(In reply to Simone Tiraboschi from comment #13)

> > Please test ovirt engine 3.5 with current vdsm version first. If it does
> > not work now, we will treat it as vdsm regression.
> 
> OK, I'm trying to reproduce there.

I was failing for different issue (the lack of getVolumePath ) please see 
https://bugzilla.redhat.com/show_bug.cgi?id=1262359

Comment 15 Nir Soffer 2015-09-11 14:25:38 UTC
(In reply to Simone Tiraboschi from comment #14)
> (In reply to Simone Tiraboschi from comment #13)
> 
> > > Please test ovirt engine 3.5 with current vdsm version first. If it does
> > > not work now, we will treat it as vdsm regression.
> > 
> > OK, I'm trying to reproduce there.
> 
> I was failing for different issue (the lack of getVolumePath ) please see 
> https://bugzilla.redhat.com/show_bug.cgi?id=1262359

Can you try to replace getVolumePath with prepareImage the old version?

We need to understand if this is a regression in hosted engine or vdsm.

Comment 16 Sandro Bonazzola 2015-09-11 15:16:22 UTC
As per https://bugzilla.redhat.com/show_bug.cgi?id=1247942#c6 yes, it worked on old version and so is a regression.

Comment 17 Simone Tiraboschi 2015-09-11 15:45:51 UTC
I just backported https://gerrit.ovirt.org/#/c/34881 to ovirt-hosted-engine-ha 1.2.6.1 in order to be able to test it with vdsm 4.17.6 and the issue is there:

[root@c71het20150910 ~]# systemctl status ovirt-ha-agent
ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled)
   Active: active (running) since ven 2015-09-11 17:24:17 CEST; 13min ago
  Process: 1053 ExecStart=/usr/lib/systemd/systemd-ovirt-ha-agent start (code=exited, status=0/SUCCESS)
 Main PID: 1096 (ovirt-ha-agent)
   CGroup: /system.slice/ovirt-ha-agent.service
           └─1096 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent

set 11 17:31:33 c71het20150910.localdomain ovirt-ha-agent[1096]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Failed trying to connect storage:
set 11 17:31:33 c71het20150910.localdomain ovirt-ha-agent[1096]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed trying to connect storage' - trying to restart agent


From VDSM logs:
Thread-17::ERROR::2015-09-11 17:24:46,076::blockVolume::426::Storage.Volume::(validateImagePath) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/blockVolume.py", line 424, in validateImagePath
    os.mkdir(imageDir, 0o755)
OSError: [Errno 2] No such file or directory: '/rhev/data-center/mnt/blockSD/5094102c-e7f9-4f34-9362-936a2887faf5/images/cb7c29da-21f5-466e-8f59-b5f9d8f2a463'
Thread-17::ERROR::2015-09-11 17:24:46,081::task::866::Storage.TaskManager.Task::(_setError) Task=`b26ad487-b1c8-4113-ae08-6c0c3868290a`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 49, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3154, in getVolumeInfo
    volUUID=volUUID).getInfo()
  File "/usr/share/vdsm/storage/sd.py", line 457, in produceVolume
    volUUID)
  File "/usr/share/vdsm/storage/blockVolume.py", line 78, in __init__
    volume.Volume.__init__(self, repoPath, sdUUID, imgUUID, volUUID)
  File "/usr/share/vdsm/storage/volume.py", line 144, in __init__
    self.validate()
  File "/usr/share/vdsm/storage/blockVolume.py", line 87, in validate
    volume.Volume.validate(self)
  File "/usr/share/vdsm/storage/volume.py", line 156, in validate
    self.validateImagePath()
  File "/usr/share/vdsm/storage/blockVolume.py", line 427, in validateImagePath
    raise se.ImagePathError(imageDir)
ImagePathError: Image path does not exist or cannot be accessed/created: ('/rhev/data-center/mnt/blockSD/5094102c-e7f9-4f34-9362-936a2887faf5/images/cb7c29da-21f5-466e-8f59-b5f9d8f2a463',)


while on file system:
[root@c71het20150910 ~]# tree /rhev/data-center/
/rhev/data-center/
└── mnt

1 directory, 0 files


I'm attaching the relevant logs.

Comment 18 Simone Tiraboschi 2015-09-11 15:47:31 UTC
Created attachment 1072597 [details]
HE 1.2.6.1 + 34881 with VDSM 4.17.6

Comment 19 Simone Tiraboschi 2015-09-18 15:01:07 UTC
The fault was here:
MainThread::DEBUG::2015-09-18 11:56:35,879::hsm::427::Storage.HSM::(__cleanStorageRepository) Started cleaning storage repository at '/rhev/data-center'
MainThread::DEBUG::2015-09-18 11:56:35,880::hsm::459::Storage.HSM::(__cleanStorageRepository) White list: ['/rhev/data-center/hsm-tasks', '/rhev/data-center/hsm-tasks/*', '/rhev/data-center/mnt']
MainThread::DEBUG::2015-09-18 11:56:35,880::hsm::460::Storage.HSM::(__cleanStorageRepository) Mount list: []
MainThread::DEBUG::2015-09-18 11:56:35,880::hsm::462::Storage.HSM::(__cleanStorageRepository) Cleaning leftovers
MainThread::DEBUG::2015-09-18 11:56:35,881::hsm::505::Storage.HSM::(__cleanStorageRepository) Finished cleaning storage repository at '/rhev/data-center'

VDSM was completely cleaning up /rhev/data-center/ when we restarted it (host-deploy replaced its cert and so then we have to restart and reconnect) at the end of the deploy process before starting the HA agent.
Another shot of prepareImage (without the bootstrap storage pool) before starting the HA agent seams to be enough to solve it.
The symlinks survive reboots and the agent is able to restart the engine VM so it seams OK on the first host.

Now let's check what happens on additional hosts.

Comment 20 Simone Tiraboschi 2015-09-18 15:56:34 UTC
First host works because after patch 46343, the symlink under /rhev/data-center/mnt/blockSD
survives a reboot.

[stirabos@c71het20150917 ~]$ date
ven 18 set 2015, 17.52.38, CEST
[stirabos@c71het20150917 ~]$ uptime
 17:52:42 up 6 min,  1 user,  load average: 0,00, 0,07, 0,05
[stirabos@c71het20150917 ~]$ ls -l /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/images/5b7dfbbe-606c-421b-825e-ddaaade72d10/aee75fb3-d21c-40ae-a19d-569d1fda16f6 
lrwxrwxrwx. 1 vdsm kvm 78 18 set 17.00 /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/images/5b7dfbbe-606c-421b-825e-ddaaade72d10/aee75fb3-d21c-40ae-a19d-569d1fda16f6 -> /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/aee75fb3-d21c-40ae-a19d-569d1fda16f6


Second hosts still fails cause nobody created that symlinks:
[root@c71het20150918 ~]# vdsClient -s 0 prepareImage 00000000-0000-0000-0000-000000000000 efd23a0f-ce8c-4ee1-8c88-7069e0be88ce f557457d-d1e4-4be1-b00d-65458e5eb08c 9f94c8ec-be53-4bd1-9077-818bc7f7dec3
Image path does not exist or cannot be accessed/created: ('/rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/images/f557457d-d1e4-4be1-b00d-65458e5eb08c',)
[root@c71het20150918 ~]# ls -l /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/images/f557457d-d1e4-4be1-b00d-65458e5eb08c
ls: cannot access /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/images/f557457d-d1e4-4be1-b00d-65458e5eb08c: No such file or directory
[root@c71het20150918 ~]# tree /rhev/data-center/mnt/
/rhev/data-center/mnt/
0 directories, 0 files

directly calling startMonitoringDomain is not enough

Comment 21 Simone Tiraboschi 2015-09-18 16:10:34 UTC
Nir,
is there a verb to explicitly have the links under /rhev/data-center/mnt/blockSD created/refreshed?
if not, can we add one?

Comment 22 Simone Tiraboschi 2015-09-18 16:25:04 UTC
On the first host they were create there and then nothing destroyed:
[stirabos@c71het20150917 ~]$ grep "symlink" /var/log/vdsm/vdsm.log  | grep rhev
Thread-34::DEBUG::2015-09-18 16:59:06,173::blockSD::1334::Storage.StorageDomain::(refreshDirTree) Creating symlink from /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/metadata to /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/dom_md/metadata
Thread-34::DEBUG::2015-09-18 16:59:06,173::blockSD::1334::Storage.StorageDomain::(refreshDirTree) Creating symlink from /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/leases to /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/dom_md/leases
Thread-34::DEBUG::2015-09-18 16:59:06,173::blockSD::1334::Storage.StorageDomain::(refreshDirTree) Creating symlink from /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/ids to /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/dom_md/ids
Thread-34::DEBUG::2015-09-18 16:59:06,173::blockSD::1334::Storage.StorageDomain::(refreshDirTree) Creating symlink from /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/inbox to /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/dom_md/inbox
Thread-34::DEBUG::2015-09-18 16:59:06,173::blockSD::1334::Storage.StorageDomain::(refreshDirTree) Creating symlink from /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/outbox to /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/dom_md/outbox
Thread-34::DEBUG::2015-09-18 16:59:06,173::blockSD::1334::Storage.StorageDomain::(refreshDirTree) Creating symlink from /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/master to /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/dom_md/master

Comment 23 Nir Soffer 2015-09-18 22:51:49 UTC
According to comment 19 and discussion with Simone, there is no vdsm bug,
and of course no regression. connectStorageServer behavior is not different
between NFS and ISCSI, and it was not changed in 3.6.

Simone, please take this bug back to hosted engine component.

Please do not reuse this bug for new issues or features, open a new
bug for these.

Comment 24 Simone Tiraboschi 2015-09-21 08:59:16 UTC
Calling getStorageDomainStats to ensure /rhev/data-center/... got populated

Comment 25 Elad 2015-11-15 12:34:03 UTC
On 3.6, getStorageDomainStats is called 3 times over iSCSI while in 3.5 it is called only once.

Verified using:
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
vdsm-4.17.10.1-0.el7ev.noarch

Comment 26 Simone Tiraboschi 2015-12-17 17:12:45 UTC
It still occurs on FC!

Comment 27 Elad 2016-01-11 09:32:00 UTC
Storage domain refresh is performed also for FC for syncing the symlinks under /rhev/data-center:

ha-agent.log:

MainThread::INFO::2016-01-11 11:23:53,626::storage_server::110::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server
MainThread::INFO::2016-01-11 11:23:53,635::storage_server::143::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain



vdsm.log:

Thread-79172::INFO::2016-01-11 11:23:53,654::logUtils::48::dispatcher::(wrapper) Run and protect: getStorageDomainStats(sdUUID='594ea5cf-53ed-4674-8e23-b185565a9b86', options=None)
Thread-79172::DEBUG::2016-01-11 11:23:53,654::resourceManager::198::Storage.ResourceManager.Request::(__init__) ResName=`Storage.594ea5cf-53ed-4674-8e23-b185565a9b86`ReqID=`1bebbba1-c756-427b-8d57-e3521cb9760f`::R
equest was made in '/usr/share/vdsm/storage/hsm.py' line '2848' at 'getStorageDomainStats'

Verified using:
ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.2.1-1.el7ev.noarch
vdsm-4.17.15-0.el7ev.noarch