Bug 1258465
Summary: | Different behavior of connectStorageServer and prepareImage between iSCSI and NFS | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Simone Tiraboschi <stirabos> | ||||||||
Component: | ovirt-hosted-engine-ha | Assignee: | Simone Tiraboschi <stirabos> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Elad <ebenahar> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 3.6.0 | CC: | acanan, ahino, amarchuk, amureini, bazulay, dfediuck, fdeutsch, gklein, lsurette, mgoldboi, nsoffer, sbonazzo, sherold, stirabos, tnisan, ycui, yeylon, ykaul, ylavi | ||||||||
Target Milestone: | ovirt-3.6.1 | Keywords: | Regression | ||||||||
Target Release: | 3.6.1 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | ovirt-hosted-engine-ha-1.3.3.5-1 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-03-11 07:32:56 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1247942, 1251752 | ||||||||||
Attachments: |
|
Description
Simone Tiraboschi
2015-08-31 13:00:33 UTC
Created attachment 1068640 [details]
connectStorageServer and prepareImage on NFS and on iSCSI
connectStorageServer and prepareImage on NFS and on iSCSI
By design prepareImage should not work when the pool is not up, the only bug here that prepareImage does work when the pool is not up and this behavior should be blocked, reducing severity hosted-engine is not using the storage pool not the SPM for its own storage domain witch is directly monitoring. But is still need to be able to call prepareImage (In reply to Tal Nisan from comment #2) > By design prepareImage should not work when the pool is not up The we need another verb that can be called just with a monitored domain. It worked so far in 3.5 and in 3.6 until a couple of weeks ago. Raising again severity, since this is a blocker for oVirt 3.6.0 GA (In reply to Sandro Bonazzola from comment #4) > (In reply to Tal Nisan from comment #2) > > By design prepareImage should not work when the pool is not up > > The we need another verb that can be called just with a monitored domain. > It worked so far in 3.5 and in 3.6 until a couple of weeks ago. > Raising again severity, since this is a blocker for oVirt 3.6.0 GA If a pooled verb worked on an inactive pool that's the bug, not anything else. In what version did it seem to work? Can you add a log of a successful run? Here the situation on 3.5 with iSCSI when ovirt-ha-agent bring up the system after a reboot: [root@c7120150907he35is ~]# vdsClient -s 0 getStorageDomainsList c80e2ec1-a9c2-4952-949d-9a101c200539 [root@c7120150907he35is ~]# vdsClient -s 0 getStorageDomainInfo c80e2ec1-a9c2-4952-949d-9a101c200539 uuid = c80e2ec1-a9c2-4952-949d-9a101c200539 vguuid = KVuNlI-Fs35-g7bi-Bdnh-C6bZ-FIsU-1F3f0C state = OK version = 3 role = Master type = ISCSI class = Data pool = ['b9208baa-7c5d-4eea-962b-a6c9f188238c'] name = hosted_storage On 3.5 the HE storage domain was still connected to a storage pool but we don't connect it: [root@c7120150907he35is ~]# vdsClient -s 0 getStoragePoolInfo b9208baa-7c5d-4eea-962b-a6c9f188238c Unknown pool id, pool not connected: ('b9208baa-7c5d-4eea-962b-a6c9f188238c',) But /rhev/data-center/ git correctly populated: [root@c7120150907he35is ~]# tree /rhev/data-center/mnt/blockSD/c80e2ec1-a9c2-4952-949d-9a101c200539/ /rhev/data-center/mnt/blockSD/c80e2ec1-a9c2-4952-949d-9a101c200539/ ├── dom_md │ ├── ids -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/ids │ ├── inbox -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/inbox │ ├── leases -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/leases │ ├── master -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/master │ ├── metadata -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/metadata │ └── outbox -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/outbox ├── ha_agent │ ├── hosted-engine.lockspace -> /rhev/data-center/mnt/blockSD/c80e2ec1-a9c2-4952-949d-9a101c200539/images/b343b1fc-8a9f-40d2-9035-1bf8a3c8cce2/c23d7f7f-b068-4272-ac6f-8d703ad5506f │ └── hosted-engine.metadata -> /rhev/data-center/mnt/blockSD/c80e2ec1-a9c2-4952-949d-9a101c200539/images/002aa7c0-ab4d-4a09-9e3b-549961e45a30/c8125c0d-ea55-4fac-b1f4-88b085c18bc8 ├── images │ ├── 002aa7c0-ab4d-4a09-9e3b-549961e45a30 │ │ └── c8125c0d-ea55-4fac-b1f4-88b085c18bc8 -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/c8125c0d-ea55-4fac-b1f4-88b085c18bc8 │ ├── 73c0134f-4fd4-4c4d-8b36-3a7e85c01fea │ │ └── 20a13701-077a-444c-b09a-400aa319e5d6 -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/20a13701-077a-444c-b09a-400aa319e5d6 │ └── b343b1fc-8a9f-40d2-9035-1bf8a3c8cce2 │ └── c23d7f7f-b068-4272-ac6f-8d703ad5506f -> /dev/c80e2ec1-a9c2-4952-949d-9a101c200539/c23d7f7f-b068-4272-ac6f-8d703ad5506f └── master 7 directories, 11 files I'm attaching VDSM log of what happens after a reboot. Created attachment 1071055 [details]
VDSM logs from 3.5 on iSCSI after a reboot
VDSM logs from 3.5 on iSCSI after a reboot
Ala/Nir - frankly, this (the HE) flow doesn't make any sense to me whatsoever, but according to Simone's logs, it didn't seem to change since 3.5. Please take a look and see if we have something that's easily revertable and that we can live without, at least until we can properly fix the root cause. On 3.6 you must use connectStoragePool before calling prepareImage on block storage. This works on nfs since nfs is mounted in /rhev/datacenter, but in block storage there is no mount; the symbolic links to block storage domains that makes block storage looks like file storage are created when connecting the pool. The correct way to use vdsm APIs is to use exactly the same api calls used by engine itself. There is no bug here, so there can be no regression. On HE on 3.6 we don't have anymore a dedicated storagePool for the hosted engine storageDomain. The aim is to be able to import the hosted-engine storage domain into the engine in order to be able to manage the engine VM from the engine (if the storageDomain is already attached to another storagePool the engine refuses to import it). Having no storagePool we cannot call connectStoragePool. By the way we neither weren't calling it on 3.5 were it was correctly working without that so, at least on that aspect, it's a regression. It was also working on 3.6 till about one month ago. Said that I'm really open to any other solution (a different sequence, a new verb, a command on the host to directly mount it on /rhev/datacenter/ as for NFS / ...) to have it working again on iSCSI. Any ideas? (In reply to Simone Tiraboschi from comment #11) > On HE on 3.6 we don't have anymore a dedicated storagePool for the hosted > engine storageDomain. > The aim is to be able to import the hosted-engine storage domain into the > engine in order to be able to manage the engine VM from the engine (if the > storageDomain is already attached to another storagePool the engine refuses > to import it). You should be able to import the storage domain into engine using import domain feature that was introduced in 3.5. If it does not work we may need to tweak it so it becomes possible, or find a way to make it work (see bellow). > Having no storagePool we cannot call connectStoragePool. This is will work only with SDM, when we don't have a pool. This will not be available in 3.6.0, so you cannot depend on this. > By the way we > neither weren't calling it on 3.5 were it was correctly working without that > so, at least on that aspect, it's a regression. It was also working on 3.6 > till about one month ago. Please test ovirt engine 3.5 with current vdsm version first. If it does not work now, we will treat it as vdsm regression. > Said that I'm really open to any other solution (a different sequence, a new > verb, a command on the host to directly mount it on /rhev/datacenter/ as for > NFS / ...) to have it working again on iSCSI. > > Any ideas? I think the way is to remove the domain - this is tricky since it is hard to move the last domain (master). But once you removed it you should be able to import it into the hosted engine. On the engine, you must create a new master domain on some other storage first, before you can import another domain. You can create a bootstrap storage domain for that on shared storage (e.g. nfs) or on the first host the engine is running on (nfs, loop device, etc.) Once you imported the hosted engine domain, you can remove the boostrap storage domain. (In reply to Nir Soffer from comment #12) > You should be able to import the storage domain into engine using > import domain feature that was introduced in 3.5. If it does not work > we may need to tweak it so it becomes possible, or find a way to > make it work (see bellow). We are doing it or at least we are trying to. > Please test ovirt engine 3.5 with current vdsm version first. If it does > not work now, we will treat it as vdsm regression. OK, I'm trying to reproduce there. > I think the way is to remove the domain - this is tricky since it is hard > to move the last domain (master). But once you removed it you should be > able to import it into the hosted engine. > > On the engine, you must create a new master domain on some other storage > first, before you can import another domain. You can create a bootstrap > storage domain for that on shared storage (e.g. nfs) or on the first host > the engine is running on (nfs, loop device, etc.) > > Once you imported the hosted engine domain, you can remove the boostrap > storage domain. It's exactly what we are doing: we create a bootstrap storage pool with a bootstrap PosixFS storage domain on a loopback device. Then we ensure that the bootstrap storage domain is the master one and we detach the hosted engine storage domain to be imported into the engine. The only difference is that we are doing it from the HA agent cause we want to ensure that it works also on upgrades from 3.5 without the need of having users running manual commands for that. Then, as a soon as a 3.6 engine will recognize an hosted-engine host witha score that indicates that it's correctly at 3.6, the engine will try import the hosted-engine storage domain. The issue we are facing here is on the reboot: the engine VM configuration is now on the shared storage (an additional volume on the hosted-engine storage domain) so the agent should be able to read it to eventually start the engine VM but to do that it has to prepareImage but prepareImage is now failing on iSCSI after the reboot cause /rhev/data-center/mnt/blockSD hasn't been populated. (In reply to Simone Tiraboschi from comment #13) > > Please test ovirt engine 3.5 with current vdsm version first. If it does > > not work now, we will treat it as vdsm regression. > > OK, I'm trying to reproduce there. I was failing for different issue (the lack of getVolumePath ) please see https://bugzilla.redhat.com/show_bug.cgi?id=1262359 (In reply to Simone Tiraboschi from comment #14) > (In reply to Simone Tiraboschi from comment #13) > > > > Please test ovirt engine 3.5 with current vdsm version first. If it does > > > not work now, we will treat it as vdsm regression. > > > > OK, I'm trying to reproduce there. > > I was failing for different issue (the lack of getVolumePath ) please see > https://bugzilla.redhat.com/show_bug.cgi?id=1262359 Can you try to replace getVolumePath with prepareImage the old version? We need to understand if this is a regression in hosted engine or vdsm. As per https://bugzilla.redhat.com/show_bug.cgi?id=1247942#c6 yes, it worked on old version and so is a regression. I just backported https://gerrit.ovirt.org/#/c/34881 to ovirt-hosted-engine-ha 1.2.6.1 in order to be able to test it with vdsm 4.17.6 and the issue is there: [root@c71het20150910 ~]# systemctl status ovirt-ha-agent ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled) Active: active (running) since ven 2015-09-11 17:24:17 CEST; 13min ago Process: 1053 ExecStart=/usr/lib/systemd/systemd-ovirt-ha-agent start (code=exited, status=0/SUCCESS) Main PID: 1096 (ovirt-ha-agent) CGroup: /system.slice/ovirt-ha-agent.service └─1096 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent set 11 17:31:33 c71het20150910.localdomain ovirt-ha-agent[1096]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Failed trying to connect storage: set 11 17:31:33 c71het20150910.localdomain ovirt-ha-agent[1096]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed trying to connect storage' - trying to restart agent From VDSM logs: Thread-17::ERROR::2015-09-11 17:24:46,076::blockVolume::426::Storage.Volume::(validateImagePath) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/blockVolume.py", line 424, in validateImagePath os.mkdir(imageDir, 0o755) OSError: [Errno 2] No such file or directory: '/rhev/data-center/mnt/blockSD/5094102c-e7f9-4f34-9362-936a2887faf5/images/cb7c29da-21f5-466e-8f59-b5f9d8f2a463' Thread-17::ERROR::2015-09-11 17:24:46,081::task::866::Storage.TaskManager.Task::(_setError) Task=`b26ad487-b1c8-4113-ae08-6c0c3868290a`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 3154, in getVolumeInfo volUUID=volUUID).getInfo() File "/usr/share/vdsm/storage/sd.py", line 457, in produceVolume volUUID) File "/usr/share/vdsm/storage/blockVolume.py", line 78, in __init__ volume.Volume.__init__(self, repoPath, sdUUID, imgUUID, volUUID) File "/usr/share/vdsm/storage/volume.py", line 144, in __init__ self.validate() File "/usr/share/vdsm/storage/blockVolume.py", line 87, in validate volume.Volume.validate(self) File "/usr/share/vdsm/storage/volume.py", line 156, in validate self.validateImagePath() File "/usr/share/vdsm/storage/blockVolume.py", line 427, in validateImagePath raise se.ImagePathError(imageDir) ImagePathError: Image path does not exist or cannot be accessed/created: ('/rhev/data-center/mnt/blockSD/5094102c-e7f9-4f34-9362-936a2887faf5/images/cb7c29da-21f5-466e-8f59-b5f9d8f2a463',) while on file system: [root@c71het20150910 ~]# tree /rhev/data-center/ /rhev/data-center/ └── mnt 1 directory, 0 files I'm attaching the relevant logs. Created attachment 1072597 [details]
HE 1.2.6.1 + 34881 with VDSM 4.17.6
The fault was here: MainThread::DEBUG::2015-09-18 11:56:35,879::hsm::427::Storage.HSM::(__cleanStorageRepository) Started cleaning storage repository at '/rhev/data-center' MainThread::DEBUG::2015-09-18 11:56:35,880::hsm::459::Storage.HSM::(__cleanStorageRepository) White list: ['/rhev/data-center/hsm-tasks', '/rhev/data-center/hsm-tasks/*', '/rhev/data-center/mnt'] MainThread::DEBUG::2015-09-18 11:56:35,880::hsm::460::Storage.HSM::(__cleanStorageRepository) Mount list: [] MainThread::DEBUG::2015-09-18 11:56:35,880::hsm::462::Storage.HSM::(__cleanStorageRepository) Cleaning leftovers MainThread::DEBUG::2015-09-18 11:56:35,881::hsm::505::Storage.HSM::(__cleanStorageRepository) Finished cleaning storage repository at '/rhev/data-center' VDSM was completely cleaning up /rhev/data-center/ when we restarted it (host-deploy replaced its cert and so then we have to restart and reconnect) at the end of the deploy process before starting the HA agent. Another shot of prepareImage (without the bootstrap storage pool) before starting the HA agent seams to be enough to solve it. The symlinks survive reboots and the agent is able to restart the engine VM so it seams OK on the first host. Now let's check what happens on additional hosts. First host works because after patch 46343, the symlink under /rhev/data-center/mnt/blockSD survives a reboot. [stirabos@c71het20150917 ~]$ date ven 18 set 2015, 17.52.38, CEST [stirabos@c71het20150917 ~]$ uptime 17:52:42 up 6 min, 1 user, load average: 0,00, 0,07, 0,05 [stirabos@c71het20150917 ~]$ ls -l /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/images/5b7dfbbe-606c-421b-825e-ddaaade72d10/aee75fb3-d21c-40ae-a19d-569d1fda16f6 lrwxrwxrwx. 1 vdsm kvm 78 18 set 17.00 /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/images/5b7dfbbe-606c-421b-825e-ddaaade72d10/aee75fb3-d21c-40ae-a19d-569d1fda16f6 -> /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/aee75fb3-d21c-40ae-a19d-569d1fda16f6 Second hosts still fails cause nobody created that symlinks: [root@c71het20150918 ~]# vdsClient -s 0 prepareImage 00000000-0000-0000-0000-000000000000 efd23a0f-ce8c-4ee1-8c88-7069e0be88ce f557457d-d1e4-4be1-b00d-65458e5eb08c 9f94c8ec-be53-4bd1-9077-818bc7f7dec3 Image path does not exist or cannot be accessed/created: ('/rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/images/f557457d-d1e4-4be1-b00d-65458e5eb08c',) [root@c71het20150918 ~]# ls -l /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/images/f557457d-d1e4-4be1-b00d-65458e5eb08c ls: cannot access /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/images/f557457d-d1e4-4be1-b00d-65458e5eb08c: No such file or directory [root@c71het20150918 ~]# tree /rhev/data-center/mnt/ /rhev/data-center/mnt/ 0 directories, 0 files directly calling startMonitoringDomain is not enough Nir, is there a verb to explicitly have the links under /rhev/data-center/mnt/blockSD created/refreshed? if not, can we add one? On the first host they were create there and then nothing destroyed: [stirabos@c71het20150917 ~]$ grep "symlink" /var/log/vdsm/vdsm.log | grep rhev Thread-34::DEBUG::2015-09-18 16:59:06,173::blockSD::1334::Storage.StorageDomain::(refreshDirTree) Creating symlink from /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/metadata to /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/dom_md/metadata Thread-34::DEBUG::2015-09-18 16:59:06,173::blockSD::1334::Storage.StorageDomain::(refreshDirTree) Creating symlink from /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/leases to /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/dom_md/leases Thread-34::DEBUG::2015-09-18 16:59:06,173::blockSD::1334::Storage.StorageDomain::(refreshDirTree) Creating symlink from /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/ids to /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/dom_md/ids Thread-34::DEBUG::2015-09-18 16:59:06,173::blockSD::1334::Storage.StorageDomain::(refreshDirTree) Creating symlink from /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/inbox to /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/dom_md/inbox Thread-34::DEBUG::2015-09-18 16:59:06,173::blockSD::1334::Storage.StorageDomain::(refreshDirTree) Creating symlink from /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/outbox to /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/dom_md/outbox Thread-34::DEBUG::2015-09-18 16:59:06,173::blockSD::1334::Storage.StorageDomain::(refreshDirTree) Creating symlink from /dev/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/master to /rhev/data-center/mnt/blockSD/efd23a0f-ce8c-4ee1-8c88-7069e0be88ce/dom_md/master According to comment 19 and discussion with Simone, there is no vdsm bug, and of course no regression. connectStorageServer behavior is not different between NFS and ISCSI, and it was not changed in 3.6. Simone, please take this bug back to hosted engine component. Please do not reuse this bug for new issues or features, open a new bug for these. Calling getStorageDomainStats to ensure /rhev/data-center/... got populated On 3.6, getStorageDomainStats is called 3 times over iSCSI while in 3.5 it is called only once. Verified using: ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch vdsm-4.17.10.1-0.el7ev.noarch It still occurs on FC! Storage domain refresh is performed also for FC for syncing the symlinks under /rhev/data-center: ha-agent.log: MainThread::INFO::2016-01-11 11:23:53,626::storage_server::110::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2016-01-11 11:23:53,635::storage_server::143::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain vdsm.log: Thread-79172::INFO::2016-01-11 11:23:53,654::logUtils::48::dispatcher::(wrapper) Run and protect: getStorageDomainStats(sdUUID='594ea5cf-53ed-4674-8e23-b185565a9b86', options=None) Thread-79172::DEBUG::2016-01-11 11:23:53,654::resourceManager::198::Storage.ResourceManager.Request::(__init__) ResName=`Storage.594ea5cf-53ed-4674-8e23-b185565a9b86`ReqID=`1bebbba1-c756-427b-8d57-e3521cb9760f`::R equest was made in '/usr/share/vdsm/storage/hsm.py' line '2848' at 'getStorageDomainStats' Verified using: ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch ovirt-hosted-engine-setup-1.3.2.1-1.el7ev.noarch vdsm-4.17.15-0.el7ev.noarch |