Created attachment 1499424 [details] sosreport from host alma04 Description of problem: Engine not correctly constructing the OVF_STORE disks content. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.19-0.0.master.20181002122327.20181002122322.gitb449616.el7.noarch ovirt-hosted-engine-setup-2.2.29-0.0.master.20181002122252.git9ae169e.el7.noarch ovirt-engine-appliance-4.2-20181013.1.el7.noarch How reproducible: 100% Steps to Reproduce: 1.Deployed node0 over NFS. 2.Attached one ha-host. 3.Added one ISO domain. 4.Added one NFS data storage domain. 5.Created one guest VM on alma03 (SPM host). 6.Migrated HE-VM to alma04 (none SPM host). 7.Set environment in to global maintenance. 8.Retrieved backup files from the engine. 9.Reprovisioned alma04 (none SPM). 10.Copied backup file to clean alma04. 11.Installed ovirt-hosted-engine-setup on alma04. 12.Tied to restore the engine during deployment using "hosted-engine --deploy --restore-from-file=/root/nsednev_from_alma04_rhevm_4_2_7_nospm" on alma04 on another NFS mount point. 13.Failed. Additional info: Let's focus on the latest engine.log: engine VM has been correctly created: 2018-10-29 13:53:58,116+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-27) [] EVENT_ID: USER_ADD_VM(34), VM HostedEngine was created by admin@internal-authz. and editing it correctly triggered an OVF update: 2018-10-29 13:55:06,456+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-57) [2042de45-2539-4a2a-bd04-f4d4902f2a12] EVENT_ID: USER_UPDATE_VM(35), VM HostedEngine configuration was updated by admin@internal-authz. 2018-10-29 13:55:06,460+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Before acquiring and wait lock 'EngineLock:{exclusiveLocks='[ebda3e8a-da8d-11e8-8a73-00163e7bb853=OVF_UPDATE]', sharedLocks=''}' 2018-10-29 13:55:06,461+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Lock-wait acquired to object 'EngineLock:{exclusiveLocks='[ebda3e8a-da8d-11e8-8a73-00163e7bb853=OVF_UPDATE]', sharedLocks=''}' 2018-10-29 13:55:06,462+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Running command: ProcessOvfUpdateForStoragePoolCommand internal: true. Entities affected : ID: ebda3e8a-da8d-11e8-8a73-00163e7bb853 Type: StoragePool 2018-10-29 13:55:06,465+02 INFO [org.ovirt.engine.core.bll.UpdateVmCommand] (default task-57) [2042de45-2539-4a2a-bd04-f4d4902f2a12] Lock freed to object 'EngineLock:{exclusiveLocks='[HostedEngine=VM_NAME]', sharedLocks='[bffc5624-14c2-4f66-9854-33e981604ca5=VM]'}' 2018-10-29 13:55:06,474+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Attempting to update VM OVFs in Data Center 'Default' 2018-10-29 13:55:06,543+02 WARN [org.ovirt.engine.core.vdsbroker.builder.vminfo.VmInfoBuildUtils] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] No host NUMA nodes found for vm HostedEngine (bffc5624-14c2-4f66-9854-33e981604ca5) 2018-10-29 13:55:06,590+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Successfully updated VM OVFs in Data Center 'Default' 2018-10-29 13:55:06,590+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Attempting to update template OVFs in Data Center 'Default' 2018-10-29 13:55:06,594+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Successfully updated templates OVFs in Data Center 'Default' 2018-10-29 13:55:06,595+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Attempting to remove unneeded template/vm OVFs in Data Center 'Default' 2018-10-29 13:55:06,615+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Successfully removed unneeded template/vm OVFs in Data Center 'Default' 2018-10-29 13:55:06,616+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Lock freed to object 'EngineLock:{exclusiveLocks='[ebda3e8a-da8d-11e8-8a73-00163e7bb853=OVF_UPDATE]', sharedLocks=''}' 2018-10-29 13:55:06,666+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStorageDomainCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] Lock Acquired to object 'EngineLock:{exclusiveLocks='[bb2288b9-0303-483c-888b-2bc4c4c6a351=STORAGE]', sharedLocks='[ebda3e8a-da8d-11e8-8a73-00163e7bb853=OVF_UPDATE]'}' 2018-10-29 13:55:06,673+02 WARN [org.ovirt.engine.core.vdsbroker.builder.vminfo.VmInfoBuildUtils] (default task-57) [2042de45-2539-4a2a-bd04-f4d4902f2a12] No host NUMA nodes found for vm HostedEngine (bffc5624-14c2-4f66-9854-33e981604ca5) 2018-10-29 13:55:06,733+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStorageDomainCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] Running command: ProcessOvfUpdateForStorageDomainCommand internal: true. Entities affected : ID: bb2288b9-0303-483c-888b-2bc4c4c6a351 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN 2018-10-29 13:55:06,772+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] START, SetVolumeDescriptionVDSCommand( SetVolumeDescriptionVDSCommandParameters:{storagePoolId='ebda3e8a-da8d-11e8-8a73-00163e7bb853', ignoreFailoverLimit='false', storageDomainId='bb2288b9-0303-483c-888b-2bc4c4c6a351', imageGroupId='aede6f87-ed75-4f5b-92f8-776cc62bdf3e', imageId='994c1f99-77bc-4df9-a686-8c2ba197211f'}), log id: 532068ec 2018-10-29 13:55:06,773+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] -- executeIrsBrokerCommand: calling 'setVolumeDescription', parameters: 2018-10-29 13:55:06,773+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ spUUID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 2018-10-29 13:55:06,773+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ sdUUID=bb2288b9-0303-483c-888b-2bc4c4c6a351 2018-10-29 13:55:06,773+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ imageGroupGUID=aede6f87-ed75-4f5b-92f8-776cc62bdf3e 2018-10-29 13:55:06,773+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ volUUID=994c1f99-77bc-4df9-a686-8c2ba197211f 2018-10-29 13:55:06,773+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ description={"Updated":false,"Last Updated":"Sun Oct 28 12:12:24 IST 2018","Storage Domains":[{"uuid":"bb2288b9-0303-483c-888b-2bc4c4c6a351"}],"Disk Description":"OVF_STORE"} 2018-10-29 13:55:06,816+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] FINISH, SetVolumeDescriptionVDSCommand, log id: 532068ec and the OVF_STORE content has been uploaded trough alma03 which was the SPM host (the user was redeploying on alma04): 2018-10-29 13:55:06,958+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] START, UploadStreamVDSCommand(HostName = alma03.qa.lab.tlv.redhat.com, UploadStreamVDSCommandParameters:{hostId='f2897fa0-0372-4fdb-9088-475002bb2a91'}), log id: 478e0b1e 2018-10-29 13:55:06,958+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] -- executeVdsBrokerCommand, parameters: 2018-10-29 13:55:06,958+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ spUUID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 2018-10-29 13:55:06,958+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ sdUUID=bb2288b9-0303-483c-888b-2bc4c4c6a351 2018-10-29 13:55:06,958+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ imageGUID=aede6f87-ed75-4f5b-92f8-776cc62bdf3e 2018-10-29 13:55:06,958+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ volUUID=994c1f99-77bc-4df9-a686-8c2ba197211f 2018-10-29 13:55:06,958+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ size=30720 2018-10-29 13:55:07,215+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] FINISH, UploadStreamVDSCommand, log id: 478e0b1e Everything went as expected on alma03: 2018-10-29 13:55:44,905+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedThreadFactory-engine-Thread-261) [5811be50] Ending command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' successfully. 2018-10-29 13:55:44,913+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedThreadFactory-engine-Thread-262) [5811be50] Ending command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' successfully. 2018-10-29 13:55:44,924+02 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-261) [5811be50] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'UploadStream' completed, handling the result. 2018-10-29 13:55:44,924+02 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-261) [5811be50] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'UploadStream' succeeded, clearing tasks. 2018-10-29 13:55:44,924+02 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-261) [5811be50] SPMAsyncTask::ClearAsyncTask: Attempting to clear task '920dafb3-98eb-43d8-9683-81c59117064c' but then if we check the content of the OVF_STORE volume on host side: [root@alma04 ~]# vdsm-client Image prepare storagepoolID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 storagedomainID=c1078875-3b6d-45da-8ee5-6f0eb8efc52c imageID=5e3bf7b6-26eb-4411-9635-2a476f899661 volumeID=abbe3c72-2f45-46db-8f11-750ae5053aa8 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -xvf - 20+0 records in 20+0 records out 10240 bytes (10 kB) copied, 0,000141984 s, 72,1 MB/s info.json metadata.json [root@alma04 ~]# cat info.json { "Last Updated" : "Mon Oct 29 13:55:37 IST 2018", "Storage Domains" : [ { "uuid" : "c1078875-3b6d-45da-8ee5-6f0eb8efc52c" } ] }[root@alma04 ~]# cat metadata.json { "VM Status" : { "f51ec4cd-5ab2-4a58-b13e-15314c8b296d" : 1 } } This seams an engine bug to me: the engine created an OVF_STORE without vm.ovf. I think it's the same issue that sometimes causes https://bugzilla.redhat.com/show_bug.cgi?id=1628836 Please see also https://bugzilla.redhat.com/show_bug.cgi?id=1568841 Logs from host attached. http://pastebin.test.redhat.com/664922
Created attachment 1499425 [details] additional logs from alma04
Engine log http://pastebin.test.redhat.com/664922 I'm opening thios bug as a separate, forth to Simone's request from https://bugzilla.redhat.com/show_bug.cgi?id=1568841#c51.
The workflow here is extremely specific. Just to get this right, we're: 1. Deploy HE on one host 2. Migrate to a 2nd host 3. Put it into global maintenance 4. Wipe the second host 5. Try to restore HE from an engine backup on the second host, using a different NFS share than the one originally used Is this the recommended path for restoring a HE backup (onto a different NFS share)? Does restoring from backup onto a different share properly inform the engine that there's a new SD?
(In reply to Ryan Barry from comment #3) > The workflow here is extremely specific. We got it twice on this flow but I'm not sure it's really specific to this flow. > Just to get this right, we're: > > 1. Deploy HE on one host > 2. Migrate to a 2nd host > 3. Put it into global maintenance > 4. Wipe the second host > 5. Try to restore HE from an engine backup on the second host, using a > different NFS share than the one originally used > > Is this the recommended path for restoring a HE backup (onto a different NFS > share)? It's automatically performed by an ansible procedure. An engine backup is going to be restored over a new VM before executing engine-setup. Then the old hosted engine storage domain is renamed and a new one is going to be added according to user inputs. > Does restoring from backup onto a different share properly inform > the engine that there's a new SD? Yes, the engine correctly creates the OVF_STORE disks on the new hosted-engine SD and the engine correctly uploads, via VDSM, a content there. The issue is that at the end *sometimes* the OVF_STORE doesn't contain any info about the new hosted-engine VM witch has just one disk on the new hosted-engine SD.
so the HE VM is created in engine as a regular VM, then you update its origin to MANAGED_HOSTED_ENGINE and trigger OvfUpdate. I see in the ansible log that it's failing? Either way, it could be that you are seeing the previous OVF from the time before the VM was a MANAGED_HOSTED_ENGINE (before the db update at 13:55:06)
(In reply to Michal Skrivanek from comment #5) > so the HE VM is created in engine as a regular VM, then you update its > origin to MANAGED_HOSTED_ENGINE and trigger OvfUpdate. Correct. > I see in the ansible > log that it's failing? Maybe I'm wrong but everything was fine on that front. > Either way, it could be that you are seeing the > previous OVF from the time before the VM was a MANAGED_HOSTED_ENGINE (before > the db update at 13:55:06) It could be on the first attempt, but we are trying 12 times in 2 minutes before giving up. https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/ansible/create_target_vm.yml#L283 and indeed in this case the the 12th attempt failed only at 13:59:48. So I'd expect that at that time the content should be there. 2018-10-29 13:59:48,613+0200 DEBUG var changed: host "localhost" var "ovf_store_content" type "<type 'dict'>" value: "{ "changed": true, "failed": true, "msg": "All items completed", "results": [ { "_ansible_item_label": { "id": "8cf670a2-c573-4a62-9829-07359dd09b84", "image_id": "2abdc4c0-d8d7-40e8-9123-b1eeeb43283b", "name": "OVF_STORE" }, "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "attempts": 12, "changed": true, "cmd": "vdsm-client Image prepare storagepoolID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 storagedomainID=c1078875-3b6d-45da-8ee5-6f0eb8efc52c imageID=8cf670a2-c573-4a62-9829-07359dd09b84 volumeID=2abdc4c0-d8d7-40e8-9123-b1eeeb43283b | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - bffc5624-14c2-4f66-9854-33e981604ca5.ovf", "delta": "0:00:00.617159", "end": "2018-10-29 13:57:35.511616", "failed": true, "invocation": { "module_args": { "_raw_params": "vdsm-client Image prepare storagepoolID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 storagedomainID=c1078875-3b6d-45da-8ee5-6f0eb8efc52c imageID=8cf670a2-c573-4a62-9829-07359dd09b84 volumeID=2abdc4c0-d8d7-40e8-9123-b1eeeb43283b | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - bffc5624-14c2-4f66-9854-33e981604ca5.ovf", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "item": { "id": "8cf670a2-c573-4a62-9829-07359dd09b84", "image_id": "2abdc4c0-d8d7-40e8-9123-b1eeeb43283b", "name": "OVF_STORE" }, "msg": "non-zero return code", "rc": 2, "start": "2018-10-29 13:57:34.894457", "stderr": "20+0 records in\n20+0 records out\n10240 bytes (10 kB) copied, 0.000131806 s, 77.7 MB/s\ntar: bffc5624-14c2-4f66-9854-33e981604ca5.ovf: Not found in archive\ntar: Exiting with failure status due to previous errors", "stderr_lines": [ "20+0 records in", "20+0 records out", "10240 bytes (10 kB) copied, 0.000131806 s, 77.7 MB/s", "tar: bffc5624-14c2-4f66-9854-33e981604ca5.ovf: Not found in archive", "tar: Exiting with failure status due to previous errors" ], "stdout": "", "stdout_lines": [] }, { "_ansible_item_label": { "id": "5e3bf7b6-26eb-4411-9635-2a476f899661", "image_id": "abbe3c72-2f45-46db-8f11-750ae5053aa8", "name": "OVF_STORE" }, "_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "attempts": 12, "changed": true, "cmd": "vdsm-client Image prepare storagepoolID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 storagedomainID=c1078875-3b6d-45da-8ee5-6f0eb8efc52c imageID=5e3bf7b6-26eb-4411-9635-2a476f899661 volumeID=abbe3c72-2f45-46db-8f11-750ae5053aa8 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - bffc5624-14c2-4f66-9854-33e981604ca5.ovf", "delta": "0:00:00.617772", "end": "2018-10-29 13:59:47.536466", "failed": true, "invocation": { "module_args": { "_raw_params": "vdsm-client Image prepare storagepoolID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 storagedomainID=c1078875-3b6d-45da-8ee5-6f0eb8efc52c imageID=5e3bf7b6-26eb-4411-9635-2a476f899661 volumeID=abbe3c72-2f45-46db-8f11-750ae5053aa8 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - bffc5624-14c2-4f66-9854-33e981604ca5.ovf", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "item": { "id": "5e3bf7b6-26eb-4411-9635-2a476f899661", "image_id": "abbe3c72-2f45-46db-8f11-750ae5053aa8", "name": "OVF_STORE" }, "msg": "non-zero return code", "rc": 2, "start": "2018-10-29 13:59:46.918694", "stderr": "20+0 records in\n20+0 records out\n10240 bytes (10 kB) copied, 0.000111758 s, 91.6 MB/s\ntar: bffc5624-14c2-4f66-9854-33e981604ca5.ovf: Not found in archive\ntar: Exiting with failure status due to previous errors", "stderr_lines": [ "20+0 records in", "20+0 records out", "10240 bytes (10 kB) copied, 0.000111758 s, 91.6 MB/s", "tar: bffc5624-14c2-4f66-9854-33e981604ca5.ovf: Not found in archive", "tar: Exiting with failure status due to previous errors" ], "stdout": "", "stdout_lines": [] } ] }"
If anyone is wrong it’s most likely me:) But I see “Not found in archive” in each attempt to check the ovf_store, is that expected? Does the (wrong) ovf have the expected origin field? I.e. is it from before or after the update
Ths sounds race-y if it's only "sometimes". What's the mtime on the files in the SD? Are we sure it's not still being written? I wonder if 12 tries with a delay of 10 (2 minutes, more or less, unless I've misread the playbook) is not enough, especially if it's still using a raw image at full size instead of a sparse qcow.
(In reply to Michal Skrivanek from comment #7) > If anyone is wrong it’s most likely me:) > But I see “Not found in archive” in each attempt to check the ovf_store, is > that expected? No, the issue is just here. hosted-engine VM got correctly updated: 2018-10-29 13:55:06,465+02 INFO [org.ovirt.engine.core.bll.UpdateVmCommand] (default task-57) [2042de45-2539-4a2a-bd04-f4d4902f2a12] Lock freed to object 'EngineLock:{exclusiveLocks='[HostedEngine=VM_NAME]', sharedLocks='[bffc5624-14c2-4f66-9854-33e981604ca5=VM]'}' and please note that it immediately triggered an OVF_STORE update since the engine recognised it has the engine VM: 2018-10-29 13:55:06,590+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Successfully updated VM OVFs in Data Center 'Default' 2018-10-29 13:55:06,590+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Attempting to update template OVFs in Data Center 'Default' 2018-10-29 13:55:06,594+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Successfully updated templates OVFs in Data Center 'Default' 2018-10-29 13:55:06,595+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Attempting to remove unneeded template/vm OVFs in Data Center 'Default' and then we are using 'tar -tvf - bffc5624-14c2-4f66-9854-33e981604ca5.ovf' till 13:59:48 on the content of the OVF_STORE till the issue is that the tar archive never contains a .ovf file for the new engine VM but bffc5624-14c2-4f66-9854-33e981604ca5.ovf never landed there (please note that bffc5624-14c2-4f66-9854-33e981604ca5.ovf is the expected file name). By the way I manually checked the OVF_STORE content after 30 of 40 minutes and it was still without any .ovf files inside. > Does the (wrong) ovf have the expected origin field? I.e. is it from before > or after the update ?
(In reply to Ryan Barry from comment #8) > Ths sounds race-y if it's only "sometimes". Yes I think so. I fear that the engine could start uploading the OVF_STORE content before computing all the .ovf files or something like that. > What's the mtime on the files in the SD? Are we sure it's not still being > written? I wonder if 12 tries with a delay of 10 (2 minutes, more or less, > unless I've misread the playbook) is not enough, especially if it's still > using a raw image at full size instead of a sparse qcow. OVF_STORE volumes are 1 GB large but the tar archive inside is just a few Kbs (1 single VM at that time!). By the way I manually checked the OVF_STORE content after 30 or 40 minutes and the result was just: [root@alma04 ~]# vdsm-client Image prepare storagepoolID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 storagedomainID=c1078875-3b6d-45da-8ee5-6f0eb8efc52c imageID=5e3bf7b6-26eb-4411-9635-2a476f899661 volumeID=abbe3c72-2f45-46db-8f11-750ae5053aa8 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -xvf - 20+0 records in 20+0 records out 10240 bytes (10 kB) copied, 0,000141984 s, 72,1 MB/s info.json metadata.json [root@alma04 ~]# cat info.json { "Last Updated" : "Mon Oct 29 13:55:37 IST 2018", "Storage Domains" : [ { "uuid" : "c1078875-3b6d-45da-8ee5-6f0eb8efc52c" } ] }[root@alma04 ~]# cat metadata.json { "VM Status" : { "f51ec4cd-5ab2-4a58-b13e-15314c8b296d" : 1 } } without any .ovf file inside. So the engine correctly sent a well formed tar archive to vdsm and vdsm (on another host in this case) correctly wrote it, but the tar archive was simply lacking the .ovf file for our VM.
Ok, well at least we know it's not a network issue... Nikolai, do we have an environment with a reproducer we can set up instead of engineering going through 10+ steps? It may be helpful to see if the file handle is still open, etc, and hopefully to make writing a patch much faster. We've discussed QE and engineering working hand-in-hand with Meital, and this looks like a good candidate (or at least get me/Shmuel to the final step) due to the number of steps necessary to get to a "sometimes" reproducer
(In reply to Ryan Barry from comment #11) > Ok, well at least we know it's not a network issue... > > Nikolai, do we have an environment with a reproducer we can set up instead > of engineering going through 10+ steps? It may be helpful to see if the file > handle is still open, etc, and hopefully to make writing a patch much faster. > > We've discussed QE and engineering working hand-in-hand with Meital, and > this looks like a good candidate (or at least get me/Shmuel to the final > step) due to the number of steps necessary to get to a "sometimes" reproducer Environment with the reproducer is not available at the moment, I will try to reproduce and if successful I will send an email. You may try to get the reproducer if you'll make restore and then during the restore you'll wipe out the old hosted-storage like "rm -rf /mnt/nsednev_he_1/*". When I did so last time, I've got the issue reproduced.
Can you reconsider targeting to 4.2.8?
looking at the locking sequence of that VM update it seems teh OVF is being written while the UpdateVmCommand still holds the lock of the VM. ProcessOvfUpdateForStoragePoolCommand locks only the pool, not the VM, so it's most likely just reading the old data prior to the update - that explains why isHostedEngine() returns true in UpdateVmCommand (which triggers the immediate OVF update) but returns false in the LibvirtVmXmlBuilder when writing the OVF. Try to wait 60 minutes and check afterwards, or lower the OvfUpdateIntervalInMinutes parameter and check after the next cycle.
Andrej, it seems to be your code originally, can you think of a better solution with Simone?
(In reply to Michal Skrivanek from comment #14) > Try to wait 60 minutes and check afterwards, or lower the > OvfUpdateIntervalInMinutes parameter and check after the next cycle. Adding up to 60 minutes more to hosted-engine deployment almost for free is definitively not an option. We can instead evaluate about lowering OvfUpdateIntervalInMinutes during hosted-engine deployment although I'd definitively prefer a proper fix avoiding the race condition at all thorough proper locking. I suspect that this can also *sometimes* cause https://bugzilla.redhat.com/1628836
It is likely causing that one as well. This is specific for HE and not really part of Virt. Looking at the code using “1” minute would work quite trivially for the first attempt (instead of current 0) adding back needinfo on Andrej
Looking at the logs, the UpdateVmCommand releases VM lock before the ProcessOvfUpdateForStoragePoolCommand tries to read the VM from DB: 2018-10-29 13:55:06,465+02 INFO [org.ovirt.engine.core.bll.UpdateVmCommand] (default task-57) [2042de45-2539-4a2a-bd04-f4d4902f2a12] Lock freed to object 'EngineLock:{exclusiveLocks='[HostedEngine=VM_NAME]', sharedLocks='[bffc5624-14c2-4f66-9854-33e981604ca5=VM]'}' 2018-10-29 13:55:06,474+02 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Attempting to update VM OVFs in Data Center 'Default' So it seems the VM is not changed in the middle of writing the the ovf. Looking at the code, the ovf is not written to the storage if the VM generation number changes in the middle of ProcessOvfUpdateForStoragePoolCommand command, which does not seem to happen in this case. I will look into it more...
it's not freed five lines above, and before the log of "Attempting to update VMs/Templates Ovf." and "Attempting to update ovfs in domain in Data Center Default" the lock is held and there's performOvfUpdateForStoragePool() invocation. So I'm not sure...
I meant *between* those two log lines
The ProcessOvfUpdateForStoragePoolCommand does not read the VM from DB before outputting the log line: 'Attempting to update VM OVFs in Data Center ...', so there should not be a race. But it looks like a near race. I'm in the middle of reproducing it, so we can debug this.
The pull-request in Github is here: https://github.com/oVirt/ovirt-ansible-hosted-engine-setup/pull/51
*** Bug 1662912 has been marked as a duplicate of this bug. ***
Tested it on these components and it works just fine: ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.32-1.el7ev.noarch rhvm-appliance-4.2-20181212.0.el7.noarch Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.6 (Maipo)
Also tested on and it works just fine: ovirt-engine-setup-4.2.8.2-0.1.el7ev.noarch ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.33-1.el7ev.noarch rhvm-appliance-4.2-20190108.0.el7.noarch Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.6 (Maipo)
*** Bug 1668000 has been marked as a duplicate of this bug. ***
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Hit this bug again with: ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch ovirt-hosted-engine-setup-2.2.33-1.el7ev.noarch rhvm-appliance-4.2-20190108.0.el7.noarch Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Red Hat Enterprise Linux Server release 7.6 (Maipo) Also look at https://bugzilla.redhat.com/show_bug.cgi?id=1668000, which was closed as duplicate to this bug.
In my opinion, this is not a blocker We don't have a reliable root cause, and it's been open for months without a ticket. Nikolai, how often does this happen? Simone, is there a workaround?
(In reply to Ryan Barry from comment #30) > In my opinion, this is not a blocker I tend to agree. > We don't have a reliable root cause, and it's been open for months without a > ticket. > > Nikolai, how often does this happen? Really seldom. We got it reported 4 o 5 times across CI, QE and upstream users over hundreds and hundreds of executions. > Simone, is there a workaround? We already implemented a workaround as for https://gerrit.ovirt.org/96648/ reducing OvfUpdateIntervalInMinutes to 1 minute during the deployment. hosted-engine-setup is waiting for two minutes so the engine can now try twice but it seams still not enough. Maybe we can increase the waiting period on hosted-engine-setup to 5 minutes or something like that but if the engine missed it twice in a row probably the issue is somewhere else.
(In reply to Ryan Barry from comment #30) > In my opinion, this is not a blocker > > We don't have a reliable root cause, and it's been open for months without a > ticket. > > Nikolai, how often does this happen? > > Simone, is there a workaround? Happened 4 times in a row on different servers.
quoting from previous comments: "how often does this happen? Really seldom. We got it reported 4 o 5 times across CI, QE and upstream users over hundreds and hundreds of executions." "Happened 4 times in a row on different servers." and now Simone said "We already implemented a workaround as for https://gerrit.ovirt.org/96648/ " - was any of those 4 cases reproduced on a build with that patch already in (it's merged for 7 weeks)?
(In reply to Michal Skrivanek from comment #34) > quoting from previous comments: > > "how often does this happen? > Really seldom. > We got it reported 4 o 5 times across CI, QE and upstream users over > hundreds and hundreds of executions." > > "Happened 4 times in a row on different servers." > > and now Simone said "We already implemented a workaround as for > https://gerrit.ovirt.org/96648/ " - was any of those 4 cases reproduced on a > build with that patch already in (it's merged for 7 weeks)? Yesterday It worked for me just fine and restore passed and a few days ago too. We may close as works for me at the moment, until we'll get it reproduced again if you wish. It happens randomly and inconsistently.
sure. thanks