1644748 – Engine not correctly constructing the OVF_STORE disks content.

Bug 1644748 - Engine not correctly constructing the OVF_STORE disks content.

Summary: Engine not correctly constructing the OVF_STORE disks content.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.2.7
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	ovirt-4.3.2
Target Release:	---
Assignee:	Michal Skrivanek
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1662912 1668000 (view as bug list)
Depends On:
Blocks:	1664060
TreeView+	depends on / blocked

Reported:	2018-10-31 14:18 UTC by Nikolai Sednev
Modified:	2019-03-19 16:13 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ovirt-hosted-engine-setup-2.2.33
Clone Of:
Environment:
Last Closed:	2019-02-28 15:06:06 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	nsednev: needinfo+ rule-engine: ovirt-4.3+ rule-engine: blocker+

Attachments	(Terms of Use)
sosreport from host alma04 (10.51 MB, application/x-xz) 2018-10-31 14:18 UTC, Nikolai Sednev	no flags	Details
additional logs from alma04 (165.53 KB, application/x-gzip) 2018-10-31 14:19 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	oVirt ovirt-ansible-hosted-engine-setup pull 51	'None'	closed	Set OvfUpdateIntervalInMinutes config to 1 during deploy	2021-02-19 23:42:46 UTC
Red Hat Bugzilla	1628836	urgent	CLOSED	hosted-engine VM created with node zero misses the console device	2021-02-22 00:41:40 UTC
oVirt gerrit	96201	'None'	ABANDONED	Set OvfUpdateIntervalInMinutes config to 1 during deploy	2021-02-19 23:42:45 UTC
oVirt gerrit	96648	'None'	MERGED	Set OvfUpdateIntervalInMinutes config to 1 during deploy	2021-02-19 23:42:45 UTC
oVirt gerrit	96649	'None'	ABANDONED	Set OvfUpdateIntervalInMinutes config to 1 during deploy	2021-02-19 23:42:45 UTC

Internal Links: 1628836

Description Nikolai Sednev 2018-10-31 14:18:30 UTC

Created attachment 1499424 [details]
sosreport from host alma04

Description of problem:
Engine not correctly constructing the OVF_STORE disks content.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.19-0.0.master.20181002122327.20181002122322.gitb449616.el7.noarch
ovirt-hosted-engine-setup-2.2.29-0.0.master.20181002122252.git9ae169e.el7.noarch
ovirt-engine-appliance-4.2-20181013.1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1.Deployed node0 over NFS.
2.Attached one ha-host.
3.Added one ISO domain.
4.Added one NFS data storage domain.
5.Created one guest VM on alma03 (SPM host).
6.Migrated HE-VM to alma04 (none SPM host).
7.Set environment in to global maintenance.
8.Retrieved backup files from the engine.
9.Reprovisioned alma04 (none SPM).
10.Copied backup file to clean alma04.
11.Installed ovirt-hosted-engine-setup on alma04.
12.Tied to restore the engine during deployment using "hosted-engine --deploy --restore-from-file=/root/nsednev_from_alma04_rhevm_4_2_7_nospm" on alma04 on another NFS mount point.
13.Failed.

Additional info:
Let's focus on the latest engine.log:

engine VM has been correctly created:
 2018-10-29 13:53:58,116+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-27) [] EVENT_ID: USER_ADD_VM(34), VM HostedEngine was created by admin@internal-authz.

and editing it correctly triggered an OVF update:
 2018-10-29 13:55:06,456+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-57) [2042de45-2539-4a2a-bd04-f4d4902f2a12] EVENT_ID: USER_UPDATE_VM(35), VM HostedEngine configuration was updated by admin@internal-authz.
 2018-10-29 13:55:06,460+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Before acquiring and wait lock 'EngineLock:{exclusiveLocks='[ebda3e8a-da8d-11e8-8a73-00163e7bb853=OVF_UPDATE]', sharedLocks=''}'
 2018-10-29 13:55:06,461+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Lock-wait acquired to object 'EngineLock:{exclusiveLocks='[ebda3e8a-da8d-11e8-8a73-00163e7bb853=OVF_UPDATE]', sharedLocks=''}'
 2018-10-29 13:55:06,462+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Running command: ProcessOvfUpdateForStoragePoolCommand internal: true. Entities affected :  ID: ebda3e8a-da8d-11e8-8a73-00163e7bb853 Type: StoragePool
 2018-10-29 13:55:06,465+02 INFO  [org.ovirt.engine.core.bll.UpdateVmCommand] (default task-57) [2042de45-2539-4a2a-bd04-f4d4902f2a12] Lock freed to object 'EngineLock:{exclusiveLocks='[HostedEngine=VM_NAME]', sharedLocks='[bffc5624-14c2-4f66-9854-33e981604ca5=VM]'}'
 2018-10-29 13:55:06,474+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Attempting to update VM OVFs in Data Center 'Default'
 2018-10-29 13:55:06,543+02 WARN  [org.ovirt.engine.core.vdsbroker.builder.vminfo.VmInfoBuildUtils] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] No host NUMA nodes found for vm HostedEngine (bffc5624-14c2-4f66-9854-33e981604ca5)
 2018-10-29 13:55:06,590+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Successfully updated VM OVFs in Data Center 'Default'
 2018-10-29 13:55:06,590+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Attempting to update template OVFs in Data Center 'Default'
 2018-10-29 13:55:06,594+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Successfully updated templates OVFs in Data Center 'Default'
 2018-10-29 13:55:06,595+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Attempting to remove unneeded template/vm OVFs in Data Center 'Default'
 2018-10-29 13:55:06,615+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Successfully removed unneeded template/vm OVFs in Data Center 'Default'
 2018-10-29 13:55:06,616+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Lock freed to object 'EngineLock:{exclusiveLocks='[ebda3e8a-da8d-11e8-8a73-00163e7bb853=OVF_UPDATE]', sharedLocks=''}'
 2018-10-29 13:55:06,666+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStorageDomainCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] Lock Acquired to object 'EngineLock:{exclusiveLocks='[bb2288b9-0303-483c-888b-2bc4c4c6a351=STORAGE]', sharedLocks='[ebda3e8a-da8d-11e8-8a73-00163e7bb853=OVF_UPDATE]'}'
 2018-10-29 13:55:06,673+02 WARN  [org.ovirt.engine.core.vdsbroker.builder.vminfo.VmInfoBuildUtils] (default task-57) [2042de45-2539-4a2a-bd04-f4d4902f2a12] No host NUMA nodes found for vm HostedEngine (bffc5624-14c2-4f66-9854-33e981604ca5)
 2018-10-29 13:55:06,733+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStorageDomainCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] Running command: ProcessOvfUpdateForStorageDomainCommand internal: true. Entities affected :  ID: bb2288b9-0303-483c-888b-2bc4c4c6a351 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN
 2018-10-29 13:55:06,772+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] START, SetVolumeDescriptionVDSCommand( SetVolumeDescriptionVDSCommandParameters:{storagePoolId='ebda3e8a-da8d-11e8-8a73-00163e7bb853', ignoreFailoverLimit='false', storageDomainId='bb2288b9-0303-483c-888b-2bc4c4c6a351', imageGroupId='aede6f87-ed75-4f5b-92f8-776cc62bdf3e', imageId='994c1f99-77bc-4df9-a686-8c2ba197211f'}), log id: 532068ec
 2018-10-29 13:55:06,773+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] -- executeIrsBrokerCommand: calling 'setVolumeDescription', parameters:
 2018-10-29 13:55:06,773+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ spUUID=ebda3e8a-da8d-11e8-8a73-00163e7bb853
 2018-10-29 13:55:06,773+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ sdUUID=bb2288b9-0303-483c-888b-2bc4c4c6a351
 2018-10-29 13:55:06,773+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ imageGroupGUID=aede6f87-ed75-4f5b-92f8-776cc62bdf3e
 2018-10-29 13:55:06,773+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ volUUID=994c1f99-77bc-4df9-a686-8c2ba197211f
 2018-10-29 13:55:06,773+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ description={"Updated":false,"Last Updated":"Sun Oct 28 12:12:24 IST 2018","Storage Domains":[{"uuid":"bb2288b9-0303-483c-888b-2bc4c4c6a351"}],"Disk Description":"OVF_STORE"}
 2018-10-29 13:55:06,816+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SetVolumeDescriptionVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] FINISH, SetVolumeDescriptionVDSCommand, log id: 532068ec


and the OVF_STORE content has been uploaded trough alma03 which was the SPM host (the user was redeploying on alma04):
 2018-10-29 13:55:06,958+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] START, UploadStreamVDSCommand(HostName = alma03.qa.lab.tlv.redhat.com, UploadStreamVDSCommandParameters:{hostId='f2897fa0-0372-4fdb-9088-475002bb2a91'}), log id: 478e0b1e
 2018-10-29 13:55:06,958+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] -- executeVdsBrokerCommand, parameters:
 2018-10-29 13:55:06,958+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ spUUID=ebda3e8a-da8d-11e8-8a73-00163e7bb853
 2018-10-29 13:55:06,958+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ sdUUID=bb2288b9-0303-483c-888b-2bc4c4c6a351
 2018-10-29 13:55:06,958+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ imageGUID=aede6f87-ed75-4f5b-92f8-776cc62bdf3e
 2018-10-29 13:55:06,958+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ volUUID=994c1f99-77bc-4df9-a686-8c2ba197211f
 2018-10-29 13:55:06,958+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] ++ size=30720
 2018-10-29 13:55:07,215+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [609dcd76] FINISH, UploadStreamVDSCommand, log id: 478e0b1e

Everything went as expected on alma03:
 2018-10-29 13:55:44,905+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedThreadFactory-engine-Thread-261) [5811be50] Ending command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' successfully.
 2018-10-29 13:55:44,913+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedThreadFactory-engine-Thread-262) [5811be50] Ending command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' successfully.
 2018-10-29 13:55:44,924+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-261) [5811be50] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'UploadStream' completed, handling the result.
 2018-10-29 13:55:44,924+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (EE-ManagedThreadFactory-engine-Thread-261) [5811be50] CommandAsyncTask::HandleEndActionResult [within thread]: endAction for action type 'UploadStream' succeeded, clearing tasks.
 2018-10-29 13:55:44,924+02 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedThreadFactory-engine-Thread-261) [5811be50] SPMAsyncTask::ClearAsyncTask: Attempting to clear task '920dafb3-98eb-43d8-9683-81c59117064c'


but then if we check the content of the OVF_STORE volume on host side:
[root@alma04 ~]# vdsm-client Image prepare storagepoolID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 storagedomainID=c1078875-3b6d-45da-8ee5-6f0eb8efc52c imageID=5e3bf7b6-26eb-4411-9635-2a476f899661 volumeID=abbe3c72-2f45-46db-8f11-750ae5053aa8 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -xvf - 
20+0 records in
20+0 records out
10240 bytes (10 kB) copied, 0,000141984 s, 72,1 MB/s
info.json
metadata.json
[root@alma04 ~]# cat info.json
{
  "Last Updated" : "Mon Oct 29 13:55:37 IST 2018",
  "Storage Domains" : [ {
    "uuid" : "c1078875-3b6d-45da-8ee5-6f0eb8efc52c"
  } ]
}[root@alma04 ~]# cat metadata.json
{
  "VM Status" : {
    "f51ec4cd-5ab2-4a58-b13e-15314c8b296d" : 1
  }
}

This seams an engine bug to me: the engine created an OVF_STORE without vm.ovf.
I think it's the same issue that sometimes causes https://bugzilla.redhat.com/show_bug.cgi?id=1628836

Please see also https://bugzilla.redhat.com/show_bug.cgi?id=1568841

Logs from host attached.
http://pastebin.test.redhat.com/664922

Comment 1 Nikolai Sednev 2018-10-31 14:19:04 UTC

Created attachment 1499425 [details]
additional logs from alma04

Comment 2 Nikolai Sednev 2018-10-31 14:20:38 UTC

Engine log http://pastebin.test.redhat.com/664922

I'm opening thios bug as a separate, forth to Simone's request from https://bugzilla.redhat.com/show_bug.cgi?id=1568841#c51.

Comment 3 Ryan Barry 2018-11-06 22:50:00 UTC

The workflow here is extremely specific.

Just to get this right, we're:

1. Deploy HE on one host
2. Migrate to a 2nd host
3. Put it into global maintenance
4. Wipe the second host
5. Try to restore HE from an engine backup on the second host, using a different NFS share than the one originally used

Is this the recommended path for restoring a HE backup (onto a different NFS share)? Does restoring from backup onto a different share properly inform the engine that there's a new SD?

Comment 4 Simone Tiraboschi 2018-11-07 08:48:00 UTC

(In reply to Ryan Barry from comment #3)
> The workflow here is extremely specific.

We got it twice on this flow but I'm not sure it's really specific to this flow.

> Just to get this right, we're:
> 
> 1. Deploy HE on one host
> 2. Migrate to a 2nd host
> 3. Put it into global maintenance
> 4. Wipe the second host
> 5. Try to restore HE from an engine backup on the second host, using a
> different NFS share than the one originally used
> 
> Is this the recommended path for restoring a HE backup (onto a different NFS
> share)? 

It's automatically performed by an ansible procedure.
An engine backup is going to be restored over a new VM before executing engine-setup.
Then the old hosted engine storage domain is renamed and a new one is going to be added according to user inputs.

> Does restoring from backup onto a different share properly inform
> the engine that there's a new SD?

Yes, the engine correctly creates the OVF_STORE disks on the new hosted-engine SD and the engine correctly uploads, via VDSM, a content there.
The issue is that at the end *sometimes* the OVF_STORE doesn't contain any info about the new hosted-engine VM witch has just one disk on the new hosted-engine SD.

Comment 5 Michal Skrivanek 2018-11-08 14:54:23 UTC

so the HE VM is created in engine as a regular VM, then you update its origin to  MANAGED_HOSTED_ENGINE and trigger OvfUpdate. I see in the ansible log that it's failing? Either way, it could be that you are seeing the previous OVF from the time before the VM was a MANAGED_HOSTED_ENGINE (before the db update at 13:55:06)

Comment 6 Simone Tiraboschi 2018-11-08 15:06:26 UTC

(In reply to Michal Skrivanek from comment #5)
> so the HE VM is created in engine as a regular VM, then you update its
> origin to  MANAGED_HOSTED_ENGINE and trigger OvfUpdate.

Correct.

> I see in the ansible
> log that it's failing?

Maybe I'm wrong but everything was fine on that front.

> Either way, it could be that you are seeing the
> previous OVF from the time before the VM was a MANAGED_HOSTED_ENGINE (before
> the db update at 13:55:06)

It could be on the first attempt,
but we are trying 12 times in 2 minutes before giving up.
https://github.com/oVirt/ovirt-hosted-engine-setup/blob/master/src/ansible/create_target_vm.yml#L283


and indeed in this case the the 12th attempt failed only at 13:59:48.
So I'd expect that at that time the content should be there.

2018-10-29 13:59:48,613+0200 DEBUG var changed: host "localhost" var "ovf_store_content" type "<type 'dict'>" value: "{
    "changed": true, 
    "failed": true, 
    "msg": "All items completed", 
    "results": [
        {
            "_ansible_item_label": {
                "id": "8cf670a2-c573-4a62-9829-07359dd09b84", 
                "image_id": "2abdc4c0-d8d7-40e8-9123-b1eeeb43283b", 
                "name": "OVF_STORE"
            }, 
            "_ansible_item_result": true, 
            "_ansible_no_log": false, 
            "_ansible_parsed": true, 
            "attempts": 12, 
            "changed": true, 
            "cmd": "vdsm-client Image prepare storagepoolID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 storagedomainID=c1078875-3b6d-45da-8ee5-6f0eb8efc52c imageID=8cf670a2-c573-4a62-9829-07359dd09b84 volumeID=2abdc4c0-d8d7-40e8-9123-b1eeeb43283b | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - bffc5624-14c2-4f66-9854-33e981604ca5.ovf", 
            "delta": "0:00:00.617159", 
            "end": "2018-10-29 13:57:35.511616", 
            "failed": true, 
            "invocation": {
                "module_args": {
                    "_raw_params": "vdsm-client Image prepare storagepoolID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 storagedomainID=c1078875-3b6d-45da-8ee5-6f0eb8efc52c imageID=8cf670a2-c573-4a62-9829-07359dd09b84 volumeID=2abdc4c0-d8d7-40e8-9123-b1eeeb43283b | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - bffc5624-14c2-4f66-9854-33e981604ca5.ovf", 
                    "_uses_shell": true, 
                    "argv": null, 
                    "chdir": null, 
                    "creates": null, 
                    "executable": null, 
                    "removes": null, 
                    "stdin": null, 
                    "warn": true
                }
            }, 
            "item": {
                "id": "8cf670a2-c573-4a62-9829-07359dd09b84", 
                "image_id": "2abdc4c0-d8d7-40e8-9123-b1eeeb43283b", 
                "name": "OVF_STORE"
            }, 
            "msg": "non-zero return code", 
            "rc": 2, 
            "start": "2018-10-29 13:57:34.894457", 
            "stderr": "20+0 records in\n20+0 records out\n10240 bytes (10 kB) copied, 0.000131806 s, 77.7 MB/s\ntar: bffc5624-14c2-4f66-9854-33e981604ca5.ovf: Not found in archive\ntar: Exiting with failure status due to previous errors", 
            "stderr_lines": [
                "20+0 records in", 
                "20+0 records out", 
                "10240 bytes (10 kB) copied, 0.000131806 s, 77.7 MB/s", 
                "tar: bffc5624-14c2-4f66-9854-33e981604ca5.ovf: Not found in archive", 
                "tar: Exiting with failure status due to previous errors"
            ], 
            "stdout": "", 
            "stdout_lines": []
        }, 
        {
            "_ansible_item_label": {
                "id": "5e3bf7b6-26eb-4411-9635-2a476f899661", 
                "image_id": "abbe3c72-2f45-46db-8f11-750ae5053aa8", 
                "name": "OVF_STORE"
            }, 
            "_ansible_item_result": true, 
            "_ansible_no_log": false, 
            "_ansible_parsed": true, 
            "attempts": 12, 
            "changed": true, 
            "cmd": "vdsm-client Image prepare storagepoolID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 storagedomainID=c1078875-3b6d-45da-8ee5-6f0eb8efc52c imageID=5e3bf7b6-26eb-4411-9635-2a476f899661 volumeID=abbe3c72-2f45-46db-8f11-750ae5053aa8 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - bffc5624-14c2-4f66-9854-33e981604ca5.ovf", 
            "delta": "0:00:00.617772", 
            "end": "2018-10-29 13:59:47.536466", 
            "failed": true, 
            "invocation": {
                "module_args": {
                    "_raw_params": "vdsm-client Image prepare storagepoolID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 storagedomainID=c1078875-3b6d-45da-8ee5-6f0eb8efc52c imageID=5e3bf7b6-26eb-4411-9635-2a476f899661 volumeID=abbe3c72-2f45-46db-8f11-750ae5053aa8 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -tvf - bffc5624-14c2-4f66-9854-33e981604ca5.ovf", 
                    "_uses_shell": true, 
                    "argv": null, 
                    "chdir": null, 
                    "creates": null, 
                    "executable": null, 
                    "removes": null, 
                    "stdin": null, 
                    "warn": true
                }
            }, 
            "item": {
                "id": "5e3bf7b6-26eb-4411-9635-2a476f899661", 
                "image_id": "abbe3c72-2f45-46db-8f11-750ae5053aa8", 
                "name": "OVF_STORE"
            }, 
            "msg": "non-zero return code", 
            "rc": 2, 
            "start": "2018-10-29 13:59:46.918694", 
            "stderr": "20+0 records in\n20+0 records out\n10240 bytes (10 kB) copied, 0.000111758 s, 91.6 MB/s\ntar: bffc5624-14c2-4f66-9854-33e981604ca5.ovf: Not found in archive\ntar: Exiting with failure status due to previous errors", 
            "stderr_lines": [
                "20+0 records in", 
                "20+0 records out", 
                "10240 bytes (10 kB) copied, 0.000111758 s, 91.6 MB/s", 
                "tar: bffc5624-14c2-4f66-9854-33e981604ca5.ovf: Not found in archive", 
                "tar: Exiting with failure status due to previous errors"
            ], 
            "stdout": "", 
            "stdout_lines": []
        }
    ]
}"

Comment 7 Michal Skrivanek 2018-11-08 18:01:19 UTC

If anyone is wrong it’s most likely me:)
But I see “Not found in archive” in each attempt to check the ovf_store, is that expected? 
Does the (wrong) ovf have the expected origin field? I.e. is it from before or after the update

Comment 8 Ryan Barry 2018-11-08 18:21:23 UTC

Ths sounds race-y if it's only "sometimes". 

What's the mtime on the files in the SD? Are we sure it's not still being written? I wonder if 12 tries with a delay of 10 (2 minutes, more or less, unless I've misread the playbook) is not enough, especially if it's still using a raw image at full size instead of a sparse qcow.

Comment 9 Simone Tiraboschi 2018-11-08 20:15:07 UTC

(In reply to Michal Skrivanek from comment #7)
> If anyone is wrong it’s most likely me:)
> But I see “Not found in archive” in each attempt to check the ovf_store, is
> that expected?

No, the issue is just here.

hosted-engine VM got correctly updated:

 2018-10-29 13:55:06,465+02 INFO  [org.ovirt.engine.core.bll.UpdateVmCommand] (default task-57) [2042de45-2539-4a2a-bd04-f4d4902f2a12] Lock freed to object 'EngineLock:{exclusiveLocks='[HostedEngine=VM_NAME]', sharedLocks='[bffc5624-14c2-4f66-9854-33e981604ca5=VM]'}'

and please note that it immediately triggered an OVF_STORE update since the engine recognised it has the engine VM:
 2018-10-29 13:55:06,590+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Successfully updated VM OVFs in Data Center 'Default'
 2018-10-29 13:55:06,590+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Attempting to update template OVFs in Data Center 'Default'
 2018-10-29 13:55:06,594+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Successfully updated templates OVFs in Data Center 'Default'
 2018-10-29 13:55:06,595+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Attempting to remove unneeded template/vm OVFs in Data Center 'Default'




and then we are using 'tar -tvf - bffc5624-14c2-4f66-9854-33e981604ca5.ovf' till 13:59:48 on the content of the OVF_STORE till  the issue is that the tar archive never contains a .ovf file for the new engine VM but bffc5624-14c2-4f66-9854-33e981604ca5.ovf never landed there (please note that bffc5624-14c2-4f66-9854-33e981604ca5.ovf is the expected file name).
By the way I manually checked the OVF_STORE content after 30 of 40 minutes and it was still without any .ovf files inside. 

> Does the (wrong) ovf have the expected origin field? I.e. is it from before
> or after the update

?

Comment 10 Simone Tiraboschi 2018-11-08 20:20:38 UTC

(In reply to Ryan Barry from comment #8)
> Ths sounds race-y if it's only "sometimes". 

Yes I think so.
I fear that the engine could start uploading the OVF_STORE content before computing all the .ovf files or something like that.

> What's the mtime on the files in the SD? Are we sure it's not still being
> written? I wonder if 12 tries with a delay of 10 (2 minutes, more or less,
> unless I've misread the playbook) is not enough, especially if it's still
> using a raw image at full size instead of a sparse qcow.

OVF_STORE volumes are 1 GB large but the tar archive inside is just a few Kbs (1 single VM at that time!).

By the way I manually checked the OVF_STORE content after 30 or 40 minutes and the result was just:
[root@alma04 ~]# vdsm-client Image prepare storagepoolID=ebda3e8a-da8d-11e8-8a73-00163e7bb853 storagedomainID=c1078875-3b6d-45da-8ee5-6f0eb8efc52c imageID=5e3bf7b6-26eb-4411-9635-2a476f899661 volumeID=abbe3c72-2f45-46db-8f11-750ae5053aa8 | grep path | awk '{ print $2 }' | xargs -I{} sudo -u vdsm dd if={} | tar -xvf - 
20+0 records in
20+0 records out
10240 bytes (10 kB) copied, 0,000141984 s, 72,1 MB/s
info.json
metadata.json
[root@alma04 ~]# cat info.json
{
  "Last Updated" : "Mon Oct 29 13:55:37 IST 2018",
  "Storage Domains" : [ {
    "uuid" : "c1078875-3b6d-45da-8ee5-6f0eb8efc52c"
  } ]
}[root@alma04 ~]# cat metadata.json
{
  "VM Status" : {
    "f51ec4cd-5ab2-4a58-b13e-15314c8b296d" : 1
  }
}

without any .ovf file inside.

So the engine correctly sent a well formed tar archive to vdsm and vdsm (on another host in this case) correctly wrote it, but the tar archive was simply lacking the .ovf file for our VM.

Comment 11 Ryan Barry 2018-11-08 20:34:54 UTC

Ok, well at least we know it's not a network issue...

Nikolai, do we have an environment with a reproducer we can set up instead of engineering going through 10+ steps? It may be helpful to see if the file handle is still open, etc, and hopefully to make writing a patch much faster.

We've discussed QE and engineering working hand-in-hand with Meital, and this looks like a good candidate (or at least get me/Shmuel to the final step) due to the number of steps necessary to get to a "sometimes" reproducer

Comment 12 Nikolai Sednev 2018-11-11 09:33:18 UTC

(In reply to Ryan Barry from comment #11)
> Ok, well at least we know it's not a network issue...
> 
> Nikolai, do we have an environment with a reproducer we can set up instead
> of engineering going through 10+ steps? It may be helpful to see if the file
> handle is still open, etc, and hopefully to make writing a patch much faster.
> 
> We've discussed QE and engineering working hand-in-hand with Meital, and
> this looks like a good candidate (or at least get me/Shmuel to the final
> step) due to the number of steps necessary to get to a "sometimes" reproducer

Environment with the reproducer is not available at the moment, I will try to reproduce and if successful I will send an email.
You may try to get the reproducer if you'll make restore and then during the restore you'll wipe out the old hosted-storage like "rm -rf /mnt/nsednev_he_1/*". When I did so last time, I've got the issue reproduced.

Comment 13 Sandro Bonazzola 2018-11-12 11:18:03 UTC

Can you reconsider targeting to 4.2.8?

Comment 14 Michal Skrivanek 2018-11-12 11:59:18 UTC

looking at the locking sequence of that VM update it seems teh OVF is being written while the UpdateVmCommand still holds the lock of the VM. ProcessOvfUpdateForStoragePoolCommand locks only the pool, not the VM, so it's most likely just reading the old data prior to the update - that explains why isHostedEngine() returns true in UpdateVmCommand (which triggers the immediate OVF update) but returns false in the LibvirtVmXmlBuilder when writing the OVF.

Try to wait 60 minutes and check afterwards, or lower the OvfUpdateIntervalInMinutes parameter and check after the next cycle.

Comment 15 Michal Skrivanek 2018-11-12 13:45:04 UTC

Andrej, it seems to be your code originally, can you think of a better solution with Simone?

Comment 16 Simone Tiraboschi 2018-11-12 14:41:01 UTC

(In reply to Michal Skrivanek from comment #14)
> Try to wait 60 minutes and check afterwards, or lower the
> OvfUpdateIntervalInMinutes parameter and check after the next cycle.

Adding up to 60 minutes more to hosted-engine deployment almost for free is definitively not an option.
We can instead evaluate about lowering OvfUpdateIntervalInMinutes during hosted-engine deployment although I'd definitively prefer a proper fix avoiding the race condition at all thorough proper locking.
I suspect that this can also *sometimes* cause https://bugzilla.redhat.com/1628836

Comment 17 Michal Skrivanek 2018-11-12 14:53:19 UTC

It is likely causing that one as well. This is specific for HE and not really part of Virt. Looking at the code using “1” minute would work quite trivially for the first attempt (instead of current 0)
adding back needinfo on Andrej

Comment 18 Andrej Krejcir 2018-11-13 11:52:36 UTC

Looking at the logs, the UpdateVmCommand releases VM lock before the ProcessOvfUpdateForStoragePoolCommand tries to read the VM from DB:

2018-10-29 13:55:06,465+02 INFO  [org.ovirt.engine.core.bll.UpdateVmCommand] (default task-57) [2042de45-2539-4a2a-bd04-f4d4902f2a12] Lock freed to object 'EngineLock:{exclusiveLocks='[HostedEngine=VM_NAME]', sharedLocks='[bffc5624-14c2-4f66-9854-33e981604ca5=VM]'}'
2018-10-29 13:55:06,474+02 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-92) [4bdcdb1] Attempting to update VM OVFs in Data Center 'Default'

So it seems the VM is not changed in the middle of writing the the ovf.

Looking at the code, the ovf is not written to the storage if the VM generation number changes in the middle of ProcessOvfUpdateForStoragePoolCommand command, which does not seem to happen in this case.

I will look into it more...

Comment 19 Michal Skrivanek 2018-11-15 13:32:05 UTC

it's not freed five lines above, and before the log of 
"Attempting to update VMs/Templates Ovf." and "Attempting to update ovfs in domain in Data Center Default" the lock is held and there's performOvfUpdateForStoragePool() invocation. So I'm not sure...

Comment 20 Michal Skrivanek 2018-11-15 13:32:33 UTC

I meant *between* those two log lines

Comment 21 Andrej Krejcir 2018-11-16 15:41:03 UTC

The ProcessOvfUpdateForStoragePoolCommand does not read the VM from DB before outputting the log line: 'Attempting to update VM OVFs in Data Center ...',
so there should not be a race. But it looks like a near race.

I'm in the middle of reproducing it, so we can debug this.

Comment 22 Andrej Krejcir 2019-01-03 12:34:04 UTC

The pull-request in Github is here: https://github.com/oVirt/ovirt-ansible-hosted-engine-setup/pull/51

Comment 23 Simone Tiraboschi 2019-01-07 13:41:37 UTC

*** Bug 1662912 has been marked as a duplicate of this bug. ***

Comment 24 Nikolai Sednev 2019-01-09 18:37:36 UTC

Tested it on these components and it works just fine:
ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.32-1.el7ev.noarch
rhvm-appliance-4.2-20181212.0.el7.noarch
Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.6 (Maipo)

Comment 25 Nikolai Sednev 2019-01-09 18:41:30 UTC

Also tested on and it works just fine:
ovirt-engine-setup-4.2.8.2-0.1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.33-1.el7ev.noarch
rhvm-appliance-4.2-20190108.0.el7.noarch
Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.6 (Maipo)

Comment 26 Simone Tiraboschi 2019-01-21 16:51:42 UTC

*** Bug 1668000 has been marked as a duplicate of this bug. ***

Comment 27 Red Hat Bugzilla Rules Engine 2019-01-21 16:55:07 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 28 Nikolai Sednev 2019-01-21 16:57:05 UTC

Hit this bug again with:
ovirt-hosted-engine-ha-2.2.19-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.33-1.el7ev.noarch
rhvm-appliance-4.2-20190108.0.el7.noarch
Linux 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.6 (Maipo)

Also look at https://bugzilla.redhat.com/show_bug.cgi?id=1668000, which was closed as duplicate to this bug.

Comment 29 Red Hat Bugzilla Rules Engine 2019-01-21 16:57:26 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 30 Ryan Barry 2019-01-21 17:09:15 UTC

In my opinion, this is not a blocker 

We don't have a reliable root cause, and it's been open for months without a ticket. 

Nikolai, how often does this happen?

Simone, is there a workaround?

Comment 31 Simone Tiraboschi 2019-01-22 08:49:01 UTC

(In reply to Ryan Barry from comment #30)
> In my opinion, this is not a blocker 

I tend to agree.

> We don't have a reliable root cause, and it's been open for months without a
> ticket. 
> 
> Nikolai, how often does this happen?

Really seldom.
We got it reported 4 o 5 times across CI, QE and upstream users over hundreds and hundreds of executions.

> Simone, is there a workaround?

We already implemented a workaround as for https://gerrit.ovirt.org/96648/ reducing OvfUpdateIntervalInMinutes to 1 minute during the deployment.
hosted-engine-setup is waiting for two minutes so the engine can now try twice but it seams still not enough.
Maybe we can increase the waiting period on hosted-engine-setup to 5 minutes or something like that but if the engine missed it twice in a row probably the issue is somewhere else.

Comment 32 Nikolai Sednev 2019-01-22 10:10:52 UTC

(In reply to Ryan Barry from comment #30)
> In my opinion, this is not a blocker 
> 
> We don't have a reliable root cause, and it's been open for months without a
> ticket. 
> 
> Nikolai, how often does this happen?
> 
> Simone, is there a workaround?

Happened 4 times in a row on different servers.

Comment 34 Michal Skrivanek 2019-02-28 13:44:04 UTC

quoting from previous comments:

"how often does this happen?
Really seldom.
We got it reported 4 o 5 times across CI, QE and upstream users over hundreds and hundreds of executions."

"Happened 4 times in a row on different servers."

and now Simone said "We already implemented a workaround as for https://gerrit.ovirt.org/96648/ " - was any of those 4 cases reproduced on a build with that patch already in (it's merged for 7 weeks)?

Comment 36 Nikolai Sednev 2019-02-28 15:03:18 UTC

(In reply to Michal Skrivanek from comment #34)
> quoting from previous comments:
> 
> "how often does this happen?
> Really seldom.
> We got it reported 4 o 5 times across CI, QE and upstream users over
> hundreds and hundreds of executions."
> 
> "Happened 4 times in a row on different servers."
> 
> and now Simone said "We already implemented a workaround as for
> https://gerrit.ovirt.org/96648/ " - was any of those 4 cases reproduced on a
> build with that patch already in (it's merged for 7 weeks)?

Yesterday It worked for me just fine and restore passed and a few days ago too. We may close as works for me at the moment, until we'll get it reproduced again if you wish. It happens randomly and inconsistently.

Comment 37 Michal Skrivanek 2019-02-28 15:06:06 UTC

sure. thanks

Note You need to log in before you can comment on or make changes to this bug.