1639604 – engine fails to imports external VMs

Bug 1639604 - engine fails to imports external VMs

Summary: engine fails to imports external VMs

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.2.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.3.0
Target Release:	---
Assignee:	Ryan Barry
QA Contact:	Liran Rotenberg
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1649615
TreeView+	depends on / blocked

Reported:	2018-10-16 08:03 UTC by Liran Rotenberg
Modified:	2019-02-13 07:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:	ovirt-engine-4.3.0_rc
Clone Of:
Clones:	1649615 (view as bug list)
Environment:
Last Closed:	2019-02-13 07:45:11 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.3+ rule-engine: blocker+

Attachments	(Terms of Use)
engine-sosreport (9.11 MB, application/x-xz) 2018-10-16 08:05 UTC, Liran Rotenberg	no flags	Details
host-sosreport (10.20 MB, application/x-xz) 2018-10-16 08:06 UTC, Liran Rotenberg	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	95433	0	'None'	MERGED	core: Fix importing of unmanaged VMs	2020-08-21 13:40:20 UTC
oVirt gerrit	95578	0	'None'	MERGED	core: Fix importing of unmanaged VMs	2020-08-21 13:40:20 UTC

Description Liran Rotenberg 2018-10-16 08:03:58 UTC

Description of problem:
After vintage deployment, adding a storage domain doesn't add the hosted-engine storage domain and the HE VM to the environment. 

Version-Release number of selected component (if applicable):
ovirt-host-deploy-1.7.5-0.0.master.20180530161905.gitc423dec.el7.noarch
ovirt-engine-appliance-4.2-20181014.1.el7.noarch
ovirt-imageio-common-1.4.5-0.el7.x86_64
ovirt-release42-snapshot-4.2.7-0.2.rc2.20181014014958.gitfb30674.el7.noarch
ovirt-release42-4.2.7-0.2.rc2.20181014014958.gitfb30674.el7.noarch
ovirt-engine-sdk-python-3.6.9.2-0.1.20180209.gite99bbd1.el7.centos.noarch
python-ovirt-engine-sdk4-4.2.9-2.20181004git4d189a6.el7.x86_64
ovirt-provider-ovn-driver-1.2.17-0.20181003135950.git6aa6b37.el7.noarch
cockpit-ovirt-dashboard-0.11.35-1.el7.noarch
ovirt-setup-lib-1.1.6-0.0.master.20180921125403.git90612e6.el7.noarch
ovirt-vmconsole-host-1.0.6-1.el7.noarch
ovirt-host-dependencies-4.2.3-1.el7.x86_64
ovirt-hosted-engine-setup-2.2.29-0.0.master.20181002122252.git9ae169e.el7.noarch
cockpit-machines-ovirt-176-1.el7.noarch
ovirt-imageio-daemon-1.4.5-0.el7.noarch
ovirt-host-4.2.3-1.el7.x86_64
ovirt-vmconsole-1.0.6-1.el7.noarch
ovirt-hosted-engine-ha-2.2.19-0.0.master.20181002122327.20181002122322.gitb449616.el7.noarch
vdsm-4.20.42-4.git43e2555.el7.x86_64
libvirt-4.5.0-10.el7.x86_64
glusterfs-cli-3.12.14-1.el7.x86_64
glusterfs-fuse-3.12.14-1.el7.x86_64
glusterfs-client-xlators-3.12.14-1.el7.x86_64
glusterfs-libs-3.12.14-1.el7.x86_64
glusterfs-3.12.14-1.el7.x86_64
glusterfs-rdma-3.12.14-1.el7.x86_64
glusterfs-api-3.12.14-1.el7.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7.x86_64

ovirt-engine-4.2.7.3-0.0.master.20181012152958.gitfc1595b.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Vintage deploy HE environment(I saw it on gluster storage).
2. Add storage domain to the environment.

Actual results:
The storage domain is added after step 2. No other storage is added, HE-VM is not seen in the web-ui.
On the engine every ~15 seconds there is a task:
"Adding unmanaged VMs running on Host ocelot05.qa.lab.tlv.redhat.com to Cluster Default". 

Expected results:
After adding a storage domain, the auto import should activate and succeed. Another storage domain should be added - the storage of the hosted engine and the HE VM should be in showed in the environment.

Additional info:
It is possible to manually add the storage domain(using import) of the HE-VM, but the HE-VM is still not shown in the environment, also in the specific storage under the domain's virutal machines tab there is no VM. In that case the engine keeps doing the task above ("Adding umnanaged VMs...").

Comment 1 Liran Rotenberg 2018-10-16 08:05:40 UTC

Created attachment 1494278 [details]
engine-sosreport

Comment 2 Liran Rotenberg 2018-10-16 08:06:26 UTC

Created attachment 1494279 [details]
host-sosreport

Comment 3 Simone Tiraboschi 2018-10-16 08:51:48 UTC

This happens on engine side: the engine continuously scans for the external VMs on the host but it never imports them.

for instance in engine.log in the attached engine-sosreport we can count 43 instances of "Running command: AddUnmanagedVmsCommand internal: true."  but no one of them is successful or failing with a clear error.

Comment 4 Sandro Bonazzola 2018-10-17 09:30:15 UTC

Moving to Ryan being identified like a Virt related issue.

Comment 5 Simone Tiraboschi 2018-10-17 09:46:06 UTC

I reproduced it also with an external VM that is not related to hosted-engine.

Comment 6 Red Hat Bugzilla Rules Engine 2018-10-17 10:07:59 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 7 Ryan Barry 2018-11-08 02:54:42 UTC

I've repeatedly tried to reproduce this, and failed.

It is not reproducible on master.

It is not reproducible on 4.2.

I'm currently editing an appliance image with a development build of 4.2, but that's a very slow process.

Simone, can you please provide steps to reproduce with another VM? I've created with virt-install and virsh, with both successfully imported (on both 4.2 and 4.3), though neither was on a storage domain -- both were local storage.

I'll continue trying to reproduce with the appliance, but it's a long turnaround?

Comment 8 Simone Tiraboschi 2018-11-08 09:29:42 UTC

(In reply to Ryan Barry from comment #7)
> I've repeatedly tried to reproduce this, and failed.
> 
> It is not reproducible on master.
> 
> It is not reproducible on 4.2.

It's systematically reproduced on master and on 4.2, please check:
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_he-basic-suite-4.2/
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_he-basic-iscsi-suite-4.2/
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_he-basic-suite-master/
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_he-basic-iscsi-suite-master/

> Simone, can you please provide steps to reproduce with another VM? I've
> created with virt-install and virsh, with both successfully imported (on
> both 4.2 and 4.3), though neither was on a storage domain -- both were local
> storage.
> 
> I'll continue trying to reproduce with the appliance, but it's a long
> turnaround?

That engine VM to be imported resides on a VDSM managed Storage Domain and it has been directly created trough VDSM.
Maybe it depends from a specific devices or something like that.
I'd suggest to try the vintage hosted-engine deployment (deploy with 'hosted-engine --deploy --noansible') and check what happens on that engine.

Comment 9 Ryan Barry 2018-11-08 12:18:47 UTC

(In reply to Simone Tiraboschi from comment #8)
> It's systematically reproduced on master and on 4.2, please check:
> https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-
> tests_he-basic-suite-4.2/
> https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-
> tests_he-basic-iscsi-suite-4.2/
> https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-
> tests_he-basic-suite-master/
> https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-
> tests_he-basic-iscsi-suite-master/

Right, but all of these involve hosted engine itself. It doesn't seem to be reproducible outside of hosted engine.

> That engine VM to be imported resides on a VDSM managed Storage Domain and
> it has been directly created trough VDSM.
> Maybe it depends from a specific devices or something like that.
> I'd suggest to try the vintage hosted-engine deployment (deploy with
> 'hosted-engine --deploy --noansible') and check what happens on that engine.

I'll test this way as well.

If it's not reproducible there, I'll test with the appliance, but this is an extremely slow process, because to find a root cause, it involves rebuilding the engine RPM, editing the appliance qcow, deploying HE, waiting for it to fail, and repeating.

That's ok, but I wouldn't expect to find a cause until later this week.

Comment 10 Michal Skrivanek 2018-11-12 13:52:29 UTC

note that unless anyone helps with reproduction scenario we'll have to close this. Contrary to comment #5 we're not able to reproduce this with regular VMs

Comment 11 Simone Tiraboschi 2018-11-12 14:45:41 UTC

Isn't the engine VM in the vintage flow a good example by itself?

Comment 12 Michal Skrivanek 2018-11-12 15:09:43 UTC

don't know. It did work for me in bug 1626157. So there is probably something else involved here.

Comment 13 Ryan Barry 2018-11-13 00:39:06 UTC

Note:

I've also failed to reproduce this with the vintage flow on both NFS and is so.

Liram reproduced on Gluster, but I don't have a gluster environment set up. If this is isolated to a deprecated flow on OST/Gluster only, I'm nacking until we see a "real world" report or a more reliable reproducer is found, since it's unlikely that any HC users will select the vintage flow

Comment 14 Simone Tiraboschi 2018-11-13 09:49:21 UTC

(In reply to Ryan Barry from comment #13)
> Note:
> 
> I've also failed to reproduce this with the vintage flow on both NFS and is
> so.
> 
> Liram reproduced on Gluster, but I don't have a gluster environment set up.
> If this is isolated to a deprecated flow on OST/Gluster only, I'm nacking
> until we see a "real world" report or a more reliable reproducer is found,
> since it's unlikely that any HC users will select the vintage flow

We have for sure a report by an upstream user on:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/DOJENTCWFPFQGDT3IZW542POCTTNAZOW/

The user specify *nested* and in OST we are running nested as well.
Maybe the issue is just there?

Comment 15 Ryan Barry 2018-11-13 12:04:37 UTC

Unfortunately, the tests were all run nested.

I'll try to reproduce a couple more times, but it seems extremely reliable in my environment.

Alternatively, if QE can provide an environment where this is reliably reproducible (without actually reproducing it, so I can edit the appliance), that may yield progress

Comment 16 Ryan Barry 2018-11-14 03:05:32 UTC

Finally got a reproducer, which was actually as trivial as logging into the engine after vintage HE deployment. I'm not sure how it succeeds without the VM registered.

It looks like FullList is not actually returning disks. Patch tomorrow, hopefully

Comment 17 Ryan Barry 2018-11-14 17:36:16 UTC

A patch is up which resolves the engine issue, but this appears to be a partial fix only, and needs some kind of hosted engine changes.

Specifically, the VM can be imported, but it bogs down in hosted-engine-specific code.

The engine (after a HE deployment) has no active storage domains, no active datacenters, and this kills

HostedEngineImporter -> ImportHostedEngineStorageDomain

Since it isn't active, the HE VM is never imported. I have not delved into the HE parts of the engine code before, but I would guess that this should happen from he setup itself, correct? It adds a SD, adds the host, etc.

2018-11-14 12:19:41,224-05 INFO  [org.ovirt.engine.core.bll.storage.domain.GetExistingStorageDomainListQuery] (EE-ManagedThreadFactory-engine-Thread-24) [] START, GetExistingStorageDomainListQuery(GetExistingStorageDomainListParameters:{refresh='false', filtered='false'}), log id: 
4ddcff6a                                                                                                                                                                                                                                                                                  
2018-11-14 12:19:41,229-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainsListVDSCommand] (EE-ManagedThreadFactory-engine-Thread-24) [] START, HSMGetStorageDomainsListVDSCommand(HostName = ovirthoststable.phresus.priv, HSMGetStorageDomainsListVDSCommandParamet
ers:{hostId='78e0919a-44b4-483d-9447-e45a8e2eb95d', storagePoolId='00000000-0000-0000-0000-000000000000', storageType='null', storageDomainType='Data', path='null'}), log id: 14c7454b                                                                                                   
2018-11-14 12:19:41,425-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainsListVDSCommand] (EE-ManagedThreadFactory-engine-Thread-24) [] FINISH, HSMGetStorageDomainsListVDSCommand, return: [66d8a735-ccb3-44a2-991c-872a6927a9a2], log id: 14c7454b                
2018-11-14 12:19:41,440-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainInfoVDSCommand] (EE-ManagedThreadFactory-engine-Thread-24) [] START, HSMGetStorageDomainInfoVDSCommand(HostName = ovirthoststable.phresus.priv, HSMGetStorageDomainInfoVDSCommandParameters
:{hostId='78e0919a-44b4-483d-9447-e45a8e2eb95d', storageDomainId='66d8a735-ccb3-44a2-991c-872a6927a9a2'}), log id: 361ba638                                                                                                                                                               
2018-11-14 12:19:41,451-05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainInfoVDSCommand] (EE-ManagedThreadFactory-engine-Thread-24) [] FINISH, HSMGetStorageDomainInfoVDSCommand, return: <StorageDomainStatic:{name='hosted_storage', id='66d8a735-ccb3-44a2-991c-8
72a6927a9a2'}, null>, log id: 361ba638                                                                                                                                                                                                                                                    
2018-11-14 12:19:41,451-05 INFO  [org.ovirt.engine.core.bll.storage.domain.GetExistingStorageDomainListQuery] (EE-ManagedThreadFactory-engine-Thread-24) [] FINISH, GetExistingStorageDomainListQuery, log id: 4ddcff6a                                                                   
2018-11-14 12:19:41,454-05 INFO  [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (EE-ManagedThreadFactory-engine-Thread-24) [6ed10acf] Lock Acquired to object 'EngineLock:{exclusiveLocks='[66d8a735-ccb3-44a2-991c-872a6927a9a2=STORAGE]', sharedLocks
=''}'                                                                                                                                                                                                                                                                                     
2018-11-14 12:19:41,466-05 WARN  [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (EE-ManagedThreadFactory-engine-Thread-24) [6ed10acf] Validation of action 'ImportHostedEngineStorageDomain' failed for user SYSTEM. Reasons: VAR__ACTION__ADD,VAR__TYP
E__STORAGE__DOMAIN,ACTION_TYPE_FAILED_MASTER_STORAGE_DOMAIN_NOT_ACTIVE                                                                                                                                                                                                                    
2018-11-14 12:19:41,468-05 INFO  [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (EE-ManagedThreadFactory-engine-Thread-24) [6ed10acf] Lock freed to object 'EngineLock:{exclusiveLocks='[66d8a735-ccb3-44a2-991c-872a6927a9a2=STORAGE]', sharedLocks=''
}'

Comment 18 Simone Tiraboschi 2018-11-14 17:56:48 UTC

(In reply to Ryan Barry from comment #17)
> A patch is up which resolves the engine issue, but this appears to be a
> partial fix only, and needs some kind of hosted engine changes.
> 
> Specifically, the VM can be imported, but it bogs down in
> hosted-engine-specific code.
> 
> The engine (after a HE deployment) has no active storage domains, no active
> datacenters, and this kills
> 
> HostedEngineImporter -> ImportHostedEngineStorageDomain
> 
> Since it isn't active, the HE VM is never imported. I have not delved into
> the HE parts of the engine code before, but I would guess that this should
> happen from he setup itself, correct? It adds a SD, adds the host, etc.

Here we are talking about the "vintage" HE flow.
In the vintage HE flow, hosted-engine setup is directly creating a (the hosted-engine) storage domain through VDSM before having a running engine.
The engine VM is directly create by ovirt-hosted-engine-setup via vdsm over that storage domain.
The host where the user runs hosted-engine setup is then add to the engine and this is enough to correctly conclude hosted-engine-setup process.

Then the user was asked to manually add his first storage data domain to the engine.
That storage domain is going to become the master storage domain and the datacenter is going to go up.

Only when the datacenter is up, the engine was importing the hosted-engine storage domain and the engine VM stored there.
Now this part is looping without never completing.

Comment 19 Ryan Barry 2018-11-14 18:06:01 UTC

The loop is resolved.

What is not resolved is that the HE VM is not imported because there is no active master SD, which is the question. Shouldn't HE setup handle this? I looked through the git logs and cannot find any indication that engine ever handled this.

Note that during HE setup, I was not asked for another SD, and I don't remember doing this in the past, but it's been a while.

Is the expected workflow to log into engine to add the master SD? If so, I'll do that to ensure it's imported then, but my memory tells me (from 4.1) that HE setup creates a default DC/cluster and adds itself.

Comment 20 Simone Tiraboschi 2018-11-14 19:45:25 UTC

(In reply to Ryan Barry from comment #19)
> The loop is resolved.
> 
> What is not resolved is that the HE VM is not imported because there is no
> active master SD, which is the question. Shouldn't HE setup handle this? I
> looked through the git logs and cannot find any indication that engine ever
> handled this.
> 
> Note that during HE setup, I was not asked for another SD, and I don't
> remember doing this in the past, but it's been a while.
> 
> Is the expected workflow to log into engine to add the master SD? If so,
> I'll do that to ensure it's imported then, but my memory tells me (from 4.1)
> that HE setup creates a default DC/cluster and adds itself.

Current ansible code does it automatically, in the vintage flow it was up to the user to create the first data storage domain and the auto import process was going to be triggered just after that.

Comment 21 Liran Rotenberg 2018-12-20 15:10:42 UTC

Verified on:
ovirt-engine-4.3.0-0.4.master.20181218200623.gitf1f0e41.el7.noarch

Steps:
1. Add an external VM to a host not connected to the environment.
For example:
# virt-install --name centos7 --ram 1024 --disk path=./centos7.qcow2,size=8 --vcpus 1 --os-type linux --os-variant centos7.0 --network bridge=virbr0 --graphics none --console pty,target_type=serial --location 'http://mirror.i3d.net/pub/centos/7/os/x86_64/' --extra-args 'console=ttyS0,115200n8 serial'

2. Add the host to the environment
3. Check for the VM import.

Results:
After adding the host, when it was activated the VM imported successfully into the environment.

Comment 22 Sandro Bonazzola 2019-02-13 07:45:11 UTC

This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.