Description of problem: After vintage deployment, adding a storage domain doesn't add the hosted-engine storage domain and the HE VM to the environment. Version-Release number of selected component (if applicable): ovirt-host-deploy-1.7.5-0.0.master.20180530161905.gitc423dec.el7.noarch ovirt-engine-appliance-4.2-20181014.1.el7.noarch ovirt-imageio-common-1.4.5-0.el7.x86_64 ovirt-release42-snapshot-4.2.7-0.2.rc2.20181014014958.gitfb30674.el7.noarch ovirt-release42-4.2.7-0.2.rc2.20181014014958.gitfb30674.el7.noarch ovirt-engine-sdk-python-3.6.9.2-0.1.20180209.gite99bbd1.el7.centos.noarch python-ovirt-engine-sdk4-4.2.9-2.20181004git4d189a6.el7.x86_64 ovirt-provider-ovn-driver-1.2.17-0.20181003135950.git6aa6b37.el7.noarch cockpit-ovirt-dashboard-0.11.35-1.el7.noarch ovirt-setup-lib-1.1.6-0.0.master.20180921125403.git90612e6.el7.noarch ovirt-vmconsole-host-1.0.6-1.el7.noarch ovirt-host-dependencies-4.2.3-1.el7.x86_64 ovirt-hosted-engine-setup-2.2.29-0.0.master.20181002122252.git9ae169e.el7.noarch cockpit-machines-ovirt-176-1.el7.noarch ovirt-imageio-daemon-1.4.5-0.el7.noarch ovirt-host-4.2.3-1.el7.x86_64 ovirt-vmconsole-1.0.6-1.el7.noarch ovirt-hosted-engine-ha-2.2.19-0.0.master.20181002122327.20181002122322.gitb449616.el7.noarch vdsm-4.20.42-4.git43e2555.el7.x86_64 libvirt-4.5.0-10.el7.x86_64 glusterfs-cli-3.12.14-1.el7.x86_64 glusterfs-fuse-3.12.14-1.el7.x86_64 glusterfs-client-xlators-3.12.14-1.el7.x86_64 glusterfs-libs-3.12.14-1.el7.x86_64 glusterfs-3.12.14-1.el7.x86_64 glusterfs-rdma-3.12.14-1.el7.x86_64 glusterfs-api-3.12.14-1.el7.x86_64 libvirt-daemon-driver-storage-gluster-4.5.0-10.el7.x86_64 ovirt-engine-4.2.7.3-0.0.master.20181012152958.gitfc1595b.el7.noarch How reproducible: Always Steps to Reproduce: 1. Vintage deploy HE environment(I saw it on gluster storage). 2. Add storage domain to the environment. Actual results: The storage domain is added after step 2. No other storage is added, HE-VM is not seen in the web-ui. On the engine every ~15 seconds there is a task: "Adding unmanaged VMs running on Host ocelot05.qa.lab.tlv.redhat.com to Cluster Default". Expected results: After adding a storage domain, the auto import should activate and succeed. Another storage domain should be added - the storage of the hosted engine and the HE VM should be in showed in the environment. Additional info: It is possible to manually add the storage domain(using import) of the HE-VM, but the HE-VM is still not shown in the environment, also in the specific storage under the domain's virutal machines tab there is no VM. In that case the engine keeps doing the task above ("Adding umnanaged VMs...").
Created attachment 1494278 [details] engine-sosreport
Created attachment 1494279 [details] host-sosreport
This happens on engine side: the engine continuously scans for the external VMs on the host but it never imports them. for instance in engine.log in the attached engine-sosreport we can count 43 instances of "Running command: AddUnmanagedVmsCommand internal: true." but no one of them is successful or failing with a clear error.
Moving to Ryan being identified like a Virt related issue.
I reproduced it also with an external VM that is not related to hosted-engine.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
I've repeatedly tried to reproduce this, and failed. It is not reproducible on master. It is not reproducible on 4.2. I'm currently editing an appliance image with a development build of 4.2, but that's a very slow process. Simone, can you please provide steps to reproduce with another VM? I've created with virt-install and virsh, with both successfully imported (on both 4.2 and 4.3), though neither was on a storage domain -- both were local storage. I'll continue trying to reproduce with the appliance, but it's a long turnaround?
(In reply to Ryan Barry from comment #7) > I've repeatedly tried to reproduce this, and failed. > > It is not reproducible on master. > > It is not reproducible on 4.2. It's systematically reproduced on master and on 4.2, please check: https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_he-basic-suite-4.2/ https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_he-basic-iscsi-suite-4.2/ https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_he-basic-suite-master/ https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_he-basic-iscsi-suite-master/ > Simone, can you please provide steps to reproduce with another VM? I've > created with virt-install and virsh, with both successfully imported (on > both 4.2 and 4.3), though neither was on a storage domain -- both were local > storage. > > I'll continue trying to reproduce with the appliance, but it's a long > turnaround? That engine VM to be imported resides on a VDSM managed Storage Domain and it has been directly created trough VDSM. Maybe it depends from a specific devices or something like that. I'd suggest to try the vintage hosted-engine deployment (deploy with 'hosted-engine --deploy --noansible') and check what happens on that engine.
(In reply to Simone Tiraboschi from comment #8) > It's systematically reproduced on master and on 4.2, please check: > https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system- > tests_he-basic-suite-4.2/ > https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system- > tests_he-basic-iscsi-suite-4.2/ > https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system- > tests_he-basic-suite-master/ > https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system- > tests_he-basic-iscsi-suite-master/ Right, but all of these involve hosted engine itself. It doesn't seem to be reproducible outside of hosted engine. > That engine VM to be imported resides on a VDSM managed Storage Domain and > it has been directly created trough VDSM. > Maybe it depends from a specific devices or something like that. > I'd suggest to try the vintage hosted-engine deployment (deploy with > 'hosted-engine --deploy --noansible') and check what happens on that engine. I'll test this way as well. If it's not reproducible there, I'll test with the appliance, but this is an extremely slow process, because to find a root cause, it involves rebuilding the engine RPM, editing the appliance qcow, deploying HE, waiting for it to fail, and repeating. That's ok, but I wouldn't expect to find a cause until later this week.
note that unless anyone helps with reproduction scenario we'll have to close this. Contrary to comment #5 we're not able to reproduce this with regular VMs
Isn't the engine VM in the vintage flow a good example by itself?
don't know. It did work for me in bug 1626157. So there is probably something else involved here.
Note: I've also failed to reproduce this with the vintage flow on both NFS and is so. Liram reproduced on Gluster, but I don't have a gluster environment set up. If this is isolated to a deprecated flow on OST/Gluster only, I'm nacking until we see a "real world" report or a more reliable reproducer is found, since it's unlikely that any HC users will select the vintage flow
(In reply to Ryan Barry from comment #13) > Note: > > I've also failed to reproduce this with the vintage flow on both NFS and is > so. > > Liram reproduced on Gluster, but I don't have a gluster environment set up. > If this is isolated to a deprecated flow on OST/Gluster only, I'm nacking > until we see a "real world" report or a more reliable reproducer is found, > since it's unlikely that any HC users will select the vintage flow We have for sure a report by an upstream user on: https://lists.ovirt.org/archives/list/users@ovirt.org/message/DOJENTCWFPFQGDT3IZW542POCTTNAZOW/ The user specify *nested* and in OST we are running nested as well. Maybe the issue is just there?
Unfortunately, the tests were all run nested. I'll try to reproduce a couple more times, but it seems extremely reliable in my environment. Alternatively, if QE can provide an environment where this is reliably reproducible (without actually reproducing it, so I can edit the appliance), that may yield progress
Finally got a reproducer, which was actually as trivial as logging into the engine after vintage HE deployment. I'm not sure how it succeeds without the VM registered. It looks like FullList is not actually returning disks. Patch tomorrow, hopefully
A patch is up which resolves the engine issue, but this appears to be a partial fix only, and needs some kind of hosted engine changes. Specifically, the VM can be imported, but it bogs down in hosted-engine-specific code. The engine (after a HE deployment) has no active storage domains, no active datacenters, and this kills HostedEngineImporter -> ImportHostedEngineStorageDomain Since it isn't active, the HE VM is never imported. I have not delved into the HE parts of the engine code before, but I would guess that this should happen from he setup itself, correct? It adds a SD, adds the host, etc. 2018-11-14 12:19:41,224-05 INFO [org.ovirt.engine.core.bll.storage.domain.GetExistingStorageDomainListQuery] (EE-ManagedThreadFactory-engine-Thread-24) [] START, GetExistingStorageDomainListQuery(GetExistingStorageDomainListParameters:{refresh='false', filtered='false'}), log id: 4ddcff6a 2018-11-14 12:19:41,229-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainsListVDSCommand] (EE-ManagedThreadFactory-engine-Thread-24) [] START, HSMGetStorageDomainsListVDSCommand(HostName = ovirthoststable.phresus.priv, HSMGetStorageDomainsListVDSCommandParamet ers:{hostId='78e0919a-44b4-483d-9447-e45a8e2eb95d', storagePoolId='00000000-0000-0000-0000-000000000000', storageType='null', storageDomainType='Data', path='null'}), log id: 14c7454b 2018-11-14 12:19:41,425-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainsListVDSCommand] (EE-ManagedThreadFactory-engine-Thread-24) [] FINISH, HSMGetStorageDomainsListVDSCommand, return: [66d8a735-ccb3-44a2-991c-872a6927a9a2], log id: 14c7454b 2018-11-14 12:19:41,440-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainInfoVDSCommand] (EE-ManagedThreadFactory-engine-Thread-24) [] START, HSMGetStorageDomainInfoVDSCommand(HostName = ovirthoststable.phresus.priv, HSMGetStorageDomainInfoVDSCommandParameters :{hostId='78e0919a-44b4-483d-9447-e45a8e2eb95d', storageDomainId='66d8a735-ccb3-44a2-991c-872a6927a9a2'}), log id: 361ba638 2018-11-14 12:19:41,451-05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainInfoVDSCommand] (EE-ManagedThreadFactory-engine-Thread-24) [] FINISH, HSMGetStorageDomainInfoVDSCommand, return: <StorageDomainStatic:{name='hosted_storage', id='66d8a735-ccb3-44a2-991c-8 72a6927a9a2'}, null>, log id: 361ba638 2018-11-14 12:19:41,451-05 INFO [org.ovirt.engine.core.bll.storage.domain.GetExistingStorageDomainListQuery] (EE-ManagedThreadFactory-engine-Thread-24) [] FINISH, GetExistingStorageDomainListQuery, log id: 4ddcff6a 2018-11-14 12:19:41,454-05 INFO [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (EE-ManagedThreadFactory-engine-Thread-24) [6ed10acf] Lock Acquired to object 'EngineLock:{exclusiveLocks='[66d8a735-ccb3-44a2-991c-872a6927a9a2=STORAGE]', sharedLocks =''}' 2018-11-14 12:19:41,466-05 WARN [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (EE-ManagedThreadFactory-engine-Thread-24) [6ed10acf] Validation of action 'ImportHostedEngineStorageDomain' failed for user SYSTEM. Reasons: VAR__ACTION__ADD,VAR__TYP E__STORAGE__DOMAIN,ACTION_TYPE_FAILED_MASTER_STORAGE_DOMAIN_NOT_ACTIVE 2018-11-14 12:19:41,468-05 INFO [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (EE-ManagedThreadFactory-engine-Thread-24) [6ed10acf] Lock freed to object 'EngineLock:{exclusiveLocks='[66d8a735-ccb3-44a2-991c-872a6927a9a2=STORAGE]', sharedLocks='' }'
(In reply to Ryan Barry from comment #17) > A patch is up which resolves the engine issue, but this appears to be a > partial fix only, and needs some kind of hosted engine changes. > > Specifically, the VM can be imported, but it bogs down in > hosted-engine-specific code. > > The engine (after a HE deployment) has no active storage domains, no active > datacenters, and this kills > > HostedEngineImporter -> ImportHostedEngineStorageDomain > > Since it isn't active, the HE VM is never imported. I have not delved into > the HE parts of the engine code before, but I would guess that this should > happen from he setup itself, correct? It adds a SD, adds the host, etc. Here we are talking about the "vintage" HE flow. In the vintage HE flow, hosted-engine setup is directly creating a (the hosted-engine) storage domain through VDSM before having a running engine. The engine VM is directly create by ovirt-hosted-engine-setup via vdsm over that storage domain. The host where the user runs hosted-engine setup is then add to the engine and this is enough to correctly conclude hosted-engine-setup process. Then the user was asked to manually add his first storage data domain to the engine. That storage domain is going to become the master storage domain and the datacenter is going to go up. Only when the datacenter is up, the engine was importing the hosted-engine storage domain and the engine VM stored there. Now this part is looping without never completing.
The loop is resolved. What is not resolved is that the HE VM is not imported because there is no active master SD, which is the question. Shouldn't HE setup handle this? I looked through the git logs and cannot find any indication that engine ever handled this. Note that during HE setup, I was not asked for another SD, and I don't remember doing this in the past, but it's been a while. Is the expected workflow to log into engine to add the master SD? If so, I'll do that to ensure it's imported then, but my memory tells me (from 4.1) that HE setup creates a default DC/cluster and adds itself.
(In reply to Ryan Barry from comment #19) > The loop is resolved. > > What is not resolved is that the HE VM is not imported because there is no > active master SD, which is the question. Shouldn't HE setup handle this? I > looked through the git logs and cannot find any indication that engine ever > handled this. > > Note that during HE setup, I was not asked for another SD, and I don't > remember doing this in the past, but it's been a while. > > Is the expected workflow to log into engine to add the master SD? If so, > I'll do that to ensure it's imported then, but my memory tells me (from 4.1) > that HE setup creates a default DC/cluster and adds itself. Current ansible code does it automatically, in the vintage flow it was up to the user to create the first data storage domain and the auto import process was going to be triggered just after that.
Verified on: ovirt-engine-4.3.0-0.4.master.20181218200623.gitf1f0e41.el7.noarch Steps: 1. Add an external VM to a host not connected to the environment. For example: # virt-install --name centos7 --ram 1024 --disk path=./centos7.qcow2,size=8 --vcpus 1 --os-type linux --os-variant centos7.0 --network bridge=virbr0 --graphics none --console pty,target_type=serial --location 'http://mirror.i3d.net/pub/centos/7/os/x86_64/' --extra-args 'console=ttyS0,115200n8 serial' 2. Add the host to the environment 3. Check for the VM import. Results: After adding the host, when it was activated the VM imported successfully into the environment.
This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.