Description of problem: If a VM has a stateless snapshot that contains cloud-init network configuration, engine fails to (re)start with: 2021-09-02 10:11:43,067+10 ERROR [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 46) [] Error during initialization: javax.ejb.EJBException: java.lang.IllegalStateException: WFLYEE0042: Failed to construct component instance at org.jboss.as.ejb3.8.GA-redhat-00001//org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInOurTx(CMTTxInterceptor.java:264) at org.jboss.as.ejb3.8.GA-redhat-00001//org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:386) .... Caused by: java.lang.RuntimeException: org.ovirt.engine.core.utils.ovf.OvfReaderException: OVF error: TEST1: cannot read 'Domain' with value: Invalid type id 'org.ovirt.engine.core.common.businessentities.VmInitNetwork' (for id type 'Id.class'): no such class found at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.SnapshotsManager.getVmConfigurationInStatelessSnapshotOfVm(SnapshotsManager.java:609) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1675) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:952) at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:926) at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327) at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746) at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) Caused by: org.ovirt.engine.core.utils.ovf.OvfReaderException: OVF error: TEST1: cannot read 'Domain' with value: Invalid type id 'org.ovirt.engine.core.common.businessentities.VmInitNetwork' (for id type 'Id.class'): no such class found at deployment.engine.ear.bll.jar//org.ovirt.engine.core.utils.ovf.OvfManager.importVm(OvfManager.java:110) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.ovfstore.OvfHelper.readVmFromOvf(OvfHelper.java:97) at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.SnapshotsManager.getVmConfigurationInStatelessSnapshotOfVm(SnapshotsManager.java:607) ... 16 more That comes from here, apparently unable to deserialize. However, it does not seem to be just this to trigger it, sometimes identical configuration won't trigger the problem. # /usr/share/ovirt-engine/dbscripts/engine-psql.sh -A -t -c "select vm_configuration from snapshots where snapshot_id = 'ab0000f8-d265-4054-9eca-3d564f7ef19c'" [...] <VmInit ovf:hostname="TEST1" ovf:timeZone="Australia/Brisbane" ovf:authorizedKeys="" ovf:regenerateKeys="false" ovf:dnsSearch="kvm.local" ovf:dnsServers="192.168.100.253 192.168.100.252" ovf:networks="[ [ "org.ovirt.engine.core.common.businessentities.VmInitNetwork", { "startOnBoot" : true, "name" : "eth0", "bootProtocol" : "STATIC_IP", "ip" : "192.168.100.50", "netmask" : "255.255.255.0", "gateway" : "192.168.100.253", "ipv6BootProtocol" : "NONE", "ipv6Address" : null, "ipv6Prefix" : null, "ipv6Gateway" : null, "id" : null, "managed" : true } ] ]" ovf:customScript=""></VmInit> Version-Release number of selected component (if applicable): * rhvm-4.4.7.6-0.11.el8ev.noarch How reproducible: * Almost always. I had some trouble reproducing it at first, I had to create 3 VMs initially to be able to reproduce the first time, but in the last 2 times I've managed to reproduce it just as below, I think its somewhat reliable if the VM is newly created, never run and no OS installed. Unsure why. Steps to Reproduce: 1. Create new VM, just add a new disk and a NIC, no need to install OS, and set it as stateless 2. Run VM (stateless snapshot is created, containing that cloud-init config above) 3. Restart ovirt-engine
I'm under the impression the issue reproduces if there is no guest-agent/OS and the VM is started first time when the stateless snapshot is created, as if its related to the guest agent reporting something back (vNICs?) Still a bit confusing though.
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
Verified with: ovirt-engine-4.4.9.3-0.3.el8ev.noarch Steps: 1. Create one stateless VM with cloud-init network(no OS, one NIC device, one disk), run the VM, restart ovirt-engine. 2. Compare the restart times of ovirt-engine 4.4.9.3 and 4.4.9.1 when there are many running stateless VMs: 1) create 1000 stateless VMs with the following configurations: - 1M mem(because of the resource limitation) - 1M disk(because of the resource limitation) - no console(because of the port number limitation) - no NIC device(because of the MAC resource limitation) - no OS - cloud-init enabled - cloud-init network configured(for 4.4.9.1 not configured, otherwise the engine can't be restarted) 2) run the 1000 VMs 3) restart ovirt-engine 10 times, record restart times, calculate the average value. 4) compare the average restart times of ovirt-engine 4.4.9.3 and 4.4.9.1 3. Run cloud-init automation tests. Results: 1. ovirt-engine can be restarted when there is one running stateless VM with cloud-init network configured. 2. ovirt-engine can be restarted when there are 1000 running stateless VMs with cloud-init network configured. 3. The average restart time of ovirt-engine 4.4.9.3 is 5 seconds longer than that of ovirt-engine 4.4.9.1 when there are 1000 running stateless VMs. Checked with Arik and Saif, the overhead looks reasonable. 4. No regression issue found in cloud-init automation tests.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) security update [ovirt-4.4.9]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4626