Bug 2000364

Summary: Engine fails to start, unable to read cloud-init network config from stateless snapshot configuration.
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: ovirt-engineAssignee: Saif Abusaleh <sabusale>
Status: CLOSED ERRATA QA Contact: Qin Yuan <qiyuan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4.7CC: ahadas, emarcus, jspanko, mavital, mburman, mn.albeschenko, mperina, qiyuan
Target Milestone: ovirt-4.4.9Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.9.2 Doc Type: Bug Fix
Doc Text:
Previously, on Manager startup, system threads may have been used to retrieve the virtual machine configuration from stateless snapshots, causing the Manager to fail to start. In this release, the way of retrieving the virtual machine configuration from stateless snapshots on the Manager was changed to avoid using the system thread and only use application threads. AS a result, the Manager can start when stateless snapshots with cloud-init network properties are defined.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-16 14:46:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2021-09-02 00:24:32 UTC
Description of problem:

If a VM has a stateless snapshot that contains cloud-init network configuration, engine fails to (re)start with:

2021-09-02 10:11:43,067+10 ERROR [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 46) [] Error during initialization: javax.ejb.EJBException: java.lang.IllegalStateException: WFLYEE0042: Failed to construct component instance
	at org.jboss.as.ejb3.8.GA-redhat-00001//org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInOurTx(CMTTxInterceptor.java:264)
	at org.jboss.as.ejb3.8.GA-redhat-00001//org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:386)
        ....
Caused by: java.lang.RuntimeException: org.ovirt.engine.core.utils.ovf.OvfReaderException: OVF error: TEST1: cannot read 'Domain' with value: Invalid type id 'org.ovirt.engine.core.common.businessentities.VmInitNetwork' (for id type 'Id.class'): no such class found
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.SnapshotsManager.getVmConfigurationInStatelessSnapshotOfVm(SnapshotsManager.java:609)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
	at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1675)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:952)
	at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:926)
	at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327)
	at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: org.ovirt.engine.core.utils.ovf.OvfReaderException: OVF error: TEST1: cannot read 'Domain' with value: Invalid type id 'org.ovirt.engine.core.common.businessentities.VmInitNetwork' (for id type 'Id.class'): no such class found
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.utils.ovf.OvfManager.importVm(OvfManager.java:110)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.ovfstore.OvfHelper.readVmFromOvf(OvfHelper.java:97)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.SnapshotsManager.getVmConfigurationInStatelessSnapshotOfVm(SnapshotsManager.java:607)
	... 16 more

That comes from here, apparently unable to deserialize. However, it does not seem to be just this to trigger it, sometimes identical configuration won't trigger the problem.

# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -A -t -c "select vm_configuration from snapshots where snapshot_id = 'ab0000f8-d265-4054-9eca-3d564f7ef19c'"
[...]
<VmInit ovf:hostname="TEST1" ovf:timeZone="Australia/Brisbane" ovf:authorizedKeys="" ovf:regenerateKeys="false" ovf:dnsSearch="kvm.local" ovf:dnsServers="192.168.100.253 192.168.100.252" ovf:networks="[ [ &quot;org.ovirt.engine.core.common.businessentities.VmInitNetwork&quot;, {
  &quot;startOnBoot&quot; : true,
  &quot;name&quot; : &quot;eth0&quot;,
  &quot;bootProtocol&quot; : &quot;STATIC_IP&quot;,
  &quot;ip&quot; : &quot;192.168.100.50&quot;,
  &quot;netmask&quot; : &quot;255.255.255.0&quot;,
  &quot;gateway&quot; : &quot;192.168.100.253&quot;,
  &quot;ipv6BootProtocol&quot; : &quot;NONE&quot;,
  &quot;ipv6Address&quot; : null,
  &quot;ipv6Prefix&quot; : null,
  &quot;ipv6Gateway&quot; : null,
  &quot;id&quot; : null,
  &quot;managed&quot; : true
} ] ]" ovf:customScript=""></VmInit>

Version-Release number of selected component (if applicable):
* rhvm-4.4.7.6-0.11.el8ev.noarch

How reproducible:
* Almost always. 

I had some trouble reproducing it at first, I had to create 3 VMs initially to be able to reproduce the first time, but in the last 2 times I've managed to reproduce it just as below, I think its somewhat reliable if the VM is newly created, never run and no OS installed. Unsure why.

Steps to Reproduce:
1. Create new VM, just add a new disk and a NIC, no need to install OS, and set it as stateless
2. Run VM (stateless snapshot is created, containing that cloud-init config above)
3. Restart ovirt-engine

Comment 1 Germano Veit Michel 2021-09-02 00:32:00 UTC
I'm under the impression the issue reproduces if there is no guest-agent/OS and the VM is started first time when the stateless snapshot is created, as if its related to the guest agent reporting something back (vNICs?)
Still a bit confusing though.

Comment 10 RHEL Program Management 2021-10-11 11:11:14 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 13 Qin Yuan 2021-10-25 12:23:20 UTC
Verified with:
ovirt-engine-4.4.9.3-0.3.el8ev.noarch

Steps:
1. Create one stateless VM with cloud-init network(no OS, one NIC device, one disk), run the VM, restart ovirt-engine.
2. Compare the restart times of ovirt-engine 4.4.9.3 and 4.4.9.1 when there are many running stateless VMs:
   1) create 1000 stateless VMs with the following configurations:
      - 1M mem(because of the resource limitation)
      - 1M disk(because of the resource limitation)
      - no console(because of the port number limitation)
      - no NIC device(because of the MAC resource limitation)
      - no OS
      - cloud-init enabled
      - cloud-init network configured(for 4.4.9.1 not configured, otherwise the engine can't be restarted)
   2) run the 1000 VMs
   3) restart ovirt-engine 10 times, record restart times, calculate the average value.
   4) compare the average restart times of ovirt-engine 4.4.9.3 and 4.4.9.1
3. Run cloud-init automation tests.

Results:
1. ovirt-engine can be restarted when there is one running stateless VM with cloud-init network configured.
2. ovirt-engine can be restarted when there are 1000 running stateless VMs with cloud-init network configured.
3. The average restart time of ovirt-engine 4.4.9.3 is 5 seconds longer than that of ovirt-engine 4.4.9.1 when there are 1000 running stateless VMs. Checked with Arik and Saif, the overhead looks reasonable.
4. No regression issue found in cloud-init automation tests.

Comment 17 errata-xmlrpc 2021-11-16 14:46:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) security update [ovirt-4.4.9]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4626