Bug 2000364 - Engine fails to start, unable to read cloud-init network config from stateless snapshot configuration.
Summary: Engine fails to start, unable to read cloud-init network config from stateles...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.7
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.4.9
: ---
Assignee: Saif Abusaleh
QA Contact: Qin Yuan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-02 00:24 UTC by Germano Veit Michel
Modified: 2022-12-07 15:32 UTC (History)
8 users (show)

Fixed In Version: ovirt-engine-4.4.9.2
Doc Type: Bug Fix
Doc Text:
Previously, on Manager startup, system threads may have been used to retrieve the virtual machine configuration from stateless snapshots, causing the Manager to fail to start. In this release, the way of retrieving the virtual machine configuration from stateless snapshots on the Manager was changed to avoid using the system thread and only use application threads. AS a result, the Manager can start when stateless snapshots with cloud-init network properties are defined.
Clone Of:
Environment:
Last Closed: 2021-11-16 14:46:57 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43358 0 None None None 2021-09-02 00:25:37 UTC
Red Hat Knowledge Base (Solution) 6301611 0 None None None 2021-09-02 00:36:45 UTC
Red Hat Product Errata RHSA-2021:4626 0 None None None 2021-11-16 14:47:07 UTC
oVirt gerrit 117012 0 master MERGED engine: Fix engine crash when there is VM with cloud-init network 2021-10-11 10:54:59 UTC
oVirt gerrit 117049 0 ovirt-engine-4.4 MERGED engine: Fix engine crash when there is VM with cloud-init network 2021-10-11 11:03:06 UTC

Description Germano Veit Michel 2021-09-02 00:24:32 UTC
Description of problem:

If a VM has a stateless snapshot that contains cloud-init network configuration, engine fails to (re)start with:

2021-09-02 10:11:43,067+10 ERROR [org.ovirt.engine.core.bll.Backend] (ServerService Thread Pool -- 46) [] Error during initialization: javax.ejb.EJBException: java.lang.IllegalStateException: WFLYEE0042: Failed to construct component instance
	at org.jboss.as.ejb3.8.GA-redhat-00001//org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInOurTx(CMTTxInterceptor.java:264)
	at org.jboss.as.ejb3.8.GA-redhat-00001//org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:386)
        ....
Caused by: java.lang.RuntimeException: org.ovirt.engine.core.utils.ovf.OvfReaderException: OVF error: TEST1: cannot read 'Domain' with value: Invalid type id 'org.ovirt.engine.core.common.businessentities.VmInitNetwork' (for id type 'Id.class'): no such class found
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.SnapshotsManager.getVmConfigurationInStatelessSnapshotOfVm(SnapshotsManager.java:609)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
	at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1675)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:952)
	at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:926)
	at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327)
	at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: org.ovirt.engine.core.utils.ovf.OvfReaderException: OVF error: TEST1: cannot read 'Domain' with value: Invalid type id 'org.ovirt.engine.core.common.businessentities.VmInitNetwork' (for id type 'Id.class'): no such class found
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.utils.ovf.OvfManager.importVm(OvfManager.java:110)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.ovfstore.OvfHelper.readVmFromOvf(OvfHelper.java:97)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.snapshots.SnapshotsManager.getVmConfigurationInStatelessSnapshotOfVm(SnapshotsManager.java:607)
	... 16 more

That comes from here, apparently unable to deserialize. However, it does not seem to be just this to trigger it, sometimes identical configuration won't trigger the problem.

# /usr/share/ovirt-engine/dbscripts/engine-psql.sh -A -t -c "select vm_configuration from snapshots where snapshot_id = 'ab0000f8-d265-4054-9eca-3d564f7ef19c'"
[...]
<VmInit ovf:hostname="TEST1" ovf:timeZone="Australia/Brisbane" ovf:authorizedKeys="" ovf:regenerateKeys="false" ovf:dnsSearch="kvm.local" ovf:dnsServers="192.168.100.253 192.168.100.252" ovf:networks="[ [ &quot;org.ovirt.engine.core.common.businessentities.VmInitNetwork&quot;, {
  &quot;startOnBoot&quot; : true,
  &quot;name&quot; : &quot;eth0&quot;,
  &quot;bootProtocol&quot; : &quot;STATIC_IP&quot;,
  &quot;ip&quot; : &quot;192.168.100.50&quot;,
  &quot;netmask&quot; : &quot;255.255.255.0&quot;,
  &quot;gateway&quot; : &quot;192.168.100.253&quot;,
  &quot;ipv6BootProtocol&quot; : &quot;NONE&quot;,
  &quot;ipv6Address&quot; : null,
  &quot;ipv6Prefix&quot; : null,
  &quot;ipv6Gateway&quot; : null,
  &quot;id&quot; : null,
  &quot;managed&quot; : true
} ] ]" ovf:customScript=""></VmInit>

Version-Release number of selected component (if applicable):
* rhvm-4.4.7.6-0.11.el8ev.noarch

How reproducible:
* Almost always. 

I had some trouble reproducing it at first, I had to create 3 VMs initially to be able to reproduce the first time, but in the last 2 times I've managed to reproduce it just as below, I think its somewhat reliable if the VM is newly created, never run and no OS installed. Unsure why.

Steps to Reproduce:
1. Create new VM, just add a new disk and a NIC, no need to install OS, and set it as stateless
2. Run VM (stateless snapshot is created, containing that cloud-init config above)
3. Restart ovirt-engine

Comment 1 Germano Veit Michel 2021-09-02 00:32:00 UTC
I'm under the impression the issue reproduces if there is no guest-agent/OS and the VM is started first time when the stateless snapshot is created, as if its related to the guest agent reporting something back (vNICs?)
Still a bit confusing though.

Comment 10 RHEL Program Management 2021-10-11 11:11:14 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 13 Qin Yuan 2021-10-25 12:23:20 UTC
Verified with:
ovirt-engine-4.4.9.3-0.3.el8ev.noarch

Steps:
1. Create one stateless VM with cloud-init network(no OS, one NIC device, one disk), run the VM, restart ovirt-engine.
2. Compare the restart times of ovirt-engine 4.4.9.3 and 4.4.9.1 when there are many running stateless VMs:
   1) create 1000 stateless VMs with the following configurations:
      - 1M mem(because of the resource limitation)
      - 1M disk(because of the resource limitation)
      - no console(because of the port number limitation)
      - no NIC device(because of the MAC resource limitation)
      - no OS
      - cloud-init enabled
      - cloud-init network configured(for 4.4.9.1 not configured, otherwise the engine can't be restarted)
   2) run the 1000 VMs
   3) restart ovirt-engine 10 times, record restart times, calculate the average value.
   4) compare the average restart times of ovirt-engine 4.4.9.3 and 4.4.9.1
3. Run cloud-init automation tests.

Results:
1. ovirt-engine can be restarted when there is one running stateless VM with cloud-init network configured.
2. ovirt-engine can be restarted when there are 1000 running stateless VMs with cloud-init network configured.
3. The average restart time of ovirt-engine 4.4.9.3 is 5 seconds longer than that of ovirt-engine 4.4.9.1 when there are 1000 running stateless VMs. Checked with Arik and Saif, the overhead looks reasonable.
4. No regression issue found in cloud-init automation tests.

Comment 17 errata-xmlrpc 2021-11-16 14:46:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) security update [ovirt-4.4.9]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4626


Note You need to log in before you can comment on or make changes to this bug.