1585456 – [downstream clone - 4.2.4] ovirt-engine fails to start when having a large number of stateless snapshots

Bug 1585456 - [downstream clone - 4.2.4] ovirt-engine fails to start when having a large number of stateless snapshots

Summary: [downstream clone - 4.2.4] ovirt-engine fails to start when having a large nu...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.2.4
Target Release:	---
Assignee:	Roy Golan
QA Contact:	mlehrer
Docs Contact:
URL:
Whiteboard:
Depends On:	1579008
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-03 07:11 UTC by RHV bug bot
Modified:	2023-09-07 19:11 UTC (History)
CC List:	14 users (show)
Fixed In Version:	ovirt-engine-4.2.4.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1579008
Environment:
Last Closed:	2018-06-27 10:02:42 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-47445	None	None	None	2022-07-09 13:52:34 UTC
Red Hat Knowledge Base (Solution)	3722231	None	None	None	2018-11-28 23:54:24 UTC
Red Hat Product Errata	RHSA-2018:2071	None	None	None	2018-06-27 10:03:32 UTC
oVirt gerrit	91585	master	MERGED	core: process OVF in parallel to spead up engine startup	2021-01-19 21:16:48 UTC
oVirt gerrit	91700	ovirt-engine-4.2	MERGED	core: process OVF in parallel to spead up engine startup	2021-01-19 21:16:08 UTC

Description RHV bug bot 2018-06-03 07:11:17 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1579008 +++
======================================================================

Description of problem:

The start of the Engine service in JBoss gets interrupted after 5 minutes by the com.arjuna.ats.arjuna Transaction Reaper. The aborted transaction was processing the OVFs from the stateless snapshots:

(...)
2018-05-16 17:48:25,427+02 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 43) WFLYUT0021: Registered web context: '/ovirt-engine/api' for server 'default-server'
2018-05-16 17:48:26,196+02 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 56) WFLYUT0021: Registered web context: '/ovirt-engine/sso' for server 'default-server'
2018-05-16 17:53:25,338+02 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff939c027c:-60225362:5afc52c0:1b4 in state  RUN
2018-05-16 17:53:25,338+02 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper) ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff939c027c:-60225362:5afc52c0:1b7 in state  RUN
2018-05-16 17:53:25,340+02 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012121: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 0,5,main] successfully canceled TX 0:ffff939c027c:-60225362:5afc52c0:1b4
2018-05-16 17:53:25,340+02 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012095: Abort of action id 0:ffff939c027c:-60225362:5afc52c0:1b7 invoked while multiple threads active within it.
2018-05-16 17:53:25,342+02 WARN  [com.arjuna.ats.arjuna] (Transaction Reaper Worker 0) ARJUNA012381: Action id 0:ffff939c027c:-60225362:5afc52c0:1b7 completed with multiple threads - thread ServerService Thread Pool -- 45 was in progress with com.sun.org.apache.xml.internal.dtm.ref.dom2dtm.DOM2DTM.nextNode(DOM2DTM.java:524)
com.sun.org.apache.xml.internal.dtm.ref.dom2dtm.DOM2DTM.getHandleFromNode(DOM2DTM.java:680)
com.sun.org.apache.xml.internal.dtm.ref.dom2dtm.DOM2DTM.getHandleOfNode(DOM2DTM.java:732)
com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault.getDTMHandleFromNode(DTMManagerDefault.java:575)
com.sun.org.apache.xpath.internal.XPathContext.getDTMHandleFromNode(XPathContext.java:183)
com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:306)
com.sun.org.apache.xpath.internal.jaxp.XPathImpl.eval(XPathImpl.java:206)
com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:271)
org.ovirt.engine.core.utils.ovf.xml.XmlNode.selectSingleNode(XmlNode.java:29)
org.ovirt.engine.core.utils.ovf.OvfReader.selectSingleNode(OvfReader.java:1128)
org.ovirt.engine.core.utils.ovf.OvfReader.readVmDevice(OvfReader.java:269)
org.ovirt.engine.core.utils.ovf.OvfReader.readManagedVmDevice(OvfReader.java:209)
org.ovirt.engine.core.utils.ovf.OvfReader.readCdItem(OvfReader.java:419)
org.ovirt.engine.core.utils.ovf.OvfReader.readHardwareSection(OvfReader.java:367)
org.ovirt.engine.core.utils.ovf.OvfReader.readGeneralData(OvfReader.java:559)
org.ovirt.engine.core.utils.ovf.OvfReader.buildVirtualSystem(OvfReader.java:205)
org.ovirt.engine.core.utils.ovf.IOvfBuilder.build(IOvfBuilder.java:30)
org.ovirt.engine.core.utils.ovf.OvfManager.importVm(OvfManager.java:58)
org.ovirt.engine.core.bll.storage.ovfstore.OvfHelper.readVmFromOvf(OvfHelper.java:71)
org.ovirt.engine.core.bll.snapshots.SnapshotsManager.getVmConfigurationInStatelessSnapshotOfVm(SnapshotsManager.java:680)
org.ovirt.engine.core.bll.network.macpool.MacsUsedAcrossWholeSystem$$Lambda$756/1208388850.apply(Unknown Source)
(...)

There are 1969 stateless snapshots in the DB:

engine=# select count(vm_id) from snapshots where snapshot_type='STATELESS';
 count 
-------
  1969


How reproducible:

Always


Steps to Reproduce:

1. Create a large number of stateless snapshots
2. Restart ovirt-engine

(Originally by Sergio Lopez Pascual)

Comment 1 RHV bug bot 2018-06-03 07:11:27 UTC

With this amount of snapshots(~2000), if each conversion of the code below[1] takes 150 ms we already are reaching the default TX timout. 

The code is sequentially getting each OVF and processing it:
 
 Stream<VM> statelessSnapshotsOfRunningVMs =
   idsOfRunningStatelessVMs.map(snapshotsManager::getVmConfigurationInStatelessSnapshotOfVm)
       .filter(Optional::isPresent)
       .map(Optional::get);


Generally its a perfect candidate for parallel stream handling. 




Another question, how come they reached this amount of stale snapshots?  I guess as an easy fix you just increase the tx timeout and (hopefully you can) remove all those undeeded snapshots for now

To increase tx timeout add this:
  <coordinator-environment default-timeout="600"/>

under element:
  <subsystem xmlns="urn:jboss:domain:transactions:3.0">

in  /etc/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in

(Originally by Roy Golan)

Comment 10 RHV bug bot 2018-06-03 07:12:04 UTC

(In reply to Roy Golan from comment #1)
> With this amount of snapshots(~2000), if each conversion of the code
> below[1] takes 150 ms we already are reaching the default TX timout. 
> 
> The code is sequentially getting each OVF and processing it:
>  
>  Stream<VM> statelessSnapshotsOfRunningVMs =
>   
> idsOfRunningStatelessVMs.map(snapshotsManager::
> getVmConfigurationInStatelessSnapshotOfVm)
>        .filter(Optional::isPresent)
>        .map(Optional::get);
> 
> 
> Generally its a perfect candidate for parallel stream handling. 

I agree basically but there's one problem with this approach, currently all of the multithreaded handling in Engine is done via the thread pool, using parallel streaming will use a number of threads at its discretion and might exhaust the number of threads, what do you suggest then Roy?

(Originally by Tal Nisan)

Comment 13 RHV bug bot 2018-06-03 07:12:18 UTC

> I agree basically but there's one problem with this approach, currently all
> of the multithreaded handling in Engine is done via the thread pool, using
> parallel streaming will use a number of threads at its discretion and might
> exhaust the number of threads, what do you suggest then Roy?

There are 2 reasons this is not a big concern:
1. Parallel stream uses the common pool inside ForkJoin which defaults to 
the number of cores the JVM see (Runtime.getRuntime().availableProcessors())
2. We are dealing with the engine startup here - anyway Wildfly is waiting for the MacPoolPerCluster Bean, which is the root of this invocation, to finish it's initialization.

As I see it this is the cheapest and safest solution at the moment.

(Originally by Roy Golan)

Comment 14 RHV bug bot 2018-06-03 07:12:21 UTC

If it's not a big concern I'm for it then, patch is straight forward and simple, targeting to 4.2.4 so 4.1.z customers can have this fix on upgrade to 4.2.z

(Originally by Tal Nisan)

Comment 18 Ilan Zuckerman 2018-06-19 13:26:31 UTC

Copying my comment from the upstream BZ https://bugzilla.redhat.com/show_bug.cgi?id=1579008 (verified)

After adding additional 500 VMs, 
Total up VMs amount: 2314
Total amount of stateless snapshots: 2314

AVG restart time of the engine: 54 Sec

restart ovirt-engine:  Mon Jun 18 10:15:22 IDT 2018
finish: 2018-06-18 10:16:15
total: 53 Sec

restart ovirt-engine:  Mon Jun 18 10:18:15 IDT 2018
finish: 2018-06-18 10:19:10
total: 55 Sec

restart ovirt-engine:  Mon Jun 18 10:21:45 IDT 2018
finish: 2018-06-18 10:22:39
total: 54 Sec


Restart with byteman (as suggested Roy Golan in upstream BZ):
Total: 38 Sec

2018-06-18 11:22:36,330+03 INFO  [stdout] (ServerService Thread Pool -- 59) *** BYTEMAN - started getMacsForMacPool 1529310156329
2018-06-18 11:23:14,707+03 INFO  [stdout] (ServerService Thread Pool -- 59) *** BYTEMAN - ended getMacsForMacPool 1529310194706
2018-06-18 11:23:14,710+03 INFO  [stdout] (ServerService Thread Pool -- 59) *** BYTEMAN - started getMacsForMacPool 1529310194710
2018-06-18 11:23:14,711+03 INFO  [stdout] (ServerService Thread Pool -- 59) *** BYTEMAN - ended getMacsForMacPool 1529310194711

Comment 20 errata-xmlrpc 2018-06-27 10:02:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2071

Comment 21 Franta Kust 2019-05-16 13:09:42 UTC

BZ<2>Jira Resync

Comment 22 Daniel Gur 2019-08-28 13:15:23 UTC

sync2jira

Comment 23 Daniel Gur 2019-08-28 13:21:10 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.