Bug 1585456
| Summary: | [downstream clone - 4.2.4] ovirt-engine fails to start when having a large number of stateless snapshots | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | RHV bug bot <rhv-bugzilla-bot> |
| Component: | ovirt-engine | Assignee: | Roy Golan <rgolan> |
| Status: | CLOSED ERRATA | QA Contact: | mlehrer |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.1.10 | CC: | dagur, gveitmic, izuckerm, lsurette, lsvaty, lveyde, pvilayat, rbalakri, rgolan, Rhev-m-bugs, slopezpa, srevivo, tnisan, ykaul |
| Target Milestone: | ovirt-4.2.4 | Keywords: | Performance, ZStream |
| Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | ovirt-engine-4.2.4.1 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1579008 | Environment: | |
| Last Closed: | 2018-06-27 10:02:42 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1579008 | ||
| Bug Blocks: | |||
|
Description
RHV bug bot
2018-06-03 07:11:17 UTC
With this amount of snapshots(~2000), if each conversion of the code below[1] takes 150 ms we already are reaching the default TX timout.
The code is sequentially getting each OVF and processing it:
Stream<VM> statelessSnapshotsOfRunningVMs =
idsOfRunningStatelessVMs.map(snapshotsManager::getVmConfigurationInStatelessSnapshotOfVm)
.filter(Optional::isPresent)
.map(Optional::get);
Generally its a perfect candidate for parallel stream handling.
Another question, how come they reached this amount of stale snapshots? I guess as an easy fix you just increase the tx timeout and (hopefully you can) remove all those undeeded snapshots for now
To increase tx timeout add this:
<coordinator-environment default-timeout="600"/>
under element:
<subsystem xmlns="urn:jboss:domain:transactions:3.0">
in /etc/ovirt-engine/services/ovirt-engine/ovirt-engine.xml.in
(Originally by Roy Golan)
(In reply to Roy Golan from comment #1) > With this amount of snapshots(~2000), if each conversion of the code > below[1] takes 150 ms we already are reaching the default TX timout. > > The code is sequentially getting each OVF and processing it: > > Stream<VM> statelessSnapshotsOfRunningVMs = > > idsOfRunningStatelessVMs.map(snapshotsManager:: > getVmConfigurationInStatelessSnapshotOfVm) > .filter(Optional::isPresent) > .map(Optional::get); > > > Generally its a perfect candidate for parallel stream handling. I agree basically but there's one problem with this approach, currently all of the multithreaded handling in Engine is done via the thread pool, using parallel streaming will use a number of threads at its discretion and might exhaust the number of threads, what do you suggest then Roy? (Originally by Tal Nisan) > I agree basically but there's one problem with this approach, currently all
> of the multithreaded handling in Engine is done via the thread pool, using
> parallel streaming will use a number of threads at its discretion and might
> exhaust the number of threads, what do you suggest then Roy?
There are 2 reasons this is not a big concern:
1. Parallel stream uses the common pool inside ForkJoin which defaults to
the number of cores the JVM see (Runtime.getRuntime().availableProcessors())
2. We are dealing with the engine startup here - anyway Wildfly is waiting for the MacPoolPerCluster Bean, which is the root of this invocation, to finish it's initialization.
As I see it this is the cheapest and safest solution at the moment.
(Originally by Roy Golan)
If it's not a big concern I'm for it then, patch is straight forward and simple, targeting to 4.2.4 so 4.1.z customers can have this fix on upgrade to 4.2.z (Originally by Tal Nisan) Copying my comment from the upstream BZ https://bugzilla.redhat.com/show_bug.cgi?id=1579008 (verified) After adding additional 500 VMs, Total up VMs amount: 2314 Total amount of stateless snapshots: 2314 AVG restart time of the engine: 54 Sec restart ovirt-engine: Mon Jun 18 10:15:22 IDT 2018 finish: 2018-06-18 10:16:15 total: 53 Sec restart ovirt-engine: Mon Jun 18 10:18:15 IDT 2018 finish: 2018-06-18 10:19:10 total: 55 Sec restart ovirt-engine: Mon Jun 18 10:21:45 IDT 2018 finish: 2018-06-18 10:22:39 total: 54 Sec Restart with byteman (as suggested Roy Golan in upstream BZ): Total: 38 Sec 2018-06-18 11:22:36,330+03 INFO [stdout] (ServerService Thread Pool -- 59) *** BYTEMAN - started getMacsForMacPool 1529310156329 2018-06-18 11:23:14,707+03 INFO [stdout] (ServerService Thread Pool -- 59) *** BYTEMAN - ended getMacsForMacPool 1529310194706 2018-06-18 11:23:14,710+03 INFO [stdout] (ServerService Thread Pool -- 59) *** BYTEMAN - started getMacsForMacPool 1529310194710 2018-06-18 11:23:14,711+03 INFO [stdout] (ServerService Thread Pool -- 59) *** BYTEMAN - ended getMacsForMacPool 1529310194711 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2071 BZ<2>Jira Resync sync2jira sync2jira |