1572078 – Starting and stopping VMs fails when executed with concurrency (Jmeter API)

Bug 1572078 - Starting and stopping VMs fails when executed with concurrency (Jmeter API)

Summary: Starting and stopping VMs fails when executed with concurrency (Jmeter API)

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.2.3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	bugs@ovirt.org
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-26 07:07 UTC by Ilan Zuckerman
Modified:	2020-03-09 09:01 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-21 14:24:23 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ilan Zuckerman 2018-04-26 07:07:46 UTC

Description of problem:
In our environment we have 400 Hosts, and 4,000 Vms.
We are testing the engine's response to a stressful and continuous  situation where it practically deals with 12.8/min (this rate is taken from Jmeter agg report) concurrent API requests to stop / start VM.

The load is done by Jmeter script. It queries for VMs with status 'down' from one particular cluster, then randomly choose one VM out of VM's list, then sending an API call to start it, validating its up (by requesting its status), waiting 5 minutes, only then stopping it, and waiting 5 minutes again prior querying the VMs again.
This loop runs for 8 hours, with 35 concurrent threads.

We saw that after aproximetely 2 hours of test execution, An ERROR started to appear both in WebAdmin 'events' tab and in engine.log . The ERROR is around the failure to start and stop the VM.
Looking into libvirt and vds logs didnt reveal related errors.

It appears to be that those errors are not pointing to the actual stop / start failure, since as far as i noted from Jmeter and UI events, those VMs which appeared in error log, were able to start and stop in all occasions.
This assumption is based on the fact that Jmeter script validates the VM status by querying its status, and from a look at VM 'events' tab (see screenshot in GDrive link) we see that the VM was managed to shot down although it throw an error.

#Topology
Engine / Engine DB (1 machine)
DWH DB (separate machine)
400 Hosts
RHEL 7.5 3.10.0-858.el7.x86_64
ovirt-engine-4.2.2.2-0.1.el7.noarch

#Shared via drive (see private message)
- engine.log (during the test execution) for the following machines:
b01-h21 (engine)
b01-h31 (one host on which those errors accured)
b02-h02 (SPM)

- nmon charts for: b02-h02 (SPM), b01-h21 (Engine)


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.3.2-0.1.el7.noarch

How reproducible:
Reproduced with Jmeter

Steps to Reproduce:
1. Set up Jmeter or any other tool to execute API calls for shutting down and starting vm with concurrency.
2. Observe engine.log and 'events' tab in web admin portal

Actual results:
After some time you will see errors regarding sart and stop vm.

Expected results:
The process of starting and powering on VMs should be without Errors even if it is done with some level of concurrency.

Additional info:
Some of errors reagrding start / stop vm (taken from engine.log):

2018-04-25 20:57:48,669Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-198) [53796e69-4733-473c-a0ae-73acdb72d4c7] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM b01-h31-r620.rhev.openstack.engineering.redhat.com command DestroyVDS failed: Virtual machine does not exist: {'vmId': u'f5a64c03-1024-4ea4-af84-652791852ccd'}

2018-04-25 20:57:48,691Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-198) [53796e69-4733-473c-a0ae-73acdb72d4c7] EVENT_ID: USER_FAILED_STOP_VM(56), Failed to power off VM vm_L1_07-03_18 (Host: b01-h31-r620.rhev.openstack.engineering.redhat.com, User: admin@internal-authz).^[[39m
^[[32m2018-04-25 20:57:48,699Z ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-198) [] Operation Failed: [Desktop does not exist]^[[39m

Comment 2 Yaniv Kaul 2018-04-26 16:26:48 UTC

That's it? Only two logs on failing to stop VMs?
What about the server.log, ovirt-request-log? Anything else in engine.log?

Where's the jmeter code? Are you using pipe-lining and multiplexing? (see https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/asynchronous_inventory.py for a Python based example)
Is the connection persistent? (see https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/connection_builder.py )

Comment 3 Ilan Zuckerman 2018-04-29 06:15:42 UTC

There is more than two logs of stopping VM's
for example ""Failed to power off VM" message appears 6 times throughout engine.log

server.log is inside sos-report archive. looking at it when the errors appeared on engine, we see errors as such:

2018-04-18 12:57:29,714Z ERROR [org.jboss.as.ejb3.invocation] (default task-76) WFLYEJB0034: EJB Invocation failed on component Backend for method public abstract org.ovirt.engine.core.common.queries.QueryReturnValue org.ovirt.engine.core.common.interfaces.BackendLocal.runQuery(org.ovirt.engine.core.common.queries.QueryType,org.ovirt.engine.core.common.queries.QueryParametersBase): javax.ejb.EJBException: java.lang.IllegalStateException

2018-04-18 12:57:29,730Z ERROR [io.undertow.servlet] (default task-76) Exception while dispatching incoming RPC call: com.google.gwt.user.server.rpc.UnexpectedException: Service method 'public abstract java.util.ArrayList org.ovirt.engine.ui.frontend.gwtservices.GenericApiGWTService.runMultipleQueries(java.util.ArrayList,java.util.ArrayList)' threw an unexpected exception: javax.ejb.EJBException: java.lang.IllegalStateException

There are also errors on starting VMs.
For example grep "Cannot run VM". you can find it at 2018-04-25 20:51:48

Apparently there is much more in engine.log but most of it not related to this issue (for example "NoRouteToHostException")

I have added the Jmeter script to "jmeter_run" folder on shared Gdrive folder (see private message for link). Only enabled "thread groups" were executed during the test.
As far as i understand, the connection is persistent. Otherwise i couldnt perform API requests fluently.

Comment 4 Yaniv Kaul 2018-04-29 06:49:40 UTC

1. Does it happen only on scaled environment?
2. What build are you using?
ovirt-engine-4.2.2.2-0.1.el7.noarch
 or
ovirt-engine-4.2.3.2-0.1.el7.noarch

?

3. There's a 4.3GB(!) file called '--help' in the sos report. What is it?

4. May I suggest you use follow links? See for example (random):
For a specific VM, 4 calls:
[26/Apr/2018:05:24:26 +0000] 10.12.69.26 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384 "GET /ovirt-engine/api/vms/6dc92b8c-ee03-4101-b176-cc014b144312/nics HTTP/1.1" 614
[26/Apr/2018:05:24:26 +0000] 10.12.69.26 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384 "GET /ovirt-engine/api/vms/6dc92b8c-ee03-4101-b176-cc014b144312/reporteddevices HTTP/1.1" 359
[26/Apr/2018:05:24:26 +0000] 10.12.69.26 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384 "GET /ovirt-engine/api/vms/6dc92b8c-ee03-4101-b176-cc014b144312/snapshots HTTP/1.1" 327
[26/Apr/2018:05:24:26 +0000] 10.12.69.26 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384 "GET /ovirt-engine/api/vms/6dc92b8c-ee03-4101-b176-cc014b144312/diskattachments HTTP/1.1" 341

Whereas you could use https://www.ovirt.org/develop/release-management/features/infra/link-following/

Comment 5 Ilan Zuckerman 2018-04-29 10:47:28 UTC

(In reply to Yaniv Kaul from comment #4)
> 1. Does it happen only on scaled environment?
Not sure. We tested this on our scaled env

> 2. What build are you using?
> ovirt-engine-4.2.2.2-0.1.el7.noarch
>  or
> ovirt-engine-4.2.3.2-0.1.el7.noarch
we use 4.2.3.2-0.1.el7

> 
> 3. There's a 4.3GB(!) file called '--help' in the sos report. What is it?
We didn't notice this in previous log-collector archives, but thanks for pointing this out. I'll keep an eye on this.
> 
> 4. May I suggest you use follow links? See for example (random):
> For a specific VM, 4 calls:
> [26/Apr/2018:05:24:26 +0000] 10.12.69.26 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384
> "GET /ovirt-engine/api/vms/6dc92b8c-ee03-4101-b176-cc014b144312/nics
> HTTP/1.1" 614
> [26/Apr/2018:05:24:26 +0000] 10.12.69.26 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384
> "GET
> /ovirt-engine/api/vms/6dc92b8c-ee03-4101-b176-cc014b144312/reporteddevices
> HTTP/1.1" 359
> [26/Apr/2018:05:24:26 +0000] 10.12.69.26 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384
> "GET /ovirt-engine/api/vms/6dc92b8c-ee03-4101-b176-cc014b144312/snapshots
> HTTP/1.1" 327
> [26/Apr/2018:05:24:26 +0000] 10.12.69.26 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384
> "GET
> /ovirt-engine/api/vms/6dc92b8c-ee03-4101-b176-cc014b144312/diskattachments
> HTTP/1.1" 341
> 
> Whereas you could use
> https://www.ovirt.org/develop/release-management/features/infra/link-
> following/
I understand that this methodology would avoid the multiple network round-trips. but this way we might loose the purpose of multi threaded testing, where we want to simulate multiple users trying to simultaneously (or with some level of concurrency) to shot down / trigger on VM's

Comment 6 Martin Perina 2018-07-31 05:42:08 UTC

There are lots of unrelated errors in the log, are you able to reproduce it in stable environment?

Comment 7 Ilan Zuckerman 2018-07-31 07:40:48 UTC

The issue described above was reproduced on stable scaled out environment.
Since reproducing it again requires resources and current environment is being heavily used, i did not get a priority to reproduce this again.

Comment 8 Moran Goldboim 2018-08-27 18:23:17 UTC

(In reply to Ilan Zuckerman from comment #7)
> The issue described above was reproduced on stable scaled out environment.
> Since reproducing it again requires resources and current environment is
> being heavily used, i did not get a priority to reproduce this again.

Illan, have you tried experiment on a lower scale env and increase concurrency numbers if needed? this can help us getting a cleaner output on the errors you are seeing.

thanks.

Comment 9 Martin Perina 2019-01-21 14:24:23 UTC

We cannot proceed with this bug without direct access to tested scale environment. So feel free to reopen the bug if the issue is reproduced and let's schedule a dedicated session to debug this issue

Note You need to log in before you can comment on or make changes to this bug.