Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1112684

Summary:

[scale] Start VM on host is slow (max = 12 min) under load of 400 running vms

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Yuri Obshansky <yobshans>

Component:

ovirt-engine-restapi

Assignee:

Liran Zelkha <lzelkha>

Status:

CLOSED CANTFIX

QA Contact:

Yuri Obshansky <yobshans>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.4.0

CC:

bazulay, ecohen, gklein, iheim, lpeer, michal.skrivanek, nsoffer, oourfali, oramraz, rbalakri, Rhev-m-bugs, yeylon, yobshans

Target Milestone:

---

Target Release:

3.6.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

infra

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-11-04 07:50:14 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Infra

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
RHEVM-3.4-UserPortal-Performance-Test-400TH-2014-06-17 report	none
JMeter script	none
Host 24 vdsm log	none
Host 25 vdsm log	none
Host 20 vdsm log	none
Host 30 vdsm log	none
Host 29 vdsm log	none
Memory Usage sample data	none

Description Yuri Obshansky 2014-06-24 13:26:52 UTC

Description of problem:
Response time of REST API call "start VM" is slow under load of 400 concurrent threads. 
Min - 205 ms ~ 0.004 min
Average - 76208	ms ~ 1.27 min
90% - 266367 ms ~ 4.44 min
Max - 769799 ms ~ 12.83 min
Detected degradation of response time when increased load from 300 up to 400 threads: 300 threads Max = 4 min -> 400 threads Max = 12 min
Degradation started after 3 hours of test running (see attached report).
As result amount of performed REST API calls decreased compare to test of 300 threads. 

Version-Release number of selected component (if applicable):
RHEVM - 3.4.0-0.16.rc.el6ev
OS - RHEL - 6Server - 6.5.0.1.el6
Kernel - 2.6.32 - 431.5.1.el6.x86_64
KVM - 0.12.1.2 - 2.415.el6_5.6
Libvirt -libvirt-0.10.2-29.el6_5.5
VDSM - vdsm-4.14.7-3.el6ev

How reproducible:
100 %

Steps to Reproduce:
1. Perform JMEter load test RHEVM_3_4_PHX_USER_PORTAL_FLOW_400.jmx
2. Analise results

Actual results:


Expected results:


Additional info:
After a short investigation cam assume that problem in vdsm 
Reference bug -> https://bugzilla.redhat.com/show_bug.cgi?id=861918

Comment 1 Yuri Obshansky 2014-06-24 13:28:12 UTC

Created attachment 911737 [details]
RHEVM-3.4-UserPortal-Performance-Test-400TH-2014-06-17 report

Comment 2 Yuri Obshansky 2014-06-24 13:29:01 UTC

Created attachment 911738 [details]
JMeter script

Comment 3 Yuri Obshansky 2014-06-25 08:44:37 UTC

Created attachment 911974 [details]
Host 24 vdsm log

Comment 4 Yuri Obshansky 2014-06-25 08:45:09 UTC

Created attachment 911975 [details]
Host 25 vdsm log

Comment 5 Yuri Obshansky 2014-06-25 08:46:21 UTC

Created attachment 911977 [details]
Host 20 vdsm log

Comment 6 Yuri Obshansky 2014-06-25 08:48:36 UTC

Created attachment 911978 [details]
Host 30 vdsm log

Comment 7 Yuri Obshansky 2014-06-25 08:49:33 UTC

Created attachment 911979 [details]
Host 29 vdsm log

Comment 8 Nir Soffer 2014-06-29 08:16:33 UTC

We need more info:

- vdsm cpu graph: what is 40? 40% of of single core? 40% or all cores?
- How many cores are in the host?
- What was cpu usage of other processes? we have 400 qemu processes running, right?
- What is the memory usage of other processes on the host?
- Is host overloaded and swapping to disk
- vdsm memory graph does not make sense. Where is the sample data? how was it collected?
- If I'm reading the tcp graph correctly, we seem to have about 800 connections near the end of the test. Can you explain what is each color in the graph, and how this data was collected?
- What vms are running? idle?
- What type of storage is used?
- How many storage domains are used?

Comment 9 Yuri Obshansky 2014-06-29 09:24:41 UTC

All information could be found in attached "RHEVM-3.4-UserPortal-Performance-Test-400TH-2014-06-17 report". 
Anyway here is it:
- vdsm cpu graph: results for vdsm was running on engine machine and cpu usage graph shows process CPU usage percentage for all cores 
- 24 x Intel(R) Xeon(R) CPU E5-2630 @ 2.00GHz – RAM 64 G
CPU Sockets: 2, CPU Cores per Socket: 6, CPU Threads per Core: 2
- See attached report 1.4 Summary statistics table. I didn't check how many processes of qemu were running.
- I didn't monitor hosts. 
- I didn't monitor hosts. 
- It was collected using JMeter Perfmon plugin (based on Sigar API). Monitor_MEM.csv with sample data attached to bug
- BLUE - CLOSERWAIT, PINK - EST, RED - TIMEWAIT. JMeter Perfmon plugin
- Physical VMs with kernel and quest-agent installed
- NFS
- 2 SDs

Comment 10 Yuri Obshansky 2014-06-29 09:25:45 UTC

Created attachment 913148 [details]
Memory Usage sample data

Comment 11 Michal Skrivanek 2014-06-30 08:39:27 UTC

Hi Yuri,
how quickly are you starting the VMs? From the logs it looks like it's sequential with delay of 80s between each create call?

Comment 12 Yuri Obshansky 2014-06-30 08:46:05 UTC

It is not static value. Each threads performs in cycle the following calls: different getInfo calls, shutdown VM, waiting till VM is down, startup VM and waiting till it is Up. So, it is very randomly. It depends on response time of previous calls. What is static that thread ramp up delay - 10 sec i.e. each thread stats after 10 sec.

Comment 14 Liran Zelkha 2014-09-02 09:11:38 UTC

Yuri - can you provide more info?
1. Enclose jstack and logs of the engine
2. Speciy how many VMs are running on each host?

Comment 15 Liran Zelkha 2014-09-02 09:50:26 UTC

Yuri - can you provide more info?
1. Enclose jstack and logs of the engine
2. Speciy how many VMs are running on each host?

Comment 16 Yuri Obshansky 2014-09-02 10:56:04 UTC

1. I don't have environment right now
I'll prepare jstack and engine when environment be ready
2. 400 VMs were running on 6 hosts, so ~ 66 VMs per hosts

Comment 17 Liran Zelkha 2014-09-15 18:39:25 UTC

Hi Yuri - any updates?

Comment 18 Yuri Obshansky 2014-09-16 06:51:40 UTC

Environment still is not ready since I encountered with several new issues/bugs.
I hope, the environment will be ready till end of week.

Comment 19 Liran Zelkha 2014-10-23 02:39:22 UTC

Yuri - any additional info on this? Otherwise, let's close this bug.

Comment 20 Yuri Obshansky 2014-11-04 07:50:14 UTC

Let's close this bug. We cannot reproduce it on rhev-m 3.5 because we detected performance degradation on first test iteration (50 threads).
https://bugzilla.redhat.com/show_bug.cgi?id=1155146