1110835 – [scale] balloon stats retrieval should be done outside of the actual API call

Bug 1110835 - [scale] balloon stats retrieval should be done outside of the actual API call

Summary: [scale] balloon stats retrieval should be done outside of the actual API call

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.4.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Francesco Romani
QA Contact:	Yuri Obshansky
Docs Contact:
URL:
Whiteboard:	virt
Duplicates (1):	1102814 (view as bug list)
Depends On:	1102147
Blocks:	1110845 1110848 rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-06-18 14:46 UTC by Michal Skrivanek
Modified:	2015-01-13 11:42 UTC (History)
CC List:	18 users (show)
Fixed In Version:	vdsm-4.13.2-0.18.el6ev
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1102147
Clones:	1110845 1110848 (view as bug list)
Environment:
Last Closed:	2015-01-13 11:42:11 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:
Flags:	mavital: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	27696	0	None	None	None	Never
oVirt gerrit	28686	0	None	None	None	Never

Description Michal Skrivanek 2014-06-18 14:46:54 UTC

easy fix addressing a significant slowdown in a frequent API call

+++ This bug was initially created as a clone of Bug #1102147 +++

Description of problem:
Host state is not stable after population of 200 VMs
Every 4-5 min host switched to Connecting state
2014-05-28 14:01:55,464 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-32) Command GetAllVmStatsVDSCommand(HostName = HOST-REAL, HostId = ee38bc5a-0ad7-4034-80a6-a2af84eac5dc, vds=Host[HOST-REAL,ee38bc5a-0ad7-4034-80a6-a2af84eac5dc]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException
2014-05-28 14:01:55,542 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-32) Failed to refresh VDS , vds = ee38bc5a-0ad7-4034-80a6-a2af84eac5dc : HOST-REAL, VDS Network Error, continuing.java.util.concurrent.TimeoutException
occurs in engine.log and

Thread-886::ERROR::2014-05-28 13:22:28,627::SecureXMLRPCServer::168::root::(handle_error) client ('10.35.3.13', 59944)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/SocketServer.py", line 570, in process_request_thread
  File "/usr/lib64/python2.6/SocketServer.py", line 332, in finish_request
  File "/usr/lib64/python2.6/SocketServer.py", line 627, in __init__
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 331, in handle
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 223, in do_POST
  File "/usr/lib64/python2.6/socket.py", line 324, in write
  File "/usr/lib64/python2.6/socket.py", line 303, in flush
  File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 222, in write
  File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 205, in _write_bio
SSLError: bad write retry
occurs in vdsm.log

Version-Release number of selected component (if applicable):
RHEVM : 3.4.0-0.16.rc.el6ev
OS Version: RHEV Hypervisor - 6.5 - 20140313.1.el6ev
Kernel Version: 2.6.32 - 431.5.1.el6.x86_64
KVM Version: 0.12.1.2 - 2.415.el6_5.6
LIBVIRT Version: libvirt-0.10.2-29.el6_5.5
VDSM Version: vdsm-4.14.2-0.4.el6ev

How reproducible:


Steps to Reproduce:
1. Create 1 DC, 1 Cluster, 1 Host
2. Connect to iSCSI SD
3. Create 200 VMs

Actual results:
see above

Expected results:
Hst is stable

Additional info:

--- Additional comment from Yuri Obshansky on 2014-05-28 16:36:45 CEST ---



--- Additional comment from Eldad Marciano on 2014-05-28 16:49:11 CEST ---

+1 
reproduced on 3.3 too

--- Additional comment from Barak on 2014-05-29 14:28:01 CEST ---

Yuri:
- In what rate were the VMs created ?
- In bulks? all together ?

--- Additional comment from Barak on 2014-05-29 14:30:46 CEST ---

Danken - does it smell like M2C ?

--- Additional comment from Piotr Kliczewski on 2014-05-29 15:15:10 CEST ---

It looks like it is similar to [1]. Can we get the size of the data to be sent as a response to GetAllVmStatsVDSCommand?


[1] http://rt.openssl.org/Ticket/Display.html?id=598

--- Additional comment from Yuri Obshansky on 2014-05-29 15:54:26 CEST ---

Barak
- one by one
- no bulks

Piotr
- how can I measure size of data?

--- Additional comment from Piotr Kliczewski on 2014-05-29 16:10:54 CEST ---

The easiest way to get the response size is to log or print len(response). You can modify [1] and log the length just before write in do_POST method.

[1] /usr/lib64/python2.6/site-packages/vdsm/utils.py

--- Additional comment from Dan Kenigsberg on 2014-05-29 16:18:26 CEST ---

It might be a m2crypto issue, but I doubt that, as it happens every 4 minutes plust 5 or 10 seconds. It's a bit too periodical. It might be a matter of Engine choking up the connection periodically - can you check if there's a spike in traffic during the time of errors?

--- Additional comment from Piotr Kliczewski on 2014-05-29 16:28:55 CEST ---

It looks like quarz job on the engine tries to get all vm stats every couple of minutes and due to some reason it never receives any answer. After some time it recovers and tries again. I think that we need to understand why writing of the response fails on the vdsm side.

--- Additional comment from Piotr Kliczewski on 2014-05-29 16:48:27 CEST ---

There is no source on this host machine so we can't modify it to see the size. I will try to recreate the bug by playing with the size of the response.

--- Additional comment from Yaniv Bronhaim on 2014-06-02 11:53:08 CEST ---

Its a matter of load. running "vdsClinet getAllVmStats" on that host (which runs 200 vms) takes 4 minutes to return. The exception in VDSM is one issue that must be solved and it sounds like regression. From engine's prospective, 4 minutes must reach a timeout, we can't do much about it . Vdsm needs to response faster

--- Additional comment from Yaniv Bronhaim on 2014-06-02 12:12:13 CEST ---

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                          
23221 vdsm       0 -20 32.2g 900m   9m S 1066.6  0.2 110:59.49 vdsm                                                                                                                                                                           
61074 root      20   0 7290m  31m 8112 S 114.9  0.0  12899:21 libvirtd                                                                                                                                                                        
29402 root      20   0 17092 3356 1004 R 21.0  0.0   0:30.96 top                                                                                                                                                                              
29582 qemu      20   0 1726m 1.2g 6364 S  8.3  0.3 439:08.29 qemu-kvm     
...


not sure it's suppose to work... have we ever checked 200*3 python's threads at  same time?

--- Additional comment from Piotr Kliczewski on 2014-06-02 12:37:57 CEST ---

The response size if over 370k so following the issue from the link it seems that during sending this huge response there are new messages received and sent so try fails. There are similar issues [1][2] that people see with openssl.

[1] http://stackoverflow.com/questions/20715408/solved-pyopenssl-ssl3-write-pendingbad-write-retry-return-self-sslobj-write
[2] http://stackoverflow.com/questions/14695247/in-php-openssl-error-messages-error-1409f07f-ssl-routines-ssl3-write-pendin

--- Additional comment from Piotr Kliczewski on 2014-06-02 14:30:00 CEST ---



--- Additional comment from Yaniv Bronhaim on 2014-06-02 15:13:34 CEST ---

The talked host contains 400GB ram and 32 cpu cores. what limitation do we on such hosts? do we allow to ran 400 vms each with 1GB ram? don't we need to have a limitation based on cpus also?

currently we're testing what is the maximum number of vms that this host can run before reaching to such slow response that engine throws timeout on

--- Additional comment from Michal Skrivanek on 2014-06-02 15:40:25 CEST ---

adding Francesco. Could the getBalloonInfo be of help? I suppose with 200VMs eliminating the dom access may help at least with the responsiveness of the getstats call.
Can we try VMs without balloon? worth a try...

@Yaniv, 32GB virtual size is just ill, any idea why?

--- Additional comment from Yuri Obshansky on 2014-06-03 11:20:29 CEST ---

We started reproduce that issue and find out the limit.
We cleanup the RHEVM, reinstall vdsm on host machine
and started populate VMs from 0 till get the exception.
It happened on 182-th VMs.
2014-06-03 12:13:42,876 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-7) [17f97fa8] Command GetAllVmStatsVDSCommand(HostName = HOST-REAL, HostId = ee38bc5a-0ad7-4034-80a6-a2af84eac5dc, vds=Host[HOST-REAL,ee38bc5a-0ad7-4034-80a6-a2af84eac5dc]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException

--- Additional comment from Dan Kenigsberg on 2014-06-03 11:51:00 CEST ---

(In reply to Yaniv Bronhaim from comment #15)
> don't we need
> to have a limitation based on cpus also?

I don't know if such a limitation exists on Engine side. In my opinion, it should not, as CPU over-commitment is one of the benefits of virtualization. If the guests do nothing, and Vdsm (and the rest of the management stack) is responsive - we should accept more VMs.

--- Additional comment from Francesco Romani on 2014-06-03 14:33:33 CEST ---

(In reply to Yaniv Bronhaim from comment #11)
> Its a matter of load. running "vdsClinet getAllVmStats" on that host (which
> runs 200 vms) takes 4 minutes to return. The exception in VDSM is one issue
> that must be solved and it sounds like regression. From engine's
> prospective, 4 minutes must reach a timeout, we can't do much about it .
> Vdsm needs to response faster

For the response time, this patch may provide some relief: http://gerrit.ovirt.org/#/c/27696/5

I'm all in to make things faster, but we'll need some profile data for this, but this will in turn require VDSM >= 4.15 (yappi support was merged well after 4.14)

Lastly, the amount of memory consumed is indeed worrysome. I'm thinking to add optional support for memory profiling as well.

--- Additional comment from Dima Kuznetsov on 2014-06-11 16:05:13 CEST ---

Yuri,

I've tried Francesco's patch on my humble i7 quad-core dell with 125 faqemu vms and it got vdsClient getAllVmStats response time from 6 minutes to 1 seconds (!), can you try it?

--- Additional comment from Yuri Obshansky on 2014-06-11 16:15:05 CEST ---

Yes, sure
I'm preparing environment for bug verification right now
Send me patch and guide how to install it by e-mail
I think, we should first reproduce bug, only after that apply patch and again try reproduce.

--- Additional comment from Francesco Romani on 2014-06-13 09:59:12 CEST ---

VDSM patch backported here: http://gerrit.ovirt.org/#/c/28686/1

--- Additional comment from Michal Skrivanek on 2014-06-13 15:23:17 CEST ---

I don't think the patch was meant to steal the bug completely

Also @Yuri, be careful when answering on a bug with multiple needinfos:)


(In reply to Dan Kenigsberg from comment #18)
> to have a limitation based on cpus also?
there's no such limitation. +1 on being able to absorb more

(In reply to Yaniv Bronhaim from comment #12)
> have we ever checked 200*3 python's threads at  same time?
why *3?

also please see comment #16

--- Additional comment from Yaniv Bronhaim on 2014-06-15 08:58:41 CEST ---

I don't know the reason for this huge virtual memory capacity , but still looks that Francesco's patch have huge improvement impact on that. Dima will share profiling output and usage status soon for more investigation

--- Additional comment from Michal Skrivanek on 2014-06-17 09:23:53 CEST ---

do we want to do anything else except the balloon here? If not, I'd take it under virt and put Francesco in charge of backports….

--- Additional comment from Yaniv Bronhaim on 2014-06-17 10:12:45 CEST ---

the backport already pushed but not verified [1]. Dima works with Yuri to find limitations and see if the patch helps on his environment. If we'll find out that Francesco's improvement is enough to fix the communication issue I'll forward you the bug to continue

[1] http://gerrit.ovirt.org/#/c/28686

--- Additional comment from Yaniv Bronhaim on 2014-06-17 14:00:00 CEST ---

Looks like the patch fix the issue for this amount of vms on the same environment, it reduces the response for getAllVmStats for 30 seconds

--- Additional comment from Yuri Obshansky on 2014-06-18 14:06:01 CEST ---

Applied patch has resolved our problem with 200 VMs
Host is stable. No exceptions.
But.
We tried unsuccessfully create additional 100 VMs 
(increase number of VMs up to 300).
Engine (or vdsm) started to lose VMs
VM every minute changed state from UP to Non responsive
and vice versa
I found only that info message in engine.log
2014-06-18 11:17:47,316 INFO
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-46) [a94764c] VM scale_108
c03cdb08-bc8c-4649-b034-617492f87b71 moved from Up --> NotResponding
2014-06-18 11:17:47,324 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-46) [a94764c] Correlation ID: null, Call
Stack: null, Custom Event ID: -1, Message: VM scale_108 is not responding.
............
2014-06-18 11:18:10,845 INFO
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-41) [2a9e7042] VM scale_108
c03cdb08-bc8c-4649-b034-617492f87b71 moved from NotResponding --> Up

Looks like it is a limit - 200 VMs 
See attached new log files.

--- Additional comment from Yuri Obshansky on 2014-06-18 14:06:58 CEST ---



--- Additional comment from Yuri Obshansky on 2014-06-18 14:07:32 CEST ---



--- Additional comment from Francesco Romani on 2014-06-18 14:28:41 CEST ---

Hi, the VDSM log is too tiny, there isn't much to see here (it covers ~18 seconds!). It is possible to have more VDSM logs?

We'd probably greatly benefit from profiling data as Dan suggested in
https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c19

--- Additional comment from Yuri Obshansky on 2014-06-18 14:45:21 CEST ---

Yes, sure 
Uploaded 5 VDSM log files

--- Additional comment from Yuri Obshansky on 2014-06-18 14:45:56 CEST ---



--- Additional comment from Yuri Obshansky on 2014-06-18 14:46:28 CEST ---



--- Additional comment from Yuri Obshansky on 2014-06-18 14:46:59 CEST ---



--- Additional comment from Yuri Obshansky on 2014-06-18 14:47:34 CEST ---



--- Additional comment from Yuri Obshansky on 2014-06-18 14:48:03 CEST ---



--- Additional comment from Michal Skrivanek on 2014-06-18 16:41:50 CEST ---

it looks like a nicely overloaded system. Can we get sar file? or at least a load value?
it may be just that we're simply over the limit the host can handle...

Comment 5 Michal Skrivanek 2014-06-24 10:03:43 UTC

*** Bug 1102814 has been marked as a duplicate of this bug. ***

Comment 7 Yuri Obshansky 2014-11-10 07:11:21 UTC

Bug verification failed.
I ran simple script which perform command getAllVmStats
and measure response time
See results:
getAllVmStats response time w/o patch
min        37.319 sec  
average 46.398 sec
max       52.005 sec   

getAllVmStats response time with patch
min        33.058 sec
average 48.268 sec
max       56.701 sec

Comment 8 Yuri Obshansky 2014-11-10 07:12:33 UTC

RHEVM - 3.5.0-0.17.beta.el6ev
VDSM - vdsm-4.16.7.1-1.el6ev

Comment 9 Francesco Romani 2014-11-10 13:41:20 UTC

(In reply to Yuri Obshansky from comment #7)
> Bug verification failed.
> I ran simple script which perform command getAllVmStats
> and measure response time
> See results:
> getAllVmStats response time w/o patch
> min        37.319 sec  
> average 46.398 sec
> max       52.005 sec   
> 
> getAllVmStats response time with patch
> min        33.058 sec
> average 48.268 sec
> max       56.701 sec

Inital observation suggests that something else is eating the gains obtained with this patch. Difference is likely little enough to get lost in variation.

The times are too high anyway.

Comment 10 Michal Skrivanek 2014-11-11 11:23:18 UTC

not failed, still in investigation of whether it improved things or not, so moving bug back to appropriate state

Need to figure out a better verification test or find the source of high latency in general. May need an artificial test...but I'm not sure it's worth the effort.

There are likely other reasons for ~45s average and that's something we should understand...

Comment 11 Gil Klein 2014-12-24 11:21:40 UTC

Michal, could you please suggest how we should test it, or maybe point us to other info we can provide to help you figure this out?

Comment 12 Michal Skrivanek 2015-01-08 14:48:30 UTC

as I said, perhaps only by artificial test, run  the getallvmstats in a loop as fast as you can with and without the patch, you *should* see a real difference; 
again, not that it's that important to prove it as the benefit is obvious from the flow perspective...it may be just negligent/hidden by other stuff easily

Comment 13 Gil Klein 2015-01-13 09:08:34 UTC

Michal, based on comment #7 we already tested your suggested flow, and I can't say we see real difference. 

How do you suggest we act on this BZ?

Comment 14 Michal Skrivanek 2015-01-13 10:53:59 UTC

it was not really representative, the conditions are specific and i suggested artificial test. But IMHO not worth it and you can close the bug

Comment 15 Gil Klein 2015-01-13 11:42:11 UTC

Closing based on comment #14

Note You need to log in before you can comment on or make changes to this bug.