Bug 1102147

Summary:

[scale] ReconstructMasterDomainCommand issue with 200 VMs

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Yuri Obshansky <yobshans>

Component:

vdsm

Assignee:

Francesco Romani <fromani>

Status:

CLOSED NOTABUG

QA Contact:

Yuri Obshansky <yobshans>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.4.0

CC:

aberezin, bazulay, cpelland, danken, dkuznets, dsulliva, emarcian, fromani, gklein, iheim, kabbott, laravot, lpeer, michal.skrivanek, nsoffer, oourfali, pkliczew, pstehlik, sherold, ybronhei, yeylon, yobshans

Target Milestone:

---

Target Release:

3.5.1

Hardware:

x86_64

OS:

Linux

Whiteboard:

virt

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1110835 1167229 (view as bug list)

Environment:

Last Closed:

2014-12-16 10:57:05 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1113106, 1113116

Bug Blocks:

1110835, 1110848

Attachments:

Description	Flags
engine.log	none
vdsm.log	none
VDSM response to get all vm stats	none
Engine log file after patch applied	none
VDSM log after patch applied	none
VDSM 1	none
VDSM 2	none
VDSM 3	none
VDSM 4	none
VDSM 5	none
small setup log	none
small setup log vdsm	none
rhevm ui screenshot	none
rhevm ui jsonrpc screenshot	none
vdsm log jsonrpc	none
engine log jsonrpc	none
vdsm log xmlrpc	none
engine log xmlrpc	none
debug engine log	none
Latest engine and vdsm logs with debug	none

Description Yuri Obshansky 2014-05-28 14:34:56 UTC

Created attachment 900022 [details]
engine.log

Description of problem:
Host state is not stable after population of 200 VMs
Every 4-5 min host switched to Connecting state
2014-05-28 14:01:55,464 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-32) Command GetAllVmStatsVDSCommand(HostName = HOST-REAL, HostId = ee38bc5a-0ad7-4034-80a6-a2af84eac5dc, vds=Host[HOST-REAL,ee38bc5a-0ad7-4034-80a6-a2af84eac5dc]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException
2014-05-28 14:01:55,542 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-32) Failed to refresh VDS , vds = ee38bc5a-0ad7-4034-80a6-a2af84eac5dc : HOST-REAL, VDS Network Error, continuing.java.util.concurrent.TimeoutException
occurs in engine.log and

Thread-886::ERROR::2014-05-28 13:22:28,627::SecureXMLRPCServer::168::root::(handle_error) client ('10.35.3.13', 59944)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/SocketServer.py", line 570, in process_request_thread
  File "/usr/lib64/python2.6/SocketServer.py", line 332, in finish_request
  File "/usr/lib64/python2.6/SocketServer.py", line 627, in __init__
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 331, in handle
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 223, in do_POST
  File "/usr/lib64/python2.6/socket.py", line 324, in write
  File "/usr/lib64/python2.6/socket.py", line 303, in flush
  File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 222, in write
  File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 205, in _write_bio
SSLError: bad write retry
occurs in vdsm.log

Version-Release number of selected component (if applicable):
RHEVM : 3.4.0-0.16.rc.el6ev
OS Version: RHEV Hypervisor - 6.5 - 20140313.1.el6ev
Kernel Version: 2.6.32 - 431.5.1.el6.x86_64
KVM Version: 0.12.1.2 - 2.415.el6_5.6
LIBVIRT Version: libvirt-0.10.2-29.el6_5.5
VDSM Version: vdsm-4.14.2-0.4.el6ev

How reproducible:


Steps to Reproduce:
1. Create 1 DC, 1 Cluster, 1 Host
2. Connect to iSCSI SD
3. Create 200 VMs

Actual results:
see above

Expected results:
Hst is stable

Additional info:

Comment 1 Yuri Obshansky 2014-05-28 14:36:45 UTC

Created attachment 900023 [details]
vdsm.log

Comment 2 Eldad Marciano 2014-05-28 14:49:11 UTC

+1 
reproduced on 3.3 too

Comment 3 Barak 2014-05-29 12:28:01 UTC

Yuri:
- In what rate were the VMs created ?
- In bulks? all together ?

Comment 4 Barak 2014-05-29 12:30:46 UTC

Danken - does it smell like M2C ?

Comment 5 Piotr Kliczewski 2014-05-29 13:15:10 UTC

It looks like it is similar to [1]. Can we get the size of the data to be sent as a response to GetAllVmStatsVDSCommand?


[1] http://rt.openssl.org/Ticket/Display.html?id=598

Comment 6 Yuri Obshansky 2014-05-29 13:54:26 UTC

Barak
- one by one
- no bulks

Piotr
- how can I measure size of data?

Comment 7 Piotr Kliczewski 2014-05-29 14:10:54 UTC

The easiest way to get the response size is to log or print len(response). You can modify [1] and log the length just before write in do_POST method.

[1] /usr/lib64/python2.6/site-packages/vdsm/utils.py

Comment 8 Dan Kenigsberg 2014-05-29 14:18:26 UTC

It might be a m2crypto issue, but I doubt that, as it happens every 4 minutes plust 5 or 10 seconds. It's a bit too periodical. It might be a matter of Engine choking up the connection periodically - can you check if there's a spike in traffic during the time of errors?

Comment 9 Piotr Kliczewski 2014-05-29 14:28:55 UTC

It looks like quarz job on the engine tries to get all vm stats every couple of minutes and due to some reason it never receives any answer. After some time it recovers and tries again. I think that we need to understand why writing of the response fails on the vdsm side.

Comment 10 Piotr Kliczewski 2014-05-29 14:48:27 UTC

There is no source on this host machine so we can't modify it to see the size. I will try to recreate the bug by playing with the size of the response.

Comment 11 Yaniv Bronhaim 2014-06-02 09:53:08 UTC

Its a matter of load. running "vdsClinet getAllVmStats" on that host (which runs 200 vms) takes 4 minutes to return. The exception in VDSM is one issue that must be solved and it sounds like regression. From engine's prospective, 4 minutes must reach a timeout, we can't do much about it . Vdsm needs to response faster

Comment 12 Yaniv Bronhaim 2014-06-02 10:12:13 UTC

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                          
23221 vdsm       0 -20 32.2g 900m   9m S 1066.6  0.2 110:59.49 vdsm                                                                                                                                                                           
61074 root      20   0 7290m  31m 8112 S 114.9  0.0  12899:21 libvirtd                                                                                                                                                                        
29402 root      20   0 17092 3356 1004 R 21.0  0.0   0:30.96 top                                                                                                                                                                              
29582 qemu      20   0 1726m 1.2g 6364 S  8.3  0.3 439:08.29 qemu-kvm     
...


not sure it's suppose to work... have we ever checked 200*3 python's threads at  same time?

Comment 13 Piotr Kliczewski 2014-06-02 10:37:57 UTC

The response size if over 370k so following the issue from the link it seems that during sending this huge response there are new messages received and sent so try fails. There are similar issues [1][2] that people see with openssl.

[1] http://stackoverflow.com/questions/20715408/solved-pyopenssl-ssl3-write-pendingbad-write-retry-return-self-sslobj-write
[2] http://stackoverflow.com/questions/14695247/in-php-openssl-error-messages-error-1409f07f-ssl-routines-ssl3-write-pendin

Comment 14 Piotr Kliczewski 2014-06-02 12:30:00 UTC

Created attachment 901431 [details]
VDSM response to get all vm stats

Comment 15 Yaniv Bronhaim 2014-06-02 13:13:34 UTC

The talked host contains 400GB ram and 32 cpu cores. what limitation do we on such hosts? do we allow to ran 400 vms each with 1GB ram? don't we need to have a limitation based on cpus also?

currently we're testing what is the maximum number of vms that this host can run before reaching to such slow response that engine throws timeout on

Comment 16 Michal Skrivanek 2014-06-02 13:40:25 UTC

adding Francesco. Could the getBalloonInfo be of help? I suppose with 200VMs eliminating the dom access may help at least with the responsiveness of the getstats call.
Can we try VMs without balloon? worth a try...

@Yaniv, 32GB virtual size is just ill, any idea why?

Comment 17 Yuri Obshansky 2014-06-03 09:20:29 UTC

We started reproduce that issue and find out the limit.
We cleanup the RHEVM, reinstall vdsm on host machine
and started populate VMs from 0 till get the exception.
It happened on 182-th VMs.
2014-06-03 12:13:42,876 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler_Worker-7) [17f97fa8] Command GetAllVmStatsVDSCommand(HostName = HOST-REAL, HostId = ee38bc5a-0ad7-4034-80a6-a2af84eac5dc, vds=Host[HOST-REAL,ee38bc5a-0ad7-4034-80a6-a2af84eac5dc]) execution failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException

Comment 18 Dan Kenigsberg 2014-06-03 09:51:00 UTC

(In reply to Yaniv Bronhaim from comment #15)
> don't we need
> to have a limitation based on cpus also?

I don't know if such a limitation exists on Engine side. In my opinion, it should not, as CPU over-commitment is one of the benefits of virtualization. If the guests do nothing, and Vdsm (and the rest of the management stack) is responsive - we should accept more VMs.

Comment 19 Francesco Romani 2014-06-03 12:33:33 UTC

(In reply to Yaniv Bronhaim from comment #11)
> Its a matter of load. running "vdsClinet getAllVmStats" on that host (which
> runs 200 vms) takes 4 minutes to return. The exception in VDSM is one issue
> that must be solved and it sounds like regression. From engine's
> prospective, 4 minutes must reach a timeout, we can't do much about it .
> Vdsm needs to response faster

For the response time, this patch may provide some relief: http://gerrit.ovirt.org/#/c/27696/5

I'm all in to make things faster, but we'll need some profile data for this, but this will in turn require VDSM >= 4.15 (yappi support was merged well after 4.14)

Lastly, the amount of memory consumed is indeed worrysome. I'm thinking to add optional support for memory profiling as well.

Comment 20 Dima Kuznetsov 2014-06-11 14:05:13 UTC

Yuri,

I've tried Francesco's patch on my humble i7 quad-core dell with 125 faqemu vms and it got vdsClient getAllVmStats response time from 6 minutes to 1 seconds (!), can you try it?

Comment 21 Yuri Obshansky 2014-06-11 14:15:05 UTC

Yes, sure
I'm preparing environment for bug verification right now
Send me patch and guide how to install it by e-mail
I think, we should first reproduce bug, only after that apply patch and again try reproduce.

Comment 22 Francesco Romani 2014-06-13 07:59:12 UTC

VDSM patch backported here: http://gerrit.ovirt.org/#/c/28686/1

Comment 23 Michal Skrivanek 2014-06-13 13:23:17 UTC

I don't think the patch was meant to steal the bug completely

Also @Yuri, be careful when answering on a bug with multiple needinfos:)


(In reply to Dan Kenigsberg from comment #18)
> to have a limitation based on cpus also?
there's no such limitation. +1 on being able to absorb more

(In reply to Yaniv Bronhaim from comment #12)
> have we ever checked 200*3 python's threads at  same time?
why *3?

also please see comment #16

Comment 24 Yaniv Bronhaim 2014-06-15 06:58:41 UTC

I don't know the reason for this huge virtual memory capacity , but still looks that Francesco's patch have huge improvement impact on that. Dima will share profiling output and usage status soon for more investigation

Comment 25 Michal Skrivanek 2014-06-17 07:23:53 UTC

do we want to do anything else except the balloon here? If not, I'd take it under virt and put Francesco in charge of backports….

Comment 26 Yaniv Bronhaim 2014-06-17 08:12:45 UTC

the backport already pushed but not verified [1]. Dima works with Yuri to find limitations and see if the patch helps on his environment. If we'll find out that Francesco's improvement is enough to fix the communication issue I'll forward you the bug to continue

[1] http://gerrit.ovirt.org/#/c/28686

Comment 27 Yaniv Bronhaim 2014-06-17 12:00:00 UTC

Looks like the patch fix the issue for this amount of vms on the same environment, it reduces the response for getAllVmStats for 30 seconds

Comment 28 Yuri Obshansky 2014-06-18 12:06:01 UTC

Applied patch has resolved our problem with 200 VMs
Host is stable. No exceptions.
But.
We tried unsuccessfully create additional 100 VMs 
(increase number of VMs up to 300).
Engine (or vdsm) started to lose VMs
VM every minute changed state from UP to Non responsive
and vice versa
I found only that info message in engine.log
2014-06-18 11:17:47,316 INFO
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-46) [a94764c] VM scale_108
c03cdb08-bc8c-4649-b034-617492f87b71 moved from Up --> NotResponding
2014-06-18 11:17:47,324 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-46) [a94764c] Correlation ID: null, Call
Stack: null, Custom Event ID: -1, Message: VM scale_108 is not responding.
............
2014-06-18 11:18:10,845 INFO
[org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
(DefaultQuartzScheduler_Worker-41) [2a9e7042] VM scale_108
c03cdb08-bc8c-4649-b034-617492f87b71 moved from NotResponding --> Up

Looks like it is a limit - 200 VMs 
See attached new log files.

Comment 29 Yuri Obshansky 2014-06-18 12:06:58 UTC

Created attachment 909986 [details]
Engine log file after patch applied

Comment 30 Yuri Obshansky 2014-06-18 12:07:32 UTC

Created attachment 909987 [details]
VDSM log after patch applied

Comment 31 Francesco Romani 2014-06-18 12:28:41 UTC

Hi, the VDSM log is too tiny, there isn't much to see here (it covers ~18 seconds!). It is possible to have more VDSM logs?

We'd probably greatly benefit from profiling data as Dan suggested in
https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c19

Comment 32 Yuri Obshansky 2014-06-18 12:45:21 UTC

Yes, sure 
Uploaded 5 VDSM log files

Comment 33 Yuri Obshansky 2014-06-18 12:45:56 UTC

Created attachment 909992 [details]
VDSM 1

Comment 34 Yuri Obshansky 2014-06-18 12:46:28 UTC

Created attachment 909993 [details]
VDSM 2

Comment 35 Yuri Obshansky 2014-06-18 12:46:59 UTC

Created attachment 909994 [details]
VDSM 3

Comment 36 Yuri Obshansky 2014-06-18 12:47:34 UTC

Created attachment 909995 [details]
VDSM 4

Comment 37 Yuri Obshansky 2014-06-18 12:48:03 UTC

Created attachment 909996 [details]
VDSM 5

Comment 38 Michal Skrivanek 2014-06-18 14:41:50 UTC

it looks like a nicely overloaded system. Can we get sar file? or at least a load value?
it may be just that we're simply over the limit the host can handle...

Comment 39 Yuri Obshansky 2014-06-19 07:31:18 UTC

I'd like clarify the test condition.
No load performed during test.
Just created x number of VMs using REST API and JMeter.
Hardware profile:
HOST - 64 CPUs and 400 G RAM
VMs - 4 CPUs and 1 G RAM (installed kernel and guest agent)

Comment 40 Michal Skrivanek 2014-06-19 11:21:44 UTC

I'm asking about system load indicator. And System Activity report to see the separate the idle load of the environment vs. what RHEV causes by querying

would be great to get a comparison when running vdsm vs not running vdsm (with VMs still running)

Comment 41 Yuri Obshansky 2014-06-19 15:10:16 UTC

Looks like I found the number of VMs which are running w/o problems (changed state from UP to Non responsive and vice versa) in current hardware and software configuration and patch applied.
It is 215 VMs.

Comment 42 Eldad Marciano 2014-06-22 11:38:55 UTC

This bug also reproduced in small setup (532 hosts, 19 vms per host).
any chance the problem related to number of hosts?

Comment 43 Eldad Marciano 2014-06-22 11:56:22 UTC

engine version: 3.4.0-0.21.el6ev
vdsm version: vdsm-4.14.7-3.el6ev.x86_64

Comment 44 Yaniv Bronhaim 2014-06-22 12:33:34 UTC

probably not, the problem was the load of one host, each vm produces threads to collect stats and this really slowed down vdsm, the response to getAllVmStats took 6minutes. 
sharing logs might help to understand what happened in your environment

Comment 45 Eldad Marciano 2014-06-22 13:03:22 UTC

Created attachment 911139 [details]
small setup log

Comment 46 Eldad Marciano 2014-06-22 13:07:01 UTC

Created attachment 911141 [details]
small setup log vdsm

Comment 47 Barak 2014-06-22 23:05:57 UTC

Yuri,
AFAIU we are running 215 VMs on a 32 core hypervisor (is it 32 or 16 with hyperthreading) ?

Comment 48 Barak 2014-06-22 23:07:00 UTC

Yaniv,

Other areas to analyze:
- profiling vdsm again for bottlenecks (AFAIU the xml parsing from libvirt is an issue)
- investigate the memory consumption of vdsm

Comment 49 Barak 2014-06-22 23:13:21 UTC

(In reply to Barak from comment #47)
> Yuri,
> AFAIU we are running 215 VMs on a 32 core hypervisor (is it 32 or 16 with
> hyperthreading) ?

I see in comment 39 64 cores (same question about hyperthreading ?)

Comment 50 Piotr Kliczewski 2014-06-23 08:26:32 UTC

Eldad,

I checked the logs and still can see engine timing out when running GetAllVmStatsVDSCommand. There is one more issue in the engine log during ActivateVdsCommand: Transaction is not active.

I checked vdsm log for any occurrences of getAllVmStats but I was not able to find. Can you provide vdsm log where we can see manifestation of timeout issue?

Comment 51 Michal Skrivanek 2014-06-25 11:31:58 UTC

(In reply to Barak from comment #48)
(AFAIU the xml parsing from libvirt is an issue)

there are http://gerrit.ovirt.org/#/c/17694/ and http://gerrit.ovirt.org/#/c/26856/ which should help

Comment 52 Michal Skrivanek 2014-06-25 11:33:20 UTC

any response to comment #40?

@eldad, comment #43 - so you had 10000 VMs?

Comment 53 Yuri Obshansky 2014-06-25 11:38:40 UTC

(In reply to Michal Skrivanek from comment #40)
> I'm asking about system load indicator. And System Activity report to see
> the separate the idle load of the environment vs. what RHEV causes by
> querying
> 
> would be great to get a comparison when running vdsm vs not running vdsm
> (with VMs still running)

We don't have environment right now.

What kind of load indicator? Please, specify.
What is system activity report? What do you mean?

Comment 54 Dima Kuznetsov 2014-06-25 12:29:04 UTC

(In reply to Barak from comment #48)
> Yaniv,
> 
> Other areas to analyze:
> - profiling vdsm again for bottlenecks (AFAIU the xml parsing from libvirt
> is an issue)
> - investigate the memory consumption of vdsm

We've looked into the virtual memory map of vdsm on Yuri's host and found a lot of maps with 0 rss and no access permissions, all in the range of 50~64 megabytes. These maps accounted for about 23 GB of virtual memory.
After close examination, all of those maps are heaps malloc manages, here is a short overview of how it becomes so big:
* Heaps are allocated on demand, meaning when a thread tries to allocate, it tries to lock any created heap, and if no heap is available, it creates a new one.
* There is a limit on the maximum number of heaps a process will have, it depends on the amount *cores* on the machine. 
 Max value is calculated this way (on a 64-bit machine): number of heaps <= (number of cores) * 8.
* When creating a new heap, 64 megabyte mapping is created, with no access permissions, then, whenever process needs writable memory, it mprotects() a piece of the heap (splitting the mapping in two) and uses the writable chunk.
* On a 64 core machine it can grow to 32 GB across 512 heaps.

Other major contribution to virtual memory are thread stacks, with 8 megabytes per stack, having 300 threads means there is 2.4GB bump in the virtual memory.

Comment 55 Michal Skrivanek 2014-06-25 14:03:43 UTC

I'd like to find out which thread(s) actually consume too much rss so it all adds up to 900MB, well, but probably not that significant anyway

Yuri, would you please let me know when you have the setup again just so I can take a look at the host?

Comment 56 Eldad Marciano 2014-08-06 13:40:08 UTC

Martin i think i have reproduce the problem on 'vdsm-4.14.11-1.el6ev.x86_64'

Comment 57 Michal Skrivanek 2014-08-21 14:53:20 UTC

*** Bug 1132165 has been marked as a duplicate of this bug. ***

Comment 58 Yuri Obshansky 2014-11-10 07:13:30 UTC

Bug verification failed.
RHEVM - 3.5.0-0.17.beta.el6ev
VDSM - vdsm-4.16.7.1-1.el6ev

I ran simple script which perform command getAllVmStats
and measure response time
See results:
getAllVmStats response time w/o patch
min        37.319 sec  
average 46.398 sec
max       52.005 sec   

getAllVmStats response time with patch
min        33.058 sec
average 48.268 sec
max       56.701 sec

Comment 59 Yaniv Bronhaim 2014-11-11 11:30:02 UTC

I suggest to open new bug . the fix here refers to virt by polling the vms status and it decreases the penalty of it - this was merged to 3.5 as the bug is targeted. although performance issues regarding running large scale of vms can lead to many bottle necks
this bottle neck is solved in the bugzilla (fromani - please change the bug title according to the fix)

Yuri, 5 second better is also something meaningful. but we need to understand more deeply what takes long and where to investigate. please open new bug with the latest logs and we'll try to investigate more and close this one

Comment 60 Francesco Romani 2014-11-11 11:40:26 UTC

(In reply to Yaniv Bronhaim from comment #59)
> I suggest to open new bug . the fix here refers to virt by polling the vms
> status and it decreases the penalty of it - this was merged to 3.5 as the
> bug is targeted. although performance issues regarding running large scale
> of vms can lead to many bottle necks
> this bottle neck is solved in the bugzilla (fromani - please change the bug
> title according to the fix)

Done

Comment 64 Nir Soffer 2014-11-12 17:13:40 UTC

(In reply to Yuri Obshansky from comment #58)
> Bug verification failed.
> RHEVM - 3.5.0-0.17.beta.el6ev
> VDSM - vdsm-4.16.7.1-1.el6ev
> 
> I ran simple script which perform command getAllVmStats
> and measure response time
> See results:
> getAllVmStats response time w/o patch
> min        37.319 sec  
> average 46.398 sec
> max       52.005 sec   
> 
> getAllVmStats response time with patch
> min        33.058 sec
> average 48.268 sec
> max       56.701 sec

Please specify which patch are you talking about.

Also, please perform the test with both jsonrpc and xmlrpc and report the results for each.

Comment 65 Barak 2014-11-12 21:13:18 UTC

Nir - I added you in order to check the storage related issue.
all the info is in this bug.

Yuri has trouble reaching  200 vms on his system - and encountered a DC/sp issue (see comments 61 & 62), the needinfo on you is to check what is the problem.

Comment 66 Barak 2014-11-13 18:10:52 UTC

Yuri just to be on the safe side,
you mentioned:
1 - comment #39 - using REST API and JMeter
2 - Host status is not stable (bug description)

While on comment #58 you failed verification - you wrote  "I ran simple script which perform command getAllVmStats"

Did the host change status ? did you see still "synchronous VM balloon query causes VDSNetworkException and TimeoutException" ?

It looks like you failed verification due to not improving the response of getAll VmStats enough ?

Please be careful and Not clear other needinfos ;-)

Comment 67 Yuri Obshansky 2014-11-16 14:00:41 UTC

I ran script with getAllVmStats to test Francesco patch 
which is fix for bug 1110835(it is a clone of current bug).
Anyway, I rerun test using original flow
with only one difference - I repeated that flow (create 200 VMs) twice:
1. host created without jsonrpc protocol (using old xmlrpc) 
2. host created with jsonrpc protocol.

Executive summary:
- Test with jsonrpc protocol failed. 
- VM creation process started to be slower and slower after we've created more that 80 VMs.
- DC and Storage Domain crashed each 1/2 hours of running.
- Last 111 VMs creation process got 7 minutes. 
- Environment was not functional from my point of view.
Exceptions raised in engine log:
2014-11-16 14:22:45,120 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-15) [69a1a5c] Correlation ID: 69a1a5c, Job ID: b1d0760c-c346-4037-84c5-30bc8b73c58e, Call Stack: null, Custom Event ID: -1, Message: Failed to Reconstruct Master Domain for Data Center DC-REAL.
2014-11-16 14:25:42,260 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-60) [409d1fa0] ERROR, org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand, exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues', log id: 7d4be17c: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues'
2014-11-16 14:25:42,275 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-60) [409d1fa0] IRS failover failed - cant allocate vds server
2014-11-16 14:25:42,520 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-7) Operation Failed: [Cannot run VM. The relevant Storage Domain's status is Inactive.]
Error raised in vdsm log:
GuestMonitor-scale-107::DEBUG::2014-11-16 14:27:08,414::vm::486::vm.Vm::(_getUserCpuTuneInfo) vmId=`535fb4ef-a64b-4754-b700-87a8e32d31b5`::Domain Metadata is not set
Thread-140622::DEBUG::2014-11-16 14:27:08,419::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present

Host details:
vdsm-4.16.7.1-1.el6ev
libvirt-0.10.2-46.el6_6.1
32 CPU Cores
400 G RAM

Comment 68 Barak 2014-11-16 16:26:56 UTC

None of the above ERRORS has nothing to do with json-rpc,
The only thing that may be relevant is 

"""
org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-60) [409d1fa0] ERROR, org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand, exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues', log id: 7d4be17c: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues'
"""

Yuri - please also re-run the same test with XML-RPC (same versions and hosts),
Just to be on the safe side (it still looks like a storage related issue)

Piotr - please take a look about the timeout above.

Comment 69 Yuri Obshansky 2014-11-17 06:12:04 UTC

Additional info: VM template details:
2 CPU, 1 G RAM
Already done:
Test 1 (xmlrpc) results: Successfully completed
Test 2 (jsonrpc) test results: Test failed.

Comment 70 Yuri Obshansky 2014-11-17 06:13:16 UTC

Created attachment 958176 [details]
rhevm ui screenshot

Comment 71 Yuri Obshansky 2014-11-17 06:14:06 UTC

Created attachment 958177 [details]
rhevm ui jsonrpc screenshot

Comment 72 Yuri Obshansky 2014-11-17 06:16:28 UTC

Created attachment 958178 [details]
vdsm log jsonrpc

Comment 73 Yuri Obshansky 2014-11-17 06:17:13 UTC

Created attachment 958179 [details]
engine log jsonrpc

Comment 74 Yuri Obshansky 2014-11-17 06:18:06 UTC

Created attachment 958180 [details]
vdsm log xmlrpc

Comment 75 Yuri Obshansky 2014-11-17 06:18:43 UTC

Created attachment 958181 [details]
engine log xmlrpc

Comment 76 Piotr Kliczewski 2014-11-17 09:34:24 UTC

I checked jsonrpc logs and I can see that during the test we reconstructed master domain.

2014-11-16 14:22:45,129 INFO  [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (org.ovirt.thread.pool-7-thread-36) [1b84f6e9] Finished reconstruct for pool 0bc21fb5-d17d-45b7-8225-6ecefabca3f2. Clearing event queue

There was no interaction between the engine and vdsm for long time (excluding heartbeats). Next log entry is:

2014-11-16 14:25:00,012 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (DefaultQuartzScheduler_Worker-15) [1ad59ae9] Autorecovering 1 storage domains


I am not sure what happened yet but from the vdsm log I can see that the last message attempted to send was a response to timeout message.

Thread-161511:EBUG::2014-11-16 14:22:42,609::__init__::498::jsonrpc.JsonRpcServer:_serveRequest) Return 'StoragePool.getInfo' in bridge with {'info': {'name': 'No Description', 'isoprefix': '', 'pool_status': 'connected', 'lver': -1, 'domains': u'56b6b93d-f3f5-4075-98ae-85ce4a2d7b72:Active', 'master_uuid': u'56b6b93d-f3f5-4075-98ae-85ce4a2d7b72', 'version': '3', 'spm_id': -1, 'type': 'NFS', 'master_ver': 1}, 'dominfo': {u'56b6b93d-f3f5-4075-98ae-85ce4a2d7b72': {'status': u'Active', 'diskfree': '775436566528', 'isoprefix': '', 'alerts': [], 'disktotal': '805306368000', 'version': 3}}}

I am not sure whether this is the response that we waited for.

I would like to understand what arrived to the other side. Can you please provide engine.log on debug level?

Comment 77 Yuri Obshansky 2014-11-17 09:47:58 UTC

I've already uploaded engine.log ->  engine log jsonrpc (123.78 KB, application/x-gzip) 
Is it not good?

Comment 78 Piotr Kliczewski 2014-11-17 10:03:08 UTC

Yuri,

The log uses info log level. I would need debug log level to understand what is received by the engine.

Comment 79 Yuri Obshansky 2014-11-17 12:47:44 UTC

Created attachment 958229 [details]
debug engine log

Comment 80 Yuri Obshansky 2014-11-17 12:52:19 UTC

I've changed log level up to DEBUG and uploaded new engine.log
It contains next actions:
- activate Storage Domain, cause it was Inactive
- start 2 VMs
- Storage Domain crashes and change state to Inactive
"2014-Nov-17, 14:34
Failed to Reconstruct Master Domain for Data Center DC-REAL."

Comment 81 Piotr Kliczewski 2014-11-17 14:24:03 UTC

None of the messages timed out after setting the 

I can see the issue in the logs following actions taking place:
- SpmStopVDSCommand
- ResetIrsVDSCommand (closes the connecton engine <-> vdsm (nullifyInternalProxies))
- ReconstructMasterDomainCommand (fails)

We close connection and reconnect later (2014-11-17 14:42) so we have 8 mins without irc connection. Is that intended?

Comment 82 Piotr Kliczewski 2014-11-18 09:34:09 UTC

I got additional logs from Yuri and noticed that irs related commands timed out due to spmStop ending with disconnection. This is expected behavior during master reconstruction.

Comment 83 Piotr Kliczewski 2014-11-18 11:08:32 UTC

I check master reconstruction flow and I noticed that StoragePool.reconstructMaster verb is not run at all.

Here is what I could see in the logs on both ends:

2014-11-18 10:39:51,842 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (org.ovirt.thread.pool-7-thread-42) [68045ae6] FINISH, ResetIrsVDSCommand, log id: 29a77c5a
2014-11-18 10:39:51,850 DEBUG [org.ovirt.engine.core.bll.storage.ReconstructMasterDomainCommand] (org.ovirt.thread.pool-7-thread-42) [68045ae6] Command [id=791a6240-3038-40aa-9988-f382bb7040e9]: No compensation data.

Thread-307701::DEBUG::2014-11-18 10:39:52,256::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.spmStart' in bridge with {u'maxHostID': 250, u'storagepoolID': u'0bc21fb5-d17d-45b7-8225-6ecefabca3f2', u'prevLver': u'-1', u'enableScsiFencing': u'false', u'prevID': -1}

Thread-307701::DEBUG::2014-11-18 10:39:53,005::__init__::498::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.spmStart' in bridge with 2859a77a-6b46-405b-9fac-3baa4ed6a50f


Thread-307708::DEBUG::2014-11-18 10:39:55,438::__init__::467::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'StoragePool.connectStorageServer' in bridge with {u'connectionParams': [{u'id': u'7027805c-7dd8-4c7d-a266-e645d927c103', u'connection': u'netapp.qa.lab.tlv.redhat.com:/vol/vol_rhev_stress/sd_1', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'password': u'', u'port': u''}], u'storagepoolID': u'00000000-0000-0000-0000-000000000000', u'domainType': 1}

Thread-307708::DEBUG::2014-11-18 10:40:17,825::__init__::498::jsonrpc.JsonRpcServer::(_serveRequest) Return 'StoragePool.connectStorageServer' in bridge with [{'status': 0, 'id': u'7027805c-7dd8-4c7d-a266-e645d927c103'}]

Nir can you help to understand what is failing during reconstruct? I do not see any issue related to communication.

Comment 84 Piotr Kliczewski 2014-11-18 14:26:28 UTC

Created attachment 958609 [details]
Latest engine and vdsm logs with debug

Comment 85 Barak 2014-11-20 18:45:22 UTC

I think Liron can help with the problem mentioned in comment #83

Comment 86 Francesco Romani 2014-11-24 09:07:14 UTC

This bug is becoming too complex to follow.

I don't think the issues found in ReconstructMasterDomainCommand flow with json rpc, although of course important, belong in this bug. I suggest to open a new one to better track them.

I also believe that is best to file a new one addressing the performance characteristics of this patch as concerns were raised in https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c58

This bug was about avoiding timeouts and exceptions, and verfication *did not* proved it faulty (although of course more verification is welcome).

Barak, Michal, what do you think about the above?

Comment 87 Francesco Romani 2014-11-24 09:25:22 UTC

Please let's continue here

https://bugzilla.redhat.com/show_bug.cgi?id=1167229

for the performance concerns as raised in 

https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c58

Comment 88 Michal Skrivanek 2014-11-24 09:32:34 UTC

changing title to something less misleading. Moving to 3.5.1 to not block GA

Comment 89 Liron Aravot 2014-11-25 16:07:39 UTC

reconstruct wasn't executed here as there was one domain in the pool which all the hosts had problem to access - see the following repoStats result -

Thread-307378::INFO::2014-11-18 10:33:13,394::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {u'56b6b93d-f3f5-4075-98ae-85ce4a2d7b72': {'code': 0, 'version': 3, 'acquired': True, 'delay': '0.000537113', 'lastCheck': '74.3', 'valid': True}}

by default, a report of lastCheck > 30 (that's a config value) marks a domain as problematic if after 5 minutes the domain isn't report as not problematic by the hosts in the data center, it will be moved to inactive status.

in this case, i can see that the repoStats calls in the vdsm logs are in huge gaps (10-12 minutes)- that's related to the number of the running vms (i assume) - and is related to bug https://bugzilla.redhat.com/show_bug.cgi?id=1099068 - as currently the vm monitoring and host monitoring aren't separated.

What needs to be done is a separation of the monitoring, regardless - the config value (of the lastCheck limit) can be increased under that heavy load.

Comment 90 Barak 2014-11-25 16:28:44 UTC

As detailed above in comment #89 , it looks like the fact that it succeeded in XML-RPC and failed in JSON-RPC is a pure matter of luck, unless we return from vdsm a wrong "lastCheck" number on json-rpc (long shot)

looking at the code I see:

[bazulay@bazulay vdsm (master)]$ git grep -i lastcheck 
vdsm/rpc/vdsmapi-schema.json:# @lastCheck:  Elapsed time since the Storage Domain was last checked
vdsm/rpc/vdsmapi-schema.json: 'data': {'code': 'int', 'delay': 'float', 'lastCheck': 'float',
vdsm/storage/hsm.py:            lastcheck = '%.1f' % (statsGenTime - domStatus.checkTime)
vdsm/storage/hsm.py:                    'lastCheck': lastcheck, 

Piotr can you please chack that the numbers are reported the same on XML-RPC & Json-rpc ?

Comment 91 Piotr Kliczewski 2014-11-26 07:57:50 UTC

Here are the values that I was able to find in the log:

- xmlrpc - 6.9, 1.3, 1.4, 6.7, 0.7, 9.0, 12.0
- jsonrpc - 68.9, 44.7, 24.6, 68.9, 44.7, 24.6

Comment 92 Barak 2014-11-26 09:24:26 UTC

I mean in the same env while running engine with json-rpc - use the vdsClient to get it from xml-rpc and compare.

Are these the numbers in comment #91 ?

Comment 93 Piotr Kliczewski 2014-11-26 10:46:23 UTC

I am not sure what are you looking for but qhen running in the same env I can see that the numbers are similar in terms of order of magnitude. 

Like 2.7 and 0.4.

Comment 94 Piotr Kliczewski 2014-11-27 09:31:07 UTC

We can see in bug #1168540 that with xmlrpc reconstruction master fails every hour.

Comment 95 Barak 2014-11-27 16:30:51 UTC

(In reply to Piotr Kliczewski from comment #93)
> I am not sure what are you looking for but qhen running in the same env I
> can see that the numbers are similar in terms of order of magnitude. 
> 
> Like 2.7 and 0.4.

float might be reported slightly different ... and with higher numbers it might be the cause for the reported delay only through json-rpc (default configuration in engine is that when it passes 30 seconds ... you enter this flow)

Comment 96 Piotr Kliczewski 2014-11-28 07:32:02 UTC

There is no difference in passing response to marshalling code. As stated in comment #94 there is the same issue using xmlrpc.

I think that we are looking in wrong direction. We marshall many float values and we would have noticed the difference.

As I understand we generate load and lastCheck values get higher. We can see that the values are higher than 30 so after 5 mins we report this domain as problematic which triggers master reconstruction.

Comment 97 Michal Skrivanek 2014-12-16 10:39:24 UTC

seems to me that there is no real difference in protocols, it's just a matter of luck when you cross 30s it gets triggered. 
I would close the bug if there is no objection

Comment 98 Michal Skrivanek 2014-12-16 10:40:46 UTC

also, see comment #89 for the possible explanation of problems. Once reporting is delayed so much, anything happens and falls apart

Comment 99 Yuri Obshansky 2014-12-16 10:57:05 UTC

I cannot reproduce it 100 % due the storage issues.
Looks like not bug, so it will be closed