1364791 – [scale] - Fix communication issues between engine and VDSM to allow higher number of hosts

Bug 1364791 - [scale] - Fix communication issues between engine and VDSM to allow higher number of hosts

Summary: [scale] - Fix communication issues between engine and VDSM to allow higher nu...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Backend.Core
Sub Component:
Version:	4.0.2.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.1-1
Target Release:	4.1.0
Assignee:	Piotr Kliczewski
QA Contact:	eberman
Docs Contact:
URL:
Whiteboard:
Depends On:	1411648 1429534
Blocks:	1301587 1402597
TreeView+	depends on / blocked

Reported:	2016-08-07 13:42 UTC by Eldad Marciano
Modified:	2017-04-21 09:30 UTC (History)
CC List:	11 users (show)
Fixed In Version:	4.1.0-12 (jsonrpc 1.3.8)
Clone Of:
Environment:
Last Closed:	2017-04-21 09:30:16 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+ rule-engine: planning_ack+ mperina: devel_ack+ pstehlik: testing_ack+

Attachments	(Terms of Use)
adding logs and TD see the following lock 0x0000000088f4d3f8 (109.22 KB, application/zip) 2016-09-14 11:23 UTC, Eldad Marciano	no flags	Details
engine and vdsmfake thread dumps (37.20 KB, application/zip) 2016-11-14 14:00 UTC, Eldad Marciano	no flags	Details
reproducing the problem with 163 nested hosts (67.29 KB, application/zip) 2016-12-19 15:26 UTC, Eldad Marciano	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	60400	None	None	None	2017-01-11 06:21:50 UTC
oVirt gerrit	68899	None	None	None	2017-01-11 06:20:56 UTC
oVirt gerrit	68900	None	None	None	2017-01-11 06:20:13 UTC
oVirt gerrit	69891	None	None	None	2017-01-11 06:19:16 UTC
oVirt gerrit	69960	ovirt-engine-4.1	MERGED	scheduler: Log error if scheduled task cannot be submitted	2017-01-12 09:07:21 UTC
oVirt gerrit	70464	ovirt-engine-4.1	MERGED	Version bump	2017-01-16 09:03:08 UTC
oVirt gerrit	70935	ovirt-4.1	MERGED	process messages one by one	2017-01-23 08:37:18 UTC
oVirt gerrit	70936	ovirt-4.1	MERGED	Version bump	2017-01-23 08:41:44 UTC
oVirt gerrit	71001	ovirt-engine-4.1	MERGED	jsonrpc: version bump	2017-01-23 11:52:54 UTC

Comment 3 Oved Ourfali 2016-08-08 05:34:59 UTC

Please also share the logs.

Comment 5 Eldad Marciano 2016-08-08 09:55:18 UTC

yes..
Piotr, i found at the vdsmfake side, again, the Unrecognized messages error.
seems like the issue reproduced with a bench of hosts.
seems like we need to resolve it before we have any conclusions.

Comment 7 Piotr Kliczewski 2016-08-08 12:55:17 UTC

I checked the latest logs and it looks like fakevdsm is sending many empty lines, which breaks message parsing code. Investigating what could cause those empty lines.

Comment 10 Eldad Marciano 2016-08-16 14:15:06 UTC

(In reply to Piotr Kliczewski from comment #7)
> I checked the latest logs and it looks like fakevdsm is sending many empty
> lines, which breaks message parsing code. Investigating what could cause
> those empty lines.

Piotr any news?

Comment 12 Piotr Kliczewski 2016-08-16 14:18:52 UTC

> Piotr any news?

We know that empty lines are related to heartbeats which they cumulating over time. It is not related to this locking issue.

Comment 13 Eldad Marciano 2016-09-05 09:37:45 UTC

we need urgent priority, since we are blocked to run multiple hosts.

Comment 14 Oved Ourfali 2016-09-05 09:39:48 UTC

(In reply to Eldad Marciano from comment #13)
> we need urgent priority, since we are blocked to run multiple hosts.

As I've asked before in another thread, and on the current emails thread, can you check a smaller number of hosts, and gradually increase to check at what number it happens?
I think we can get valuable scale data also from environments with less than 300 hosts. Can you check 50? 100? 150? 200? 250?

Comment 15 Eldad Marciano 2016-09-05 11:34:51 UTC

Currently I reproduce it with 300 hosts.
either way we need it working for 500 hosts.

Comment 16 Oved Ourfali 2016-09-05 12:10:27 UTC

(In reply to Eldad Marciano from comment #15)
> Currently I reproduce it with 300 hosts.
> either way we need it working for 500 hosts.

Please test with less hosts and see what you find.
We keep saying we should test with less hosts, and I think the right approach is to see where the issue starts, as you can give a lot of valuable scale results with fewer amount of hosts....

Comment 17 Eldad Marciano 2016-09-05 14:21:34 UTC

(In reply to Oved Ourfali from comment #16)
> (In reply to Eldad Marciano from comment #15)
> > Currently I reproduce it with 300 hosts.
> > either way we need it working for 500 hosts.
> 
> Please test with less hosts and see what you find.
> We keep saying we should test with less hosts, and I think the right
> approach is to see where the issue starts, as you can give a lot of valuable
> scale results with fewer amount of hosts....

I tested it with 200 and 150 hosts.
for 200 the issue still reproduced.
for 150 it works pretty stable.
so, seems like we can say that the problem started from 200 hosts, which is less than the default number of MaxNumberOfHostsInStoragePool

Comment 18 Eldad Marciano 2016-09-06 14:28:06 UTC

this issue blocking us from keep in progress with the scale out regression tests, as we did since 3.4 which includes 500 hosts.

Comment 20 Eldad Marciano 2016-09-14 11:16:15 UTC

the issue reproduced to me on hybrid env  which mixed 30 real hosts and 150 fake.
both real and fake suffering from this issue so i'll change the bug titile
adding logs + TD
seems like the monitor lock running too long ~15sec and more.
please the the "0x0000000088f4d3f8"
which related to 
        at org.ovirt.engine.core.bll.VdsEventListener.vdsUpEvent(VdsEventListener.java:308)
        at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refresh(HostMonitoring.java:93)
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:238)
        - locked <0x0000000088f4d3f8> (a java.lang.Object)

Comment 21 Eldad Marciano 2016-09-14 11:23:51 UTC

Created attachment 1200790 [details]
adding logs and TD see the following lock 0x0000000088f4d3f8

adding TD see the following lock 0x0000000088f4d3f8

Comment 23 Eldad Marciano 2016-09-15 13:37:29 UTC

This bug was reproduced on master.

the issue appears when working with 150 hosts and 1 SD.

Comment 24 Roy Golan 2016-10-31 13:55:05 UTC

- Please use fedora's http://paste.fedoraproject.org and non-private comments
- Please add thread dump of vdsmfake
- As a workaround you can deploy several fakevdsm instances and use dnsmasq to dynamically resolve the new addresses (see https://github.com/oVirt/ovirt-vdsmfake)
- Please increase the heartbeat to at least 15 (to surpass the 15s wait) and see how it behaves

Comment 25 Eldad Marciano 2016-11-14 13:58:19 UTC

(In reply to Roy Golan from comment #24)
> - Please use fedora's http://paste.fedoraproject.org and non-private comments
> - Please add thread dump of vdsmfake
> - As a workaround you can deploy several fakevdsm instances and use dnsmasq
> to dynamically resolve the new addresses (see
> https://github.com/oVirt/ovirt-vdsmfake)
> - Please increase the heartbeat to at least 15 (to surpass the 15s wait) and
> see how it behaves

this bug reproduced on ovirt master with the following vdsmfake distribution:
5 vdsmfake instances * 25 hosts each = 125 hosts.
I used the latest github repository for the test.

I found the issue reproduced when restarting vdsmfake (instance #1) what happens is hosts got non responsive than most of the back to "up" state, and 2 of them got stuck on connecting for long period, but finally they got "up"

bottom line it looks like hosts got stuck for long period.
adding thread dumps from both engine and vdsmfake.

in the engine logs there is no thing besides "Vds timeout occured" error.

Comment 26 Eldad Marciano 2016-11-14 14:00:35 UTC

Created attachment 1220442 [details]
engine and vdsmfake thread dumps

Comment 27 Yaniv Kaul 2016-12-01 13:15:17 UTC

Removing blocker flag, for 2 reasons:
1. It is how 4.0 behaved.
2. We need re-testing in 4.1 - we believe we have made some good progress on various areas partially related to this as well. Specifically, host monitoring should be significantly improved (ETA - next week).

Comment 32 Eldad Marciano 2016-12-19 15:20:24 UTC

This problem reproduced for 163 nested hosts without storage.

test profile:
1.run 163 vms for nested hosts
2.assign the vms as nested hosts in different datacenter (with no storage).
3.power off the vms, in order to simulate restart, as we have restarted the vdsmfake.
4.seeing the hosts as non responsive.
5.start the vms again.
6.restart the engine.
7.seeing the nested hosts some of them as non responsive and some stuck on "activating"

Comment 33 Oved Ourfali 2016-12-19 15:25:09 UTC

We are also exploring that, but it would be helpful to get a thread dump so that we'll be able to compare. Can you produce it?

Comment 34 Eldad Marciano 2016-12-19 15:26:27 UTC

Created attachment 1233397 [details]
reproducing the problem with 163 nested hosts

attaching thread dumps
see the following lock on the latch call 
0x000000008a026d68 or DefaultQuartzScheduler71

Comment 35 Eldad Marciano 2016-12-19 15:28:46 UTC

(In reply to Oved Ourfali from comment #33)
> We are also exploring that, but it would be helpful to get a thread dump so
> that we'll be able to compare. Can you produce it?

I Just replied with it, also vdsm-jsonrpc-version
vdsm-jsonrpc-java-1.2.10-1.el7ev.noarch

Comment 36 Roy Golan 2016-12-20 09:51:07 UTC

I see the stomp reactor looping all channels to get the connection back. It doesn't find any of the channels open probably. 

I started with 

  $ grep "Connecting to" /var/log/ovirt-engine/engine.log
  # tons of records

Then went deeper to see what the thread is constantly trying to read the resources. 

- First spot the problematic thread

  $ ps -Leo pcpu,pmem,pid,lwp,user| sort -k1 -r | head -4
  %CPU %MEM   PID   LWP USER 
  81.3  4.8 13203 13446 ovirt

- LWP is the thread id. converting it to hexa
  
  $ printf "0x%x" 13446

- Get thread if from the stack trace

  $ su ovirt -s /bin/bash -c "jstack $(pgrep -f jboss-modules.jar)" | grep -A 2 3486
  "SSL Stomp Reactor" #226 daemon prio=5 os_prio=0 tid=0x00007fd1b0245800 nid=0x3486 runnable [0x00007fd13faf9000] 
   java.lang.Thread.State: RUNNABLE                                                                              
        at sun.nio.ch.NativeThread.current(Native Method)

- Strace shows this thread is performing 4000 reads per sec and can't find anything
   $ strace -ff -x -o strace -t -y -p 13446 | grep "09:36:15 read" strace.out.13446 | wc -l
   4002
   
                                                  
Piotr, Martin don't we have a treatment to simply stop connecting after a while?

Comment 37 Piotr Kliczewski 2016-12-20 10:02:03 UTC

Jsonrpc java is quite dump and it performs operations that the engine is asking for like sending. When a channel is closed and we have send request we attempt to connect. Usually host monitoring triggers sends.

Comment 38 Roy Golan 2016-12-20 10:10:06 UTC

(In reply to Piotr Kliczewski from comment #37)
> Jsonrpc java is quite dump and it performs operations that the engine is
> asking for like sending. When a channel is closed and we have send request
> we attempt to connect. Usually host monitoring triggers sends.

Right but I would expect to see far less stress here. The engine is trying to connect every 3 seconds right? it does look like that at all. This looks like retries of the reactor.

Comment 39 Roy Golan 2016-12-20 10:10:34 UTC

(In reply to Piotr Kliczewski from comment #37)
> Jsonrpc java is quite dump and it performs operations that the engine is
> asking for like sending. When a channel is closed and we have send request
> we attempt to connect. Usually host monitoring triggers sends.

Right but I would expect to see far less stress here. The engine is trying to connect every 3 seconds right? it does not look like that at all. This looks like retries of the reactor.

Comment 40 Roy Golan 2016-12-20 10:11:39 UTC

scratch comment 38. I meant 'it does not look like that"

Comment 41 Piotr Kliczewski 2016-12-20 10:35:57 UTC

Correct, I saw many times that the engine triggers the job couple of times per second.

Comment 42 Eldad Marciano 2016-12-20 16:03:50 UTC

seems like we have 2 issues here one is the lock caused by DefaultQuartzScheduler71 > countdown which never released and holds the lock.

and the other is the reactor intensive processing.

Comment 44 Red Hat Bugzilla Rules Engine 2016-12-26 10:55:49 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 51 Eldad Marciano 2017-01-04 15:30:52 UTC

summarizing lasts findings:

based on https://bugzilla.redhat.com/show_bug.cgi?id=1364791#c43

zstream env with bare-metal hosts + patched vdsm-json-rpc (https://gerrit.ovirt.org/#/q/topic:slow_host_start)
works fine in idle state, but when working with vms and adding more tasks to the system, we found an issue on vdsm side, see https://bugzilla.redhat.com/show_bug.cgi?id=1410120

for upstream with vdsmfake we test only idle state, we should see what will happens when generating engine tasks e.g. starting vms
plus idle system with 500 hosts.

Comment 52 Roy Golan 2017-01-11 10:06:48 UTC

Can we reach higher numbers just by increasing the num of threads and connections (thought not welcome change by itself) and without the fixes proposed here? It okay to test that with vdsmfake

Comment 53 Martin Perina 2017-01-11 10:15:30 UTC

(In reply to Roy Golan from comment #52)
> Can we reach higher numbers just by increasing the num of threads and
> connections (thought not welcome change by itself) and without the fixes
> proposed here? It okay to test that with vdsmfake

We need to have all fixes on fakevdsm and vdsm-jsonrpc-java. On engine we have just added logging to know if Quartz thread pool is depleted and to bump vdsm-jsonrpc-java version required.

And we also optimized number of threads need to execute check for upgrades in BZ1411648, although I recommend to turn off check for upgrades completely if you are using fakevdsm.

Comment 54 Piotr Kliczewski 2017-01-11 11:41:27 UTC

We can talk to more hosts if we increase quartz thread pool size and number of db connections in the pools (this requires changes in db config).

Comment 55 Eldad Marciano 2017-01-15 10:46:37 UTC

The issue reproduced on 4.1.0.3.

setup profile:
1 SD,
23 hosts
vms running 0.

adding dumps.

Comment 58 Oved Ourfali 2017-01-22 09:13:32 UTC

Piotr - Should this move to ON_QA?

Comment 59 Piotr Kliczewski 2017-01-23 08:23:17 UTC

This BZ is targeted to 4.1 I posted patches to have it in for this branch but they were not acked yet. I will move once version bump is merged to the engine.

Comment 60 Martin Perina 2017-01-23 08:36:31 UTC

Moving to 4.1.1 as latest patch is not that critical to require RC rebuild

Comment 72 Roy Golan 2017-02-01 11:17:36 UTC

mperina please add release note for enlarging the quartz thread pool and also add the kb(if that exist already)

Comment 73 Martin Perina 2017-02-01 12:41:33 UTC

(In reply to Roy Golan from comment #72)
> mperina please add release note for enlarging the quartz thread pool and
> also add the kb(if that exist already)

We don't want to enlarge quartz pool until it's really necessary. So I hope that when Eldad will try to verify this bug we will have confidence if we need to enlarge the pool for certain number of hosts or not.

Comment 74 Roy Golan 2017-02-06 15:39:39 UTC

(In reply to Martin Perina from comment #73)
> (In reply to Roy Golan from comment #72)
> > mperina please add release note for enlarging the quartz thread pool and
> > also add the kb(if that exist already)
> 
> We don't want to enlarge quartz pool until it's really necessary. So I hope
> that when Eldad will try to verify this bug we will have confidence if we
> need to enlarge the pool for certain number of hosts or not.

Hence the release note - to advice how it will help in case of large number of hosts.

Comment 75 Oved Ourfali 2017-02-06 16:03:16 UTC

We will update what needs to be updated once the scale team will do some testing.

Comment 81 Eldad Marciano 2017-02-12 12:37:14 UTC

summarazing the lasts ramp up scenario for 250 hosts with default installation.

we faced the problem around 150 hosts and above, the problem occur just when restarting the engine, some of the tasks rejected as we saw previously during the initialisation phase.
but eventually the hosts become up & running properly.

we did ramp up scenario for hosts capacity,
see the following capacity vs engine restart results.

50 hosts, task rejected rate: 0
100 hosts, task rejected rate: 0
150 hosts, task rejected rate: 24 in 1 sec after restart.
200 hosts, task rejected rate: 126 in 8 sec after restart.
250 hosts, task rejected rate: 612 in 11 sec after restart.

Comment 88 Martin Perina 2017-02-13 11:15:22 UTC

(In reply to Eldad Marciano from comment #81)
> summarazing the lasts ramp up scenario for 250 hosts with default
> installation.
> 
> we faced the problem around 150 hosts and above, the problem occur just when
> restarting the engine, some of the tasks rejected as we saw previously
> during the initialisation phase.
> but eventually the hosts become up & running properly.

So if I understand correctly when engine is up and running we can manage 250 hosts with default configuration (no thread pool changes) without problem, right?
If so, that's great!

Yes, we know we have an issue during engine startup, when we try to execute HostMonitoring for all existing hosts pretty much at once. But because fencing is disabled 5 minutes after startup, it shouldn't cause any unwanted hosts restarts right? So, are you able co confirm, that even though some host monitoring tasks are rejected, hosts become Up and stable in a few minutes?

Comment 89 Eldad Marciano 2017-02-13 13:56:10 UTC

(In reply to Martin Perina from comment #88)
> (In reply to Eldad Marciano from comment #81)
> > summarazing the lasts ramp up scenario for 250 hosts with default
> > installation.
> > 
> > we faced the problem around 150 hosts and above, the problem occur just when
> > restarting the engine, some of the tasks rejected as we saw previously
> > during the initialisation phase.
> > but eventually the hosts become up & running properly.
> 
> So if I understand correctly when engine is up and running we can manage 250
> hosts with default configuration (no thread pool changes) without problem,
> right?
> If so, that's great!
Yes, we just have some task rejected due to low thread pool (100 by default) and than they being monitored again, it's ony happens when restarting the engine.

seems like the enigne try to monitor all existing hosts at the same time, so we reach the pool limit in terms of threads.

> 
> Yes, we know we have an issue during engine startup, when we try to execute
> HostMonitoring for all existing hosts pretty much at once. But because
> fencing is disabled 5 minutes after startup, it shouldn't cause any unwanted
> hosts restarts right? So, are you able co confirm, that even though some
> host monitoring tasks are rejected, hosts become Up and stable in a few
> minutes?
Yes, I already mentioned it in comment https://bugzilla.redhat.com/show_bug.cgi?id=1364791#c81

any further information needed ?

Comment 90 eberman 2017-02-15 10:18:10 UTC

(In reply to Roy Golan from comment #72)
> mperina please add release note for enlarging the quartz thread pool and
> also add the kb(if that exist already)

according to this after manually reconfigure the pool size we are now at 250 host under nested topology


what is the next step from our side?

Comment 91 Martin Perina 2017-02-16 09:37:18 UTC

According to Comment 81, Comment 88 and Comment 89 I'd say the bug is verified.

I will create a KCS article how to increase Quartz pool size (and db connection pool + maximum number of client connections on PostgreSQL which is also needed if quartz pool is increased), but I'd not change the defaults, because it's waste of resources for most customers.

Comment 92 Eldad Marciano 2017-03-07 15:24:42 UTC

(In reply to Martin Perina from comment #91)
> According to Comment 81, Comment 88 and Comment 89 I'd say the bug is
> verified.
> 
> I will create a KCS article how to increase Quartz pool size (and db
> connection pool + maximum number of client connections on PostgreSQL which
> is also needed if quartz pool is increased), but I'd not change the
> defaults, because it's waste of resources for most customers.

we saw the issue reproduced again with default settings and 250 hosts.
the issue reproduced when switching 50 hosts to maintenance out of 250.

moreover, we saw the task rejection issue for another scenario please see 
https://bugzilla.redhat.com/show_bug.cgi?id=1429534

Comment 93 Piotr Kliczewski 2017-03-13 08:17:21 UTC

Eldad,

As Martin mentioned we documented the setting which helps to mitigate the issue. I agree that we need to handle better how we use quartz jobs so you opened BZ #1429534. We have discussed the issue and as Martin mentioned we do not want to change default value. Please tell me what do you expect to be done for this BZ.

Comment 94 Eldad Marciano 2017-03-13 22:23:47 UTC

(In reply to Piotr Kliczewski from comment #93)
> Eldad,
> 
> As Martin mentioned we documented the setting which helps to mitigate the
> issue. I agree that we need to handle better how we use quartz jobs so you
> opened BZ #1429534. We have discussed the issue and as Martin mentioned we
> do not want to change default value. Please tell me what do you expect to be
> done for this BZ.

Just mentioning the last findings, and adding "depends on" BZ.
based on yours and Martin suggestions lets handle it on BZ 1429534

Comment 95 eberman 2017-04-09 11:10:51 UTC

(In reply to Piotr Kliczewski from comment #93)
> Eldad,
> 
> As Martin mentioned we documented the setting which helps to mitigate the
> issue. I agree that we need to handle better how we use quartz jobs so you
> opened BZ #1429534. We have discussed the issue and as Martin mentioned we
> do not want to change default value. Please tell me what do you expect to be
> done for this BZ.

is there another specific reproduction required? missing info?(In reply to Martin Perina from comment #91)
> According to Comment 81, Comment 88 and Comment 89 I'd say the bug is
> verified.
> 
> I will create a KCS article how to increase Quartz pool size (and db
> connection pool + maximum number of client connections on PostgreSQL which
> is also needed if quartz pool is increased), but I'd not change the
> defaults, because it's waste of resources for most customers.

can we also see and comment on this Doc?

Comment 96 eberman 2017-04-12 08:37:09 UTC

(In reply to Martin Perina from comment #91)
> According to Comment 81, Comment 88 and Comment 89 I'd say the bug is
> verified.
> 
> I will create a KCS article how to increase Quartz pool size (and db
> connection pool + maximum number of client connections on PostgreSQL which
> is also needed if quartz pool is increased), but I'd not change the
> defaults, because it's waste of resources for most customers.

where can we review the KCS article so we can confirm from our side?

Comment 97 Gil Klein 2017-04-13 07:42:50 UTC

As this is mainly about documentation, it will be verified once the KCS article will be ready. 

Moving to 4.1.2.

Comment 98 Martin Perina 2017-04-13 22:38:27 UTC

Because of fix provided by BZ1429534 we don't need to raise thread pool size anymore, so I no longer see a reason for such KCS article. On the contrary during 4.2 development we will try to even lower number of threads that engine using.

From my point of this bug should be marked as VERIFIED, because BZ1429534 is already VERIFIED and they are both tried to solve the same issue. So I don't a see a reason we should retarget it to 4.1.2.

Comment 99 Gil Klein 2017-04-18 08:28:11 UTC

Ok, moving to verified based on comment #98

Note You need to log in before you can comment on or make changes to this bug.