Please also share the logs.
yes.. Piotr, i found at the vdsmfake side, again, the Unrecognized messages error. seems like the issue reproduced with a bench of hosts. seems like we need to resolve it before we have any conclusions.
I checked the latest logs and it looks like fakevdsm is sending many empty lines, which breaks message parsing code. Investigating what could cause those empty lines.
(In reply to Piotr Kliczewski from comment #7) > I checked the latest logs and it looks like fakevdsm is sending many empty > lines, which breaks message parsing code. Investigating what could cause > those empty lines. Piotr any news?
> Piotr any news? We know that empty lines are related to heartbeats which they cumulating over time. It is not related to this locking issue.
we need urgent priority, since we are blocked to run multiple hosts.
(In reply to Eldad Marciano from comment #13) > we need urgent priority, since we are blocked to run multiple hosts. As I've asked before in another thread, and on the current emails thread, can you check a smaller number of hosts, and gradually increase to check at what number it happens? I think we can get valuable scale data also from environments with less than 300 hosts. Can you check 50? 100? 150? 200? 250?
Currently I reproduce it with 300 hosts. either way we need it working for 500 hosts.
(In reply to Eldad Marciano from comment #15) > Currently I reproduce it with 300 hosts. > either way we need it working for 500 hosts. Please test with less hosts and see what you find. We keep saying we should test with less hosts, and I think the right approach is to see where the issue starts, as you can give a lot of valuable scale results with fewer amount of hosts....
(In reply to Oved Ourfali from comment #16) > (In reply to Eldad Marciano from comment #15) > > Currently I reproduce it with 300 hosts. > > either way we need it working for 500 hosts. > > Please test with less hosts and see what you find. > We keep saying we should test with less hosts, and I think the right > approach is to see where the issue starts, as you can give a lot of valuable > scale results with fewer amount of hosts.... I tested it with 200 and 150 hosts. for 200 the issue still reproduced. for 150 it works pretty stable. so, seems like we can say that the problem started from 200 hosts, which is less than the default number of MaxNumberOfHostsInStoragePool
this issue blocking us from keep in progress with the scale out regression tests, as we did since 3.4 which includes 500 hosts.
the issue reproduced to me on hybrid env which mixed 30 real hosts and 150 fake. both real and fake suffering from this issue so i'll change the bug titile adding logs + TD seems like the monitor lock running too long ~15sec and more. please the the "0x0000000088f4d3f8" which related to at org.ovirt.engine.core.bll.VdsEventListener.vdsUpEvent(VdsEventListener.java:308) at org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring.refresh(HostMonitoring.java:93) at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:238) - locked <0x0000000088f4d3f8> (a java.lang.Object)
Created attachment 1200790 [details] adding logs and TD see the following lock 0x0000000088f4d3f8 adding TD see the following lock 0x0000000088f4d3f8
This bug was reproduced on master. the issue appears when working with 150 hosts and 1 SD.
- Please use fedora's http://paste.fedoraproject.org and non-private comments - Please add thread dump of vdsmfake - As a workaround you can deploy several fakevdsm instances and use dnsmasq to dynamically resolve the new addresses (see https://github.com/oVirt/ovirt-vdsmfake) - Please increase the heartbeat to at least 15 (to surpass the 15s wait) and see how it behaves
(In reply to Roy Golan from comment #24) > - Please use fedora's http://paste.fedoraproject.org and non-private comments > - Please add thread dump of vdsmfake > - As a workaround you can deploy several fakevdsm instances and use dnsmasq > to dynamically resolve the new addresses (see > https://github.com/oVirt/ovirt-vdsmfake) > - Please increase the heartbeat to at least 15 (to surpass the 15s wait) and > see how it behaves this bug reproduced on ovirt master with the following vdsmfake distribution: 5 vdsmfake instances * 25 hosts each = 125 hosts. I used the latest github repository for the test. I found the issue reproduced when restarting vdsmfake (instance #1) what happens is hosts got non responsive than most of the back to "up" state, and 2 of them got stuck on connecting for long period, but finally they got "up" bottom line it looks like hosts got stuck for long period. adding thread dumps from both engine and vdsmfake. in the engine logs there is no thing besides "Vds timeout occured" error.
Created attachment 1220442 [details] engine and vdsmfake thread dumps
Removing blocker flag, for 2 reasons: 1. It is how 4.0 behaved. 2. We need re-testing in 4.1 - we believe we have made some good progress on various areas partially related to this as well. Specifically, host monitoring should be significantly improved (ETA - next week).
This problem reproduced for 163 nested hosts without storage. test profile: 1.run 163 vms for nested hosts 2.assign the vms as nested hosts in different datacenter (with no storage). 3.power off the vms, in order to simulate restart, as we have restarted the vdsmfake. 4.seeing the hosts as non responsive. 5.start the vms again. 6.restart the engine. 7.seeing the nested hosts some of them as non responsive and some stuck on "activating"
We are also exploring that, but it would be helpful to get a thread dump so that we'll be able to compare. Can you produce it?
Created attachment 1233397 [details] reproducing the problem with 163 nested hosts attaching thread dumps see the following lock on the latch call 0x000000008a026d68 or DefaultQuartzScheduler71
(In reply to Oved Ourfali from comment #33) > We are also exploring that, but it would be helpful to get a thread dump so > that we'll be able to compare. Can you produce it? I Just replied with it, also vdsm-jsonrpc-version vdsm-jsonrpc-java-1.2.10-1.el7ev.noarch
I see the stomp reactor looping all channels to get the connection back. It doesn't find any of the channels open probably. I started with $ grep "Connecting to" /var/log/ovirt-engine/engine.log # tons of records Then went deeper to see what the thread is constantly trying to read the resources. - First spot the problematic thread $ ps -Leo pcpu,pmem,pid,lwp,user| sort -k1 -r | head -4 %CPU %MEM PID LWP USER 81.3 4.8 13203 13446 ovirt - LWP is the thread id. converting it to hexa $ printf "0x%x" 13446 - Get thread if from the stack trace $ su ovirt -s /bin/bash -c "jstack $(pgrep -f jboss-modules.jar)" | grep -A 2 3486 "SSL Stomp Reactor" #226 daemon prio=5 os_prio=0 tid=0x00007fd1b0245800 nid=0x3486 runnable [0x00007fd13faf9000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.NativeThread.current(Native Method) - Strace shows this thread is performing 4000 reads per sec and can't find anything $ strace -ff -x -o strace -t -y -p 13446 | grep "09:36:15 read" strace.out.13446 | wc -l 4002 Piotr, Martin don't we have a treatment to simply stop connecting after a while?
Jsonrpc java is quite dump and it performs operations that the engine is asking for like sending. When a channel is closed and we have send request we attempt to connect. Usually host monitoring triggers sends.
(In reply to Piotr Kliczewski from comment #37) > Jsonrpc java is quite dump and it performs operations that the engine is > asking for like sending. When a channel is closed and we have send request > we attempt to connect. Usually host monitoring triggers sends. Right but I would expect to see far less stress here. The engine is trying to connect every 3 seconds right? it does look like that at all. This looks like retries of the reactor.
(In reply to Piotr Kliczewski from comment #37) > Jsonrpc java is quite dump and it performs operations that the engine is > asking for like sending. When a channel is closed and we have send request > we attempt to connect. Usually host monitoring triggers sends. Right but I would expect to see far less stress here. The engine is trying to connect every 3 seconds right? it does not look like that at all. This looks like retries of the reactor.
scratch comment 38. I meant 'it does not look like that"
Correct, I saw many times that the engine triggers the job couple of times per second.
seems like we have 2 issues here one is the lock caused by DefaultQuartzScheduler71 > countdown which never released and holds the lock. and the other is the reactor intensive processing.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
summarizing lasts findings: based on https://bugzilla.redhat.com/show_bug.cgi?id=1364791#c43 zstream env with bare-metal hosts + patched vdsm-json-rpc (https://gerrit.ovirt.org/#/q/topic:slow_host_start) works fine in idle state, but when working with vms and adding more tasks to the system, we found an issue on vdsm side, see https://bugzilla.redhat.com/show_bug.cgi?id=1410120 for upstream with vdsmfake we test only idle state, we should see what will happens when generating engine tasks e.g. starting vms plus idle system with 500 hosts.
Can we reach higher numbers just by increasing the num of threads and connections (thought not welcome change by itself) and without the fixes proposed here? It okay to test that with vdsmfake
(In reply to Roy Golan from comment #52) > Can we reach higher numbers just by increasing the num of threads and > connections (thought not welcome change by itself) and without the fixes > proposed here? It okay to test that with vdsmfake We need to have all fixes on fakevdsm and vdsm-jsonrpc-java. On engine we have just added logging to know if Quartz thread pool is depleted and to bump vdsm-jsonrpc-java version required. And we also optimized number of threads need to execute check for upgrades in BZ1411648, although I recommend to turn off check for upgrades completely if you are using fakevdsm.
We can talk to more hosts if we increase quartz thread pool size and number of db connections in the pools (this requires changes in db config).
The issue reproduced on 4.1.0.3. setup profile: 1 SD, 23 hosts vms running 0. adding dumps.
Piotr - Should this move to ON_QA?
This BZ is targeted to 4.1 I posted patches to have it in for this branch but they were not acked yet. I will move once version bump is merged to the engine.
Moving to 4.1.1 as latest patch is not that critical to require RC rebuild
mperina please add release note for enlarging the quartz thread pool and also add the kb(if that exist already)
(In reply to Roy Golan from comment #72) > mperina please add release note for enlarging the quartz thread pool and > also add the kb(if that exist already) We don't want to enlarge quartz pool until it's really necessary. So I hope that when Eldad will try to verify this bug we will have confidence if we need to enlarge the pool for certain number of hosts or not.
(In reply to Martin Perina from comment #73) > (In reply to Roy Golan from comment #72) > > mperina please add release note for enlarging the quartz thread pool and > > also add the kb(if that exist already) > > We don't want to enlarge quartz pool until it's really necessary. So I hope > that when Eldad will try to verify this bug we will have confidence if we > need to enlarge the pool for certain number of hosts or not. Hence the release note - to advice how it will help in case of large number of hosts.
We will update what needs to be updated once the scale team will do some testing.
summarazing the lasts ramp up scenario for 250 hosts with default installation. we faced the problem around 150 hosts and above, the problem occur just when restarting the engine, some of the tasks rejected as we saw previously during the initialisation phase. but eventually the hosts become up & running properly. we did ramp up scenario for hosts capacity, see the following capacity vs engine restart results. 50 hosts, task rejected rate: 0 100 hosts, task rejected rate: 0 150 hosts, task rejected rate: 24 in 1 sec after restart. 200 hosts, task rejected rate: 126 in 8 sec after restart. 250 hosts, task rejected rate: 612 in 11 sec after restart.
(In reply to Eldad Marciano from comment #81) > summarazing the lasts ramp up scenario for 250 hosts with default > installation. > > we faced the problem around 150 hosts and above, the problem occur just when > restarting the engine, some of the tasks rejected as we saw previously > during the initialisation phase. > but eventually the hosts become up & running properly. So if I understand correctly when engine is up and running we can manage 250 hosts with default configuration (no thread pool changes) without problem, right? If so, that's great! Yes, we know we have an issue during engine startup, when we try to execute HostMonitoring for all existing hosts pretty much at once. But because fencing is disabled 5 minutes after startup, it shouldn't cause any unwanted hosts restarts right? So, are you able co confirm, that even though some host monitoring tasks are rejected, hosts become Up and stable in a few minutes?
(In reply to Martin Perina from comment #88) > (In reply to Eldad Marciano from comment #81) > > summarazing the lasts ramp up scenario for 250 hosts with default > > installation. > > > > we faced the problem around 150 hosts and above, the problem occur just when > > restarting the engine, some of the tasks rejected as we saw previously > > during the initialisation phase. > > but eventually the hosts become up & running properly. > > So if I understand correctly when engine is up and running we can manage 250 > hosts with default configuration (no thread pool changes) without problem, > right? > If so, that's great! Yes, we just have some task rejected due to low thread pool (100 by default) and than they being monitored again, it's ony happens when restarting the engine. seems like the enigne try to monitor all existing hosts at the same time, so we reach the pool limit in terms of threads. > > Yes, we know we have an issue during engine startup, when we try to execute > HostMonitoring for all existing hosts pretty much at once. But because > fencing is disabled 5 minutes after startup, it shouldn't cause any unwanted > hosts restarts right? So, are you able co confirm, that even though some > host monitoring tasks are rejected, hosts become Up and stable in a few > minutes? Yes, I already mentioned it in comment https://bugzilla.redhat.com/show_bug.cgi?id=1364791#c81 any further information needed ?
(In reply to Roy Golan from comment #72) > mperina please add release note for enlarging the quartz thread pool and > also add the kb(if that exist already) according to this after manually reconfigure the pool size we are now at 250 host under nested topology what is the next step from our side?
According to Comment 81, Comment 88 and Comment 89 I'd say the bug is verified. I will create a KCS article how to increase Quartz pool size (and db connection pool + maximum number of client connections on PostgreSQL which is also needed if quartz pool is increased), but I'd not change the defaults, because it's waste of resources for most customers.
(In reply to Martin Perina from comment #91) > According to Comment 81, Comment 88 and Comment 89 I'd say the bug is > verified. > > I will create a KCS article how to increase Quartz pool size (and db > connection pool + maximum number of client connections on PostgreSQL which > is also needed if quartz pool is increased), but I'd not change the > defaults, because it's waste of resources for most customers. we saw the issue reproduced again with default settings and 250 hosts. the issue reproduced when switching 50 hosts to maintenance out of 250. moreover, we saw the task rejection issue for another scenario please see https://bugzilla.redhat.com/show_bug.cgi?id=1429534
Eldad, As Martin mentioned we documented the setting which helps to mitigate the issue. I agree that we need to handle better how we use quartz jobs so you opened BZ #1429534. We have discussed the issue and as Martin mentioned we do not want to change default value. Please tell me what do you expect to be done for this BZ.
(In reply to Piotr Kliczewski from comment #93) > Eldad, > > As Martin mentioned we documented the setting which helps to mitigate the > issue. I agree that we need to handle better how we use quartz jobs so you > opened BZ #1429534. We have discussed the issue and as Martin mentioned we > do not want to change default value. Please tell me what do you expect to be > done for this BZ. Just mentioning the last findings, and adding "depends on" BZ. based on yours and Martin suggestions lets handle it on BZ 1429534
(In reply to Piotr Kliczewski from comment #93) > Eldad, > > As Martin mentioned we documented the setting which helps to mitigate the > issue. I agree that we need to handle better how we use quartz jobs so you > opened BZ #1429534. We have discussed the issue and as Martin mentioned we > do not want to change default value. Please tell me what do you expect to be > done for this BZ. is there another specific reproduction required? missing info?(In reply to Martin Perina from comment #91) > According to Comment 81, Comment 88 and Comment 89 I'd say the bug is > verified. > > I will create a KCS article how to increase Quartz pool size (and db > connection pool + maximum number of client connections on PostgreSQL which > is also needed if quartz pool is increased), but I'd not change the > defaults, because it's waste of resources for most customers. can we also see and comment on this Doc?
(In reply to Martin Perina from comment #91) > According to Comment 81, Comment 88 and Comment 89 I'd say the bug is > verified. > > I will create a KCS article how to increase Quartz pool size (and db > connection pool + maximum number of client connections on PostgreSQL which > is also needed if quartz pool is increased), but I'd not change the > defaults, because it's waste of resources for most customers. where can we review the KCS article so we can confirm from our side?
As this is mainly about documentation, it will be verified once the KCS article will be ready. Moving to 4.1.2.
Because of fix provided by BZ1429534 we don't need to raise thread pool size anymore, so I no longer see a reason for such KCS article. On the contrary during 4.2 development we will try to even lower number of threads that engine using. From my point of this bug should be marked as VERIFIED, because BZ1429534 is already VERIFIED and they are both tried to solve the same issue. So I don't a see a reason we should retarget it to 4.1.2.
Ok, moving to verified based on comment #98