Description of problem: Have added 6000 hosts to satellite (1500 per capsule. 4 such capsules). Run simple date command with ReX. It took 7 hours to complete the run on 6000 nodes. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: ReX taking longer time Expected results: ReX should take lesser time Additional info:
Created attachment 1244593 [details] rex on 1029 hosts (last page & first page) after 5 mins of submitting job
couple of observations 1) Its going serially. Even after 5 mins of updating package in first host, last content hosts hasnt even prepared. 2) Its taking time to report status on UI even after package is updated. (noticed more than 3 mins for few hosts)
See also https://bugzilla.redhat.com/show_bug.cgi?id=1387658
Some observation from smaller test: - Command "date" invoked - on 126 hosts - evenly distributed over 4 Capsules Results: - individual sub-tasks started during 11:31:05 - 11:33:17 - it is known that REX invokes up to 1 job per second, only (smaller problem) - dynflow executor started to handle them during 11:33:19 - 11:33:29 - dynflow atoms/tasks RunProxyCommand started during 10:33:19 - 10:38:22 - hosts really executed the command during 10:33:35 - 10:38:26 (output of "date" was in that range) - dynflow atoms/tasks RunProxyCommand finished during 10:36:15 - 10:38:34 - (foreman) task finished during 11:36:19 - 11:38:35 (most often in 11:38) Key questions: - can't be generating a job.task faster than 1 task per second? - why dynflow started some jobs with almost 5 minutes delay? - why it took almost 3 minutes to detect very-first completed job is really completed? (cf. 1st date output and 1st dynflow task finish)?
The "6k hosts: date" test: Results: - individual sub-tasks started during 22:52:38 - 00:51:11 - dynflow executor started to handle them during 22:52:39 - 00:51:13 - dynflow atoms/tasks RunProxyCommand started during 22:52:39 - 07:32:33 (tasks after 04:19:23 were cancelled) - hosts really executed the command during 22:52:46 - 04:19:17 (output of "date" was in that range) - dynflow atoms/tasks RunProxyCommand finished during 22:52:53 - 04:19:19 - (foreman) task finished during 23:53:59 - 07:51:36 (in several evident groups/bulks, like 1 task at 04:50:* but 260 tasks at 04:52:*, latests tasks cancelled) Key questions here: > - can't be generating a job.task faster than 1 task per second? (applies here as well) > - why dynflow started some jobs with almost 5 minutes delay? Why dynflow started some tasks with _hours_ of delays? example: https://gprfc018.sbu.lab.eng.bos.redhat.com/foreman_tasks/dynflow/0c5489b0-9feb-41bd-9f86-ef2636436782 - why it took almost 3 minutes to detect very-first completed job is really completed? (seen but less evident - 1 minute delay only) Anyway, the "dynflow starts tasks too late after handling them" is the key problem here.
TL;DR version: increasing number of dynlfow workers from 5 to 10 helps a lot. (In reply to Pavel Moravec from comment #6) > Some observation from smaller test: > - Command "date" invoked > - on 126 hosts > - evenly distributed over 4 Capsules Pradeep re-run the test with 10 dynflow workers (manually changed in some .rb file and katello-service restart) > Results: > - individual sub-tasks started during 11:31:05 - 11:33:17 > - it is known that REX invokes up to 1 job per second, only (smaller > problem) tasks started during 12:32:43 - 12:35:38, i.e. again 1 job per second > - dynflow executor started to handle them during 11:33:19 - 11:33:29 now during 12:32:45 - 12:35:39 > - dynflow atoms/tasks RunProxyCommand started during 10:33:19 - 10:38:22 now during 12:32:44 - 12:37:46 > - hosts really executed the command during 10:33:35 - 10:38:26 (output of > "date" was in that range) now executed during 12:32:56 - 12:37:43 > - dynflow atoms/tasks RunProxyCommand finished during 10:36:15 - 10:38:34 now during 12:33:47 - 12:37:40 > - (foreman) task finished during 11:36:19 - 11:38:35 (most often in 11:38) now finished during 12:33:51 - 12:37:47 (most often in 12:37) Previous total exec time: 08:30 New exec time: 05:04 > Key questions: > - can't be generating a job.task faster than 1 task per second? this persists (expected) > - why dynflow started some jobs with almost 5 minutes delay? this main concern is improved from 5 minutes to 2 minutes > - why it took almost 3 minutes to detect very-first completed job is really > completed? (cf. 1st date output and 1st dynflow task finish)? Also here is improvement from 02:40 mins to 00:51. So # of dynflow workers is the key scaling parameter/bottleneck here.
will run with 6k hosts and post results
Repeated test with 6.1k hosts running "date" command, with 10 dynflow workers, with _no_ concurrency level set: job run during 16:51:20 - 22:37:05 (later on, just 30 hung tasks were "active") 6k sub-tasks started during 16:51:51 - 18:59:01 (i.e. a bit below 1 task per second) dynflow picked the tasks up during 16:51:51 - 18:59:02 dynflow RunProxyCommand started during 16:51:51 - 22:36:59 "date" command executed during 16:52:18 - 22:37:01 dynflow RunProxyCommand finished during 16:52:45 - 22:37:03 6k sub-tasks finished during 16:53:31 - 22:37:05 Key blockers here: - classical "1 subtask per second at max" - why first sub-task was created even 31s after job invocation? - why dynflow started some tasks 3.5hours after picking them up from foreman-task? - why there was a task with RunProxyCommand finished at 16:52:45, but dynflow handed it to foreman at 18:39:51 ? - some minor delays between: - dynflow atom start -> "date" executed - "date" executed -> dynflow atom finished I.e. similar issues like before. Can't concurrency setting help here?
(In reply to Pavel Moravec from comment #20) > Repeated test with 6.1k hosts running "date" command, with 10 dynflow > workers, with _no_ concurrency level set: .. and now with concurrency set to 2000 and time span set to 36000 : job run during 00:00:36 - 07:50:22 (still running (some hung task), but at 07:50, latest activity seen) 6k sub-tasks started during 00:01:18 - 01:52:21 dynflow picked the tasks up during 00:52:23 - 07:00:13 dynflow RunProxyCommand started during 00:52:23 - 07:49:47 "date" command executed during 00:52:43 - 07:49:57 dynflow RunProxyCommand finished during 01:41:39 - 07:50:21 6k sub-tasks finished during 01:43:19 - 07:50:22 Main delays: - child tasks started to be created >30s after job kicked off - why? - as usual, at most 1 child task per second created - dynflow started to pick up tasks _after_ _all_ child tasks created - this hasnt happen before, and caused almost 2hrs delay (time span effect?) - dynflow was slow on reacting on completed jobs (some RunProxyCommand finished 50mins after "date" was executed) - dynflow returned task back to foreman with some delay (90s) Critical delays: - dynflow started to pick up tasks _after_ _all_ child tasks created :( - dynflow was slow on reacting on completed jobs
(In reply to Pavel Moravec from comment #21) > (In reply to Pavel Moravec from comment #20) > > Repeated test with 6.1k hosts running "date" command, with 10 dynflow > > workers, with _no_ concurrency level set: > > .. and now with concurrency set to 2000 and time span set to 36000 : > > > job run during 00:00:36 - 07:50:22 (still running (some hung task), but at > 07:50, latest activity seen) > > 6k sub-tasks started during 00:01:18 - 01:52:21 > > dynflow picked the tasks up during 00:52:23 - 07:00:13 > > dynflow RunProxyCommand started during 00:52:23 - 07:49:47 > > "date" command executed during 00:52:43 - 07:49:57 > > dynflow RunProxyCommand finished during 01:41:39 - 07:50:21 > > 6k sub-tasks finished during 01:43:19 - 07:50:22 > > > Main delays: > - child tasks started to be created >30s after job kicked off - why? > - as usual, at most 1 child task per second created > - dynflow started to pick up tasks _after_ _all_ child tasks created - this > hasnt happen before, and caused almost 2hrs delay (time span effect?) > - dynflow was slow on reacting on completed jobs (some RunProxyCommand > finished 50mins after "date" was executed) > - dynflow returned task back to foreman with some delay (90s) > > > Critical delays: > - dynflow started to pick up tasks _after_ _all_ child tasks created :( > - dynflow was slow on reacting on completed jobs .. and also dynflow was terribly slow on picking individual jobs from foreman - this step took unbelievable 6 hours!!!
(In reply to Pavel Moravec from comment #14) > TL;DR version: increasing number of dynlfow workers from 5 to 10 helps a lot. > > > (In reply to Pavel Moravec from comment #6) > > Some observation from smaller test: > > - Command "date" invoked > > - on 126 hosts > > - evenly distributed over 4 Capsules > > Pradeep re-run the test with 10 dynflow workers (manually changed in some > .rb file and katello-service restart) Now with extra time span = 3600: > Previous total exec time: 08:30 > New exec time: 05:04 total duration: 12:12:33 - 12:20:57, so back to old bad values :( tasks created during 12:12:40 - 12:14:55 dynflow picked up them during 12:14:56 - 12:15:12, i.e. again _after_ all subtasks created - due to time span set and bz1370139 dynflow atom started during 12:14:56 - 12:18:14, i.e. still non-performant on starting to work on received task "date" command executed during 12:15:17 - 12:18:23 dynflow atom finished during 12:17:15 - 12:20:08, i.e. still there is a delay in reacting to finished ssh command foreman tasks finished during 12:17:21 - 12:20:51 So, time span due to bz1370139 negativelly postpones "dynflow picked up" and "dynflow atom finished".
My findings are: 1. the smart proxies take to much to respond when under heavier load (10s seconds) which causes the dynflow actions on main satellite to keep the workers for more time, than needed. 2. the reason why the proxy responds take so long seems to be the storage (sqlite in a file). 3. Switching the sqlite to use the in-memory storage helped a lot: # in /etc/smart_proxy_dynflow_core/settings.yml on capsules set: :database: "" intead of `:database: var/lib/foreman-proxy/dynflow/dynflow.sqlite` and restart the dynflow executor on capsule: service smart_proxy_dynflow_core restart On the proxy I've observed, this lead to increased response times from 10s secodns to 0.1 seconds, and should make the overall progress much faster. This should also influence the responsiveness on finished tasks.
Ivan, what are next steps then? Is changing to in-memory something we can do in the product w/o loss of functionality? Or, do we need this storage to give us the appropriate level of resiliency?
Rex is running like rocket after Switching the sqlite to use the in-memory storage. smart proxies are responding very fast now. I haven't increased dynflow workers to 10. Running with defualt (5)
I will update you with this info once I'm done with other related investigations
Created attachment 1248512 [details] More effective way for refreshing status of commands I've identified the issue with high CPU on smart proxy, leading to slow response times after reaching specific amount of concurrent hosts. Without this patch, the system behaves similarly as it did with persisted sqlite, leading to latencies in planning/execution. To apply (with support of git), run cd /opt/theforeman/tfm/root/usr/share/gems/gems/smart_proxy_remote_execution_ssh_core-* git init; git add -A .; git commit -m origin patch -p1 < 0001-more-effective-ticking.patch service smart_proxy_dynflow_core restart With this patch, one should not need workaround provided in https://bugzilla.redhat.com/show_bug.cgi?id=1417978#c10 anymore.
Another thing I've noticed at some point was amount of planned/pending Actions::Katello::Host::UploadPackageProfile actions (around 1000 and growing), that also might have effect on the throughput of remote execution job, as they compete for workers. state = planned and label = Actions::Katello::Host::UploadPackageProfile I recommand looking into increasing the foreman-tasks workers pool_size and testing the behaviour: if we don't hit limits on db connections, I would recommend increasing the number. We might also look into actually running multiple dynflow workers to help with the load, the final piece (except the installer) was merged here https://github.com/theforeman/foreman-tasks/pull/213
Here we go https://bugzilla.redhat.com/show_bug.cgi?id=1463506
We can try, there is additional pre-requisite for this, which is cleanup of the in-memory tasks. I'm adding http://projects.theforeman.org/issues/20626/ as depednency for this BZ, to avoid never-ending growth of the in-memory database. We are fixing it now, ETA few days
Upstream bug assigned to aruzicka
Upstream bug assigned to inecas
Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/20721 has been resolved.