Bug 1416542 - Persisted sqlite on smart proxy for remote execution is too slow for remote execution at scale
Summary: Persisted sqlite on smart proxy for remote execution is too slow for remote e...
Keywords:
Status: CLOSED DUPLICATE of bug 1519342
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Remote Execution
Version: 6.2.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: Unspecified
Assignee: Ivan Necas
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1519342
TreeView+ depends on / blocked
 
Reported: 2017-01-25 18:43 UTC by Pradeep Kumar Surisetty
Modified: 2021-06-10 11:52 UTC (History)
14 users (show)

Fixed In Version: tfm-rubygem-smart_proxy_dynflow_core-0.1.8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1519342 (view as bug list)
Environment:
Last Closed: 2018-01-02 19:29:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
rex on 1029 hosts (last page & first page) after 5 mins of submitting job (196.95 KB, image/png)
2017-01-26 05:01 UTC, Pradeep Kumar Surisetty
no flags Details
More effective way for refreshing status of commands (4.78 KB, patch)
2017-02-07 23:12 UTC, Ivan Necas
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Foreman Issue Tracker 20626 0 Normal Closed Provide a way for automatic cleanup of old execution plans from dynflow (especially useful on smart-proxy) 2020-06-26 14:42:05 UTC
Foreman Issue Tracker 20721 0 Normal Closed Use in-memory squite for smart_proxy_dynflow 2020-06-26 14:42:05 UTC

Description Pradeep Kumar Surisetty 2017-01-25 18:43:53 UTC
Description of problem:


Have added 6000 hosts to satellite (1500 per capsule. 4 such capsules).
Run simple date command with ReX. It took 7 hours to complete the run on 6000 nodes. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

ReX taking longer time
Expected results:

ReX should take lesser time
Additional info:

Comment 2 Pradeep Kumar Surisetty 2017-01-26 05:01:50 UTC
Created attachment 1244593 [details]
rex on 1029 hosts (last page & first page) after 5 mins of submitting job

Comment 3 Pradeep Kumar Surisetty 2017-01-26 05:02:38 UTC
couple of observations

1) Its going serially. Even after 5 mins of updating package in first host,  last content hosts hasnt even prepared. 
2) Its taking time to report status on UI even after package is updated. (noticed more than 3 mins for few hosts)

Comment 5 Ivan Necas 2017-01-26 12:21:06 UTC
See also https://bugzilla.redhat.com/show_bug.cgi?id=1387658

Comment 6 Pavel Moravec 2017-01-26 12:25:26 UTC
Some observation from smaller test: 
- Command "date" invoked
- on 126 hosts
- evenly distributed over 4 Capsules

Results:
- individual sub-tasks started during 11:31:05 - 11:33:17
  - it is known that REX invokes up to 1 job per second, only (smaller problem)

- dynflow executor started to handle them during 11:33:19 - 11:33:29

- dynflow atoms/tasks RunProxyCommand started during 10:33:19 - 10:38:22

- hosts really executed the command during 10:33:35 - 10:38:26 (output of "date" was in that range)

- dynflow atoms/tasks RunProxyCommand finished during 10:36:15 - 10:38:34

- (foreman) task finished during 11:36:19 - 11:38:35 (most often in 11:38)


Key questions:
- can't be generating a job.task faster than 1 task per second?

- why dynflow started some jobs with almost 5 minutes delay?

- why it took almost 3 minutes to detect very-first completed job is really completed? (cf. 1st date output and 1st dynflow task finish)?

Comment 9 Pavel Moravec 2017-01-26 13:22:17 UTC
The "6k hosts: date" test:

Results:
- individual sub-tasks started during 22:52:38 - 00:51:11

- dynflow executor started to handle them during 22:52:39 - 00:51:13

- dynflow atoms/tasks RunProxyCommand started during 22:52:39 - 07:32:33 (tasks after 04:19:23 were cancelled)

- hosts really executed the command during 22:52:46 - 04:19:17 (output of "date" was in that range)

- dynflow atoms/tasks RunProxyCommand finished during 22:52:53 - 04:19:19

- (foreman) task finished during 23:53:59 - 07:51:36 (in several evident groups/bulks, like 1 task at 04:50:* but 260 tasks at 04:52:*, latests tasks cancelled)


Key questions here:
> - can't be generating a job.task faster than 1 task per second?
(applies here as well)

> - why dynflow started some jobs with almost 5 minutes delay?
Why dynflow started some tasks with _hours_ of delays?

example: https://gprfc018.sbu.lab.eng.bos.redhat.com/foreman_tasks/dynflow/0c5489b0-9feb-41bd-9f86-ef2636436782

- why it took almost 3 minutes to detect very-first completed job is really completed?
(seen but less evident - 1 minute delay only)


Anyway, the "dynflow starts tasks too late after handling them" is the key problem here.

Comment 14 Pavel Moravec 2017-01-27 17:16:15 UTC
TL;DR version: increasing number of dynlfow workers from 5 to 10 helps a lot.


(In reply to Pavel Moravec from comment #6)
> Some observation from smaller test: 
> - Command "date" invoked
> - on 126 hosts
> - evenly distributed over 4 Capsules

Pradeep re-run the test with 10 dynflow workers (manually changed in some .rb file and katello-service restart)

> Results:
> - individual sub-tasks started during 11:31:05 - 11:33:17
>   - it is known that REX invokes up to 1 job per second, only (smaller
> problem)

tasks started during 12:32:43 - 12:35:38, i.e. again 1 job per second

> - dynflow executor started to handle them during 11:33:19 - 11:33:29

now during 12:32:45 - 12:35:39

> - dynflow atoms/tasks RunProxyCommand started during 10:33:19 - 10:38:22

now during 12:32:44 - 12:37:46

> - hosts really executed the command during 10:33:35 - 10:38:26 (output of
> "date" was in that range)

now executed during 12:32:56 - 12:37:43

> - dynflow atoms/tasks RunProxyCommand finished during 10:36:15 - 10:38:34

now during 12:33:47 - 12:37:40

> - (foreman) task finished during 11:36:19 - 11:38:35 (most often in 11:38)

now finished during 12:33:51 - 12:37:47 (most often in 12:37)

Previous total exec time: 08:30
New exec time: 05:04

> Key questions:
> - can't be generating a job.task faster than 1 task per second?

this persists (expected)

> - why dynflow started some jobs with almost 5 minutes delay?

this main concern is improved from 5 minutes to 2 minutes

> - why it took almost 3 minutes to detect very-first completed job is really
> completed? (cf. 1st date output and 1st dynflow task finish)?

Also here is improvement from 02:40 mins to 00:51.

So # of dynflow workers is the key scaling parameter/bottleneck here.

Comment 15 Pradeep Kumar Surisetty 2017-01-27 17:19:47 UTC
will run with 6k hosts and post results

Comment 20 Pavel Moravec 2017-01-29 09:30:42 UTC
Repeated test with 6.1k hosts running "date" command, with 10 dynflow workers, with _no_ concurrency level set:

job run during 16:51:20 - 22:37:05 (later on, just 30 hung tasks were "active")

6k sub-tasks started during 16:51:51 - 18:59:01 (i.e. a bit below 1 task per second)

dynflow picked the tasks up during 16:51:51 - 18:59:02

dynflow RunProxyCommand started during 16:51:51 - 22:36:59

"date" command executed during 16:52:18 - 22:37:01

dynflow RunProxyCommand finished during 16:52:45 - 22:37:03

6k sub-tasks finished during 16:53:31 - 22:37:05

Key blockers here:
- classical "1 subtask per second at max"
- why first sub-task was created even 31s after job invocation?
- why dynflow started some tasks 3.5hours after picking them up from foreman-task?
- why there was a task with RunProxyCommand finished at 16:52:45, but dynflow handed it to foreman at 18:39:51 ?
- some minor delays between:
  - dynflow atom start -> "date" executed
  - "date" executed -> dynflow atom finished

I.e. similar issues like before.

Can't concurrency setting help here?

Comment 21 Pavel Moravec 2017-01-30 10:04:13 UTC
(In reply to Pavel Moravec from comment #20)
> Repeated test with 6.1k hosts running "date" command, with 10 dynflow
> workers, with _no_ concurrency level set:

.. and now with concurrency set to 2000 and time span set to 36000 :


job run during 00:00:36 - 07:50:22 (still running (some hung task), but at 07:50, latest activity seen)

6k sub-tasks started during 00:01:18 - 01:52:21

dynflow picked the tasks up during 00:52:23 - 07:00:13

dynflow RunProxyCommand started during 00:52:23 - 07:49:47

"date" command executed during 00:52:43 - 07:49:57

dynflow RunProxyCommand finished during 01:41:39 - 07:50:21

6k sub-tasks finished during 01:43:19 - 07:50:22


Main delays:
- child tasks started to be created >30s after job kicked off - why?
- as usual, at most 1 child task per second created
- dynflow started to pick up tasks _after_ _all_ child tasks created - this hasnt happen before, and caused almost 2hrs delay (time span effect?)
- dynflow was slow on reacting on completed jobs (some RunProxyCommand finished 50mins after "date" was executed)
- dynflow returned task back to foreman with some delay (90s)


Critical delays:
- dynflow started to pick up tasks _after_ _all_ child tasks created :(
- dynflow was slow on reacting on completed jobs

Comment 22 Pavel Moravec 2017-01-30 10:12:16 UTC
(In reply to Pavel Moravec from comment #21)
> (In reply to Pavel Moravec from comment #20)
> > Repeated test with 6.1k hosts running "date" command, with 10 dynflow
> > workers, with _no_ concurrency level set:
> 
> .. and now with concurrency set to 2000 and time span set to 36000 :
> 
> 
> job run during 00:00:36 - 07:50:22 (still running (some hung task), but at
> 07:50, latest activity seen)
> 
> 6k sub-tasks started during 00:01:18 - 01:52:21
> 
> dynflow picked the tasks up during 00:52:23 - 07:00:13
> 
> dynflow RunProxyCommand started during 00:52:23 - 07:49:47
> 
> "date" command executed during 00:52:43 - 07:49:57
> 
> dynflow RunProxyCommand finished during 01:41:39 - 07:50:21
> 
> 6k sub-tasks finished during 01:43:19 - 07:50:22
> 
> 
> Main delays:
> - child tasks started to be created >30s after job kicked off - why?
> - as usual, at most 1 child task per second created
> - dynflow started to pick up tasks _after_ _all_ child tasks created - this
> hasnt happen before, and caused almost 2hrs delay (time span effect?)
> - dynflow was slow on reacting on completed jobs (some RunProxyCommand
> finished 50mins after "date" was executed)
> - dynflow returned task back to foreman with some delay (90s)
> 
> 
> Critical delays:
> - dynflow started to pick up tasks _after_ _all_ child tasks created :(
> - dynflow was slow on reacting on completed jobs

.. and also dynflow was terribly slow on picking individual jobs from foreman - this step took unbelievable 6 hours!!!

Comment 23 Pavel Moravec 2017-01-30 14:08:47 UTC
(In reply to Pavel Moravec from comment #14)
> TL;DR version: increasing number of dynlfow workers from 5 to 10 helps a lot.
> 
> 
> (In reply to Pavel Moravec from comment #6)
> > Some observation from smaller test: 
> > - Command "date" invoked
> > - on 126 hosts
> > - evenly distributed over 4 Capsules
> 
> Pradeep re-run the test with 10 dynflow workers (manually changed in some
> .rb file and katello-service restart)

Now with extra time span = 3600:

> Previous total exec time: 08:30
> New exec time: 05:04
total duration: 12:12:33 - 12:20:57, so back to old bad values :(

tasks created during 12:12:40 - 12:14:55

dynflow picked up them during 12:14:56 - 12:15:12, i.e. again _after_ all subtasks created - due to time span set and bz1370139

dynflow atom started during 12:14:56 - 12:18:14, i.e. still non-performant on starting to work on received task

"date" command executed during 12:15:17 - 12:18:23

dynflow atom finished during 12:17:15 - 12:20:08, i.e. still there is a delay in reacting to finished ssh command

foreman tasks finished during 12:17:21 - 12:20:51

So, time span due to bz1370139 negativelly postpones "dynflow picked up" and "dynflow atom finished".

Comment 24 Ivan Necas 2017-01-30 15:42:37 UTC
My findings are:

1. the smart proxies take to much to respond when under heavier load (10s seconds) which causes the dynflow actions on main satellite to keep the workers for more time, than needed.

2. the reason why the proxy responds take so long seems to be the storage (sqlite in a file).

3. Switching the sqlite to use the in-memory storage helped a lot:

# in /etc/smart_proxy_dynflow_core/settings.yml on capsules set:

    :database: ""

intead of `:database: var/lib/foreman-proxy/dynflow/dynflow.sqlite`

and restart the dynflow executor on capsule:

    service smart_proxy_dynflow_core restart

On the proxy I've observed, this lead to increased response times from 10s secodns to 0.1 seconds, and should make the overall progress much faster. This should also influence the responsiveness on finished tasks.

Comment 25 Bryan Kearney 2017-01-30 16:24:03 UTC
Ivan, what are next steps then? Is changing to in-memory something we can do in the product w/o loss of functionality? Or, do we need this storage to give us the appropriate level of resiliency?

Comment 27 Pradeep Kumar Surisetty 2017-01-31 06:32:18 UTC
Rex is running like rocket after Switching the sqlite to use the in-memory storage. 
smart proxies are responding very fast now. I haven't increased dynflow workers to 10. Running with defualt (5)

Comment 35 Ivan Necas 2017-02-07 22:46:27 UTC
I will update you with this info once I'm done with other related investigations

Comment 36 Ivan Necas 2017-02-07 23:12:34 UTC
Created attachment 1248512 [details]
More effective way for refreshing status of commands

I've identified the issue with high CPU on smart proxy, leading to slow response times after reaching specific amount of concurrent hosts. 

Without this patch, the system behaves similarly as it did with persisted sqlite,
leading to latencies in planning/execution.

To apply (with support of git), run

cd /opt/theforeman/tfm/root/usr/share/gems/gems/smart_proxy_remote_execution_ssh_core-*
git init; git add -A .; git commit -m origin
patch -p1 < 0001-more-effective-ticking.patch
service smart_proxy_dynflow_core restart

With this patch, one should not need workaround provided in https://bugzilla.redhat.com/show_bug.cgi?id=1417978#c10 anymore.

Comment 37 Ivan Necas 2017-02-07 23:24:23 UTC
Another thing I've noticed at some point was amount of planned/pending 
Actions::Katello::Host::UploadPackageProfile actions (around 1000 and growing),
that also might have effect on the throughput of remote execution job, as they compete for workers.

  state = planned and label = Actions::Katello::Host::UploadPackageProfile

I recommand looking into increasing the foreman-tasks workers pool_size and testing the behaviour: if we don't hit limits on db connections, I would recommend increasing the number.

We might also look into actually running multiple dynflow workers to help with the load, the final piece (except the installer) was merged here https://github.com/theforeman/foreman-tasks/pull/213

Comment 45 Ivan Necas 2017-06-21 07:18:42 UTC
Here we go https://bugzilla.redhat.com/show_bug.cgi?id=1463506

Comment 48 Ivan Necas 2017-08-17 14:20:13 UTC
We can try, there is additional pre-requisite for this, which is cleanup of the in-memory tasks. I'm adding http://projects.theforeman.org/issues/20626/ as depednency for this BZ, to avoid never-ending growth of the in-memory database.
We are fixing it now, ETA few days

Comment 49 Satellite Program 2017-08-17 16:10:51 UTC
Upstream bug assigned to aruzicka

Comment 50 Satellite Program 2017-08-17 16:10:56 UTC
Upstream bug assigned to aruzicka

Comment 51 Satellite Program 2017-08-23 22:10:57 UTC
Upstream bug assigned to inecas

Comment 52 Satellite Program 2017-08-23 22:11:03 UTC
Upstream bug assigned to inecas

Comment 53 Satellite Program 2017-08-24 14:10:57 UTC
Upstream bug assigned to aruzicka

Comment 54 Satellite Program 2017-08-24 14:11:03 UTC
Upstream bug assigned to aruzicka

Comment 55 Satellite Program 2017-08-24 16:10:55 UTC
Upstream bug assigned to inecas

Comment 56 Satellite Program 2017-08-24 16:11:00 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/20721 has been resolved.


Note You need to log in before you can comment on or make changes to this bug.