Bug 1517048

Summary: Remote Execution is slow when running on Scale
Product: Red Hat Satellite Reporter: sbadhwar
Component: Remote ExecutionAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED ERRATA QA Contact:
Severity: medium Docs Contact:
Priority: high    
Version: 6.3.0CC: aruzicka, bbuckingham, bkearney, cduryee, inecas, jhutar, lzap, mmccune, psuriset, sbadhwar, zhunting
Target Milestone: UnspecifiedKeywords: Performance, PrioBumpQA, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: scale_lab
Fixed In Version: tfm-rubygems-dynflow-0.8.34 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-21 16:54:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sbadhwar 2017-11-24 04:41:08 UTC
Description of problem:
The remote execution seems to be too-slow when running at Scale. This happen even for the simplest of the commands such as running date command using ReX.

Here are some data:
ReX running date command while using sqlite as database for dynflow:-
Total number of hosts: 29902
Total time taken: 22hrs+

ReX running date command while using in-memory database for dynflow:-
Total number of hosts: 29902
Total time taken: 15hrs

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Schedule a remote execution for a large number of hosts (e.g. 30k)
2. Run date command under remote execution
3.

Actual results:


Expected results:
The ReX time should be less atleast for the simple commands such as date

Additional info:

Comment 1 Ivan Necas 2017-11-24 06:51:34 UTC
Please provide task export containing the tasks from the invocation fir further analysis and more investigation

Comment 2 sbadhwar 2017-11-24 06:59:08 UTC
(In reply to Ivan Necas from comment #1)
> Please provide task export containing the tasks from the invocation fir
> further analysis and more investigation

Hello Ivan,

I tried to export the tasks using the following command:
foreman-rake foreman_tasks:export_tasks

But seems like the command is aborting with the following error message:
[root@c10-h17-r730xd-vm1 ~]# foreman-rake foreman_tasks:export_tasks
Gathering 172056 tasks.
rake aborted!
Errno::ENOENT: No such file or directory @ rb_sysopen - /opt/theforeman/tfm/root/usr/share/gems/gems/dynflow-0.8.30/web/assets/vendor/google-code-prettify/run_prettify.js
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-0.9.6/lib/foreman_tasks/tasks/export_tasks.rake:217:in `block in copy_assets'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-0.9.6/lib/foreman_tasks/tasks/export_tasks.rake:214:in `each'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-0.9.6/lib/foreman_tasks/tasks/export_tasks.rake:214:in `copy_assets'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-0.9.6/lib/foreman_tasks/tasks/export_tasks.rake:251:in `block (3 levels) in <top (required)>'
/opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-0.9.6/lib/foreman_tasks/tasks/export_tasks.rake:250:in `block (2 levels) in <top (required)>'
Tasks: TOP => foreman_tasks:export_tasks
(See full trace by running task with --trace)

Comment 3 Adam Ruzicka 2017-11-24 07:45:44 UTC
(In reply to sbadhwar from comment #2)
Hello, you're most likely hitting this https://bugzilla.redhat.com/show_bug.cgi?id=1512562 . Until that is resolved, could you please get us foreman-debug? It should contain raw dump of dynflow's db.

Comment 4 Ivan Necas 2017-11-24 08:12:21 UTC
Also note the fix for #1512562 is actually a one-liner: you could just remove the specific line https://github.com/theforeman/foreman-tasks/pull/296 and the export should work then.

Comment 5 Ivan Necas 2017-11-24 08:12:34 UTC
But +1 for foreman-debug

Comment 7 Adam Ruzicka 2017-11-29 09:12:33 UTC
*** Bug 1517559 has been marked as a duplicate of this bug. ***

Comment 8 Adam Ruzicka 2017-11-29 09:15:47 UTC
Reposting data from the duplicate BZ here for completeness.

Description of problem:
Using ssh in a loop (serialized) is 3 times faster than same action with Remote Execution on clients equally distributed over 10 capsules (with dynflow database set to "in memory" on satellite and capsules).

Satellite is a VM with 20 cores and 47 GB of RAM. 10 capsules (again VMs) have 8 CPUs and 16 GB RAM. All (including hosts) on 10G network.


Version-Release number of selected component (if applicable):
satellite-6.3.0-21.0.beta.el7sat.noarch


How reproducible:
always


Steps to Reproduce:
1. Run ReX job on 30k hosts with command
   `systemctl stop rhsmcertd; systemctl disable rhsmcertd`
2. Try with a subset with simple loop and ssh


Actual results:
Job is now running for 23 hours, 24 minutes and reports 24800 hosts as done so far, i.e. more than 3 seconds per host.

I have tried to run simple loop with ssh (it is this complicated only because of IP ranges we are using:

# time \
    for ip1 in $( seq 0 30 ); do
        for ip2 in 0 1 2 3; do
            ip=$( expr $ip1 \* 8 + $ip2 )
            ssh -o "StrictHostKeyChecking no" \
                -i /root/id_rsa_perf
                root.$ip.100
                "systemctl stop rhsmcertd; systemctl disable rhsmcertd"
        done
    done

This ran the command on 124 hosts and finished 2m4.953s, i.e. slightly above 1 second per host.


Expected results:
I know satellite and capsules are doing much more than just sshing to the clients (e.g. storing results for later auditing), but as the load should be somehow distributed among 10 capsules and as we have already tuned database on satellite and capsules to be inmemory only, I would expect speed of this action to be close to what I'm able to achieve with ssh or faster.

Comment 13 Adam Ruzicka 2017-12-04 15:12:13 UTC
(In reply to sbadhwar from comment #10)
This was a new bug, created a BZ[1] for it.

[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1520487

Comment 14 Ivan Necas 2017-12-14 17:21:36 UTC
Created redmine issue http://projects.theforeman.org/issues/21980 from this bug

Comment 15 Ivan Necas 2017-12-14 17:29:28 UTC
We've found a possible regression that could cause the degradation in the performance in 6.3: see attached issue. With this improvement, we should get to a bit better numbers. There is for sure more things we could do for performance improvements, but I would leave those once outside of the advanced phase of 6.3 release. The main goal is to make sure the performance of 6.3 is better (or at least the same) as 6.2.

Will this change make rex ultimately fast? Most probably no.
Will it make it faster than it is now: Most probably yes.

Comment 16 Ivan Necas 2017-12-14 17:36:42 UTC
Upstream release here https://github.com/theforeman/foreman-packaging/pull/1983

Comment 17 Satellite Program 2018-02-21 16:54:37 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA.
> > 
> > For information on the advisory, and where to find the updated files, follow the link below.
> > 
> > If the solution does not work for you, open a new bug report.
> > 
> > https://access.redhat.com/errata/RHSA-2018:0336