Bug 1827590

Summary:

running ReX via receptor is very ineffective, satellite load is at least 90% higher, ReX job on 500 hosts gets cancelled

Product:

Red Hat Satellite

Reporter:

Jan Hutař <jhutar>

Component:

Remote Execution

Assignee:

Adam Ruzicka <aruzicka>

Status:

CLOSED ERRATA

QA Contact:

Jan Hutař <jhutar>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.7.0

CC:

ahumbe, aruzicka, egolov, inecas, ktordeur, pcreech, smallamp

Target Milestone:

6.9.0

Keywords:

Performance, Triaged

Target Release:

Unused

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

tfm-rubygem-foreman_remote_execution-4.2.0,python-receptor-satellite-1.3.0

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1965323 (view as bug list)

Environment:

Last Closed:

2021-04-21 13:14:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1930577

Bug Blocks:

1965323

Attachments:

Description	Flags
hopefully this is the relevant part of /var/log/foreman/production.log	none
/var/log/foreman-proxy/proxy.log	none
/var/log/foreman-proxy/smart_proxy_dynflow_core.log	none

Description Jan Hutař 2020-04-24 09:35:38 UTC

Created attachment 1681410 [details]
hopefully this is the relevant part of /var/log/foreman/production.log

Description of problem:
I have compared ReX job running on 500 hosts - once running it via Receptor and then rerunning it so Receptor/FiFi does not know about it. I have tracked various system metrics during job run and then compared them:

Plain ReX finished on all 500 hosts, Receptor ReX failed after 200 hosts so this performance metrics comparison is not perfectly OK, but I think it is good enough.

                 plain ReX        Receptor ReX     percentage increase
duration         0:39:11.390000   0:28:55.379000   N/A
mean load        7.2              13.8             +91%
disk read        1.9 MB           3.5 MB           +80%
Tomcat RSS mem   1.3 GB           1.7 GB           +27%

There are also some positive changes (with negative percentage), but I think these are caused by lots of failed systems (e.g. PostgreSQL tuple operations) or are on a too jumpy or too small variables - I do not think these benefits overweight above mentioned issues.


Version-Release number of selected component (if applicable):
satellite-6.7.0-7.el7sat.noarch
receptor-0.6.1-1.el7ar.noarch


How reproducible:
1 of 1


Steps to Reproduce:
1. Setup Satellite server measurement
2. Via FiFi run a playbook that sleeps 1 minute 5 times on 500 hosts
3. When it fails/finishes and Satellite settles down, rerun the ReX job on Satellite (so Receptor is not tracking it now)


Actual results:
Load on the Satellite is 90% higher when Receptor is tracking the job execution and receptor-tracked job fails


Expected results:
Both receptor-tracked and normal job passes (are not cancelled) and load is not this much higher on the Satellite


Additional info:
I think this is the actual traceback that caused Receptor ReX job to be cancelled is in the attachment.

Adding a "Regression" keyword, as same job runs nicely without Receptor which is a new feature, but I agree there are other points of view, so feel free to remove it.

Comment 3 Jan Hutař 2020-04-24 09:49:00 UTC

Created attachment 1681411 [details]
/var/log/foreman-proxy/proxy.log

Comment 4 Jan Hutař 2020-04-24 09:50:04 UTC

Created attachment 1681412 [details]
/var/log/foreman-proxy/smart_proxy_dynflow_core.log

Comment 6 Adam Ruzicka 2020-04-24 10:57:44 UTC

Background on this issue:
Satellite historically had an API to get results of the entire job and then an endpoint which could be used to get job output for a given host.

The initial requirement for Satellite-receptor integration was to be able to get per-host outputs and status updates.

The high level overview is for a single host is:
1) Receptor receives a request from cloud to execute something on Satellite
2) It triggers the job through satellite's API
3) It sleeps for a preconfigured interval (5 seconds currently, this value comes from the cloud)
4) It asks for the status of the host
5) If it is done, it sends report back to cloud
6) If it is not done, it goes to 4
7) Eventually, the job ends

But that also means that if there's not a single host, but let's say a 1000, then in 4 it doesn't do a single request against satellite, but a 1000.

Now to make things worse, when Satellite receives the request from 4 and the job is still running on that host, Satellite doesn't have latest output from the host, it has to ask the capsule. But the capsule itself doesn't know anything and just "proxies" the request to smart_proxy_dynflow_core. And again, it is not actually a single request * 3, but it is N * 3 requests for N hosts every $polling_interval.

Now to the tricky part. Satellite triggers the jobs on hosts in batches of 100. When satellite receives the request in 2 it takes the first 100 hosts and sends a request to the capsule to execute the job on those hosts, then does the same for the next 100 and so on until the job is triggered for all the hosts on the capsule. However, with each 100 being triggered, the satellite and capsule starts getting hit by $current_batch * 100 requests asking for the status. When there is sufficient number of hosts already running, the capsule is too busy answering all the requests about status and the request to trigger the next batch may timeout because the capsule is effectively being DoSed. That's what jhutar's production.log shows.

tl;dr: The more hosts, the higher load on Satellite and the higher chance the job will fail halfway through.

Possible workarounds without having to patch Satellite and friends:
1) Have more capsules, spreading the load across capsules could help with the job failing. Of course this won't help with high load on Satellite
2) Ask the cloud folks to be more conservative with updates, currently the polling interval is 5 seconds. I would say noone expects immediate responses and low latency in this scenario so polling once per minute or so may be enough and would help lower the load on Satellite.

Fix proposals:
1) Provide an API both in Satellite and on the capsule to get outputs and status for many hosts at once, adjust receptor-satellite plugin to use this new api
2) Introduce some kind of throttling to the receptor-satellite plugin to not hit Satellite so hard
3) Introduce some kind of pub-sub websocket kind of thing where satellite would push updates and receptor-satellite would just consume it.

@jhutar: When you were running this, did you watch the resource usage of receptor on Satellite? I'd be curious how it handles large volumes of messages getting sent back to the cloud.

Comment 10 Bryan Kearney 2020-10-26 16:57:20 UTC

Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/31012 has been resolved.

Comment 14 Sudhir Mallamprabhakara 2021-04-05 02:20:08 UTC

Jan - any update on this?

Comment 15 Jan Hutař 2021-04-06 07:44:19 UTC

Hello @smallamp. I'm blocked here by bug 1930577

Comment 18 errata-xmlrpc 2021-04-21 13:14:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.9 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1313