1827590 – running ReX via receptor is very ineffective, satellite load is at least 90% higher, ReX job on 500 hosts gets cancelled

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1827590 - running ReX via receptor is very ineffective, satellite load is at least 90% higher, ReX job on 500 hosts gets cancelled

Summary: running ReX via receptor is very ineffective, satellite load is at least 90% ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Remote Execution
Sub Component:
Version:	6.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	6.9.0
Assignee:	Adam Ruzicka
QA Contact:	Jan Hutař
Docs Contact:
URL:
Whiteboard:
Depends On:	1930577
Blocks:	1965323
TreeView+	depends on / blocked

Reported:	2020-04-24 09:35 UTC by Jan Hutař
Modified:	2021-05-27 13:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:	tfm-rubygem-foreman_remote_execution-4.2.0,python-receptor-satellite-1.3.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1965323 (view as bug list)
Environment:
Last Closed:	2021-04-21 13:14:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
hopefully this is the relevant part of /var/log/foreman/production.log (28.19 KB, text/plain) 2020-04-24 09:35 UTC, Jan Hutař	no flags	Details
/var/log/foreman-proxy/proxy.log (1.81 MB, application/gzip) 2020-04-24 09:49 UTC, Jan Hutař	no flags	Details
/var/log/foreman-proxy/smart_proxy_dynflow_core.log (664.75 KB, application/gzip) 2020-04-24 09:50 UTC, Jan Hutař	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	31012	Normal	Closed	Provide an API for retrieving outputs for multiple hosts from a job at once	2021-02-16 14:59:53 UTC
Github	project-receptor receptor-satellite pull 6	None	closed	Use new bulk output API	2021-02-16 14:59:53 UTC
Red Hat Product Errata	RHSA-2021:1313	None	None	None	2021-04-21 13:17:18 UTC

Description Jan Hutař 2020-04-24 09:35:38 UTC

Created attachment 1681410 [details]
hopefully this is the relevant part of /var/log/foreman/production.log

Description of problem:
I have compared ReX job running on 500 hosts - once running it via Receptor and then rerunning it so Receptor/FiFi does not know about it. I have tracked various system metrics during job run and then compared them:

Plain ReX finished on all 500 hosts, Receptor ReX failed after 200 hosts so this performance metrics comparison is not perfectly OK, but I think it is good enough.

                 plain ReX        Receptor ReX     percentage increase
duration         0:39:11.390000   0:28:55.379000   N/A
mean load        7.2              13.8             +91%
disk read        1.9 MB           3.5 MB           +80%
Tomcat RSS mem   1.3 GB           1.7 GB           +27%

There are also some positive changes (with negative percentage), but I think these are caused by lots of failed systems (e.g. PostgreSQL tuple operations) or are on a too jumpy or too small variables - I do not think these benefits overweight above mentioned issues.


Version-Release number of selected component (if applicable):
satellite-6.7.0-7.el7sat.noarch
receptor-0.6.1-1.el7ar.noarch


How reproducible:
1 of 1


Steps to Reproduce:
1. Setup Satellite server measurement
2. Via FiFi run a playbook that sleeps 1 minute 5 times on 500 hosts
3. When it fails/finishes and Satellite settles down, rerun the ReX job on Satellite (so Receptor is not tracking it now)


Actual results:
Load on the Satellite is 90% higher when Receptor is tracking the job execution and receptor-tracked job fails


Expected results:
Both receptor-tracked and normal job passes (are not cancelled) and load is not this much higher on the Satellite


Additional info:
I think this is the actual traceback that caused Receptor ReX job to be cancelled is in the attachment.

Adding a "Regression" keyword, as same job runs nicely without Receptor which is a new feature, but I agree there are other points of view, so feel free to remove it.

Comment 3 Jan Hutař 2020-04-24 09:49:00 UTC

Created attachment 1681411 [details]
/var/log/foreman-proxy/proxy.log

Comment 4 Jan Hutař 2020-04-24 09:50:04 UTC

Created attachment 1681412 [details]
/var/log/foreman-proxy/smart_proxy_dynflow_core.log

Comment 6 Adam Ruzicka 2020-04-24 10:57:44 UTC

Background on this issue:
Satellite historically had an API to get results of the entire job and then an endpoint which could be used to get job output for a given host.

The initial requirement for Satellite-receptor integration was to be able to get per-host outputs and status updates.

The high level overview is for a single host is:
1) Receptor receives a request from cloud to execute something on Satellite
2) It triggers the job through satellite's API
3) It sleeps for a preconfigured interval (5 seconds currently, this value comes from the cloud)
4) It asks for the status of the host
5) If it is done, it sends report back to cloud
6) If it is not done, it goes to 4
7) Eventually, the job ends

But that also means that if there's not a single host, but let's say a 1000, then in 4 it doesn't do a single request against satellite, but a 1000.

Now to make things worse, when Satellite receives the request from 4 and the job is still running on that host, Satellite doesn't have latest output from the host, it has to ask the capsule. But the capsule itself doesn't know anything and just "proxies" the request to smart_proxy_dynflow_core. And again, it is not actually a single request * 3, but it is N * 3 requests for N hosts every $polling_interval.

Now to the tricky part. Satellite triggers the jobs on hosts in batches of 100. When satellite receives the request in 2 it takes the first 100 hosts and sends a request to the capsule to execute the job on those hosts, then does the same for the next 100 and so on until the job is triggered for all the hosts on the capsule. However, with each 100 being triggered, the satellite and capsule starts getting hit by $current_batch * 100 requests asking for the status. When there is sufficient number of hosts already running, the capsule is too busy answering all the requests about status and the request to trigger the next batch may timeout because the capsule is effectively being DoSed. That's what jhutar's production.log shows.

tl;dr: The more hosts, the higher load on Satellite and the higher chance the job will fail halfway through.

Possible workarounds without having to patch Satellite and friends:
1) Have more capsules, spreading the load across capsules could help with the job failing. Of course this won't help with high load on Satellite
2) Ask the cloud folks to be more conservative with updates, currently the polling interval is 5 seconds. I would say noone expects immediate responses and low latency in this scenario so polling once per minute or so may be enough and would help lower the load on Satellite.

Fix proposals:
1) Provide an API both in Satellite and on the capsule to get outputs and status for many hosts at once, adjust receptor-satellite plugin to use this new api
2) Introduce some kind of throttling to the receptor-satellite plugin to not hit Satellite so hard
3) Introduce some kind of pub-sub websocket kind of thing where satellite would push updates and receptor-satellite would just consume it.

@jhutar: When you were running this, did you watch the resource usage of receptor on Satellite? I'd be curious how it handles large volumes of messages getting sent back to the cloud.

Comment 10 Bryan Kearney 2020-10-26 16:57:20 UTC

Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/31012 has been resolved.

Comment 14 Sudhir Mallamprabhakara 2021-04-05 02:20:08 UTC

Jan - any update on this?

Comment 15 Jan Hutař 2021-04-06 07:44:19 UTC

Hello @smallamp. I'm blocked here by bug 1930577

Comment 18 errata-xmlrpc 2021-04-21 13:14:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Satellite 6.9 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1313

Note You need to log in before you can comment on or make changes to this bug.