Bug 1827590
Summary: | running ReX via receptor is very ineffective, satellite load is at least 90% higher, ReX job on 500 hosts gets cancelled | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Jan Hutař <jhutar> | ||||||||
Component: | Remote Execution | Assignee: | Adam Ruzicka <aruzicka> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Jan Hutař <jhutar> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 6.7.0 | CC: | ahumbe, aruzicka, egolov, inecas, ktordeur, pcreech, smallamp | ||||||||
Target Milestone: | 6.9.0 | Keywords: | Performance, Triaged | ||||||||
Target Release: | Unused | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | tfm-rubygem-foreman_remote_execution-4.2.0,python-receptor-satellite-1.3.0 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1965323 (view as bug list) | Environment: | |||||||||
Last Closed: | 2021-04-21 13:14:53 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1930577 | ||||||||||
Bug Blocks: | 1965323 | ||||||||||
Attachments: |
|
Created attachment 1681411 [details]
/var/log/foreman-proxy/proxy.log
Created attachment 1681412 [details]
/var/log/foreman-proxy/smart_proxy_dynflow_core.log
Background on this issue: Satellite historically had an API to get results of the entire job and then an endpoint which could be used to get job output for a given host. The initial requirement for Satellite-receptor integration was to be able to get per-host outputs and status updates. The high level overview is for a single host is: 1) Receptor receives a request from cloud to execute something on Satellite 2) It triggers the job through satellite's API 3) It sleeps for a preconfigured interval (5 seconds currently, this value comes from the cloud) 4) It asks for the status of the host 5) If it is done, it sends report back to cloud 6) If it is not done, it goes to 4 7) Eventually, the job ends But that also means that if there's not a single host, but let's say a 1000, then in 4 it doesn't do a single request against satellite, but a 1000. Now to make things worse, when Satellite receives the request from 4 and the job is still running on that host, Satellite doesn't have latest output from the host, it has to ask the capsule. But the capsule itself doesn't know anything and just "proxies" the request to smart_proxy_dynflow_core. And again, it is not actually a single request * 3, but it is N * 3 requests for N hosts every $polling_interval. Now to the tricky part. Satellite triggers the jobs on hosts in batches of 100. When satellite receives the request in 2 it takes the first 100 hosts and sends a request to the capsule to execute the job on those hosts, then does the same for the next 100 and so on until the job is triggered for all the hosts on the capsule. However, with each 100 being triggered, the satellite and capsule starts getting hit by $current_batch * 100 requests asking for the status. When there is sufficient number of hosts already running, the capsule is too busy answering all the requests about status and the request to trigger the next batch may timeout because the capsule is effectively being DoSed. That's what jhutar's production.log shows. tl;dr: The more hosts, the higher load on Satellite and the higher chance the job will fail halfway through. Possible workarounds without having to patch Satellite and friends: 1) Have more capsules, spreading the load across capsules could help with the job failing. Of course this won't help with high load on Satellite 2) Ask the cloud folks to be more conservative with updates, currently the polling interval is 5 seconds. I would say noone expects immediate responses and low latency in this scenario so polling once per minute or so may be enough and would help lower the load on Satellite. Fix proposals: 1) Provide an API both in Satellite and on the capsule to get outputs and status for many hosts at once, adjust receptor-satellite plugin to use this new api 2) Introduce some kind of throttling to the receptor-satellite plugin to not hit Satellite so hard 3) Introduce some kind of pub-sub websocket kind of thing where satellite would push updates and receptor-satellite would just consume it. @jhutar: When you were running this, did you watch the resource usage of receptor on Satellite? I'd be curious how it handles large volumes of messages getting sent back to the cloud. Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/31012 has been resolved. Jan - any update on this? Hello @smallamp. I'm blocked here by bug 1930577 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Satellite 6.9 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1313 |
Created attachment 1681410 [details] hopefully this is the relevant part of /var/log/foreman/production.log Description of problem: I have compared ReX job running on 500 hosts - once running it via Receptor and then rerunning it so Receptor/FiFi does not know about it. I have tracked various system metrics during job run and then compared them: Plain ReX finished on all 500 hosts, Receptor ReX failed after 200 hosts so this performance metrics comparison is not perfectly OK, but I think it is good enough. plain ReX Receptor ReX percentage increase duration 0:39:11.390000 0:28:55.379000 N/A mean load 7.2 13.8 +91% disk read 1.9 MB 3.5 MB +80% Tomcat RSS mem 1.3 GB 1.7 GB +27% There are also some positive changes (with negative percentage), but I think these are caused by lots of failed systems (e.g. PostgreSQL tuple operations) or are on a too jumpy or too small variables - I do not think these benefits overweight above mentioned issues. Version-Release number of selected component (if applicable): satellite-6.7.0-7.el7sat.noarch receptor-0.6.1-1.el7ar.noarch How reproducible: 1 of 1 Steps to Reproduce: 1. Setup Satellite server measurement 2. Via FiFi run a playbook that sleeps 1 minute 5 times on 500 hosts 3. When it fails/finishes and Satellite settles down, rerun the ReX job on Satellite (so Receptor is not tracking it now) Actual results: Load on the Satellite is 90% higher when Receptor is tracking the job execution and receptor-tracked job fails Expected results: Both receptor-tracked and normal job passes (are not cancelled) and load is not this much higher on the Satellite Additional info: I think this is the actual traceback that caused Receptor ReX job to be cancelled is in the attachment. Adding a "Regression" keyword, as same job runs nicely without Receptor which is a new feature, but I agree there are other points of view, so feel free to remove it.