Bug 2139418

Summary:	MQTT ReX mode makes it too easy to to DDOS Satellite
Product:	Red Hat Satellite	Reporter:	Jan Hutař <jhutar>
Component:	Remote Execution	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED ERRATA	QA Contact:	Peter Ondrejka <pondrejk>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.12.0	CC:	ahumbe, aruzicka, pmendezh
Target Milestone:	6.13.0	Keywords:	Performance, Security, Triaged
Target Release:	Unused
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-05-03 13:22:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Hutař 2022-11-02 13:25:02 UTC

Description of problem:
Having Satellite with MQTT ReX mode and 3k clients, I can not run ReX on all of them in one go as before.


Version-Release number of selected component (if applicable):
satellite-6.12.0-4.el8sat.noarch


How reproducible:
always


Steps to Reproduce:
1. Configure Sat with `satellite-installer --foreman-proxy-plugin-remote-execution-script-mode=pull-mqtt` (in my case I also used "--tuning=medium --foreman-foreman-service-puma-threads-min=16 --foreman-foreman-service-puma-threads-max=16 --foreman-foreman-service-puma-workers=8 --foreman-db-pool=128 --foreman-proxy-plugin-remote-execution-script-install-key=true")
2. Ran a "Command" ReX on 3000 hosts with:
       subscription-manager refresh
       yum -y install insights-client
       insights-client --register


Actual results:
Log proxy.log has a bunch of SSL errors:

2022-11-02T09:04:25  [E] <OpenSSL::SSL::SSLError> SSL_accept SYSCALL returned=5 errno=0 state=TLSv1.3 early data
        /usr/share/ruby/webrick/server.rb:299:in `accept'
        /usr/share/ruby/webrick/server.rb:299:in `block (2 levels) in start_thread'
        /usr/share/ruby/webrick/utils.rb:263:in `timeout'
        /usr/share/ruby/webrick/server.rb:297:in `block in start_thread'

The client wasn't able to retrieve the job:

Nov 02 09:04:25 f09-h26-b02-5039ms-container100.red.ddns.perf.redhat.com yggdrasild[647]: [yggdrasild] 2022/11/02 09:04:25 cannot get detached message content: cannot download from URL: Get "https://f09-h26-b01-5039ms.rdu2.scalelab.redhat.com:9090/ssh/jobs/88e7dae2-3998-4366-b9e5-aa4216e0030f": net/http: TLS handshake timeout

=> ReX job got stuck after 879 succeeded tasks, with 2121 pending.


Expected results:
We either need to decide on documenting max allowed number of hosts one capsule with MQTT can handle, or providing a way how to limit concurrency when running the host.

FYI as of now, default tuning is documented to be able to handle 5k hosts.


Additional info:
Thanks aruzicka for investigating this!

When I switched to SSH mode with `satellite-installer --foreman-proxy-plugin-remote-execution-script-mode=ssh` and re-ran the job (after cancelling sub-tasks from previous one), 2991 succeeded, only 9 failed.

I do not mark this as a regression, because IIRC MQTT is optional.

Comment 5 Peter Ondrejka 2023-03-01 14:17:42 UTC

Verified on Sat 6.13 snap 12, the mqtt_rate_limit parameter has been added and is applied as expected.

Comment 8 errata-xmlrpc 2023-05-03 13:22:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.13 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2097