Bug 1614087

Summary: Smart Proxy will crash after overloaded with huge number requests
Product: Red Hat Satellite Reporter: Hao Chang Yu <hyu>
Component: Foreman ProxyAssignee: Lukas Zapletal <lzap>
Status: CLOSED WONTFIX QA Contact: Lukas Pramuk <lpramuk>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.3.2CC: aruzicka, hyu, inecas, jalviso, kupadhya, lzap, pdwyer, phess
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-02 12:32:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hao Chang Yu 2018-08-09 01:56:36 UTC
Description of problem:
Smart Proxy will crash and stop responding to any request after overloaded with huge number of requests.

I can reproduce this issue by sending 1000 requests to the foreman proxy at the same time. Send higher number of requests will guarantee reproduced.

It is only reproduced on:
- Satellite 6.3. I can't reproduce it on Satellite 6.2.15
- On port 9090. Port 8000 is fine.

Steps:
1) Run this in a terminal. Expecting many request timeout errors

foreman-rake console
1000.times { Thread.new { begin; RestClient::Resource.new('https://127.0.0.1:9090/features', verify_ssl: OpenSSL::SSL::VERIFY_NONE).get; rescue StandardError => e; p e.message; end } }

2) On another terminal. Run the below command to check the connections.

lsof -i :9090 | wc -l

3) The issue is produced when you see a stuck connection.

lsof -i :9090
COMMAND  PID          USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
ruby    1728 foreman-proxy    8u  IPv4    68616      0t0  TCP *:websm (LISTEN)
ruby    1728 foreman-proxy   12u  IPv4 27104675      0t0  TCP localhost:websm->localhost:58542 (ESTABLISHED) <====== This

4) Make another request will take forever.
# curl -v -k https://127.0.0.1:9090/features
* About to connect() to 127.0.0.1 port 9090 (#0)
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 9090 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
###### STUCK FOREVER ##########

5) Restart foremon-proxy fixed the issue.

Comment 5 Adam Ruzicka 2018-08-16 10:56:49 UTC
Created redmine issue http://projects.theforeman.org/issues/24634 from this bug

Comment 6 Satellite Program 2018-08-16 14:07:33 UTC
Upstream bug assigned to ikanner

Comment 7 Lukas Zapletal 2018-08-17 09:35:36 UTC
Hey,

I am able to reproduce with HTTP endpoint:

slowhttptest -u http://127.0.0.1:8000/features -c 1000 -g -o slow-headers-1000

From the graph generated, it is able to handle about 200 opened connections and then it starts refusing them.

Key question: Do you see foreman-proxy not being available only DURING the test, or even AFTER all connections are dropped?

If the latter, can you wait few minutes and try normal request if foreman-proxy starts serving requests again? I am unable to reproduce on my system, it just works after benchmark is stopped.

This is important, by default open timeout is 1 minute, so foreman-proxy should recover from this after few minutes and should respond to requests again.

Comment 8 Adam Ruzicka 2018-08-17 10:03:19 UTC
From what I understood the issue is that proxy stops responding to requests even after the attack has stopped and that was also the case the one time I managed to reproduce it. The proxy wouldn't accept any requests for roughly 10 minutes until I restarted it.

Comment 9 Hao Chang Yu 2018-08-21 07:09:44 UTC
Actually, the open_timeout is hardcoded to 10 seconds in the Satellite. This number is probably too low or it should be allow to amend.

api = ProxyAPI::Features.new(:url => SmartProxy.last.url)
=> #<ProxyAPI::Features:0x00000007f82d28 @url="https://hao-satellite63.usersys.redhat.com:9090/features", @connect_params={:timeout=>60, :open_timeout=>10, :headers=>{:accept=>:json}, :user=>nil, :password=>nil, :ssl_client_cert=>#<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name:0x00000007f5a760>, issuer=#<OpenSSL::X509::Name:0x00000007f5a788>, serial=#<OpenSSL::BN:0x00000007f5a7d8>, not_before=2018-06-25 03:46:20 UTC, not_after=2038-01-18 03:46:20 UTC>, :ssl_client_key=>#<OpenSSL::PKey::RSA:0x00000007f5b200>, :ssl_ca_file=>"/etc/foreman/proxy_ca.pem", :verify_ssl=>1}>

Comment 10 Hao Chang Yu 2018-08-21 07:55:54 UTC
(In reply to Lukas Zapletal from comment #7)
> Hey,
> 
> I am able to reproduce with HTTP endpoint:

Now, I can reproduce this on http too from my Satellite server.

> 
> slowhttptest -u http://127.0.0.1:8000/features -c 1000 -g -o
> slow-headers-1000
> 

I can 100% reproduce this with 1200 concurrent requests on several machines

> From the graph generated, it is able to handle about 200 opened connections
> and then it starts refusing them.
> 
> Key question: Do you see foreman-proxy not being available only DURING the
> test, or even AFTER all connections are dropped?

As I describe in comment #0, the last connection will stuck in "established" forever and foreman-proxy will stop accepting any new connections

# lsof -i :9090
COMMAND  PID          USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
ruby    1426 foreman-proxy   15u  IPv4 7095960      0t0  TCP *:websm (LISTEN)
ruby    1426 foreman-proxy   16u  IPv4 7110968      0t0  TCP localhost:websm->localhost:44108 (ESTABLISHED) <=== stuck forever until restart

> 
> If the latter, can you wait few minutes and try normal request if
> foreman-proxy starts serving requests again? I am unable to reproduce on my
> system, it just works after benchmark is stopped.
> 
> This is important, by default open timeout is 1 minute, so foreman-proxy
> should recover from this after few minutes and should respond to requests
> again.

Comment 11 Lukas Zapletal 2018-08-21 10:28:14 UTC
Well, open_timeout is hardcoded in Satellite CLIENT code, it's not server code. While I agree it should be configurable, it's not big deal, it's irrelevant in this case. Also read timeout is not complete timeout, read timeout of 10 can still generate and process connection which is minutes slow, it's a timeout *between* data reads not entire timeout for the record.

Now, to the point. I don't see this misbehavior on HTTP endpoint:

[root@foreman ~]# time telnet localhost 8000
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.

real    0m30.037s
user    0m0.003s
sys     0m0.000s

However, I do see it on HTTPS endpoint!

[root@foreman ~]# time telnet localhost 9090
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.


It looks like webrick does not correctly timeout for HTTPS endpoint

Comment 12 Lukas Zapletal 2018-08-21 10:43:51 UTC
So after some more testing, this looks like a misbehavior of webrick 1.3.1. Looks like 1.4.2 version (current stable) does work correctly, connection is closed after 30 seconds so no restart would be needed.

Unfortunately Webrick 1.4.x requires Ruby 2.3 and we are still on RHEL Ruby 2.0. There are discussions in SCLing smart-proxy, that would solve this. Until then, there is no easy solution - according to git log in webrick there's been *huge* amount of patches in regard to timeouts, concurrency, waits and synchronization.

Comment 13 Lukas Zapletal 2018-08-21 13:05:25 UTC
According to our packaging devs, webrick upgrade won't happen soon. So the only workaround is to put something in front of smart proxy (a http proxy) or to use rack-attack plugin: https://github.com/kickstarter/rack-attack (it needs a Redis cache tho)

Comment 15 Bryan Kearney 2020-03-04 14:08:13 UTC
The Satellite Team is attempting to provide an accurate backlog of bugzilla requests which we feel will be resolved in the next few releases. We do not believe this bugzilla will meet that criteria, and have plans to close it out in 1 month. This is not a reflection on the validity of the request, but a reflection of the many priorities for the product. If you have any concerns about this, feel free to contact Red Hat Technical Support or your account team. If we do not hear from you, we will close this bug out. Thank you.

Comment 16 Bryan Kearney 2020-04-02 12:32:55 UTC
Thank you for your interest in Satellite 6. We have evaluated this request, and while we recognize that it is a valid request, we do not expect this to be implemented in the product in the foreseeable future. This is due to other priorities for the product, and not a reflection on the request itself. We are therefore closing this out as WONTFIX. If you have any concerns about this, please do not reopen. Instead, feel free to contact Red Hat Technical Support. Thank you.

Comment 17 Lukas Zapletal 2020-04-03 11:33:18 UTC
For the record, we are almost done migrating Smart Proxy to a new web stack (Puma) which should resolve this bug.

Comment 18 Lukas Zapletal 2020-09-30 13:00:54 UTC
For the record, this should be fixed. This was caused by a bug in webrick 1.3 which we used due to using Ruby 2.0 from RHEL7 base system. It did not properly close slow connections, so things like security scanners brought it easily down. This was fixed after we migrated to SCL Ruby 2.5 which has webrick 1.4 which fixes this particular problem. So this should not be a problem for Satellite 6.4 or above.

Comment 19 Lukas Zapletal 2020-09-30 13:06:56 UTC
Correcting, fixed webrick is in Satellite 6.5. I haven't checked, but to verify this file must exist:

/opt/rh/rh-ruby25/root/usr/share/gems/specifications/default/webrick-*.gemspec

Comment 20 Lukas Zapletal 2021-01-05 12:04:49 UTC
Correction to comment 19: Satellite 6.8 is the first release that contains SCL Ruby 2.5+ which ships later version of Webrick app server which Smart Proxy uses. The only solution to this problem is upgrade to 6.8, backporting is not possible due to huge version difference (webrick in Ruby 2.0 and 2.5 contains too many changes in this regard).