Summary: | Smart Proxy will crash after overloaded with huge number requests | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Hao Chang Yu <hyu> |
Component: | Capsule | Assignee: | Lukas Zapletal <lzap> |
Status: | CLOSED WONTFIX | QA Contact: | Lukas Pramuk <lpramuk> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.3.2 | CC: | aruzicka, hyu, inecas, jalviso, kupadhya, lzap, pdwyer, phess |
Target Milestone: | Unspecified | Keywords: | Triaged |
Target Release: | Unused | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-04-02 12:32:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: |
Description
Hao Chang Yu
2018-08-09 01:56:36 UTC
Created redmine issue http://projects.theforeman.org/issues/24634 from this bug Upstream bug assigned to ikanner Hey, I am able to reproduce with HTTP endpoint: slowhttptest -u http://127.0.0.1:8000/features -c 1000 -g -o slow-headers-1000 From the graph generated, it is able to handle about 200 opened connections and then it starts refusing them. Key question: Do you see foreman-proxy not being available only DURING the test, or even AFTER all connections are dropped? If the latter, can you wait few minutes and try normal request if foreman-proxy starts serving requests again? I am unable to reproduce on my system, it just works after benchmark is stopped. This is important, by default open timeout is 1 minute, so foreman-proxy should recover from this after few minutes and should respond to requests again. From what I understood the issue is that proxy stops responding to requests even after the attack has stopped and that was also the case the one time I managed to reproduce it. The proxy wouldn't accept any requests for roughly 10 minutes until I restarted it. Actually, the open_timeout is hardcoded to 10 seconds in the Satellite. This number is probably too low or it should be allow to amend. api = ProxyAPI::Features.new(:url => SmartProxy.last.url) => #<ProxyAPI::Features:0x00000007f82d28 @url="https://hao-satellite63.usersys.redhat.com:9090/features", @connect_params={:timeout=>60, :open_timeout=>10, :headers=>{:accept=>:json}, :user=>nil, :password=>nil, :ssl_client_cert=>#<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name:0x00000007f5a760>, issuer=#<OpenSSL::X509::Name:0x00000007f5a788>, serial=#<OpenSSL::BN:0x00000007f5a7d8>, not_before=2018-06-25 03:46:20 UTC, not_after=2038-01-18 03:46:20 UTC>, :ssl_client_key=>#<OpenSSL::PKey::RSA:0x00000007f5b200>, :ssl_ca_file=>"/etc/foreman/proxy_ca.pem", :verify_ssl=>1}> (In reply to Lukas Zapletal from comment #7) > Hey, > > I am able to reproduce with HTTP endpoint: Now, I can reproduce this on http too from my Satellite server. > > slowhttptest -u http://127.0.0.1:8000/features -c 1000 -g -o > slow-headers-1000 > I can 100% reproduce this with 1200 concurrent requests on several machines > From the graph generated, it is able to handle about 200 opened connections > and then it starts refusing them. > > Key question: Do you see foreman-proxy not being available only DURING the > test, or even AFTER all connections are dropped? As I describe in comment #0, the last connection will stuck in "established" forever and foreman-proxy will stop accepting any new connections # lsof -i :9090 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 1426 foreman-proxy 15u IPv4 7095960 0t0 TCP *:websm (LISTEN) ruby 1426 foreman-proxy 16u IPv4 7110968 0t0 TCP localhost:websm->localhost:44108 (ESTABLISHED) <=== stuck forever until restart > > If the latter, can you wait few minutes and try normal request if > foreman-proxy starts serving requests again? I am unable to reproduce on my > system, it just works after benchmark is stopped. > > This is important, by default open timeout is 1 minute, so foreman-proxy > should recover from this after few minutes and should respond to requests > again. Well, open_timeout is hardcoded in Satellite CLIENT code, it's not server code. While I agree it should be configurable, it's not big deal, it's irrelevant in this case. Also read timeout is not complete timeout, read timeout of 10 can still generate and process connection which is minutes slow, it's a timeout *between* data reads not entire timeout for the record. Now, to the point. I don't see this misbehavior on HTTP endpoint: [root@foreman ~]# time telnet localhost 8000 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. Connection closed by foreign host. real 0m30.037s user 0m0.003s sys 0m0.000s However, I do see it on HTTPS endpoint! [root@foreman ~]# time telnet localhost 9090 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. It looks like webrick does not correctly timeout for HTTPS endpoint So after some more testing, this looks like a misbehavior of webrick 1.3.1. Looks like 1.4.2 version (current stable) does work correctly, connection is closed after 30 seconds so no restart would be needed. Unfortunately Webrick 1.4.x requires Ruby 2.3 and we are still on RHEL Ruby 2.0. There are discussions in SCLing smart-proxy, that would solve this. Until then, there is no easy solution - according to git log in webrick there's been *huge* amount of patches in regard to timeouts, concurrency, waits and synchronization. According to our packaging devs, webrick upgrade won't happen soon. So the only workaround is to put something in front of smart proxy (a http proxy) or to use rack-attack plugin: https://github.com/kickstarter/rack-attack (it needs a Redis cache tho) The Satellite Team is attempting to provide an accurate backlog of bugzilla requests which we feel will be resolved in the next few releases. We do not believe this bugzilla will meet that criteria, and have plans to close it out in 1 month. This is not a reflection on the validity of the request, but a reflection of the many priorities for the product. If you have any concerns about this, feel free to contact Red Hat Technical Support or your account team. If we do not hear from you, we will close this bug out. Thank you. Thank you for your interest in Satellite 6. We have evaluated this request, and while we recognize that it is a valid request, we do not expect this to be implemented in the product in the foreseeable future. This is due to other priorities for the product, and not a reflection on the request itself. We are therefore closing this out as WONTFIX. If you have any concerns about this, please do not reopen. Instead, feel free to contact Red Hat Technical Support. Thank you. For the record, we are almost done migrating Smart Proxy to a new web stack (Puma) which should resolve this bug. For the record, this should be fixed. This was caused by a bug in webrick 1.3 which we used due to using Ruby 2.0 from RHEL7 base system. It did not properly close slow connections, so things like security scanners brought it easily down. This was fixed after we migrated to SCL Ruby 2.5 which has webrick 1.4 which fixes this particular problem. So this should not be a problem for Satellite 6.4 or above. Correcting, fixed webrick is in Satellite 6.5. I haven't checked, but to verify this file must exist: /opt/rh/rh-ruby25/root/usr/share/gems/specifications/default/webrick-*.gemspec Correction to comment 19: Satellite 6.8 is the first release that contains SCL Ruby 2.5+ which ships later version of Webrick app server which Smart Proxy uses. The only solution to this problem is upgrade to 6.8, backporting is not possible due to huge version difference (webrick in Ruby 2.0 and 2.5 contains too many changes in this regard). |