Bug 2022240
| Summary: | Intermittent HTTP 502 Bad Gateway cause by connection reset by peer when Satellite server having high http load | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Hao Chang Yu <hyu> |
| Component: | Installer | Assignee: | satellite6-bugs <satellite6-bugs> |
| Status: | NEW --- | QA Contact: | Gaurav Talreja <gtalreja> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.9.6 | CC: | ahumbe, alex.bron, alsouza, dleroux, egolov, ehelms, ekohlvan, gtalreja, jhutar, jjansky, jpasqual, ktordeur, mharbi, mjia, momran, ngalvin, pdwyer, pmendezh, pmoravec, psatpute, rrajput, sadas, saydas, smeyer, tasander, vchepkov |
| Target Milestone: | Unspecified | Keywords: | Patch, Performance, PrioBumpGSS, Triaged |
| Target Release: | Unused | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2029606 | ||
|
Description
Hao Chang Yu
2021-11-11 07:30:57 UTC
I think I found the actual root cause for the intermittent HTTP 502 status caused by "connection reset by peer": ------------------------------------------ # /var/log/httpd/foreman-ssl_error_ssl.log [proxy_http:error] [pid XXXX] (104)Connection reset by peer: [client XX.XX.XX.XX:XXXX] AH01102: error reading status line from remote server httpd-UDS:0 [proxy:error] [pid XXXX] [client XX.XX.XX.XX:XXXX] AH00898: Error reading from remote server returned by /rhsm/consumers/<uuid> ------------------------------------------- This seems like an Apache bug[1]. See https://serverfault.com/questions/206738/intermittent-error-when-using-mod-proxy-to-do-reverse-proxy-to-soap-service. There is a race condition about connection close and reuse between Proxy (Apache) and the backend server (Puma here). Also see "proxy-initial-not-pooled" in Apache doc[2] [1] https://bz.apache.org/bugzilla/show_bug.cgi?id=37770 [2] http://httpd.apache.org/docs/2.4/mod/mod_proxy_http.html After several tries and finally I am able to reproduce it correctly Adding either "SetEnv proxy-nokeepalive 1" or "SetEnv proxy-initial-not-pooled 1" to the foreman httpd config solves the issue in my reproducer. In my opinion, "proxy-initial-not-pooled" is better because it is not disabling the keepalive completely. According to my tests, I don't see any noticable performance degradation after applying the above httpd config. However, I will leave it for the performance team to verify further. It is still good to increase the "net.core.somaxconn" in comment 0 to match the Puma backlog to avoid overflow or syn packet drop. Script to reproduce HTTP 502 status: require 'benchmark' require 'rest_client' puts Benchmark.measure { 10.times { 5.times { 250.times { Thread.new { begin; RestClient::Resource.new("https://example.satellite.com/rhsm/consumers/<consumer uuid>", user: "your admin username", password: "your password", timeout: 3600, open_timeout: 3600, verify_ssl: OpenSSL::SSL::VERIFY_NONE).get; rescue StandardError => e; p e.message; end }}; sleep 5 }; sleep 30} } Hao++ for the BZ, reproducer and ways of fix it. While we look at how we present this option to the user based on our performance data (present by default or configurable to turn on), at least Satellite 6.10+ should be able to configure this for the time being persistently using custom-hiera.
Add the following to /etc/foreman-installer/custom-hiera.yaml:
foreman::config::apache::https_vhost_options: {'setenv': 'proxy-initial-not-pooled 1'}
(In reply to Eric Helms from comment #21) > While we look at how we present this option to the user based on our > performance data (present by default or configurable to turn on), at least > Satellite 6.10+ should be able to configure this for the time being > persistently using custom-hiera. > > Add the following to /etc/foreman-installer/custom-hiera.yaml: > > foreman::config::apache::https_vhost_options: {'setenv': > 'proxy-initial-not-pooled 1'} On the Satellite itself, this configuration should apply this to Pulp endpoints as well given that Foreman and Pulp share a vhost. On a Capsule, this option should be applicable via custom hiera: pulpcore::apache::https_vhost_options: {'setenv': 'proxy-initial-not-pooled 1'} I'm reading through this thread, so some observations. (In reply to Hao Chang Yu from comment #2) > Adding either "SetEnv proxy-nokeepalive 1" or "SetEnv > proxy-initial-not-pooled 1" to the foreman httpd config solves the issue in > my reproducer. In my opinion, "proxy-initial-not-pooled" is better because > it is not disabling the keepalive completely. The proxy-initial-not-pooled has the downside of downgrading to HTTP/1.0. Note HTTP/1.1 specifically made changes around Keep-Alive: it changed from off by default to on by default. I wonder if proxy-initial-not-pooled effectively implies proxy-nokeepalive. https://stackoverflow.com/questions/169453/bad-gateway-502-error-with-apache-mod-proxy-and-tomcat suggests that just setting ProxyTimeout is insufficient. Now it should be mentioned that they talk about Apache 2.2, but the description of the scenario is very similar. Then https://iximiuz.com/en/posts/reverse-proxy-http-keep-alive-and-502s/ explain it very clearly and even though it's a completely different stack, I suspect it's exactly what we're seeing. It has clear mitigations: > * Make sure the upstream (application) idle connection timeout is longer than the downstream (proxy) connection timeout I'm not sure how feasible this is. We currently have KeepAliveTimeout 15 on Satellite, but set the proxy timeout very high on the Capsule because some requests can take very long. Setting a KeepAliveTimeout as high as we do on the Capsule feels like a disaster waiting to happen. > * Disable connection reuse between the upstream and the proxy (turn off HTTP Keep-Alive) I think you already observed that disabling Keep-Alive helped. And that also increases my belief that proxy-initial-not-pooled implies disabling Keep-Alive. I'd prefer proxy-nokeepalive over proxy-initial-not-pooled. > * Employ HTTP 5xx retries on the client-side (well, they are a must-have anyway). This would mean changing subscription-manager and I don't think that's feasible in the short term. I'd like to investigate the HTTP/2 direction though. We recently started to support HTTP/2 so I'd actually like to investigate if we can utilize HTTP/2 by using mod_proxy_http2. It is marked as experimental, but should be available. Since HTTP/2 supports multiple streams over the same TCP socket, perhaps sidestepping the whole problem. Quoting https://www.rfc-editor.org/rfc/rfc7540.html#section-8.1.2.2: > This means that an intermediary transforming an HTTP/1.x message to > HTTP/2 will need to remove any header fields nominated by the > Connection header field, along with the Connection header field > itself. Such intermediaries SHOULD also remove other connection- > specific header fields, such as Keep-Alive, Proxy-Connection, > Transfer-Encoding, and Upgrade, even if they are not nominated by the > Connection header field. Which makes me think Keep-Alive simply doesn't exist at all in HTTP/2 as it did with HTTP/1.1: there are better mechanisms for it. https://httpd.apache.org/docs/current/mod/mod_proxy_http2.html > It is still good to increase the "net.core.somaxconn" in comment 0 to match > the Puma backlog to avoid overflow or syn packet drop. We set Backlog= parameter in foreman.socket. From man 5 systemd.socket: > Backlog= > Takes an unsigned integer argument. Specifies the number of connections to queue that have not been accepted yet. This setting matters only for stream and sequential > packet sockets. See listen(2) for details. Defaults to SOMAXCONN (128). However, I can confirm I that on EL8 (systemd 239) I see it has a backlog of 128. Testing the same thing on Fedora 36 (systemd 250) does set the queue to match. Certainly something surprising and makes me suspect a systemd bug, but I haven't looked into this. Apologies for the quick follow up on myself. (In reply to Ewoud Kohl van Wijngaarden from comment #24) > (In reply to Hao Chang Yu from comment #2) > > It is still good to increase the "net.core.somaxconn" in comment 0 to match > > the Puma backlog to avoid overflow or syn packet drop. > > We set Backlog= parameter in foreman.socket. From man 5 systemd.socket: > > > Backlog= > > Takes an unsigned integer argument. Specifies the number of connections to queue that have not been accepted yet. This setting matters only for stream and sequential > > packet sockets. See listen(2) for details. Defaults to SOMAXCONN (128). > > However, I can confirm I that on EL8 (systemd 239) I see it has a backlog of > 128. Testing the same thing on Fedora 36 (systemd 250) does set the queue to > match. Certainly something surprising and makes me suspect a systemd bug, > but I haven't looked into this. Disregard that: https://github.com/systemd/systemd/issues/4629#issuecomment-260311719 states: > It is clamped to that value by the kernel On my Fedora net.core.somaxconn is 4096 so that explains why I was able to raise it using Backlog= I didn't share this before, but I did start work on installer support so we can try it out. https://github.com/puppetlabs/puppetlabs-apache/pull/2393 aims to add support for mod_proxy_http2 in the Puppet module we rely on. Then https://github.com/theforeman/puppet-foreman_proxy_content/pull/442 builds on that. Right now it's untested. |