2022240 – Intermittent HTTP 502 Bad Gateway cause by connection reset by peer when Satellite server having high http load

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2022240 - Intermittent HTTP 502 Bad Gateway cause by connection reset by peer when Satellite server having high http load

Summary: Intermittent HTTP 502 Bad Gateway cause by connection reset by peer when Sate...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	6.9.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high with 4 votes
Target Milestone:	6.15.0
Assignee:	satellite6-bugs
QA Contact:	Gaurav Talreja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2029606
TreeView+	depends on / blocked

Reported:	2021-11-11 07:30 UTC by Hao Chang Yu
Modified:	2024-12-20 21:35 UTC (History)
CC List:	28 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-05-29 17:27:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	SAT-11132	None	None	None	2023-09-28 05:08:54 UTC
Red Hat Issue Tracker	SAT-8907	None	None	None	2023-05-11 12:03:34 UTC
Red Hat Knowledge Base (Solution)	6507661	None	None	None	2021-11-12 11:48:54 UTC

Description Hao Chang Yu 2021-11-11 07:30:57 UTC

Description of problem:

After fixing bug 1976694, I found that clients will still get connection reset by peer from the Puma server when we put even higher HTTP load.

Based on what I understand, Puma has 1024 default backlog but the default SOMAXCONN kernel parameter is only 128.

According to "man listen"
---------------------------
If the backlog argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently truncated to that value; the default value in this file is 128.  In kernels before 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.
---------------------------

This means the 1024 backlog will be capped to 128. We can determine this by running the following command:

# sysctl net.core.somaxconn
net.core.somaxconn = 128  <========

# ss -l | grep foreman
Netid State      Recv-Q Send-Q                                                        Local Address:Port
u_str  LISTEN     0      128    /run/foreman.sock XXXXXXX               * 0   <============ Send-Q (max listen backlog) is 128

Steps to Reproduce:

1) Run the following http on any host to generate 600 simultaneous requests to the Satellite server

irb
require 'rest_client'
600.times { Thread.new { begin; RestClient::Resource.new("https://example.satellite.com/api/v2/hosts?installed_package_name=kernel&page=1&per_page=20", user: "your admin username", password: "your password", timeout: 3600, open_timeout: 3600, verify_ssl: OpenSSL::SSL::VERIFY_NONE).get; rescue StandardError => e; p e.message; end } }


2. On the Satellite, server run the following command to monitor the backlog queue. "Recv-Q" (current listen backlog) will increase up to 129.

# watch 'ss -l | grep foreman'
Netid State      Recv-Q Send-Q                                                        Local Address:Port
u_str  LISTEN     129      128   /run/foreman.sock XXXXXXX               * 0 


3. Open another window and run the following command to monitor the queue overflow and packet drop. Queue overflow times and dropped SYN will increase once "Recv-Q" reached 129.

watch 'netstat -s | grep -i LISTEN'
    16712 times the listen queue of a socket overflowed
    17602 SYNs to LISTEN sockets dropped


Actual results:
Receive many "Connection reset by peer - SSL_connect" error on clients

Expected results:
No "Connection reset by peer - SSL_connect" error


Additional info:
I am able to avoid this issue completely after increasing the SOMAXCONN kernel parameter.

Temporary increase the somaxconn
-------------------------------------
sysctl -w net.core.somaxconn=1024
systemctl daemon-reload
systemctl restart foreman.socket foreman
-------------------------------------

# ss -l | grep foreman
u_str  LISTEN     0      1024   /run/foreman.sock XXXXXXX               * 0  <================== Send-Q (max listen backlog) is now 1024

Comment 2 Hao Chang Yu 2021-11-17 04:55:12 UTC

I think I found the actual root cause for the intermittent HTTP 502 status caused by "connection reset by peer":
------------------------------------------
# /var/log/httpd/foreman-ssl_error_ssl.log
[proxy_http:error] [pid XXXX] (104)Connection reset by peer: [client XX.XX.XX.XX:XXXX] AH01102: error reading status line from remote server httpd-UDS:0
[proxy:error] [pid XXXX] [client XX.XX.XX.XX:XXXX] AH00898: Error reading from remote server returned by /rhsm/consumers/<uuid>
-------------------------------------------

This seems like an Apache bug[1]. See https://serverfault.com/questions/206738/intermittent-error-when-using-mod-proxy-to-do-reverse-proxy-to-soap-service. There is a race condition about connection close and reuse between Proxy (Apache) and the backend server (Puma here). Also see "proxy-initial-not-pooled" in Apache doc[2]

[1] https://bz.apache.org/bugzilla/show_bug.cgi?id=37770
[2] http://httpd.apache.org/docs/2.4/mod/mod_proxy_http.html

After several tries and finally I am able to reproduce it correctly

Adding either "SetEnv proxy-nokeepalive 1" or "SetEnv proxy-initial-not-pooled 1" to the foreman httpd config solves the issue in my reproducer. In my opinion, "proxy-initial-not-pooled" is better because it is not disabling the keepalive completely.

According to my tests, I don't see any noticable performance degradation after applying the above httpd config. However, I will leave it for the performance team to verify further.

It is still good to increase the "net.core.somaxconn" in comment 0 to match the Puma backlog to avoid overflow or syn packet drop.


Script to reproduce HTTP 502 status:

require 'benchmark'
require 'rest_client'
puts Benchmark.measure { 10.times { 5.times { 250.times { Thread.new { begin; RestClient::Resource.new("https://example.satellite.com/rhsm/consumers/<consumer uuid>", user: "your admin username", password: "your password", timeout: 3600, open_timeout: 3600, verify_ssl: OpenSSL::SSL::VERIFY_NONE).get; rescue StandardError => e; p e.message; end }}; sleep 5 }; sleep 30} }

Comment 5 Pavel Moravec 2022-01-26 13:31:01 UTC

Hao++ for the BZ, reproducer and ways of fix it.

Comment 21 Eric Helms 2023-03-01 02:57:23 UTC

While we look at how we present this option to the user based on our performance data (present by default or configurable to turn on), at least Satellite 6.10+ should be able to configure this for the time being persistently using custom-hiera.

Add the following to /etc/foreman-installer/custom-hiera.yaml:

foreman::config::apache::https_vhost_options: {'setenv': 'proxy-initial-not-pooled 1'}

Comment 23 Eric Helms 2023-03-01 19:56:15 UTC

(In reply to Eric Helms from comment #21)
> While we look at how we present this option to the user based on our
> performance data (present by default or configurable to turn on), at least
> Satellite 6.10+ should be able to configure this for the time being
> persistently using custom-hiera.
> 
> Add the following to /etc/foreman-installer/custom-hiera.yaml:
> 
> foreman::config::apache::https_vhost_options: {'setenv':
> 'proxy-initial-not-pooled 1'}

On the Satellite itself, this configuration should apply this to Pulp endpoints as well given that Foreman and Pulp share a vhost.


On a Capsule, this option should be applicable via custom hiera:

pulpcore::apache::https_vhost_options: {'setenv': 'proxy-initial-not-pooled 1'}

Comment 24 Ewoud Kohl van Wijngaarden 2023-03-08 19:30:32 UTC

I'm reading through this thread, so some observations.

(In reply to Hao Chang Yu from comment #2)
> Adding either "SetEnv proxy-nokeepalive 1" or "SetEnv
> proxy-initial-not-pooled 1" to the foreman httpd config solves the issue in
> my reproducer. In my opinion, "proxy-initial-not-pooled" is better because
> it is not disabling the keepalive completely.

The proxy-initial-not-pooled has the downside of downgrading to HTTP/1.0. Note HTTP/1.1 specifically made changes around Keep-Alive: it changed from off by default to on by default. I wonder if proxy-initial-not-pooled effectively implies proxy-nokeepalive.

https://stackoverflow.com/questions/169453/bad-gateway-502-error-with-apache-mod-proxy-and-tomcat suggests that just setting ProxyTimeout is insufficient. Now it should be mentioned that they talk about Apache 2.2, but the description of the scenario is very similar. Then https://iximiuz.com/en/posts/reverse-proxy-http-keep-alive-and-502s/ explain it very clearly and even though it's a completely different stack, I suspect it's exactly what we're seeing.

It has clear mitigations:

> * Make sure the upstream (application) idle connection timeout is longer than the downstream (proxy) connection timeout

I'm not sure how feasible this is. We currently have KeepAliveTimeout 15 on Satellite, but set the proxy timeout very high on the Capsule because some requests can take very long. Setting a KeepAliveTimeout as high as we do on the Capsule feels like a disaster waiting to happen.

> * Disable connection reuse between the upstream and the proxy (turn off HTTP Keep-Alive)

I think you already observed that disabling Keep-Alive helped. And that also increases my belief that proxy-initial-not-pooled implies disabling Keep-Alive. I'd prefer proxy-nokeepalive over proxy-initial-not-pooled.

> * Employ HTTP 5xx retries on the client-side (well, they are a must-have anyway).

This would mean changing subscription-manager and I don't think that's feasible in the short term.

I'd like to investigate the HTTP/2 direction though. We recently started to support HTTP/2 so I'd actually like to investigate if we can utilize HTTP/2 by using mod_proxy_http2. It is marked as experimental, but should be available. Since HTTP/2 supports multiple streams over the same TCP socket, perhaps sidestepping the whole problem. Quoting https://www.rfc-editor.org/rfc/rfc7540.html#section-8.1.2.2:

>    This means that an intermediary transforming an HTTP/1.x message to
>   HTTP/2 will need to remove any header fields nominated by the
>   Connection header field, along with the Connection header field
>   itself.  Such intermediaries SHOULD also remove other connection-
>   specific header fields, such as Keep-Alive, Proxy-Connection,
>   Transfer-Encoding, and Upgrade, even if they are not nominated by the
>   Connection header field.

Which makes me think Keep-Alive simply doesn't exist at all in HTTP/2 as it did with HTTP/1.1: there are better mechanisms for it.

https://httpd.apache.org/docs/current/mod/mod_proxy_http2.html

> It is still good to increase the "net.core.somaxconn" in comment 0 to match
> the Puma backlog to avoid overflow or syn packet drop.

We set Backlog= parameter in foreman.socket. From man 5 systemd.socket:

> Backlog=
>    Takes an unsigned integer argument. Specifies the number of connections to queue that have not been accepted yet. This setting matters only for stream and sequential
>    packet sockets. See listen(2) for details. Defaults to SOMAXCONN (128).

However, I can confirm I that on EL8 (systemd 239) I see it has a backlog of 128. Testing the same thing on Fedora 36 (systemd 250) does set the queue to match. Certainly something surprising and makes me suspect a systemd bug, but I haven't looked into this.

Comment 25 Ewoud Kohl van Wijngaarden 2023-03-08 19:35:41 UTC

Apologies for the quick follow up on myself.

(In reply to Ewoud Kohl van Wijngaarden from comment #24)
> (In reply to Hao Chang Yu from comment #2)
> > It is still good to increase the "net.core.somaxconn" in comment 0 to match
> > the Puma backlog to avoid overflow or syn packet drop.
> 
> We set Backlog= parameter in foreman.socket. From man 5 systemd.socket:
> 
> > Backlog=
> >    Takes an unsigned integer argument. Specifies the number of connections to queue that have not been accepted yet. This setting matters only for stream and sequential
> >    packet sockets. See listen(2) for details. Defaults to SOMAXCONN (128).
> 
> However, I can confirm I that on EL8 (systemd 239) I see it has a backlog of
> 128. Testing the same thing on Fedora 36 (systemd 250) does set the queue to
> match. Certainly something surprising and makes me suspect a systemd bug,
> but I haven't looked into this.

Disregard that: https://github.com/systemd/systemd/issues/4629#issuecomment-260311719 states:

> It is clamped to that value by the kernel

On my Fedora net.core.somaxconn is 4096 so that explains why I was able to raise it using Backlog=

Comment 26 Ewoud Kohl van Wijngaarden 2023-04-24 17:41:49 UTC

I didn't share this before, but I did start work on installer support so we can try it out. https://github.com/puppetlabs/puppetlabs-apache/pull/2393 aims to add support for mod_proxy_http2 in the Puppet module we rely on. Then https://github.com/theforeman/puppet-foreman_proxy_content/pull/442 builds on that. Right now it's untested.

Comment 31 Brad Buckingham 2023-10-30 11:29:29 UTC

Bulk setting Target Milestone = 6.15.0 where sat-6.15.0+ is set.

Comment 41 Eric Helms 2024-04-05 23:05:00 UTC

There have been multiple changes through our most recent releases targeted at helping to reduce the occurrence of this issue. The nature of this issue is that even at certain scales, or certain loads that it might be possible to trigger this particular error code. We have decided to close this as done in Satellite 6.15 given the, in part, the breadth of the attached issues. 

If you encounter this issue on a supported version of Satellite please open a new bug that is specific to your particular issue and that targets the workload you are performing when it occurs. Please include as much detail as possible about what you were trying to do, how many hosts or actions were involved, CPU count, memory, the version of Satellite and details that help to indicate how to reproduce it. This will help us to investigate the issue and not the symptom.

Comment 42 Brad Buckingham 2024-05-29 17:27:44 UTC

Closing based upon comment 41.

Note You need to log in before you can comment on or make changes to this bug.

ahumbe
alex.bron
alsouza
csulekha
dleroux
egolov
ehelms
ekohlvan
gtalreja
jhutar
jjansky
jpasqual
ktordeur
mharbi
mjia
momran
ngalvin
pdwyer
pmendezh
pmoravec
psatpute
rlavi
sadas
saydas
smeyer
tasander
vchepkov
wpinheir