1976694 – Http 502 Bad Gateway and 503 service not available due to Puma not accepting connections fast enough

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1976694 - Http 502 Bad Gateway and 503 service not available due to Puma not accepting connections fast enough

Summary: Http 502 Bad Gateway and 503 service not available due to Puma not accepting ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	6.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	6.9.6
Assignee:	satellite6-bugs
QA Contact:	Devendra Singh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-28 04:27 UTC by Hao Chang Yu
Modified:	2024-12-20 20:20 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-21 14:37:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2021:3628	0	None	None	None	2021-09-21 14:37:41 UTC

Description Hao Chang Yu 2021-06-28 04:27:56 UTC

Description of problem:
Firstly, I can't find the best component that either matching the web server or Puma. Feel free to move it to a suitable component if you found one.

Puma won't queue requests or not queue fast enough even the backlog is low if we send some requests that are enough to make all the workers busy (no more available threads) for awhile. Some requests will then begin to timeout after about 75 seconds (6 tcp syn retries I think) and getting connection reset by Puma. After that the client will get either 502 Bad Gateway or 503 service not available.

Can refer to the below links for more information:
----------------------------------------
https://github.com/puma/puma/issues/1405
https://github.com/puma/puma/pull/1278 <=== Before this PR, each worker would accept and queue as many connections without limit. If I rollback this PR, all 502 and 503 are error are gone.
----------------------------------------

Another reason of 503 service unavailable is caused by insufficient database connections in the active record when sending more than 16 query requests at the same time. I will open another bugzilla for this.
----------------------------------------
ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 6.756 seconds); all pooled connections were in use
----------------------------------------

Steps to Reproduce:
1. On any remote host or Satellite itself, run the command to send many concurrent requests to the web server using irb console

irb
require 'rest_client'
200.times { Thread.new { begin; RestClient::Resource.new("https://satellite.example.com/api/v2/hosts?installed_package_name=kernel&page=1&per_page=500", user: "admin", password: "pass", timeout: 3600, open_timeout: 3600, verify_ssl: OpenSSL::SSL::VERIFY_NONE).get; rescue StandardError => e; p e.message; end } }


2. On another window, run puma-status (need to install this gem manually) to check the status.

while true; do scl enable tfm 'puma-status /usr/share/foreman/tmp/puma.state'; sleep 3; done

17938 (/usr/share/foreman/tmp/puma.state) Uptime: 17m28s | Phase: 0 | Load: 32[████████████████████████████████]32
 └ 27171 CPU: 100.0% Mem: 1126 MB Uptime:  3m25s | Load: 16[████████████████]16 Queue: 1  <==== Will see either low or no queue
 └ 27177 CPU:  93.3% Mem: 1228 MB Uptime:  3m25s | Load: 16[████████████████]16    <==== Will see either low or no queue


3. On another window, tail the error log and you should see  the following error after around 75 seconds.

tail -f /var/log/httpd/foreman-ssl_error_ssl.log

[proxy_http:error] [pid 3466] (104)Connection reset by peer: [client xx.xx.xx.xx:xxxxx] AH01102: error reading status line from remote server 127.0.0.1:3000
[proxy:error] [pid 3466] [client xx.xx.xx.xx:xxxxx] AH00898: Error reading from remote server returned by /api/v2/hosts
[proxy_http:error] [pid 3484] (104)Connection reset by peer: [client xx.xx.xx.xx:xxxxx] AH01102: error reading status line from remote server 127.0.0.1:3000
[proxy:error] [pid 3484] [client xx.xx.xx.xx:xxxxx] AH00898: Error reading from remote server returned by /api/v2/hosts
[proxy_http:error] [pid 3492] (104)Connection reset by peer: [client xx.xx.xx.xx:xxxxx] AH01102: error reading status line from remote server 127.0.0.1:3000
[proxy:error] [pid 3492] [client xx.xx.xx.xx:xxxxx] AH00898: Error reading from remote server returned by /api/v2/hosts
[proxy:error] [pid 3494] (110)Connection timed out: AH00957: HTTP: attempt to connect to 127.0.0.1:3000 (127.0.0.1) failed
[proxy:error] [pid 3494] AH00959: ap_proxy_connect_backend disabling worker for (127.0.0.1) for 0s
[proxy_http:error] [pid 3494] [client xx.xx.xx.xx:xxxxx] AH01114: HTTP: failed to make connection to backend: 127.0.0.1
[proxy:error] [pid 3483] (110)Connection timed out: AH00957: HTTP: attempt to connect to 127.0.0.1:3000 (127.0.0.1) failed
[proxy:error] [pid 3483] AH00959: ap_proxy_connect_backend disabling worker for (127.0.0.1) for 0s
[proxy_http:error] [pid 3483] [client xx.xx.xx.xx:xxxxx] AH01114: HTTP: failed to make connection to backend: 127.0.0.1
[proxy:error] [pid 3481] (110)Connection timed out: AH00957: HTTP: attempt to connect to 127.0.0.1:3000 (127.0.0.1) failed
[proxy:error] [pid 3481] AH00959: ap_proxy_connect_backend disabling worker for (127.0.0.1) for 0s

Actual results:
Puma stops accepting connections (queuing requests) even if the queue/backlog is low causing connection timeout and connection reset. It is because the requests are waiting in the socket backlog and timed out.

Expected results:
Worker should queue the requests until certain limit even if all threads are busy.


Additional infos:
This never happened in Passenger.

This might not be considered as a bug (maybe some people will disagree that is is a bug) based on how Puma is designed and it could probably means that the web server just need more workers in order to serve more concurrent long requests and given that each worker has its own backlog queue. Maybe we need to document this in the tuning guide or something?

Comment 1 Brad Buckingham 2021-06-28 13:10:37 UTC

Placing this on the Installer; however, please do move if another component is more appropriate.

Comment 2 Hao Chang Yu 2021-07-01 05:19:28 UTC

I tested this in Satellite 6.10 snap 6 and I am not able to reproduce the issue. It seems like Puma in Satellite 6.10 is listening on a unix socket instead of the tcp socket which fixed the connection timed out and the "connection reset by peer" errors. If I switch the Puma back to tcp socket, then the errors come back.

The issue is also fixed after I switch the Puma in Satellite 6.9 to use unix socket.

Therefore, my conclusion is the issue remains in Puma 5.3 but it can be fixed by using the unix socket.

Comment 10 Devendra Singh 2021-09-13 15:26:53 UTC

Verified on 6.9.6 Snap2

Verification Points:

1- Created the base version of the Satellite setup using 6.9.4 GA template.
2- Checked the content host count and it was around 40.
3- Ran the upgrade from 6.9.4 to 6.9.6 Snap2.
4- Upgrade completed successfully.
5- Checked the "connection reset by peer" in /var/log/httpd/foreman-ssl_error_ssl.log and I didn't see any error there.
6- Trigger the 300 and 400 requests from other hosts on this satellite machine.

irb(main):007:0> 300.times { Thread.new { begin; RestClient::Resource.new("https://xyz.com/api/v2/hosts?installed_package_name=kernel", user: "admin", password: "xyz", timeout: 3600, open_timeout: 3600, verify_ssl: OpenSSL::SSL::VERIFY_NONE).get; rescue StandardError => e; p e.message; end } }
=> 300
irb(main):008:0> 400.times { Thread.new { begin; RestClient::Resource.new("https://xyz.com/api/v2/hosts?installed_package_name=kernel", user: "admin", password: "xyz", timeout: 3600, open_timeout: 3600, verify_ssl: OpenSSL::SSL::VERIFY_NONE).get; rescue StandardError => e; p e.message; end } }
=> 400


7- Checked the info logs and found all the requests came on that server 

# tail -f foreman-ssl_access_ssl.log
XXX.XXX.XXX.XXX - - [13/Sep/2021:11:06:46 -0400] "GET /api/v2/hosts?installed_package_name=kernel HTTP/1.1" 200 4065 "-" "Ruby"
XXX.XXX.XXX.XXX - - [13/Sep/2021:11:06:47 -0400] "GET /api/v2/hosts?installed_package_name=kernel HTTP/1.1" 200 4066 "-" "Ruby"
XXX.XXX.XXX.XXX - - [13/Sep/2021:11:06:48 -0400] "GET /api/v2/hosts?installed_package_name=kernel HTTP/1.1" 200 4069 "-" "Ruby"
XXX.XXX.XXX.XXX - - [13/Sep/2021:11:06:47 -0400] "GET /api/v2/hosts?installed_package_name=kernel HTTP/1.1" 200 4061 "-" "Ruby"
XXX.XXX.XXX.XXX - - [13/Sep/2021:11:06:47 -0400] "GET /api/v2/hosts?installed_package_name=kernel HTTP/1.1" 200 4065 "-" "Ruby"
XXX.XXX.XXX.XXX - - [13/Sep/2021:11:06:49 -0400] "GET /api/v2/hosts?installed_package_name=kernel HTTP/1.1" 200 4065 "-" "Ruby"


8- Checked the logs /var/log/httpd/foreman-ssl_error_ssl.log and didn't see any "connection reset by peer" error.
# less /var/log/httpd/foreman-ssl_error_ssl*|grep -i "connection reset by peer"
# 

For now, I am marking this bug verified based on the above verification points, if something is missing please confirm will check that part too.

Comment 15 errata-xmlrpc 2021-09-21 14:37:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Satellite 6.9.6 Async Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3628

Comment 16 Red Hat Bugzilla 2023-09-15 01:34:46 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.