1432985 – pulp_streamer returns HTTP 502 Bad Gateway from failed squid request

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1432985 - pulp_streamer returns HTTP 502 Bad Gateway from failed squid request

Summary: pulp_streamer returns HTTP 502 Bad Gateway from failed squid request

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Pulp
Sub Component:
Version:	6.2.7
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	Katello QA List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-03-16 14:01 UTC by Tasos Papaioannou
Modified:	2017-07-07 14:50 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-07-07 14:50:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Tasos Papaioannou 2017-03-16 14:01:04 UTC

Description of problem:

On QCI 1.1, which is built on Satellite 6.2.7, I'm deploying 4 OpenShift nodes simultaneously. They consistently encounter failed yum installs/updates when trying to download rpms from a repository that has lazy-sync/on_demand download policy. The exact rpm will vary.

Version-Release number of selected component (if applicable):

fusor-1.1.1-1.el7sat.noarch
satellite-6.2.7-1.0.el7sat.noarch
python-pulp-streamer-2.8.7.5-1.el7sat.noarch
squid-3.5.20-2.el7.x86_64

How reproducible:

100% (4-5 reproducers in a row, with the steps below)

Steps to Reproduce:

1.) Use QCI or another mechanism to trigger simultaneous yum install/update on 3-4 clients, from a repository that has been synced w/ on_demand download policy but not yet used.

2.) See an rpm download failure, with HTTP 502 from the /streamer/var/lib/pulp/content/units/rpms/[...] request logged in /var/log/httpd/foreman-ssl_access_ssl.log and not corresponding entry in /var/log/squid/access.log.

3.) Verify that re-running the failed yum update/install works for that rpm.

Actual results:

Failed rpm download from pulp_streamer.

Expected results:

No failed rpm download.

Additional info:

Running tcpdump on localhost port 3128 to capture the traffic between pulp_streamer and squid, I saw that the failed rpm request occurred on a previously-established persistent TCP connection that was just past its expiration time (default = 120 seconds). The sequence of events is:

1.) pulp_streamer establishes a TCP connection to squid on port 3128.
2.) pulp_streamer sends HTTP GET for an rpm.
3.) squid responds with the rpm content.
4.) The HTTP request completes, but the TCP connection remains open.
5.) 120 seconds later, pulp_streamer requests another rpm on the same TCP connection.
6.) squid responds with an ACK then a FIN/ACK, closing the TCP connection without responding to the HTTP request.
7.) pulp_streamer responds to the client with HTTP 502.

In the next reproducer attempt, I verified that changing the *_persistent_connections settings in /etc/squid/squid.conf to "off" resolved the issue. So, a proposed fix would be either to disable client persistent connections in squid, or to make pulp_streamer try again if squid terminated its connection without an HTTP response.

Comment 1 Michael Hrivnak 2017-03-16 16:22:23 UTC

Thanks for the detailed report! The analysis you did helps a lot.

What you describe makes sense, but I don't understand yet how it can be easily reproducible. It seems like the connection must sit idle for 120 seconds, and then just at the same moment that a new request is being made to squid, the 120 second timeout is hit and squid closes the connection. Since it's on the local machine with little latency, it seems like you'd need to be very unlucky to have both of those things happen at exactly the same moment. Perhaps there is some additional contributing factor here?

Jeff, what do you think?

Comment 3 Jeff Ortel 2017-03-17 16:33:33 UTC

(In reply to Michael Hrivnak from comment #1)
> Thanks for the detailed report! The analysis you did helps a lot.
> 
> What you describe makes sense, but I don't understand yet how it can be
> easily reproducible. It seems like the connection must sit idle for 120
> seconds, and then just at the same moment that a new request is being made
> to squid, the 120 second timeout is hit and squid closes the connection.
> Since it's on the local machine with little latency, it seems like you'd
> need to be very unlucky to have both of those things happen at exactly the
> same moment. Perhaps there is some additional contributing factor here?
> 
> Jeff, what do you think?

Agreed, the exact timing needed would seem to make this harder to reproduce.  Also, seems odd that squid (tried-and-true) would not deal with this race condition better.

Tasos, what level of concurrency is happening here?  How many client are requests happening concurrently?

Comment 4 Tasos Papaioannou 2017-04-03 20:11:24 UTC

I can reproduce this with 5 clients running concurrent yum installs. I'm attaching an sosreport from a reproducer Satellite. One HTTP 502 occurred at 15:32:02:

# grep ' 502 ' /var/log/httpd/foreman-ssl_access_ssl.log 
192.168.106.130 - 4028ea815b34efd3015b35433cd00500 [03/Apr/2017:15:32:02 -0400] "GET /streamer/var/lib/pulp/content/units/rpm/63/53ced84547bdde4c87f5044e86cfe37460ec95cf61a3accaded5bec60d4875/libnetfilter_cthelper-1.0.0-9.el7.x86_64.rpm?policy=eyJleHRlbnNpb25zIjogeyJyZW1vdGVfaXAiOiAiMTkyLjE2OC4xMDYuMTMwIn0sICJyZXNvdXJjZSI6ICIvc3RyZWFtZXIvdmFyL2xpYi9wdWxwL2NvbnRlbnQvdW5pdHMvcnBtLzYzLzUzY2VkODQ1NDdiZGRlNGM4N2Y1MDQ0ZTg2Y2ZlMzc0NjBlYzk1Y2Y2MWEzYWNjYWRlZDViZWM2MGQ0ODc1L2xpYm5ldGZpbHRlcl9jdGhlbHBlci0xLjAuMC05LmVsNy54ODZfNjQucnBtIiwgImV4cGlyYXRpb24iOiAxNDkxMjQ4MDEyfQ%3D%3D;signature=IXhw0M4D3vFfP5l9BONk8ZIgnJ_fTVu3_bUYl8raxcTHEKRvjfRUL5CIbrHiQP4rGu0NBuqjDtznFG3mdMnP2gnl1m24w55oB1L0fzRKP3tGSU71Td5uh1iU5enPAzMdFTFq0FguPJY5a7bnq6UgC8r1RTwktm7lAj2ve28JRY0BxYA0Y9T0mCancGQB6aE2LPGCASCxtZTY466fSaUIZf0Uozmr0_e72H7LkspXQLZMbGl70YV-nUZB2c_PaWsGvpBf2TWJs9mZmodpWVpnqUtyFbc-ib2oeneY_L60d01ijHZZpjWPVDNsf3rdn-UKoWanZ3bIXSAQ6ydeuJcJuA%3D%3D HTTP/1.1" 502 679 "-" "urlgrabber/3.10 yum/3.4.3"

The amount of httpd->squid traffic is not very high:

# grep /streamer/ /var/log/httpd/foreman-ssl_access_ssl.log|awk '{print $4}'|sort|uniq -c
      4 [03/Apr/2017:15:29:59
     16 [03/Apr/2017:15:30:00
      4 [03/Apr/2017:15:30:01
      4 [03/Apr/2017:15:30:02
     15 [03/Apr/2017:15:31:47
      6 [03/Apr/2017:15:31:55
      4 [03/Apr/2017:15:31:56
      9 [03/Apr/2017:15:32:02
     11 [03/Apr/2017:15:32:03

If I leave out the 'sort' in the above command, I notice that some of the GET requests from 120 seconds prior to the HTTP 502 error were logged with their timestamps out of order, indicating that they took a while to complete:

      2 [03/Apr/2017:15:29:59
      9 [03/Apr/2017:15:30:00
      1 [03/Apr/2017:15:29:59
      1 [03/Apr/2017:15:30:00
      1 [03/Apr/2017:15:30:01
      1 [03/Apr/2017:15:30:00
      1 [03/Apr/2017:15:30:01
      1 [03/Apr/2017:15:30:00
      2 [03/Apr/2017:15:30:01
      1 [03/Apr/2017:15:30:00
      1 [03/Apr/2017:15:29:59
      2 [03/Apr/2017:15:30:00
      3 [03/Apr/2017:15:30:02
      1 [03/Apr/2017:15:30:00
      1 [03/Apr/2017:15:30:02
     15 [03/Apr/2017:15:31:47
      1 [03/Apr/2017:15:31:56
      2 [03/Apr/2017:15:31:55
      1 [03/Apr/2017:15:31:56
      2 [03/Apr/2017:15:31:55
      2 [03/Apr/2017:15:31:56
      2 [03/Apr/2017:15:31:55
      4 [03/Apr/2017:15:32:02
      1 [03/Apr/2017:15:32:03
      4 [03/Apr/2017:15:32:02
      2 [03/Apr/2017:15:32:03
      1 [03/Apr/2017:15:32:02

The requests took a few seconds to complete because they were for large (~35MB) rpms:

----
192.168.106.132 - 4028ea815b34efd3015b354336db04e5 [03/Apr/2017:15:30:02 -0400] "GET /pulp/repos/Default_Organization/Library/content/dist/rhel/server/7/7Server/x86_64/os/perl-Storable-2.45-3.el7.x86_64.rpm HTTP
/1.1" 200 78940 "-" "urlgrabber/3.10 yum/3.4.3"

192.168.106.132 - 4028ea815b34efd3015b354336db04e5 [03/Apr/2017:15:30:00 -0400] "GET /streamer/var/lib/pulp/content/units/rpm/ed/39676894cacd695536a9d88d011c33d999dc3f5a163f96ea1ce615cd3bc398/atomic-openshift-3.
4.1.10-1.git.0.c96aed3.el7.x86_64.rpm?[snip] HTTP/1.1" 200 37173376 "-" "urlgrabber/3.10 yum/3.4.3"

192.168.106.132 - 4028ea815b34efd3015b354336db04e5 [03/Apr/2017:15:30:02 -0400] "GET /pulp/repos/Default_Organization/Library/content/dist/rhel/server/7/7Server/x86_64/os/perl-TermReadKey-2.30-20.el7.x86_64.rpm 
HTTP/1.1" 302 - "-" "urlgrabber/3.10 yum/3.4.3"
----

The timing might be a coincidence, but I thought it was worth pointing out.

Comment 6 Jeff Ortel 2017-06-27 18:21:24 UTC

(In reply to Michael Hrivnak from comment #1)
> Thanks for the detailed report! The analysis you did helps a lot.
> 
> What you describe makes sense, but I don't understand yet how it can be
> easily reproducible. It seems like the connection must sit idle for 120
> seconds, and then just at the same moment that a new request is being made
> to squid, the 120 second timeout is hit and squid closes the connection.
> Since it's on the local machine with little latency, it seems like you'd
> need to be very unlucky to have both of those things happen at exactly the
> same moment. Perhaps there is some additional contributing factor here?
> 
> Jeff, what do you think?

Yeah, that seems like a really tight race condition to reproduce so easily.

Comment 7 Jeff Ortel 2017-06-27 18:27:37 UTC

This PR:3008 mitigates some badness in python-requests and urllib3.  I suspect it may resolve this.  Tasos, do you want to try it?

[1] https://github.com/pulp/pulp/pull/3008

Comment 8 Tasos Papaioannou 2017-07-06 18:42:00 UTC

I haven't been able to reproduce the issue in my recent testing. So, as long as the session-caching changes pass regular QE testing, that's ok with me.

Note You need to log in before you can comment on or make changes to this bug.