Bug 1432985

Summary: pulp_streamer returns HTTP 502 Bad Gateway from failed squid request
Product: Red Hat Satellite Reporter: Tasos Papaioannou <tpapaioa>
Component: PulpAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED WORKSFORME QA Contact: Katello QA List <katello-qa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.2.7CC: bbuckingham, jcallaha, jortel, mhrivnak, pcreech, tpapaioa, ttereshc
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-07 14:50:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tasos Papaioannou 2017-03-16 14:01:04 UTC
Description of problem:

On QCI 1.1, which is built on Satellite 6.2.7, I'm deploying 4 OpenShift nodes simultaneously. They consistently encounter failed yum installs/updates when trying to download rpms from a repository that has lazy-sync/on_demand download policy. The exact rpm will vary.


Version-Release number of selected component (if applicable):

fusor-1.1.1-1.el7sat.noarch
satellite-6.2.7-1.0.el7sat.noarch
python-pulp-streamer-2.8.7.5-1.el7sat.noarch
squid-3.5.20-2.el7.x86_64

How reproducible:

100% (4-5 reproducers in a row, with the steps below)

Steps to Reproduce:

1.) Use QCI or another mechanism to trigger simultaneous yum install/update on 3-4 clients, from a repository that has been synced w/ on_demand download policy but not yet used.

2.) See an rpm download failure, with HTTP 502 from the /streamer/var/lib/pulp/content/units/rpms/[...] request logged in /var/log/httpd/foreman-ssl_access_ssl.log and not corresponding entry in /var/log/squid/access.log.

3.) Verify that re-running the failed yum update/install works for that rpm.

Actual results:

Failed rpm download from pulp_streamer.

Expected results:

No failed rpm download.

Additional info:

Running tcpdump on localhost port 3128 to capture the traffic between pulp_streamer and squid, I saw that the failed rpm request occurred on a previously-established persistent TCP connection that was just past its expiration time (default = 120 seconds). The sequence of events is:

1.) pulp_streamer establishes a TCP connection to squid on port 3128.
2.) pulp_streamer sends HTTP GET for an rpm.
3.) squid responds with the rpm content.
4.) The HTTP request completes, but the TCP connection remains open.
5.) 120 seconds later, pulp_streamer requests another rpm on the same TCP connection.
6.) squid responds with an ACK then a FIN/ACK, closing the TCP connection without responding to the HTTP request.
7.) pulp_streamer responds to the client with HTTP 502.

In the next reproducer attempt, I verified that changing the *_persistent_connections settings in /etc/squid/squid.conf to "off" resolved the issue. So, a proposed fix would be either to disable client persistent connections in squid, or to make pulp_streamer try again if squid terminated its connection without an HTTP response.

Comment 1 Michael Hrivnak 2017-03-16 16:22:23 UTC
Thanks for the detailed report! The analysis you did helps a lot.

What you describe makes sense, but I don't understand yet how it can be easily reproducible. It seems like the connection must sit idle for 120 seconds, and then just at the same moment that a new request is being made to squid, the 120 second timeout is hit and squid closes the connection. Since it's on the local machine with little latency, it seems like you'd need to be very unlucky to have both of those things happen at exactly the same moment. Perhaps there is some additional contributing factor here?

Jeff, what do you think?

Comment 3 Jeff Ortel 2017-03-17 16:33:33 UTC
(In reply to Michael Hrivnak from comment #1)
> Thanks for the detailed report! The analysis you did helps a lot.
> 
> What you describe makes sense, but I don't understand yet how it can be
> easily reproducible. It seems like the connection must sit idle for 120
> seconds, and then just at the same moment that a new request is being made
> to squid, the 120 second timeout is hit and squid closes the connection.
> Since it's on the local machine with little latency, it seems like you'd
> need to be very unlucky to have both of those things happen at exactly the
> same moment. Perhaps there is some additional contributing factor here?
> 
> Jeff, what do you think?

Agreed, the exact timing needed would seem to make this harder to reproduce.  Also, seems odd that squid (tried-and-true) would not deal with this race condition better.

Tasos, what level of concurrency is happening here?  How many client are requests happening concurrently?

Comment 4 Tasos Papaioannou 2017-04-03 20:11:24 UTC
I can reproduce this with 5 clients running concurrent yum installs. I'm attaching an sosreport from a reproducer Satellite. One HTTP 502 occurred at 15:32:02:

# grep ' 502 ' /var/log/httpd/foreman-ssl_access_ssl.log 
192.168.106.130 - 4028ea815b34efd3015b35433cd00500 [03/Apr/2017:15:32:02 -0400] "GET /streamer/var/lib/pulp/content/units/rpm/63/53ced84547bdde4c87f5044e86cfe37460ec95cf61a3accaded5bec60d4875/libnetfilter_cthelper-1.0.0-9.el7.x86_64.rpm?policy=eyJleHRlbnNpb25zIjogeyJyZW1vdGVfaXAiOiAiMTkyLjE2OC4xMDYuMTMwIn0sICJyZXNvdXJjZSI6ICIvc3RyZWFtZXIvdmFyL2xpYi9wdWxwL2NvbnRlbnQvdW5pdHMvcnBtLzYzLzUzY2VkODQ1NDdiZGRlNGM4N2Y1MDQ0ZTg2Y2ZlMzc0NjBlYzk1Y2Y2MWEzYWNjYWRlZDViZWM2MGQ0ODc1L2xpYm5ldGZpbHRlcl9jdGhlbHBlci0xLjAuMC05LmVsNy54ODZfNjQucnBtIiwgImV4cGlyYXRpb24iOiAxNDkxMjQ4MDEyfQ%3D%3D;signature=IXhw0M4D3vFfP5l9BONk8ZIgnJ_fTVu3_bUYl8raxcTHEKRvjfRUL5CIbrHiQP4rGu0NBuqjDtznFG3mdMnP2gnl1m24w55oB1L0fzRKP3tGSU71Td5uh1iU5enPAzMdFTFq0FguPJY5a7bnq6UgC8r1RTwktm7lAj2ve28JRY0BxYA0Y9T0mCancGQB6aE2LPGCASCxtZTY466fSaUIZf0Uozmr0_e72H7LkspXQLZMbGl70YV-nUZB2c_PaWsGvpBf2TWJs9mZmodpWVpnqUtyFbc-ib2oeneY_L60d01ijHZZpjWPVDNsf3rdn-UKoWanZ3bIXSAQ6ydeuJcJuA%3D%3D HTTP/1.1" 502 679 "-" "urlgrabber/3.10 yum/3.4.3"

The amount of httpd->squid traffic is not very high:

# grep /streamer/ /var/log/httpd/foreman-ssl_access_ssl.log|awk '{print $4}'|sort|uniq -c
      4 [03/Apr/2017:15:29:59
     16 [03/Apr/2017:15:30:00
      4 [03/Apr/2017:15:30:01
      4 [03/Apr/2017:15:30:02
     15 [03/Apr/2017:15:31:47
      6 [03/Apr/2017:15:31:55
      4 [03/Apr/2017:15:31:56
      9 [03/Apr/2017:15:32:02
     11 [03/Apr/2017:15:32:03

If I leave out the 'sort' in the above command, I notice that some of the GET requests from 120 seconds prior to the HTTP 502 error were logged with their timestamps out of order, indicating that they took a while to complete:

      2 [03/Apr/2017:15:29:59
      9 [03/Apr/2017:15:30:00
      1 [03/Apr/2017:15:29:59
      1 [03/Apr/2017:15:30:00
      1 [03/Apr/2017:15:30:01
      1 [03/Apr/2017:15:30:00
      1 [03/Apr/2017:15:30:01
      1 [03/Apr/2017:15:30:00
      2 [03/Apr/2017:15:30:01
      1 [03/Apr/2017:15:30:00
      1 [03/Apr/2017:15:29:59
      2 [03/Apr/2017:15:30:00
      3 [03/Apr/2017:15:30:02
      1 [03/Apr/2017:15:30:00
      1 [03/Apr/2017:15:30:02
     15 [03/Apr/2017:15:31:47
      1 [03/Apr/2017:15:31:56
      2 [03/Apr/2017:15:31:55
      1 [03/Apr/2017:15:31:56
      2 [03/Apr/2017:15:31:55
      2 [03/Apr/2017:15:31:56
      2 [03/Apr/2017:15:31:55
      4 [03/Apr/2017:15:32:02
      1 [03/Apr/2017:15:32:03
      4 [03/Apr/2017:15:32:02
      2 [03/Apr/2017:15:32:03
      1 [03/Apr/2017:15:32:02

The requests took a few seconds to complete because they were for large (~35MB) rpms:

----
192.168.106.132 - 4028ea815b34efd3015b354336db04e5 [03/Apr/2017:15:30:02 -0400] "GET /pulp/repos/Default_Organization/Library/content/dist/rhel/server/7/7Server/x86_64/os/perl-Storable-2.45-3.el7.x86_64.rpm HTTP
/1.1" 200 78940 "-" "urlgrabber/3.10 yum/3.4.3"

192.168.106.132 - 4028ea815b34efd3015b354336db04e5 [03/Apr/2017:15:30:00 -0400] "GET /streamer/var/lib/pulp/content/units/rpm/ed/39676894cacd695536a9d88d011c33d999dc3f5a163f96ea1ce615cd3bc398/atomic-openshift-3.
4.1.10-1.git.0.c96aed3.el7.x86_64.rpm?[snip] HTTP/1.1" 200 37173376 "-" "urlgrabber/3.10 yum/3.4.3"

192.168.106.132 - 4028ea815b34efd3015b354336db04e5 [03/Apr/2017:15:30:02 -0400] "GET /pulp/repos/Default_Organization/Library/content/dist/rhel/server/7/7Server/x86_64/os/perl-TermReadKey-2.30-20.el7.x86_64.rpm 
HTTP/1.1" 302 - "-" "urlgrabber/3.10 yum/3.4.3"
----

The timing might be a coincidence, but I thought it was worth pointing out.

Comment 6 Jeff Ortel 2017-06-27 18:21:24 UTC
(In reply to Michael Hrivnak from comment #1)
> Thanks for the detailed report! The analysis you did helps a lot.
> 
> What you describe makes sense, but I don't understand yet how it can be
> easily reproducible. It seems like the connection must sit idle for 120
> seconds, and then just at the same moment that a new request is being made
> to squid, the 120 second timeout is hit and squid closes the connection.
> Since it's on the local machine with little latency, it seems like you'd
> need to be very unlucky to have both of those things happen at exactly the
> same moment. Perhaps there is some additional contributing factor here?
> 
> Jeff, what do you think?

Yeah, that seems like a really tight race condition to reproduce so easily.

Comment 7 Jeff Ortel 2017-06-27 18:27:37 UTC
This PR:3008 mitigates some badness in python-requests and urllib3.  I suspect it may resolve this.  Tasos, do you want to try it?

[1] https://github.com/pulp/pulp/pull/3008

Comment 8 Tasos Papaioannou 2017-07-06 18:42:00 UTC
I haven't been able to reproduce the issue in my recent testing. So, as long as the session-caching changes pass regular QE testing, that's ok with me.