Bug 1432985
| Summary: | pulp_streamer returns HTTP 502 Bad Gateway from failed squid request | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Tasos Papaioannou <tpapaioa> |
| Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> |
| Status: | CLOSED WORKSFORME | QA Contact: | Katello QA List <katello-qa-list> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.2.7 | CC: | bbuckingham, jcallaha, jortel, mhrivnak, pcreech, tpapaioa, ttereshc |
| Target Milestone: | Unspecified | Keywords: | Triaged |
| Target Release: | Unused | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-07-07 14:50:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Tasos Papaioannou
2017-03-16 14:01:04 UTC
Thanks for the detailed report! The analysis you did helps a lot. What you describe makes sense, but I don't understand yet how it can be easily reproducible. It seems like the connection must sit idle for 120 seconds, and then just at the same moment that a new request is being made to squid, the 120 second timeout is hit and squid closes the connection. Since it's on the local machine with little latency, it seems like you'd need to be very unlucky to have both of those things happen at exactly the same moment. Perhaps there is some additional contributing factor here? Jeff, what do you think? (In reply to Michael Hrivnak from comment #1) > Thanks for the detailed report! The analysis you did helps a lot. > > What you describe makes sense, but I don't understand yet how it can be > easily reproducible. It seems like the connection must sit idle for 120 > seconds, and then just at the same moment that a new request is being made > to squid, the 120 second timeout is hit and squid closes the connection. > Since it's on the local machine with little latency, it seems like you'd > need to be very unlucky to have both of those things happen at exactly the > same moment. Perhaps there is some additional contributing factor here? > > Jeff, what do you think? Agreed, the exact timing needed would seem to make this harder to reproduce. Also, seems odd that squid (tried-and-true) would not deal with this race condition better. Tasos, what level of concurrency is happening here? How many client are requests happening concurrently? I can reproduce this with 5 clients running concurrent yum installs. I'm attaching an sosreport from a reproducer Satellite. One HTTP 502 occurred at 15:32:02:
# grep ' 502 ' /var/log/httpd/foreman-ssl_access_ssl.log
192.168.106.130 - 4028ea815b34efd3015b35433cd00500 [03/Apr/2017:15:32:02 -0400] "GET /streamer/var/lib/pulp/content/units/rpm/63/53ced84547bdde4c87f5044e86cfe37460ec95cf61a3accaded5bec60d4875/libnetfilter_cthelper-1.0.0-9.el7.x86_64.rpm?policy=eyJleHRlbnNpb25zIjogeyJyZW1vdGVfaXAiOiAiMTkyLjE2OC4xMDYuMTMwIn0sICJyZXNvdXJjZSI6ICIvc3RyZWFtZXIvdmFyL2xpYi9wdWxwL2NvbnRlbnQvdW5pdHMvcnBtLzYzLzUzY2VkODQ1NDdiZGRlNGM4N2Y1MDQ0ZTg2Y2ZlMzc0NjBlYzk1Y2Y2MWEzYWNjYWRlZDViZWM2MGQ0ODc1L2xpYm5ldGZpbHRlcl9jdGhlbHBlci0xLjAuMC05LmVsNy54ODZfNjQucnBtIiwgImV4cGlyYXRpb24iOiAxNDkxMjQ4MDEyfQ%3D%3D;signature=IXhw0M4D3vFfP5l9BONk8ZIgnJ_fTVu3_bUYl8raxcTHEKRvjfRUL5CIbrHiQP4rGu0NBuqjDtznFG3mdMnP2gnl1m24w55oB1L0fzRKP3tGSU71Td5uh1iU5enPAzMdFTFq0FguPJY5a7bnq6UgC8r1RTwktm7lAj2ve28JRY0BxYA0Y9T0mCancGQB6aE2LPGCASCxtZTY466fSaUIZf0Uozmr0_e72H7LkspXQLZMbGl70YV-nUZB2c_PaWsGvpBf2TWJs9mZmodpWVpnqUtyFbc-ib2oeneY_L60d01ijHZZpjWPVDNsf3rdn-UKoWanZ3bIXSAQ6ydeuJcJuA%3D%3D HTTP/1.1" 502 679 "-" "urlgrabber/3.10 yum/3.4.3"
The amount of httpd->squid traffic is not very high:
# grep /streamer/ /var/log/httpd/foreman-ssl_access_ssl.log|awk '{print $4}'|sort|uniq -c
4 [03/Apr/2017:15:29:59
16 [03/Apr/2017:15:30:00
4 [03/Apr/2017:15:30:01
4 [03/Apr/2017:15:30:02
15 [03/Apr/2017:15:31:47
6 [03/Apr/2017:15:31:55
4 [03/Apr/2017:15:31:56
9 [03/Apr/2017:15:32:02
11 [03/Apr/2017:15:32:03
If I leave out the 'sort' in the above command, I notice that some of the GET requests from 120 seconds prior to the HTTP 502 error were logged with their timestamps out of order, indicating that they took a while to complete:
2 [03/Apr/2017:15:29:59
9 [03/Apr/2017:15:30:00
1 [03/Apr/2017:15:29:59
1 [03/Apr/2017:15:30:00
1 [03/Apr/2017:15:30:01
1 [03/Apr/2017:15:30:00
1 [03/Apr/2017:15:30:01
1 [03/Apr/2017:15:30:00
2 [03/Apr/2017:15:30:01
1 [03/Apr/2017:15:30:00
1 [03/Apr/2017:15:29:59
2 [03/Apr/2017:15:30:00
3 [03/Apr/2017:15:30:02
1 [03/Apr/2017:15:30:00
1 [03/Apr/2017:15:30:02
15 [03/Apr/2017:15:31:47
1 [03/Apr/2017:15:31:56
2 [03/Apr/2017:15:31:55
1 [03/Apr/2017:15:31:56
2 [03/Apr/2017:15:31:55
2 [03/Apr/2017:15:31:56
2 [03/Apr/2017:15:31:55
4 [03/Apr/2017:15:32:02
1 [03/Apr/2017:15:32:03
4 [03/Apr/2017:15:32:02
2 [03/Apr/2017:15:32:03
1 [03/Apr/2017:15:32:02
The requests took a few seconds to complete because they were for large (~35MB) rpms:
----
192.168.106.132 - 4028ea815b34efd3015b354336db04e5 [03/Apr/2017:15:30:02 -0400] "GET /pulp/repos/Default_Organization/Library/content/dist/rhel/server/7/7Server/x86_64/os/perl-Storable-2.45-3.el7.x86_64.rpm HTTP
/1.1" 200 78940 "-" "urlgrabber/3.10 yum/3.4.3"
192.168.106.132 - 4028ea815b34efd3015b354336db04e5 [03/Apr/2017:15:30:00 -0400] "GET /streamer/var/lib/pulp/content/units/rpm/ed/39676894cacd695536a9d88d011c33d999dc3f5a163f96ea1ce615cd3bc398/atomic-openshift-3.
4.1.10-1.git.0.c96aed3.el7.x86_64.rpm?[snip] HTTP/1.1" 200 37173376 "-" "urlgrabber/3.10 yum/3.4.3"
192.168.106.132 - 4028ea815b34efd3015b354336db04e5 [03/Apr/2017:15:30:02 -0400] "GET /pulp/repos/Default_Organization/Library/content/dist/rhel/server/7/7Server/x86_64/os/perl-TermReadKey-2.30-20.el7.x86_64.rpm
HTTP/1.1" 302 - "-" "urlgrabber/3.10 yum/3.4.3"
----
The timing might be a coincidence, but I thought it was worth pointing out.
(In reply to Michael Hrivnak from comment #1) > Thanks for the detailed report! The analysis you did helps a lot. > > What you describe makes sense, but I don't understand yet how it can be > easily reproducible. It seems like the connection must sit idle for 120 > seconds, and then just at the same moment that a new request is being made > to squid, the 120 second timeout is hit and squid closes the connection. > Since it's on the local machine with little latency, it seems like you'd > need to be very unlucky to have both of those things happen at exactly the > same moment. Perhaps there is some additional contributing factor here? > > Jeff, what do you think? Yeah, that seems like a really tight race condition to reproduce so easily. This PR:3008 mitigates some badness in python-requests and urllib3. I suspect it may resolve this. Tasos, do you want to try it? [1] https://github.com/pulp/pulp/pull/3008 I haven't been able to reproduce the issue in my recent testing. So, as long as the session-caching changes pass regular QE testing, that's ok with me. |