| Summary: | Squid hangs after repeated queries | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Bryan Totty <btotty> | ||||
| Component: | squid | Assignee: | Luboš Uhliarik <luhliari> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | BaseOS QE - Apps <qe-baseos-apps> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 7.2 | CC: | btotty, dborek, dcortijo, dwoodruf, hhorak, jkejda, jorton, luhliari, mfrodl, rloring, thgardne, thozza | ||||
| Target Milestone: | rc | Flags: | luhliari:
needinfo+
|
||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-11-02 11:23:39 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1298243, 1380363, 1420851, 1473612 | ||||||
| Attachments: |
|
||||||
|
Description
Bryan Totty
2016-04-12 16:56:16 UTC
I'm sorry I have not been paying more attention to this problem. There are some things that I should have included early on in this. First off, the definition of "hang" here is that it gets caught in an infinite loop. The CPU gets driven to 100%, but it doesn't seem to be gobbling memory (no pressure to swap). I will attach a debugging log that will clearly show the loop shortly, but this sequence: 2016/03/17 11:50:47.944 kid1| AsyncCallQueue.cc(53) fireNext: leaving Initiate::noteInitiatorAborted() 2016/03/17 11:50:47.944 kid1| event.cc(250) checkEvents: checkEvents 2016/03/17 11:50:47.944 kid1| AsyncCall.cc(18) AsyncCall: The AsyncCall storeClientCopyEvent constructed, this=0x7f38a45af430 [call119] 2016/03/17 11:50:47.944 kid1| AsyncCall.cc(85) ScheduleCall: event.cc(259) will call storeClientCopyEvent(0x7f38a45826e8*?) [call119] 2016/03/17 11:50:47.944 kid1| AsyncCallQueue.cc(51) fireNext: entering storeClientCopyEvent(0x7f38a45826e8*?) 2016/03/17 11:50:47.944 kid1| AsyncCall.cc(30) make: make call storeClientCopyEvent [call119] 2016/03/17 11:50:47.944 kid1| store_client.cc(176) storeClientCopyEvent: storeClientCopyEvent: Running 2016/03/17 11:50:47.944 kid1| store_client.cc(324) storeClientCopy2: storeClientCopy2: 5AB38E13116399D2E66C60F78B77F8B5 2016/03/17 11:50:47.944 kid1| store_client.cc(353) doCopy: store_client::doCopy: co: 170, hi: 170 2016/03/17 11:50:47.944 kid1| store_client.cc(357) doCopy: There is no more to send! 2016/03/17 11:50:47.944 kid1| peer_digest.cc(775) peerDigestFetchedEnough: peerDigestHandleReply: peer 130.199.59.110, offset: 170 size: 50. 2016/03/17 11:50:47.944 kid1| peer_digest.cc(775) peerDigestFetchedEnough: peerDigestSwapInCBlock: peer 130.199.59.110, offset: 170 size: 50. 2016/03/17 11:50:47.944 kid1| store_client.cc(237) copy: store_client::copy: 5AB38E13116399D2E66C60F78B77F8B5, from 170, for length 4046, cb 1, cbdata 0x7f38a45814e8 2016/03/17 11:50:47.944 kid1| store_client.cc(319) storeClientCopy2: storeClientCopy2: Queueing storeClientCopyEvent() 2016/03/17 11:50:47.944 kid1| event.cc(342) schedule: schedule: Adding 'storeClientCopyEvent', in 0.00 seconds just gets repeated over and over and over until the thing is killed. The lines leading up to the first instance of that sequence are: 2016/03/17 11:50:47.943 kid1| ModEpoll.cc(139) SetSelect: FD 13, type=1, handler=0, client_data=0, timeout=0 2016/03/17 11:50:47.943 kid1| ModEpoll.cc(139) SetSelect: FD 13, type=2, handler=0, client_data=0, timeout=0 2016/03/17 11:50:47.944 kid1| AcceptLimiter.cc(47) kick: size=0 2016/03/17 11:50:47.944 kid1| AsyncCallQueue.cc(53) fireNext: leaving comm_close_complete(FD 13) 2016/03/17 11:50:47.944 kid1| AsyncCallQueue.cc(51) fireNext: entering Initiate::noteInitiatorAborted() 2016/03/17 11:50:47.944 kid1| AsyncCall.cc(30) make: make call Initiate::noteInitiatorAborted [call118] 2016/03/17 11:50:47.944 kid1| AsyncCall.cc(48) cancel: will not call Initiate::noteInitiatorAborted [call118] because job gone 2016/03/17 11:50:47.944 kid1| AsyncCall.cc(40) make: will not call Initiate::noteInitiatorAborted [call118] because of job gone 2016/03/17 11:50:47.944 kid1| AsyncCallQueue.cc(53) fireNext: leaving Initiate::noteInitiatorAborted() 2016/03/17 11:50:47.944 kid1| event.cc(250) checkEvents: checkEvents 2016/03/17 11:50:47.944 kid1| AsyncCall.cc(18) AsyncCall: The AsyncCall storeClientCopyEvent constructed, this=0x7f38a45af430 [call119] 2016/03/17 11:50:47.944 kid1| AsyncCall.cc(85) ScheduleCall: event.cc(259) will call storeClientCopyEvent(0x7f38a45826e8*?) [call119] I don't understand it but that thing about "will not call this-n-such because job is gone" looks like it might be significant. What do we mean the job is "gone", sez I to self.... I will get that log attached to this BZ as soon as possible. The customer also sent us some cores (taken manually using gdb against the running process IIRC), so I'll get at least one of those posted too. I think this answers what you were asking of Bryan, so I'll clear the needinfo. Please set it again if I'm mistaken about that. I was able to attach the two core files, but not the log. Even compressed over 138:1, it's too big. I'll trim it down some. It's not all useful, I'm sure. I forgot to mention in my earlier comment: Yes, the customer is definitely using RHEL7. I think _that_ answers all your questions. :-) (In reply to Thomas Gardner from comment #30) Thanks for all the information as well as your help via IRC. I think I finally found the problem. The storeClientCopy2() calls store_client::doCopy() which calls it back and the that new call schedules another storeClientCopy2(). This happens when 170 bytes out of 170 bytes have been already copied and therefore the copy procedure should be finalized. The flags.store_copying attribute serves as a state variable for the nested calls. I looked at the latest RHEL 7.x rebased squid code and in the new code there is an explicit guard for that situation. Therefore I believe the rebased squid is already fixed. It would be nice if we could verify this assessment. I am Rich's coworker here at BNL. He and I were looking into this again, and can confirm that the issue is still present on squid-3.3.8-26.el7_2.4.x86_64. Is the newly-rebased squid due to come out with RHEL 7.3? If so, is there any indication as to when the beta will actually be available so that we can test on the rebased squid? So RHEL 7.3 rolled out with a supposed bug fix for this which did not work. Squid once again silently hanged, however it took some time longer to do so. Moving bug back to NEW, since this bug was not fixed in rebased package. Could you please Rich specify, how do you make squid hang? Has anything changed after rebase? Uhliarik, The bug has not changed after the rebase. I currently use a test reverse proxy to test with as whenever I use a production RHEL 7.3 server it silently hangs squid and all reverse proxy enties in the squid.conf become unusable. In order to re-create the issue I do a wget request from the test system to a system the bug dislikes. For example: wget --no-check-certificate --header="Host: idp-1.comeplace.com" https://192.168.1.1/idp/Authn/UserPassword Please note this only seems to happen with systems using sibboleth and not all versions of of sibboleth. Hello Rich, thank you for your response. Can you please specify, with which version of Shibboleth does squid hangs? Could you please also provide current squid configuration, you are using? I'll need this information, so I'm able to reproduce the issue and try to work on the fix. Thank you for your cooperation. Uhliarik, Below is the versions of shibboleth, java, and tomcat being used. Tomcat 6.0.35 Jdk-1.7.0_02-fcs.x86_64 Shibboleth-identityprovider-2.3.5 And these are the squid configurations. cache_peer 192.168.1.1 parent 443 0 no-query ssl originserver front-end-https=on sslflags=DONT_VERIFY_PEER login=PASS ssloptions=NO_SSLv2,NO_SSLv3 name=some.dns.domain-ssl acl dst_dns-ssl dstdomain some.dns.domain cache_peer_access some.dns.domain-ssl deny !dst_dns-ssl !SSL_Ports cache_peer_access some.dns.domain-ssl allow dst_dns-ssl SSL_Ports http_access allow SSL_Ports dst_dst_dns-ssl Please not there is a separate file which points to the SSL cert. Is there any movement on this bug as RHEL 5 is coming to an end and I am unable to move certain sites to the current squid on RHEL 7.3. Created attachment 1244267 [details]
Another strace of what seems to me to be the same problem (squid version squid-3.5.20-2.el7.x86_64).
This was from the same customer as the previous new core I most recently attached.
Does it help anything to know that this customer reports that they have no trouble whatsoever handling http requests. The problem only starts when they start getting https requests.
This customer discovered early on that if they increase "workers" to 4, that even with only a single connection, they could send all four workers into this loop (each of the four processes chewing up 100% CPU). We didn't capture a core of that, though, so we can't really prove that it's the same loop in each of the four processes, but what are the odds that it's not? The previous single-process core shows that this customer is hitting the same bug as the customer from the other case was.
Hello Rich, I think, I found a workaround. Could you please add "no-digest" option to cache_peer directive? Is squid failing after this change? The workaround in comment 54 works. no-digest was added to cache_peer definitions in squid.conf and the problem can't be reproduced any more. Since we have not managed to find a reliable reproduction recipe for this issue, but there is a known workaround, there is no update planned in Squid which might address this. |