Bug 1326442

Summary: Squid hangs after repeated queries
Product: Red Hat Enterprise Linux 7 Reporter: Bryan Totty <btotty>
Component: squidAssignee: Luboš Uhliarik <luhliari>
Status: CLOSED INSUFFICIENT_DATA QA Contact: BaseOS QE - Apps <qe-baseos-apps>
Severity: high Docs Contact:
Priority: high    
Version: 7.2CC: btotty, dborek, dcortijo, dwoodruf, hhorak, jkejda, jorton, luhliari, mfrodl, rloring, thgardne, thozza
Target Milestone: rcFlags: luhliari: needinfo+
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-02 11:23:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1298243, 1380363, 1420851, 1473612    
Attachments:
Description Flags
Another strace of what seems to me to be the same problem (squid version squid-3.5.20-2.el7.x86_64). none

Description Bryan Totty 2016-04-12 16:56:16 UTC
Description of problem:
Every so often squid quietly hangs causing disruption to service.

squid-3.3.8-26.el7.x86_64

The squid PID in this example is 2617.

--------------------
# strace -p 2617
Process 2617 attached
wait4(-1, 
--------------------

No other useful data is provided via strace.

The hang is manually triggered with wget, and strace produces no valuable output.

The 4th time this command is run, the hang occurs. 

The date and time should line up with the debug logs attached:

--------------------
2016-03-25 10:31:12 (1.21 MB/s) - ‘UserPassword.4’ saved [3999/3999]

# wget --no-check-certificate --header="Host: system.example.com" https://192.168.1.2/idp/Authn/UserPassword
--------------------

A private core dump and sosreport have been attached.

Comment 27 Thomas Gardner 2016-07-06 15:18:37 UTC
I'm sorry I have not been paying more attention to this problem.  There are some things that I should have included early on in this.

First off, the definition of "hang" here is that it gets caught in an infinite loop.  The CPU gets driven to 100%, but it doesn't seem to be gobbling memory (no pressure to swap).  I will attach a debugging log that will clearly show the loop shortly, but this sequence:

2016/03/17 11:50:47.944 kid1| AsyncCallQueue.cc(53) fireNext: leaving Initiate::noteInitiatorAborted()
2016/03/17 11:50:47.944 kid1| event.cc(250) checkEvents: checkEvents
2016/03/17 11:50:47.944 kid1| AsyncCall.cc(18) AsyncCall: The AsyncCall storeClientCopyEvent constructed, this=0x7f38a45af430 [call119]
2016/03/17 11:50:47.944 kid1| AsyncCall.cc(85) ScheduleCall: event.cc(259) will call storeClientCopyEvent(0x7f38a45826e8*?) [call119]
2016/03/17 11:50:47.944 kid1| AsyncCallQueue.cc(51) fireNext: entering storeClientCopyEvent(0x7f38a45826e8*?)
2016/03/17 11:50:47.944 kid1| AsyncCall.cc(30) make: make call storeClientCopyEvent [call119]
2016/03/17 11:50:47.944 kid1| store_client.cc(176) storeClientCopyEvent: storeClientCopyEvent: Running
2016/03/17 11:50:47.944 kid1| store_client.cc(324) storeClientCopy2: storeClientCopy2: 5AB38E13116399D2E66C60F78B77F8B5
2016/03/17 11:50:47.944 kid1| store_client.cc(353) doCopy: store_client::doCopy: co: 170, hi: 170
2016/03/17 11:50:47.944 kid1| store_client.cc(357) doCopy: There is no more to send!
2016/03/17 11:50:47.944 kid1| peer_digest.cc(775) peerDigestFetchedEnough: peerDigestHandleReply: peer 130.199.59.110, offset: 170 size: 50.
2016/03/17 11:50:47.944 kid1| peer_digest.cc(775) peerDigestFetchedEnough: peerDigestSwapInCBlock: peer 130.199.59.110, offset: 170 size: 50.
2016/03/17 11:50:47.944 kid1| store_client.cc(237) copy: store_client::copy: 5AB38E13116399D2E66C60F78B77F8B5, from 170, for length 4046, cb 1, cbdata 0x7f38a45814e8
2016/03/17 11:50:47.944 kid1| store_client.cc(319) storeClientCopy2: storeClientCopy2: Queueing storeClientCopyEvent()
2016/03/17 11:50:47.944 kid1| event.cc(342) schedule: schedule: Adding 'storeClientCopyEvent', in 0.00 seconds

just gets repeated over and over and over until the thing is killed.  The lines leading up to the first instance of that sequence are:

2016/03/17 11:50:47.943 kid1| ModEpoll.cc(139) SetSelect: FD 13, type=1, handler=0, client_data=0, timeout=0
2016/03/17 11:50:47.943 kid1| ModEpoll.cc(139) SetSelect: FD 13, type=2, handler=0, client_data=0, timeout=0
2016/03/17 11:50:47.944 kid1| AcceptLimiter.cc(47) kick: size=0
2016/03/17 11:50:47.944 kid1| AsyncCallQueue.cc(53) fireNext: leaving comm_close_complete(FD 13)
2016/03/17 11:50:47.944 kid1| AsyncCallQueue.cc(51) fireNext: entering Initiate::noteInitiatorAborted()
2016/03/17 11:50:47.944 kid1| AsyncCall.cc(30) make: make call Initiate::noteInitiatorAborted [call118]
2016/03/17 11:50:47.944 kid1| AsyncCall.cc(48) cancel: will not call Initiate::noteInitiatorAborted [call118] because job gone
2016/03/17 11:50:47.944 kid1| AsyncCall.cc(40) make: will not call Initiate::noteInitiatorAborted [call118] because of job gone
2016/03/17 11:50:47.944 kid1| AsyncCallQueue.cc(53) fireNext: leaving Initiate::noteInitiatorAborted()
2016/03/17 11:50:47.944 kid1| event.cc(250) checkEvents: checkEvents
2016/03/17 11:50:47.944 kid1| AsyncCall.cc(18) AsyncCall: The AsyncCall storeClientCopyEvent constructed, this=0x7f38a45af430 [call119]
2016/03/17 11:50:47.944 kid1| AsyncCall.cc(85) ScheduleCall: event.cc(259) will call storeClientCopyEvent(0x7f38a45826e8*?) [call119]

I don't understand it but that thing about "will not call this-n-such because job is gone" looks like it might be significant.  What do we mean the job is "gone", sez I to self....

I will get that log attached to this BZ as soon as possible.  The customer also sent us some cores (taken manually using gdb against the running process IIRC), so I'll get at least one of those posted too.

I think this answers what you were asking of Bryan, so I'll clear the needinfo.  Please set it again if I'm mistaken about that.

Comment 30 Thomas Gardner 2016-07-06 16:23:06 UTC
I was able to attach the two core files, but not the log.  Even compressed over 138:1, it's too big.  I'll trim it down some.  It's not all useful, I'm sure.

I forgot to mention in my earlier comment:  Yes, the customer is definitely using RHEL7.  I think _that_ answers all your questions.  :-)

Comment 31 Pavel Šimerda (pavlix) 2016-07-07 17:13:03 UTC
(In reply to Thomas Gardner from comment #30)
Thanks for all the information as well as your help via IRC.

I think I finally found the problem. The storeClientCopy2() calls store_client::doCopy() which calls it back and the that new call schedules another storeClientCopy2(). This happens when 170 bytes out of 170 bytes have been already copied and therefore the copy procedure should be finalized. The flags.store_copying attribute serves as a state variable for the nested calls.

I looked at the latest RHEL 7.x rebased squid code and in the new code there is an explicit guard for that situation. Therefore I believe the rebased squid is already fixed. It would be nice if we could verify this assessment.

Comment 39 David Cortijo 2016-08-24 17:48:44 UTC
I am Rich's coworker here at BNL. He and I were looking into this again, and can confirm that the issue is still present on squid-3.3.8-26.el7_2.4.x86_64.

Is the newly-rebased squid due to come out with RHEL 7.3? If so, is there any indication as to when the beta will actually be available so that we can test on the rebased squid?

Comment 45 Rich Loring 2016-11-07 14:17:37 UTC
So RHEL 7.3 rolled out with a supposed bug fix for this which did not work. Squid once again silently hanged, however it took some time longer to do so.

Comment 46 Luboš Uhliarik 2016-11-24 12:15:47 UTC
Moving bug back to NEW, since this bug was not fixed in rebased package. 

Could you please Rich specify, how do you make squid hang? Has anything changed after rebase?

Comment 47 Rich Loring 2016-12-13 18:17:20 UTC
Uhliarik,

The bug has not changed after the rebase. I currently use a test reverse proxy to test with as whenever I use a production RHEL 7.3 server it silently hangs squid and all reverse proxy enties in the squid.conf become unusable.

In order to re-create the issue I do a wget request from the test system to a system the bug dislikes. For example:

 wget --no-check-certificate --header="Host: idp-1.comeplace.com" https://192.168.1.1/idp/Authn/UserPassword

Please note this only seems to happen with systems using sibboleth and not all versions of of sibboleth.

Comment 48 Luboš Uhliarik 2016-12-14 09:40:32 UTC
Hello Rich,

thank you for your response. Can you please specify, with which version of Shibboleth does squid hangs? 

Could you please also provide current squid configuration, you are using? 

I'll need this information, so I'm able to reproduce the issue and try to work on the fix.

Thank you for your cooperation.

Comment 49 Rich Loring 2016-12-14 16:22:16 UTC
Uhliarik,

Below is the versions of shibboleth, java, and tomcat being used.

Tomcat 6.0.35
Jdk-1.7.0_02-fcs.x86_64
Shibboleth-identityprovider-2.3.5

And these are the squid configurations.

cache_peer 192.168.1.1 parent 443 0 no-query ssl originserver front-end-https=on sslflags=DONT_VERIFY_PEER login=PASS ssloptions=NO_SSLv2,NO_SSLv3 name=some.dns.domain-ssl
acl dst_dns-ssl dstdomain some.dns.domain
cache_peer_access some.dns.domain-ssl deny !dst_dns-ssl !SSL_Ports
cache_peer_access some.dns.domain-ssl allow dst_dns-ssl SSL_Ports
http_access allow SSL_Ports dst_dst_dns-ssl

Please not there is a separate file which points to the SSL cert.

Comment 51 Rich Loring 2017-01-23 17:04:18 UTC
Is there any movement on this bug as RHEL 5 is coming to an end and I am unable to move certain sites to the current squid on RHEL 7.3.

Comment 52 Thomas Gardner 2017-01-25 13:46:10 UTC
Created attachment 1244267 [details]
Another strace of what seems to me to be the same problem (squid version squid-3.5.20-2.el7.x86_64).

This was from the same customer as the previous new core I most recently attached.

Does it help anything to know that this customer reports that they have no trouble whatsoever handling http requests.  The problem only starts when they start getting https requests.

This customer discovered early on that if they increase "workers" to 4, that even with only a single connection, they could send all four workers into this loop (each of the four processes chewing up 100% CPU).  We didn't capture a core of that, though, so we can't really prove that it's the same loop in each of the four processes, but what are the odds that it's not?  The previous single-process core shows that this customer is hitting the same bug as the customer from the other case was.

Comment 54 Luboš Uhliarik 2017-02-01 10:08:13 UTC
Hello Rich, 

I think, I found a workaround. Could you please add "no-digest" option to cache_peer directive?

Is squid failing after this change?

Comment 55 Daniel Borek 2017-02-01 10:47:20 UTC
The workaround in comment 54 works. no-digest was added to cache_peer definitions in squid.conf and the problem can't be reproduced any more.

Comment 57 Joe Orton 2017-11-02 11:23:39 UTC
Since we have not managed to find a reliable reproduction recipe for this issue, but there is a known workaround, there is no update planned in Squid which might address this.