Bug 855111

Summary: Cache rebuild goes wild and fills disk
Product: [Fedora] Fedora Reporter: Petr Tuma <petr.tuma>
Component: squidAssignee: Michal Luscon <mluscon>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: henrik, jonathansteffan, mluscon, stadtkind2, thozza
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-27 11:18:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Petr Tuma 2012-09-06 18:02:55 UTC
Description of problem: A small local installation of squid (one sibling, aufs storage) works just fine across multiple days, startups and shutdowns, until, on one of the startups, the cache rebuild process starts, goes wild and fills the disk.

This is what is in the log (cache.log):

2012/09/06 19:17:33 kid1| Startup: Initializing Authentication Schemes ...
2012/09/06 19:17:33 kid1| Startup: Initialized Authentication Scheme 'basic'
2012/09/06 19:17:33 kid1| Startup: Initialized Authentication Scheme 'digest'
2012/09/06 19:17:33 kid1| Startup: Initialized Authentication Scheme 'negotiate'
2012/09/06 19:17:33 kid1| Startup: Initialized Authentication Scheme 'ntlm'
2012/09/06 19:17:33 kid1| Startup: Initialized Authentication.
2012/09/06 19:17:33 kid1| Processing Configuration File: /etc/squid/squid.conf (depth 0)
2012/09/06 19:17:33 kid1| Initializing https proxy context
2012/09/06 19:17:33 kid1| Logfile: opening log daemon:/var/log/squid/access.log
2012/09/06 19:17:33 kid1| Logfile Daemon: opening log /var/log/squid/access.log
2012/09/06 19:17:33 kid1| Squid plugin modules loaded: 0
2012/09/06 19:17:33 kid1| Adaptation support is off.
2012/09/06 19:17:33 kid1| Store logging disabled
2012/09/06 19:17:33 kid1| DNS Socket created at [::], FD 10
2012/09/06 19:17:33 kid1| DNS Socket created at 0.0.0.0, FD 11
2012/09/06 19:17:33 kid1| Adding domain tmatmouci.cz from /etc/resolv.conf
2012/09/06 19:17:33 kid1| Adding domain tmatmouci.cz from /etc/resolv.conf
2012/09/06 19:17:33 kid1| Adding nameserver 192.168.0.1 from /etc/resolv.conf
2012/09/06 19:17:33 kid1| HTCP Disabled.
2012/09/06 19:17:33 kid1| Configuring Sibling 192.168.0.1/3128/3130
2012/09/06 19:17:33 kid1| Loaded Icons.
2012/09/06 19:17:33 kid1| Accepting HTTP Socket connections at local=127.0.0.1:3128 remote=[::] FD 13 flags=9
2012/09/06 19:17:33 kid1| Accepting ICP messages on [::]:3130
2012/09/06 19:17:33 kid1| Sending ICP messages from [::]:3130
2012/09/06 19:17:38 kid1| WARNING: Disk space over limit: 610715120.00 KB > 67108864 KB
2012/09/06 19:17:42 kid1| Store rebuilding is 963997.35% complete
2012/09/06 19:17:49 kid1| WARNING: Disk space over limit: 615944012.00 KB > 67108864 KB
2012/09/06 19:17:57 kid1| Store rebuilding is 975840.41% complete
2012/09/06 19:18:00 kid1| WARNING: Disk space over limit: 621642344.00 KB > 67108864 KB
2012/09/06 19:18:11 kid1| WARNING: Disk space over limit: 626752920.00 KB > 67108864 KB
2012/09/06 19:18:12 kid1| Store rebuilding is 987658.22% complete
...
2012/09/06 19:32:27 kid1| WARNING: Disk space over limit: 835379880.00 KB > 67108864 KB
2012/09/06 19:32:30 kid1| Store rebuilding is 1316233.70% complete
2012/09/06 19:32:38 kid1| WARNING: Disk space over limit: 835691972.00 KB > 67108864 KB
2012/09/06 19:32:45 kid1| Store rebuilding is 1317622.55% complete
...

The store configuration is:

maximum_object_size 1 GB
cache_dir aufs /var/spool/squid 65536 16 256

Some additional observations. The content of the storage directory (minus swap.state) is not really over the limit, du reports around 1.5 GB occupied. The disk is filled by the growing swap.state file, in this particular occurence it grew to over 30 GB and kept growing. Shutting down squid, deleting the swap.state file and starting up squid solves the problem (until next occurence). Prior to the problem, squid was shut down cleanly (and likely there was no unclean shutdown before that either).

Version-Release number of selected component (if applicable):

squid-3.2.0.16-1.fc17.x86_64
all other packages current with Fedora 17
quad core machine

How reproducible:

On this machine, the problem happens once every few weeks, slightly more often than once a month.

Comment 1 Henrik Nordström 2012-09-06 20:43:58 UTC
This is known to happen when swap.state gets corrupted for some reason. But swap.state corruption is not easily seen unless the machine as such crashes, so this is a quite unexpected.

Comment 2 Petr Tuma 2012-09-10 06:37:16 UTC
I've scanned the logs several months back and see no unclean shutdown. Is it possible to manually detect whether swap.state is corrupted, so that I can perhaps track down when the corruption happens (sooner than when I see the disk getting filled to maximum capacity) ?

Comment 3 Stefan Krüger 2012-09-30 09:06:44 UTC
I see the same thing with F17, systemd doesn't seem to honor squids *long* shutdown time (up to 30 seconds) and after every reboot I get warnings like:

2012/09/30 01:57:51 kid1| WARNING: Disk space over limit: 5278968.00 KB > 209715
2 KB

in /var/log/squid/cache.log.

Stopping squid, removing /var/spool/squid/swap.state and then restarting squid again helped but that's just a workaround.

Comment 4 Petr Tuma 2012-09-30 13:58:01 UTC
I confirm the observation above. Another pass through the logs suggests that while clean shutdown was always initiated, it sometimes did not run to completion (the log says "Preparing for shutdown" but other common shutdown messages do not follow).

What is strange is that default systemd shutdown timeout for the squid service should be 90 seconds, well above the 30 seconds mentioned above. I'm now trying with "TimeoutSec=0" added to /lib/systemd/system/squid.service to see if that helps.

Comment 5 Henrik Nordström 2012-10-01 08:48:38 UTC
For a large Squid cache 90 seconds may not be enough.

The 30 second mentioned is how log Squid waits for existing connections to finish before it shuts down for real. After that is the actual shutdown procedure which can take quite some time if either the cache is large or your system have swapped out significant parts of Squid.

Comment 6 Petr Tuma 2012-10-07 08:45:47 UTC
Summary of the problem: Squid shutdown easily takes more than one minute, even if it is not particularly active. Systemd does not wait for Squid to shut down 
properly. When killed, Squid can leave its cache files in an inconsistent state, which cause it to go wild and use up all disk space on some of the next starts.

Current solution that seems to work for me: Wait for proper Squid shutdown in Systemd. Change the ExecStop line and add the next two lines in /lib/systemd/system/squid.service:

ExecStop=/bin/sh -c "/usr/sbin/squid -k shutdown && while kill -0 $MAINPID ; do sleep 1 ; done"
PIDFile=/run/squid.pid
TimeoutSec=300

What I do not like on the current solution: The active wait for the Squid process to disappear. I could not find any better way to do this in Systemd, but I could easily be missing something. Squid itself apparently does not offer any way of shutting down synchronously (if it did, it could simply be used in ExecStop instead of the asynchronous shutdown command).

Finally, there is a SQUID_SHUTDOWN_TIMEOUT variable in /etc/sysconfig/squid, which is apparently not used anywhere. Perhaps we could do something like TimeoutSec=$SQUID_SHUTDOWN_TIMEOUT in the Systemd service file, but I'm not sure if Systemd expands shell variables in other than the Exec lines.

Comment 7 Petr Tuma 2012-12-19 06:57:52 UTC
This bug is still present in the current Squid package in Fedora 18 (ExecStop command in /lib/systemd/system/squid.service kills Squid rather than shutting it down properly).

Comment 8 Fedora End Of Life 2013-12-21 08:50:27 UTC
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 9 Fedora End Of Life 2014-02-05 12:08:58 UTC
Fedora 18 changed to end-of-life (EOL) status on 2014-01-14. Fedora 18 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 10 Michal Luscon 2014-06-27 11:18:21 UTC
Resolved in the current rawhide version of squid -> timeout option is now set to zero(unlimited) as it can't be predicted in advance.