Description of problem: beakerd crashes during startup due to EBADF when ConcurrentRotatingFileHandler tries to flock /var/log/beaker/beakerd.lock. Version-Release number of selected component (if applicable): beaker-server-0.12.0-3.el6eng.noarch python-daemon-1.6-2.el6eng.noarch python-concurrentloghandler-0.8.4-5.el6eng.noarch How reproducible: Not reliably reproducible. I have only seen this in my development environment, nowhere else. It's sensitive to timing (GC sweeps are not predictable) and also depends on open file descriptors during daemonization, for example nss_ldap may cause fd numbering to work out differently and therefore this bug is not observable (or has other symptoms). Steps to Reproduce: 1. service beakerd start Actual results: During beakerd startup: Traceback (most recent call last): File "/usr/bin/beakerd", line 9, in <module> load_entry_point('bkr.server==0.12.0', 'console_scripts', 'beakerd')() File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 880, in main schedule() File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 753, in schedule log.debug("starting update_dirty_jobs thread") File "/usr/lib64/python2.6/logging/__init__.py", line 1044, in debug self._log(DEBUG, msg, args, **kwargs) File "/usr/lib64/python2.6/logging/__init__.py", line 1173, in _log self.handle(record) File "/usr/lib64/python2.6/logging/__init__.py", line 1183, in handle self.callHandlers(record) File "/usr/lib64/python2.6/logging/__init__.py", line 1220, in callHandlers hdlr.handle(record) File "/usr/lib64/python2.6/logging/__init__.py", line 677, in handle self.acquire() File "/usr/lib/python2.6/site-packages/cloghandler.py", line 194, in acquire lock(self.stream_lock, LOCK_EX) File "/usr/lib/python2.6/site-packages/portalocker.py", line 130, in lock fcntl.flock(_getfd(file), flags) IOError: [Errno 9] Bad file descriptor Note that this stack trace is not normally observable (stderr is closed), to get it I modified beakerd.py as follows: d = daemon.DaemonContext(pidfile=pidfile.TimeoutPIDLockFile(pid_file, acquire_timeout=0), stdout=open('/tmp/beakerd.out', 'w'), stderr=open('/tmp/beakerd.err', 'w')) Expected results: beakerd should start normally. Additional info: After a *lot* of digging I eventually figured out this is a bad interaction between daemonization (done by daemon.DaemonContext.open) and the flocking in ConcurrentRotatingFileHandler. One of the things DaemonContext does during daemonization is to close all open file descriptors. It does that by literally just looping over every possible file descriptor and calling os.close on it, ignoring EBADF. So there is an (implicit, undocumented?) assumption there that the caller has not opened any files prior to daemonization -- otherwise their file descriptor will have been just closed underneath them. The real problem though is that, after daemonization, when we do start opening more files, the kernel assigns them file descriptor numbers which might correspond to existing (incorrectly) open file objects in Python. In this particular case the problem is with the stream_lock file corresponding to /var/log/beaker/beakerd.lock. In beakerd code we do go to some lengths to try and ensure logs are closed before daemonization and re-opened afterwards, but the stream_lock files were being missed because (a) the ConcurrentRotatingFileHandler.close method doesn't actually close its stream_lock and (b) the handler is apparently held in a cycle somewhere so is not immediately garbage collected. So the sequence of events is: 1. beakerd starts (before daemonization) 2. beakerd calls load_config, which configures logging which opens stream_lock files --> instance A of ConcurrentRotatingFileHandler has /var/log/beaker/beakerd.lock open as fd 4 3. beakerd calls purge_handlers, which attempts to close down all logging *BUT* there is a cycle somewhere, so instance A is still alive with fd 4 open 4. beakerd daemonizes, all fds are closed (including fd 4) even though instance A still thinks it's open 5. beakerd calls load_config again, which creates new log handlers and opens everything up again --> instance B of ConcurrentRotatingFileHandler has /var/log/beaker/beakerd.lock open, also as fd 4 because the kernel re-used that number 6. at some arbitrary point during startup, GC is triggered and collects the cycle containing instance A, which closes its stream_lock which closes fd 4 7. next time logging tries to use instance B, fd 4 has been closed and its stream_lock is invalid
The bug is *probably* that ConcurrentRotatingFileHandler doesn't close its stream_lock in the close method, but adding some code to do that causes other problems elsewhere due to the (IMHO bizarre) sequence of acquire/release that the logging modules calls on its handlers. Given all the ridiculous complexity, inefficiency, and bugginess that ConcurrentRotatingFileHandler and logging in general has caused us, I would be strongly in favour of eliminating it all and sending everything unconditionally to syslog. We can ship a config in /etc/rsyslog.d to mimic the existing file locations, and a config in /etc/logrotate.d to handle rotation. I think I've mentioned this idea previously in other bugs, but I think now is the time to actually do it.
To get beakerd going in my environment I used the following patches. I consider this just a workaround though, the real fix should be to stop using ConcurrentLogHandler. diff --git a/Server/bkr/server/tools/beakerd.py b/Server/bkr/server/tools/beakerd.py index 701b753..31bf535 100644 --- a/Server/bkr/server/tools/beakerd.py +++ b/Server/bkr/server/tools/beakerd.py @@ -811,14 +811,12 @@ def sigterm_handler(signal, frame): raise SystemExit("received SIGTERM") def purge_handlers(): - #shutdown logging subsystem - logging.shutdown() - # Remove handlers for (_, logger) in logging.root.manager.loggerDict.items(): if hasattr(logger, 'handlers'): for handler in logger.handlers: logger.removeHandler(handler) + handler.close() #clear out logging's internal handler list logging._handlerList = [] diff -ur ConcurrentLogHandler-0.8.4.orig/src/cloghandler.py ConcurrentLogHandler-0.8.4/src/cloghandler.py --- ConcurrentLogHandler-0.8.4.orig/src/cloghandler.py 2009-11-03 04:40:37.000000000 +1000 +++ ConcurrentLogHandler-0.8.4/src/cloghandler.py 2013-04-17 12:56:37.230730149 +1000 @@ -208,6 +208,7 @@ if not self.stream.closed: self.stream.flush() self.stream.close() + self.stream_lock.close() Handler.close(self) def flush(self):
+1 for moving to syslog based logging and ditching ConcurrentLogHandler (it's not like we intend to ever support running the server components on anything other than Linux) If it is a cyclic trash problem, does calling gc.collect() 3 times (to trigger at least 3 generations of collection) after purging the handlers (but prior to daemonisation) make the misbehaviour go away?
Acked with the caveat to look into Ray's observation that this may just be a bug due to modifying the "logger.handlers" list while iterating over it. The migration to syslog should be proposed as a separate RFE.
Ray's patch at http://gerrit.beaker-project.org/1647 doesn't fix this issue. So we can do the quick and dodgy fix from comment 2 (which includes patching ConcurrentLogHandler again, blech) or we can just switch to syslog.
RFE for switching to syslog is bug 963542. We should go ahead with the fix in comment 2 as well.
The Beaker side is on Gerrit: http://gerrit.beaker-project.org/1945 Patch for ConcurrentLogHandler is in dist-git: http://pkgs.fedoraproject.org/cgit/python-concurrentloghandler.git/commit/?id=17ea3d7735846045079f3fbc004128f3e2221e06 Fedora builds of ConcurrentLogHandler are here: http://koji.fedoraproject.org/koji/buildinfo?buildID=419464 http://koji.fedoraproject.org/koji/buildinfo?buildID=419461 http://koji.fedoraproject.org/koji/buildinfo?buildID=419462 http://koji.fedoraproject.org/koji/buildinfo?buildID=419463
Beaker 0.13.1 has been released.