952929 – Bad file descriptor from ConcurrentRotatingFileHandler.stream_lock in beakerd

Bug 952929 - Bad file descriptor from ConcurrentRotatingFileHandler.stream_lock in beakerd

Summary: Bad file descriptor from ConcurrentRotatingFileHandler.stream_lock in beakerd

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	scheduler
Sub Component:
Version:	0.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	0.13
Assignee:	Dan Callaghan
QA Contact:	tools-bugs
Docs Contact:
URL:
Whiteboard:	Misc
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-04-17 02:00 UTC by Dan Callaghan
Modified:	2018-02-06 00:41 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-06-25 06:25:45 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dan Callaghan 2013-04-17 02:00:15 UTC

Description of problem:

beakerd crashes during startup due to EBADF when ConcurrentRotatingFileHandler tries to flock /var/log/beaker/beakerd.lock.


Version-Release number of selected component (if applicable):

beaker-server-0.12.0-3.el6eng.noarch
python-daemon-1.6-2.el6eng.noarch
python-concurrentloghandler-0.8.4-5.el6eng.noarch


How reproducible:

Not reliably reproducible. I have only seen this in my development environment, nowhere else. It's sensitive to timing (GC sweeps are not predictable) and also depends on open file descriptors during daemonization, for example nss_ldap may cause fd numbering to work out differently and therefore this bug is not observable (or has other symptoms).


Steps to Reproduce:

1. service beakerd start


Actual results:

During beakerd startup:
Traceback (most recent call last):
  File "/usr/bin/beakerd", line 9, in <module>
    load_entry_point('bkr.server==0.12.0', 'console_scripts', 'beakerd')()
  File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 880, in main
    schedule()
  File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 753, in schedule
    log.debug("starting update_dirty_jobs thread")
  File "/usr/lib64/python2.6/logging/__init__.py", line 1044, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib64/python2.6/logging/__init__.py", line 1173, in _log
    self.handle(record)
  File "/usr/lib64/python2.6/logging/__init__.py", line 1183, in handle
    self.callHandlers(record)
  File "/usr/lib64/python2.6/logging/__init__.py", line 1220, in callHandlers
    hdlr.handle(record)
  File "/usr/lib64/python2.6/logging/__init__.py", line 677, in handle
    self.acquire()
  File "/usr/lib/python2.6/site-packages/cloghandler.py", line 194, in acquire
    lock(self.stream_lock, LOCK_EX)
  File "/usr/lib/python2.6/site-packages/portalocker.py", line 130, in lock
    fcntl.flock(_getfd(file), flags)
IOError: [Errno 9] Bad file descriptor

Note that this stack trace is not normally observable (stderr is closed), to get it I modified beakerd.py as follows:

        d = daemon.DaemonContext(pidfile=pidfile.TimeoutPIDLockFile(pid_file, acquire_timeout=0),
                stdout=open('/tmp/beakerd.out', 'w'), stderr=open('/tmp/beakerd.err', 'w'))


Expected results:

beakerd should start normally.


Additional info:

After a *lot* of digging I eventually figured out this is a bad interaction between daemonization (done by daemon.DaemonContext.open) and the flocking in ConcurrentRotatingFileHandler.

One of the things DaemonContext does during daemonization is to close all open file descriptors. It does that by literally just looping over every possible file descriptor and calling os.close on it, ignoring EBADF. So there is an (implicit, undocumented?) assumption there that the caller has not opened any files prior to daemonization -- otherwise their file descriptor will have been just closed underneath them.

The real problem though is that, after daemonization, when we do start opening more files, the kernel assigns them file descriptor numbers which might correspond to existing (incorrectly) open file objects in Python.

In this particular case the problem is with the stream_lock file corresponding to /var/log/beaker/beakerd.lock. In beakerd code we do go to some lengths to try and ensure logs are closed before daemonization and re-opened afterwards, but the stream_lock files were being missed because (a) the ConcurrentRotatingFileHandler.close method doesn't actually close its stream_lock and (b) the handler is apparently held in a cycle somewhere so is not immediately garbage collected.

So the sequence of events is:

1. beakerd starts (before daemonization)
2. beakerd calls load_config, which configures logging which opens stream_lock files
   --> instance A of ConcurrentRotatingFileHandler has /var/log/beaker/beakerd.lock open as fd 4
3. beakerd calls purge_handlers, which attempts to close down all logging
   *BUT* there is a cycle somewhere, so instance A is still alive with fd 4 open
4. beakerd daemonizes, all fds are closed (including fd 4) even though instance A still thinks it's open
5. beakerd calls load_config again, which creates new log handlers and opens everything up again
   --> instance B of ConcurrentRotatingFileHandler has /var/log/beaker/beakerd.lock open, also as fd 4 because the kernel re-used that number
6. at some arbitrary point during startup, GC is triggered and collects the cycle containing instance A, which closes its stream_lock which closes fd 4
7. next time logging tries to use instance B, fd 4 has been closed and its stream_lock is invalid

Comment 1 Dan Callaghan 2013-04-17 02:03:28 UTC

The bug is *probably* that ConcurrentRotatingFileHandler doesn't close its stream_lock in the close method, but adding some code to do that causes other problems elsewhere due to the (IMHO bizarre) sequence of acquire/release that the logging modules calls on its handlers.

Given all the ridiculous complexity, inefficiency, and bugginess that ConcurrentRotatingFileHandler and logging in general has caused us, I would be strongly in favour of eliminating it all and sending everything unconditionally to syslog. We can ship a config in /etc/rsyslog.d to mimic the existing file locations, and a config in /etc/logrotate.d to handle rotation. I think I've mentioned this idea previously in other bugs, but I think now is the time to actually do it.

Comment 2 Dan Callaghan 2013-04-17 02:58:10 UTC

To get beakerd going in my environment I used the following patches. I consider this just a workaround though, the real fix should be to stop using ConcurrentLogHandler.

diff --git a/Server/bkr/server/tools/beakerd.py b/Server/bkr/server/tools/beakerd.py
index 701b753..31bf535 100644
--- a/Server/bkr/server/tools/beakerd.py
+++ b/Server/bkr/server/tools/beakerd.py
@@ -811,14 +811,12 @@ def sigterm_handler(signal, frame):
     raise SystemExit("received SIGTERM")
 
 def purge_handlers():
-    #shutdown logging subsystem
-    logging.shutdown()
-
     # Remove handlers
     for (_, logger) in logging.root.manager.loggerDict.items():
         if hasattr(logger, 'handlers'):
             for handler in logger.handlers:
                 logger.removeHandler(handler)
+                handler.close()
 
     #clear out logging's internal handler list
     logging._handlerList = []

diff -ur ConcurrentLogHandler-0.8.4.orig/src/cloghandler.py ConcurrentLogHandler-0.8.4/src/cloghandler.py
--- ConcurrentLogHandler-0.8.4.orig/src/cloghandler.py	2009-11-03 04:40:37.000000000 +1000
+++ ConcurrentLogHandler-0.8.4/src/cloghandler.py	2013-04-17 12:56:37.230730149 +1000
@@ -208,6 +208,7 @@
         if not self.stream.closed:
             self.stream.flush()
             self.stream.close()
+        self.stream_lock.close()
         Handler.close(self)
     
     def flush(self):

Comment 3 Nick Coghlan 2013-04-17 03:33:30 UTC

+1 for moving to syslog based logging and ditching ConcurrentLogHandler (it's not like we intend to ever support running the server components on anything other than Linux)

If it is a cyclic trash problem, does calling gc.collect() 3 times (to trigger at least 3 generations of collection) after purging the handlers (but prior to daemonisation) make the misbehaviour go away?

Comment 4 Nick Coghlan 2013-04-17 06:19:11 UTC

Acked with the caveat to look into Ray's observation that this may just be a bug due to modifying the "logger.handlers" list while iterating over it.

The migration to syslog should be proposed as a separate RFE.

Comment 5 Dan Callaghan 2013-05-13 03:34:59 UTC

Ray's patch at http://gerrit.beaker-project.org/1647 doesn't fix this issue.

So we can do the quick and dodgy fix from comment 2 (which includes patching ConcurrentLogHandler again, blech) or we can just switch to syslog.

Comment 6 Dan Callaghan 2013-05-16 07:16:51 UTC

RFE for switching to syslog is bug 963542.

We should go ahead with the fix in comment 2 as well.

Comment 7 Dan Callaghan 2013-05-16 07:38:28 UTC

The Beaker side is on Gerrit: http://gerrit.beaker-project.org/1945

Patch for ConcurrentLogHandler is in dist-git: http://pkgs.fedoraproject.org/cgit/python-concurrentloghandler.git/commit/?id=17ea3d7735846045079f3fbc004128f3e2221e06

Fedora builds of ConcurrentLogHandler are here:
http://koji.fedoraproject.org/koji/buildinfo?buildID=419464
http://koji.fedoraproject.org/koji/buildinfo?buildID=419461
http://koji.fedoraproject.org/koji/buildinfo?buildID=419462
http://koji.fedoraproject.org/koji/buildinfo?buildID=419463

Comment 13 Amit Saha 2013-06-25 06:25:45 UTC

Beaker 0.13.1 has been released.

Note You need to log in before you can comment on or make changes to this bug.