Beaker-watchdog is structured as a polling loop (of sorts), like beaker-provision. But if the server goes down or becomes unreachable while beaker-watchdog is running, the daemon dies completely instead of retrying each loop iteration until the server comes back. Steps to reproduce: 1. Set up beaker-watchdog to run happily 2. Isolate beaker-watchdog from the server (e.g. stop httpd on the server, or use iptables) 3. Wait for the retry period to expire Eventually the daemon dies like this: Traceback (most recent call last): File "src/bkr/labcontroller/watchdog.py", line 127, in <module> main() File "src/bkr/labcontroller/watchdog.py", line 115, in main main_loop(watchdog, conf) File "src/bkr/labcontroller/watchdog.py", line 39, in main_loop watchdog.hub._login() File "/usr/lib/python2.6/site-packages/kobo/client/__init__.py", line 206, in _login if force or self._hub.auth.renew_session(): File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request verbose=self.__verbose File "/home/dcallagh/work/beaker/LabController/src/bkr/labcontroller/proxy.py", line 54, in request result = transport_class.request(self, *args, **kwargs) File "/usr/lib/python2.6/site-packages/kobo/xmlrpc.py", line 234, in _request self.send_content(h, request_body) File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content connection.endheaders() File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders self._send_output() File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output self.send(msg) File "/usr/lib64/python2.6/httplib.py", line 739, in send self.connect() File "/usr/lib/python2.6/site-packages/kobo/xmlrpc.py", line 41, in connect httplib.HTTPConnection.connect(self) File "/usr/lib64/python2.6/httplib.py", line 720, in connect self.timeout) File "/usr/lib64/python2.6/socket.py", line 567, in create_connection raise error, msg error: [Errno 111] Connection refused See also bug 734850 and http://git.beaker-project.org/cgit/beaker/commit/?id=e92b6e9951e37db1c1e7d9c5c721edd4305291e4 Simple fix is to expand the exception types caught in main_loop at the bottom of the while loop. It might also be worth porting beaker-watchdog to use gevent, which makes it easy to write these kinds of polling loops with suitable error handling (as in beaker-provision).
On Gerrit: http://gerrit.beaker-project.org/#/c/4764/
This is getting urgent... On beaker-devel now that we are using the OpenStack integration more heavily, beaker-watchdog is dying regularly. I suspect it might be in a failure to fetch OpenStack console logs through the server. Unfortunately since we are on RHEL6 (without systemd) the traceback on stderr is lost, and we can't do any automatic restart logic either. Sigh.
Found the cause of the current crashes: bug 1504527. But we should really get this fixed too, to make beaker-watchdog more resilient instead of dying with an error on stderr that we will never see.
https://gerrit.beaker-project.org/#/c/beaker/+/6240
Beaker 26.0 has been released.