Bug 991269 - beaker-watchdog dies if it cannot connect to the server
beaker-watchdog dies if it cannot connect to the server
Status: ASSIGNED
Product: Beaker
Classification: Community
Component: lab controller (Show other bugs)
0.13
Unspecified Unspecified
unspecified Severity unspecified (vote)
: future_maint
: ---
Assigned To: Dan Callaghan
tools-bugs
: Patch
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-01 23:26 EDT by Dan Callaghan
Modified: 2018-02-05 19:41 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Dan Callaghan 2013-08-01 23:26:39 EDT
Beaker-watchdog is structured as a polling loop (of sorts), like beaker-provision. But if the server goes down or becomes unreachable while beaker-watchdog is running, the daemon dies completely instead of retrying each loop iteration until the server comes back.

Steps to reproduce:
1. Set up beaker-watchdog to run happily
2. Isolate beaker-watchdog from the server (e.g. stop httpd on the server, or use iptables)
3. Wait for the retry period to expire

Eventually the daemon dies like this:

Traceback (most recent call last):
  File "src/bkr/labcontroller/watchdog.py", line 127, in <module>
    main()
  File "src/bkr/labcontroller/watchdog.py", line 115, in main
    main_loop(watchdog, conf)
  File "src/bkr/labcontroller/watchdog.py", line 39, in main_loop
    watchdog.hub._login()
  File "/usr/lib/python2.6/site-packages/kobo/client/__init__.py", line 206, in _login
    if force or self._hub.auth.renew_session():
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request
    verbose=self.__verbose
  File "/home/dcallagh/work/beaker/LabController/src/bkr/labcontroller/proxy.py", line 54, in request
    result = transport_class.request(self, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/kobo/xmlrpc.py", line 234, in _request
    self.send_content(h, request_body)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content
    connection.endheaders()
  File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders
    self._send_output()
  File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output
    self.send(msg)
  File "/usr/lib64/python2.6/httplib.py", line 739, in send
    self.connect()
  File "/usr/lib/python2.6/site-packages/kobo/xmlrpc.py", line 41, in connect
    httplib.HTTPConnection.connect(self)
  File "/usr/lib64/python2.6/httplib.py", line 720, in connect
    self.timeout)
  File "/usr/lib64/python2.6/socket.py", line 567, in create_connection
    raise error, msg
error: [Errno 111] Connection refused

See also bug 734850 and
http://git.beaker-project.org/cgit/beaker/commit/?id=e92b6e9951e37db1c1e7d9c5c721edd4305291e4

Simple fix is to expand the exception types caught in main_loop at the bottom of the while loop. It might also be worth porting beaker-watchdog to use gevent, which makes it easy to write these kinds of polling loops with suitable error handling (as in beaker-provision).
Comment 3 matt jia 2016-03-29 20:05:05 EDT
On Gerrit:

   http://gerrit.beaker-project.org/#/c/4764/
Comment 4 Dan Callaghan 2017-10-19 19:15:57 EDT
This is getting urgent... On beaker-devel now that we are using the OpenStack integration more heavily, beaker-watchdog is dying regularly. I suspect it might be in a failure to fetch OpenStack console logs through the server. Unfortunately since we are on RHEL6 (without systemd) the traceback on stderr is lost, and we can't do any automatic restart logic either. Sigh.
Comment 5 Dan Callaghan 2017-10-20 02:15:42 EDT
Found the cause of the current crashes: bug 1504527. But we should really get this fixed too, to make beaker-watchdog more resilient instead of dying with an error on stderr that we will never see.

Note You need to log in before you can comment on or make changes to this bug.