Bug 991269

Summary: beaker-watchdog dies if it cannot connect to the server
Product: [Retired] Beaker Reporter: Dan Callaghan <dcallagh>
Component: lab controllerAssignee: Dan Callaghan <dcallagh>
Status: CLOSED CURRENTRELEASE QA Contact: tools-bugs <tools-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 0.13CC: dcallagh, mjia, qwan, tools-bugs, xtian
Target Milestone: 26.0Keywords: Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-08 02:16:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Callaghan 2013-08-02 03:26:39 UTC
Beaker-watchdog is structured as a polling loop (of sorts), like beaker-provision. But if the server goes down or becomes unreachable while beaker-watchdog is running, the daemon dies completely instead of retrying each loop iteration until the server comes back.

Steps to reproduce:
1. Set up beaker-watchdog to run happily
2. Isolate beaker-watchdog from the server (e.g. stop httpd on the server, or use iptables)
3. Wait for the retry period to expire

Eventually the daemon dies like this:

Traceback (most recent call last):
  File "src/bkr/labcontroller/watchdog.py", line 127, in <module>
    main()
  File "src/bkr/labcontroller/watchdog.py", line 115, in main
    main_loop(watchdog, conf)
  File "src/bkr/labcontroller/watchdog.py", line 39, in main_loop
    watchdog.hub._login()
  File "/usr/lib/python2.6/site-packages/kobo/client/__init__.py", line 206, in _login
    if force or self._hub.auth.renew_session():
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request
    verbose=self.__verbose
  File "/home/dcallagh/work/beaker/LabController/src/bkr/labcontroller/proxy.py", line 54, in request
    result = transport_class.request(self, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/kobo/xmlrpc.py", line 234, in _request
    self.send_content(h, request_body)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content
    connection.endheaders()
  File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders
    self._send_output()
  File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output
    self.send(msg)
  File "/usr/lib64/python2.6/httplib.py", line 739, in send
    self.connect()
  File "/usr/lib/python2.6/site-packages/kobo/xmlrpc.py", line 41, in connect
    httplib.HTTPConnection.connect(self)
  File "/usr/lib64/python2.6/httplib.py", line 720, in connect
    self.timeout)
  File "/usr/lib64/python2.6/socket.py", line 567, in create_connection
    raise error, msg
error: [Errno 111] Connection refused

See also bug 734850 and
http://git.beaker-project.org/cgit/beaker/commit/?id=e92b6e9951e37db1c1e7d9c5c721edd4305291e4

Simple fix is to expand the exception types caught in main_loop at the bottom of the while loop. It might also be worth porting beaker-watchdog to use gevent, which makes it easy to write these kinds of polling loops with suitable error handling (as in beaker-provision).

Comment 3 matt jia 2016-03-30 00:05:05 UTC
On Gerrit:

   http://gerrit.beaker-project.org/#/c/4764/

Comment 4 Dan Callaghan 2017-10-19 23:15:57 UTC
This is getting urgent... On beaker-devel now that we are using the OpenStack integration more heavily, beaker-watchdog is dying regularly. I suspect it might be in a failure to fetch OpenStack console logs through the server. Unfortunately since we are on RHEL6 (without systemd) the traceback on stderr is lost, and we can't do any automatic restart logic either. Sigh.

Comment 5 Dan Callaghan 2017-10-20 06:15:42 UTC
Found the cause of the current crashes: bug 1504527. But we should really get this fixed too, to make beaker-watchdog more resilient instead of dying with an error on stderr that we will never see.

Comment 6 Dan Callaghan 2018-08-02 07:41:41 UTC
https://gerrit.beaker-project.org/#/c/beaker/+/6240

Comment 8 Dan Callaghan 2018-10-08 02:16:46 UTC
Beaker 26.0 has been released.