In our Jenkins environment I found a pile of bkr job-watch processes waiting for jobs that are already completed in Beaker. Using lsof/strace I can see that they are connected reading from beaker-devel forever. Presumably there was some kind of network glitch and the connections were dropped on the other side, and now bkr job-watch has failed to notice. I feel like there *should* already be a read timeout on XMLRPC requests from bkr, because we have implemented that so many times over the years, but I haven't dug in to check exactly. Evidently it's not working though. It's very important that bkr job-watch terminates in a timely fashion so we need to make sure there is a reasonably aggressive read and connect timeout for XMLRPC (we have used 2 minutes elsewhere) and that the XMLRPC retrying code is not retrying forever.
The problem is that Kobo doesn't set any timeout on its xmlrpclib Transport directly. Back when we were still using Kobo we had hacked in a timeout for LabController code but not Client. We can just move the timeout handling into HubProxy now that we have our own copy. Very tempted to replace it all with requests + xmlrpclib marshalling...
(In reply to Dan Callaghan from comment #1) > The problem is that Kobo doesn't set any timeout on its xmlrpclib Transport > directly. Back when we were still using Kobo we had hacked in a timeout for > LabController code but not Client. That was bug 717500: https://git.beaker-project.org/cgit/beaker/commit/?id=c2fb5974d4dfc16a30138901180fa00503268028
http://gerrit.beaker-project.org/4759
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
This bug fix is included in beaker-client-22.4-0.git.6.5613dcf which is currently available for download here: https://beaker-project.org/nightlies/release-22/
This patch was merged to the release-22 branch but the next release will be 23.0.
Beaker 23.0 has been released.