1309906 – bkr job-watch can get stuck waiting forever, no read timeout on XMLRPC requests?

Bug 1309906 - bkr job-watch can get stuck waiting forever, no read timeout on XMLRPC requests?

Summary: bkr job-watch can get stuck waiting forever, no read timeout on XMLRPC requests?

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	command line
Sub Component:
Version:	22
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	23.0
Assignee:	Dan Callaghan
QA Contact:	tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-02-18 22:52 UTC by Dan Callaghan
Modified:	2016-07-07 23:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-07-07 23:11:03 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dan Callaghan 2016-02-18 22:52:44 UTC

In our Jenkins environment I found a pile of bkr job-watch processes waiting for jobs that are already completed in Beaker. Using lsof/strace I can see that they are connected reading from beaker-devel forever. Presumably there was some kind of network glitch and the connections were dropped on the other side, and now bkr job-watch has failed to notice.

I feel like there *should* already be a read timeout on XMLRPC requests from bkr, because we have implemented that so many times over the years, but I haven't dug in to check exactly. Evidently it's not working though.

It's very important that bkr job-watch terminates in a timely fashion so we need to make sure there is a reasonably aggressive read and connect timeout for XMLRPC (we have used 2 minutes elsewhere) and that the XMLRPC retrying code is not retrying forever.

Comment 1 Dan Callaghan 2016-03-22 06:43:57 UTC

The problem is that Kobo doesn't set any timeout on its xmlrpclib Transport directly. Back when we were still using Kobo we had hacked in a timeout for LabController code but not Client.

We can just move the timeout handling into HubProxy now that we have our own copy.

Very tempted to replace it all with requests + xmlrpclib marshalling...

Comment 2 Dan Callaghan 2016-03-22 06:47:44 UTC

(In reply to Dan Callaghan from comment #1)
> The problem is that Kobo doesn't set any timeout on its xmlrpclib Transport
> directly. Back when we were still using Kobo we had hacked in a timeout for
> LabController code but not Client.

That was bug 717500:

https://git.beaker-project.org/cgit/beaker/commit/?id=c2fb5974d4dfc16a30138901180fa00503268028

Comment 3 Dan Callaghan 2016-03-22 07:01:08 UTC

http://gerrit.beaker-project.org/4759

Comment 4 Mike McCune 2016-03-28 22:26:18 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 5 Dan Callaghan 2016-04-08 06:44:56 UTC

This bug fix is included in beaker-client-22.4-0.git.6.5613dcf which is currently available for download here:

https://beaker-project.org/nightlies/release-22/

Comment 7 Dan Callaghan 2016-06-10 01:40:27 UTC

This patch was merged to the release-22 branch but the next release will be 23.0.

Comment 8 Dan Callaghan 2016-07-07 23:11:03 UTC

Beaker 23.0 has been released.

Note You need to log in before you can comment on or make changes to this bug.