Bug 717500

Summary: reserved guest doesn't return after timeout
Product: [Retired] Beaker Reporter: Han Pingtian <phan>
Component: lab controllerAssignee: Dan Callaghan <dcallagh>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 0.6CC: bpeck, dcallagh, mcsontos, rmancy, stl
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-14 02:07:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Dan Callaghan 2011-06-29 03:33:03 UTC
It looks like there is a valid watchdog record for the system in question, with the correct kill time (which is now well past). But it was never triggered. Currently investigating why that is so.

Comment 2 Dan Callaghan 2011-06-29 04:01:32 UTC
The beaker-watchdog daemon on the lab controller in question was stuck reading from a dead HTTP connection. Apparently the system-wide default TCP timeout for established connections is 5 days(!), at least on that box, and we never set any stricter timeouts in the beaker-watchdog daemon itself. I think that is probably the real bug we should be fixing...

Comment 3 Dan Callaghan 2011-06-29 05:19:25 UTC
I was wrong, it seems we *do* set a timeout on the kobo hub transport for all the lab controller processes.

So the question is, why in this case did the timeout not kick in and prevent beaker-watchdog from getting stuck for 19 hours?

Comment 5 Dan Callaghan 2011-07-01 04:09:59 UTC
Hmm okay I thought I wrote another comment about this yesterday but perhaps I never hit save...

I think the problem is that although the Watchdog object itself has a timeout set, it creates Monitor objects which do not have the timeout set. I think it was one of those which was stuck in a read yesterday. (That explains why there was two connections open to the server, and it was the second one which was stuck.)

I think the best fix is to move the timeout setting into ProxyHelper, which is a parent class for all the objects which talk to the server. That way it will apply to Monitor as well as any other classes we have missed (or add in the future).