It looks like there is a valid watchdog record for the system in question, with the correct kill time (which is now well past). But it was never triggered. Currently investigating why that is so.
The beaker-watchdog daemon on the lab controller in question was stuck reading from a dead HTTP connection. Apparently the system-wide default TCP timeout for established connections is 5 days(!), at least on that box, and we never set any stricter timeouts in the beaker-watchdog daemon itself. I think that is probably the real bug we should be fixing...
I was wrong, it seems we *do* set a timeout on the kobo hub transport for all the lab controller processes. So the question is, why in this case did the timeout not kick in and prevent beaker-watchdog from getting stuck for 19 hours?
Hmm okay I thought I wrote another comment about this yesterday but perhaps I never hit save... I think the problem is that although the Watchdog object itself has a timeout set, it creates Monitor objects which do not have the timeout set. I think it was one of those which was stuck in a read yesterday. (That explains why there was two connections open to the server, and it was the second one which was stuck.) I think the best fix is to move the timeout setting into ProxyHelper, which is a parent class for all the objects which talk to the server. That way it will apply to Monitor as well as any other classes we have missed (or add in the future).