Bug 1253918 - Multihost Beaker jobs hang forever
Summary: Multihost Beaker jobs hang forever
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Beaker
Classification: Retired
Component: general
Version: develop
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: beaker-dev-list
QA Contact: tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-08-15 16:09 UTC by Jiri Hladky
Modified: 2020-10-21 14:20 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-21 14:15:42 UTC
Embargoed:


Attachments (Terms of Use)

Description Jiri Hladky 2015-08-15 16:09:18 UTC
Description of problem:


The multi-host jobs with exactly same XML as previously have suddenly started to fail. We started to experience this behaviour on August 10. Please refer to

https://beaker.engineering.redhat.com/jobs/1048984

and check console.log

2015-08-12 12:16:34,250 rhts_task.twisted emit: ERROR Unhandled Error 
Traceback (most recent call last): 
Failure: exceptions.RuntimeError: Timeout waiting for RHTS variable 
 
2015-08-12 12:16:41,257 rhts_task.twisted emit: ERROR Unhandled Error 
Traceback (most recent call last): 
Failure: exceptions.RuntimeError: Timeout waiting for RHTS variable 
 
2015-08-12 12:16:48,265 rhts_task.twisted emit: ERROR Unhandled Error 
Traceback (most recent call last): 
Failure: exceptions.RuntimeError: Timeout waiting for RHTS variable 

Bill Peck has looked into it and has recommended us to use restrain instead of beah to get around. WA is working fine but we would like beah to get fixed.

Thanks a lot
Jirka

Comment 1 Jeff Burke 2015-09-09 18:54:14 UTC
Jirka,
 I took a look at this a little. I think this is a good Job (one that worked) is
https://beaker.engineering.redhat.com/jobs/1043190
 vs one that failed:
https://beaker.engineering.redhat.com/jobs/1048912
Also your originally reported one 1048984

 Although the XMLs are identical. the test versions that were used are different.
In the one that worked J:1043190 the version of the test was:
 Package kernel_netperf-performance-network_perftest.noarch 0:3.0-7
In the one that is failed 1048912(or your the job in description 1048984 the test was:
 Package kernel_netperf-performance-network_perftest.noarch 0:3.0-11

Do you know what changed in the test?

Thanks,
Jeff

Comment 2 Otto Sabart 2015-09-29 10:17:55 UTC
Hi Jeff,
in 0:3.0-7 version of our tests we have not used RHTS synchronization at all.
The tests were working because we used _our_ own implementation of synchronization written in python, using xmlrpc.

Starting from 0:3.0-11 version we wanted to move our tests under RHTS
synchronization according to Beaker documentation [0].

[0] https://beaker-project.org/docs/user-guide/multihost.html

After that our jobs started to getting stalled with "Timeout waiting for RHTS
variable" in console log.

Comment 3 Nate Straz 2015-11-11 20:56:24 UTC
I've run into something similar on the Cluster QE beaker instance and found that restarting beah-fwd-backend on the affected hosts works around the problem.  Not sure what the actual bug is.  Perhaps beah-fwd-backend.service needs a dependency on beah-srv.service.

Comment 4 Abhijeet Kasurde 2016-03-14 09:15:07 UTC
I am also observing same results while running IPA QE downstream automation in beaker 

https://beaker.engineering.redhat.com/jobs/1259997


Note You need to log in before you can comment on or make changes to this bug.