Description of problem: The multi-host jobs with exactly same XML as previously have suddenly started to fail. We started to experience this behaviour on August 10. Please refer to https://beaker.engineering.redhat.com/jobs/1048984 and check console.log 2015-08-12 12:16:34,250 rhts_task.twisted emit: ERROR Unhandled Error Traceback (most recent call last): Failure: exceptions.RuntimeError: Timeout waiting for RHTS variable 2015-08-12 12:16:41,257 rhts_task.twisted emit: ERROR Unhandled Error Traceback (most recent call last): Failure: exceptions.RuntimeError: Timeout waiting for RHTS variable 2015-08-12 12:16:48,265 rhts_task.twisted emit: ERROR Unhandled Error Traceback (most recent call last): Failure: exceptions.RuntimeError: Timeout waiting for RHTS variable Bill Peck has looked into it and has recommended us to use restrain instead of beah to get around. WA is working fine but we would like beah to get fixed. Thanks a lot Jirka
Jirka, I took a look at this a little. I think this is a good Job (one that worked) is https://beaker.engineering.redhat.com/jobs/1043190 vs one that failed: https://beaker.engineering.redhat.com/jobs/1048912 Also your originally reported one 1048984 Although the XMLs are identical. the test versions that were used are different. In the one that worked J:1043190 the version of the test was: Package kernel_netperf-performance-network_perftest.noarch 0:3.0-7 In the one that is failed 1048912(or your the job in description 1048984 the test was: Package kernel_netperf-performance-network_perftest.noarch 0:3.0-11 Do you know what changed in the test? Thanks, Jeff
Hi Jeff, in 0:3.0-7 version of our tests we have not used RHTS synchronization at all. The tests were working because we used _our_ own implementation of synchronization written in python, using xmlrpc. Starting from 0:3.0-11 version we wanted to move our tests under RHTS synchronization according to Beaker documentation [0]. [0] https://beaker-project.org/docs/user-guide/multihost.html After that our jobs started to getting stalled with "Timeout waiting for RHTS variable" in console log.
I've run into something similar on the Cluster QE beaker instance and found that restarting beah-fwd-backend on the affected hosts works around the problem. Not sure what the actual bug is. Perhaps beah-fwd-backend.service needs a dependency on beah-srv.service.
I am also observing same results while running IPA QE downstream automation in beaker https://beaker.engineering.redhat.com/jobs/1259997