Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1394956

Summary:	restraint process never returns on an ssh error
Product:	[Retired] Restraint	Reporter:	Vimal Patel <vipatel>
Component:	general	Assignee:	Artem Savkov <asavkov>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	master	CC:	asavkov, bpeck
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-10 17:58:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vimal Patel 2016-11-14 21:19:11 UTC

Description of problem:
Restraint process never returns on an ssh error.

Our team runs Restraint jobs that take a long time, and recently we have seen a lot of issues where the Restraint process never returns, and it is because of an ssh issue.

All I see in the log is:
Using ./FFinterop_final.01 for job run
Connecting to http://localhost:8091/run for host: 10.34.54.61:8081, recipe id:1
Error writing to ssh channel
Error reading from ssh channel
Connection terminated unexpectedly [g-io-error-quark, 34]

When looking into the client it was having an issue w/I saw that there was a core dump for the restraintd process:

● restraintd.service - The restraint harness.
Loaded: loaded (/usr/lib/systemd/system/restraintd.service; enabled; vendor preset: disabled)
Active: failed (Result: core-dump) since Mon 2016-11-14 21:14:54 CET; 18min ago
Process: 2411 ExecStart=/usr/bin/restraintd (code=dumped, signal=SEGV)
Process: 2405 ExecStartPre=/usr/bin/check_beaker (code=exited, status=0/SUCCESS)
Main PID: 2411 (code=dumped, signal=SEGV)
CGroup: /system.slice/restraintd.service

I don't know if this is always the case, but I have Jenkins jobs and I have a lot of issues w/s390x systems and ppc64 vms from Beaker.

I would be able to handle the issue if the Restraint process ends and returns the error code; however, it never ends.

Case 1: An ssh error on initial connect, shows an error, but never returns.
Case 2: An ssh error in the middle of a job, shows an error, but there seems to be an infinite number of retries to connect.

For Case 1, can this be reattempted a certain amount of times 5 w/60second wait, and if not connected at that point give up and exit with the error code (I would like to see similiar behavior for Case 2).

This is a major issue for our team, because we are waiting for a process that is expecting to take 3-4hrs in a an automated Jenkins job; however, in some cases nothing was ever run, and I have to implement a kill time for the process myself. If it returned, I would be able to to know it exited early, restart the restraint process on the client and retry, but currently this is not possible, since it does not return.

Version-Release number of selected component (if applicable):
0.1.28

How reproducible:
100%

Steps to Reproduce:
1. restraintd process stopped or errored on client
2. run a restraint call to run a test (i.e.:
/usr/bin/restraint -vvv --host 1=10.34.54.61:8081 --job /home/jenkins/workspace/PIT-FFInterop-nss-ppc-vipatel-runtest/qe-pit-scenarios/ffinterop_ppc-nss/FFinterop_final.xml)
3.

Actual results:
the restraint process never returns

Expected results:
the restraint process returns a failure after an acceptable amount of retries is attempted.

Additional info:

Comment 1 Artem Savkov 2016-11-16 13:21:25 UTC

 http://gerrit.beaker-project.org/5439 Fix infinite loop on ssh failure.
 http://gerrit.beaker-project.org/5440 Limit number of reconnection retries.