Bug 1394956 - restraint process never returns on an ssh error
Summary: restraint process never returns on an ssh error
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Restraint
Classification: Retired
Component: general
Version: master
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Artem Savkov
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-14 21:19 UTC by Vimal Patel
Modified: 2017-08-10 17:58 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-10 17:58:23 UTC


Attachments (Terms of Use)

Description Vimal Patel 2016-11-14 21:19:11 UTC
Description of problem:
Restraint process never returns on an ssh error.

Our team runs Restraint jobs that take a long time, and recently we have seen a lot of issues where the Restraint process never returns, and it is because of an ssh issue.

All I see in the log is:
Using ./FFinterop_final.01 for job run
Connecting to http://localhost:8091/run for host: 10.34.54.61:8081, recipe id:1
Error writing to ssh channel
Error reading from ssh channel
Connection terminated unexpectedly [g-io-error-quark, 34]

When looking into the client it was having an issue w/I saw that there was a core dump for the restraintd process:

● restraintd.service - The restraint harness.
   Loaded: loaded (/usr/lib/systemd/system/restraintd.service; enabled; vendor preset: disabled)
   Active: failed (Result: core-dump) since Mon 2016-11-14 21:14:54 CET; 18min ago
  Process: 2411 ExecStart=/usr/bin/restraintd (code=dumped, signal=SEGV)
  Process: 2405 ExecStartPre=/usr/bin/check_beaker (code=exited, status=0/SUCCESS)
 Main PID: 2411 (code=dumped, signal=SEGV)
   CGroup: /system.slice/restraintd.service

I don't know if this is always the case, but I have Jenkins jobs and I have a lot of issues w/s390x systems and ppc64 vms from Beaker.

I would be able to handle the issue if the Restraint process ends and returns the error code; however, it never ends.

Case 1: An ssh error on initial connect, shows an error, but never returns.
Case 2: An ssh error in the middle of a job, shows an error, but there seems to be an infinite number of retries to connect.

For Case 1, can this be reattempted a certain amount of times 5 w/60second wait, and if not connected at that point give up and exit with the error code (I would like to see similiar behavior for Case 2).

This is a major issue for our team, because we are waiting for a process that is expecting to take 3-4hrs in a an automated Jenkins job; however, in some cases nothing was ever run, and I have to implement a kill time for the process myself.  If it returned, I would be able to to know it exited early, restart the restraint process on the client and retry, but currently this is not possible, since it does not return.


Version-Release number of selected component (if applicable):
0.1.28

How reproducible:
100%

Steps to Reproduce:
1. restraintd process stopped or errored on client
2. run a restraint call to run a test (i.e.:
/usr/bin/restraint -vvv --host 1=10.34.54.61:8081 --job /home/jenkins/workspace/PIT-FFInterop-nss-ppc-vipatel-runtest/qe-pit-scenarios/ffinterop_ppc-nss/FFinterop_final.xml)
3.

Actual results:
the restraint process never returns

Expected results:
the restraint process returns a failure after an acceptable amount of retries is attempted.

Additional info:

Comment 1 Artem Savkov 2016-11-16 13:21:25 UTC
 http://gerrit.beaker-project.org/5439 Fix infinite loop on ssh failure.
 http://gerrit.beaker-project.org/5440 Limit number of reconnection retries.


Note You need to log in before you can comment on or make changes to this bug.