Hide Forgot
Description of problem: Restraint process never returns on an ssh error. Our team runs Restraint jobs that take a long time, and recently we have seen a lot of issues where the Restraint process never returns, and it is because of an ssh issue. All I see in the log is: Using ./FFinterop_final.01 for job run Connecting to http://localhost:8091/run for host: 10.34.54.61:8081, recipe id:1 Error writing to ssh channel Error reading from ssh channel Connection terminated unexpectedly [g-io-error-quark, 34] When looking into the client it was having an issue w/I saw that there was a core dump for the restraintd process: ● restraintd.service - The restraint harness. Loaded: loaded (/usr/lib/systemd/system/restraintd.service; enabled; vendor preset: disabled) Active: failed (Result: core-dump) since Mon 2016-11-14 21:14:54 CET; 18min ago Process: 2411 ExecStart=/usr/bin/restraintd (code=dumped, signal=SEGV) Process: 2405 ExecStartPre=/usr/bin/check_beaker (code=exited, status=0/SUCCESS) Main PID: 2411 (code=dumped, signal=SEGV) CGroup: /system.slice/restraintd.service I don't know if this is always the case, but I have Jenkins jobs and I have a lot of issues w/s390x systems and ppc64 vms from Beaker. I would be able to handle the issue if the Restraint process ends and returns the error code; however, it never ends. Case 1: An ssh error on initial connect, shows an error, but never returns. Case 2: An ssh error in the middle of a job, shows an error, but there seems to be an infinite number of retries to connect. For Case 1, can this be reattempted a certain amount of times 5 w/60second wait, and if not connected at that point give up and exit with the error code (I would like to see similiar behavior for Case 2). This is a major issue for our team, because we are waiting for a process that is expecting to take 3-4hrs in a an automated Jenkins job; however, in some cases nothing was ever run, and I have to implement a kill time for the process myself. If it returned, I would be able to to know it exited early, restart the restraint process on the client and retry, but currently this is not possible, since it does not return. Version-Release number of selected component (if applicable): 0.1.28 How reproducible: 100% Steps to Reproduce: 1. restraintd process stopped or errored on client 2. run a restraint call to run a test (i.e.: /usr/bin/restraint -vvv --host 1=10.34.54.61:8081 --job /home/jenkins/workspace/PIT-FFInterop-nss-ppc-vipatel-runtest/qe-pit-scenarios/ffinterop_ppc-nss/FFinterop_final.xml) 3. Actual results: the restraint process never returns Expected results: the restraint process returns a failure after an acceptable amount of retries is attempted. Additional info:
http://gerrit.beaker-project.org/5439 Fix infinite loop on ssh failure. http://gerrit.beaker-project.org/5440 Limit number of reconnection retries.