Bug 1386270

Summary: [RFE] Job invocations should happen asynchronously
Product: Red Hat Satellite Reporter: Daniel Lobato Garcia <dlobatog>
Component: Remote ExecutionAssignee: Adam Ruzicka <aruzicka>
Status: CLOSED ERRATA QA Contact: jcallaha
Severity: high Docs Contact: satellite6-bugs <satellite6-bugs>
Priority: high    
Version: 6.1.9CC: aruzicka, bkearney, dcaplan, ealcaniz, fgarciad, inecas, jcallaha, molasaga, m.r.watts, satellite6-bugs, tbrisker
Target Milestone: UnspecifiedKeywords: FieldEngineering, FutureFeature, PrioBumpGSS, PrioBumpPM, Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-21 16:54:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daniel Lobato Garcia 2016-10-18 14:08:07 UTC
Description of problem:

Customer needs to run many long-running job invocations at the same time on multiple machines. 

These machines are located in a network with low bandwidth, so keeping many connections alive isn't possible as some jobs could take a long time (e.g: reposync, yum update). 

These connections waste resources on the client hosts which are not very powerful machines. 

This could be implemented by having another provider different to SSH or possibly by making ssh run the job and return right away (the capsule could check the status of the job somehow)

Additional info:

Currently they are running their own custom remote execution scripts which use Ansible core libraries to make calls asynchronously and poll for the status of the execution. The solution provided by Satellite does not necessarily have to poll for the status but it would need to provide a way to check it's status.

Comment 2 Adam Ruzicka 2016-11-29 09:51:34 UTC
Created redmine issue http://projects.theforeman.org/issues/17514 from this bug

Comment 7 Satellite Program 2017-05-22 16:15:11 UTC
Upstream bug assigned to aruzicka

Comment 8 Satellite Program 2017-05-22 16:15:16 UTC
Upstream bug assigned to aruzicka

Comment 10 Satellite Program 2017-09-11 12:15:42 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/17514 has been resolved.

Comment 13 Mark Watts 2018-02-14 13:45:56 UTC
Has Foreman issue 17514 been ported to Satellite yet?
We're on Satellite 6.2.14 and the "--async" option doesn't seem to have any effect as per:


# hammer job-invocation create --job-template 'Run Command - SSH Default'  --inputs 'command=ls' --search-query name=play01273.example.com --async
Job invocation 31 created
[...................................................................................................................................................................................................................................] [100%]
1 task(s), 1 success, 0 fail


It would be really useful to have the same type of async operation as when using:

# hammer host errata apply --errata-ids $errataList --host $host --async

Comment 15 Adam Ruzicka 2018-02-14 14:18:12 UTC
@Mark:
Please note this feature has nothing to do with hammer's --async flag. Hammer's --async flag tells hammer not to wait for the job invocation to finish.

Preliminary steps:
This feature can be enabled on a per-proxy basis by setting :async_ssh to true in /etc/smart_proxy_dynflow_core/settings.d/remote_execution_ssh.yml. The interval for checking on the remote jobs can be set in the same file under the runner_refresh_interval key.

Apparently it is not exposed in the installer and needs to be uncommented and toggled in the file by hand.

Steps to reproduce:
1) Complete the preliminary steps
2) Run a remote execution job which will take some time (sleep 600)
3) Log in to the server and use ss or netstat to look for opened SSH connections
4) (note) it may take up to a minute (iirc) for the kernel to completely "forget" the tcp connection

Expected results:
There should NOT be a persistent connection opened to the remote host

Comment 16 jcallaha 2018-02-14 15:20:41 UTC
Verified in Satellite 6.3 Snap 35.

Negative Test:
Kicked off a job that executed the command `sleep 600`
Satellite immediately started connection.
While the command was running (sleeping), the connection was maintained. 
Finally Satellite killed the connection once the job was complete.

Every 2.0s: ss | grep ssh                              Wed Feb 14 15:29:49 2018

tcp    ESTAB	  0	 0      <host>:ssh                  <satellite>:37704
tcp    ESTAB	  0	 0      <host>:ssh                  <self>:44772

Every 2.0s: ss | grep ssh                              Wed Feb 14 15:36:22 2018

tcp    ESTAB	  0	 0      <host>:ssh                  <satellite>:37704
tcp    ESTAB	  0	 0      <host>:ssh                  <self>:44772

Every 2.0s: ss | grep ssh                              Wed Feb 14 15:40:26 2018

tcp    ESTAB	  0	 0      <host>:ssh                  <self>:44772



Positive Test:
Added `:async_ssh: true` to /etc/smart_proxy_dynflow_core/settings.d/remote_execution_ssh.yml
Restarted satellite.
Kicked off the job from before (sleep 600). 
Satellite checked in on the host, then exited. 
Satellite then periodically checked in until the job completed.

# for i in {1..60}; do ss | grep ssh >> connections.txt && sleep 2; done
# cat connections.txt
...
tcp    ESTAB      0      0      <host>:ssh                  <self>:44772
tcp    ESTAB      0      0      <host>:ssh                  <self>:44772
tcp    ESTAB      0      0      <host>:ssh                  <self>:44772
tcp    ESTAB      0      52     <host>:ssh                  <satellite>:40490
tcp    ESTAB      0      0      <host>:ssh                  <self>:44772
tcp    ESTAB      0      0      <host>:ssh                  <self>:44772
tcp    ESTAB      0      0      <host>:ssh                  <self>:44772
...

In both cases the job completed successfully. However, only after making the settings change, did the job run asynchronously as expected.

Comment 17 Mark Watts 2018-02-14 15:34:44 UTC
Ok, I'm misunderstanding what the --async flag for "hammer job-invocation" is doing here then.

My expectation was that:


# time hammer job-invocation create --job-template 'Run Command - SSH Default'  --inputs 'command=ls' --search-query name=play01273.example.com --async
Job invocation 32 created
[...................................................................................................................................................................................................................................] [100%]
1 task(s), 1 success, 0 fail

real    3m17.119s
user    0m1.752s
sys     0m0.644s



Would return to the console immediately, which it does not.

Comment 18 Adam Ruzicka 2018-02-15 07:56:26 UTC
(In reply to Mark Watts from comment #17)
> My expectation was that:
> 
> Would return to the console immediately, which it does not.

Your expectation was right, however this feature was broken for quite some time and should be fixed in 6.3. Please see the BZ[1] for it.

[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1440962

Comment 19 Satellite Program 2018-02-21 16:54:17 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA.
> 
> For information on the advisory, and where to find the updated files, follow the link below.
> 
> If the solution does not work for you, open a new bug report.
> 
> https://access.redhat.com/errata/RHSA-2018:0336