Bug 1431956

Summary: about 22 sub-tasks of remote execution task on 5000 systems were left in pending after 7.5 hours
Product: Red Hat Satellite Reporter: Jan Hutař <jhutar>
Component: Remote ExecutionAssignee: Adam Ruzicka <aruzicka>
Status: CLOSED ERRATA QA Contact: Roman Plevka <rplevka>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.2.8CC: aruzicka, inecas, jcallaha, ktordeur, pcreech, rdrazny, rplevka
Target Milestone: 6.4.0Keywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tfm-rubygem-foreman_remote_execution-1.5.4 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-16 19:27:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Hutař 2017-03-14 07:08:03 UTC
Description of problem:
About 22 sub-tasks of remote execution task on 5000 systems were left in pending after 7.5 hours


Version-Release number of selected component (if applicable):
Sat: satellite-6.2.8-4.0.el7sat.noarch
Capsule: satellite-capsule-6.2.8-4.0.el7sat.noarch


How reproducible:
often


Steps to Reproduce:
1. Run ReX `date` on 5000 systems


Actual results:
Sub-task is still in pending:

Id: 2123a83d-2325-4d5c-befb-57f2105fc78c
Label: Actions::RemoteExecution::RunHostJob
Name: Remote action:
Owner:
Execution type: Delayed
Start at: 2017-03-14 00:30:33 +0100
Start before: -
Started at: 2017-03-14 00:30:33 +0100
Ended at:
State: running
Result: -
Params: Run date on gprfc028container342.example.com 

Copy&Paste of a task's "Running Steps" tab days it is suspended:

Action:
Actions::RemoteExecution::RunProxyCommand
State: suspended
Input:
{"effective_user"=>"root",
 "ssh_user"=>"root",
 "effective_user_method"=>"sudo",
 "hostname"=>"172.22.57.86",
 "script"=>"date",
 "connection_options"=>{"retry_interval"=>15, "retry_count"=>4, "timeout"=>60},
 "proxy_url"=>"https://gprfc017capsule7....:9090",
 "locale"=>"en"}

Output:
{"metadata"=>{"timeout"=>"2017-03-13 21:59:24 -0400"},
 "proxy_task_id"=>"ab94b984-39cd-42a2-b51b-357a8de5a0df"}


Expected results:
Should work, ReX should be reliable

Comment 1 Jan Hutař 2017-03-14 07:08:44 UTC
I can use "Cancel" button to cancel individual sub-tasks as well.

Comment 2 Jan Hutař 2017-03-14 07:12:17 UTC
Oh, looks like I have to click on the "Cancel" button twice to cancel sub-task.

Comment 7 Satellite Program 2018-03-21 10:08:26 UTC
Upstream bug assigned to aruzicka

Comment 8 Satellite Program 2018-03-21 10:08:30 UTC
Upstream bug assigned to aruzicka

Comment 9 Ivan Necas 2018-06-29 17:10:59 UTC
*** Bug 1596642 has been marked as a duplicate of this bug. ***

Comment 10 Ivan Necas 2018-06-29 17:14:54 UTC
Proposing for 6.4: we got also another report from this https://bugzilla.redhat.com/show_bug.cgi?id=1595081, where the simpler reproducer is desribed.

The failure can be simulated by just restarting the smart_proxy_dynflow_core service during the job execution is running.

After the fix, the task should get time-outed after 10 minutes.

Comment 14 Roman Plevka 2018-09-13 12:20:58 UTC
VERIFIED
on sat6.4.0-21

performed `ls` command over ssh on 5000 hosts with 4 workers.
The task finished successfully with 100% success rate, no pending subtasks left

Comment 15 Bryan Kearney 2018-10-16 19:27:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2927