Bug 1431953

Summary: about 10 systems failed when running remote execution task on 5000 clients: "Error initializing command #<Proxy::RemoteExecution::Ssh::Dispatcher::Command:0x000000029fb800>"
Product: Red Hat Satellite Reporter: Jan Hutař <jhutar>
Component: Remote ExecutionAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED CANTFIX QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 6.2.8CC: aruzicka, bbuckingham, inecas, jcallaha
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-01 07:45:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Hutař 2017-03-14 06:51:17 UTC
Description of problem:
About 10 systems failed when running remote execution task on 5000 clients: "Error initializing command #<Proxy::RemoteExecution::Ssh::Dispatcher::Command:0x000000029fb800>"


Version-Release number of selected component (if applicable):
Sat: satellite-6.2.8-4.0.el7sat.noarch
Capsule: satellite-capsule-6.2.8-4.0.el7sat.noarch


How reproducible:
sometimes


Steps to Reproduce:
1. Run ReX `date` on 5000 systems


Actual results:
In Satellite webUI there is:

Id: 60b12eb2-598a-451d-82a7-b1dcef4e46dc
Label: Actions::RemoteExecution::RunHostJob
Name: Remote action:
Owner:
Execution type: Delayed
Start at: 2017-03-14 00:21:09 +0100
Start before: -
Started at: 2017-03-14 00:21:09 +0100
Ended at: 2017-03-14 04:28:10 +0100
State: stopped
Result: warning
Params: Run date on gprfc020container459.example.com
Parent task
100.0% Complete
100%
Output:
Error initializing command #<Proxy::RemoteExecution::Ssh::Dispatcher::Command:0x000000029fb800>
Net::SSH::AuthenticationFailed Authentication failed for user root.25.203
Exit status: EXCEPTION

Errors:
Script execution failed

That matches to the log:

2017-03-13 19:37:39 [app] [I] Started POST "/foreman_tasks/api/tasks/callback" for 172.17.1.5 at 2017-03-13 19:37:39 -0400
2017-03-13 19:37:39 [app] [I] Started POST "/foreman_tasks/api/tasks/callback" for 172.17.1.5 at 2017-03-13 19:37:39 -0400
2017-03-13 19:37:39 [app] [I] Processing by ForemanTasks::Api::TasksController#callback as XML
2017-03-13 19:37:39 [app] [I]   Parameters: {"callback"=>{"task_id"=>"d1a95dee-75d6-4588-aa2a-e011cb058067", "step_id"=>3}, "data"=>{"result"=>[{"output_type"=>"stdout", "output"=>"Mon Mar 13 23:37:18 UTC 2017\r\n", "timestamp"=>148944823
9.105965}], "exit_status"=>0}, "task"=>{}}
2017-03-13 19:37:39 [app] [I] Processing by ForemanTasks::Api::TasksController#callback as XML
2017-03-13 19:37:39 [app] [I]   Parameters: {"callback"=>{"task_id"=>"60b12eb2-598a-451d-82a7-b1dcef4e46dc", "step_id"=>3}, "data"=>{"result"=>[{"output_type"=>"debug", "output"=>"Error initializing command #<Proxy::RemoteExecution::Ssh::Dispatcher::Command:0x000000029fb800>\nNet::SSH::AuthenticationFailed Authentication failed for user root.25.203", "timestamp"=>1489448243.2427125}], "exit_status"=>"EXCEPTION"}, "task"=>{}}
2017-03-13 19:37:39 [app] [I] Completed 200 OK in 7ms (Views: 0.2ms | ActiveRecord: 1.4ms)
2017-03-13 19:37:39 [app] [I] Completed 200 OK in 8ms (Views: 0.2ms | ActiveRecord: 1.3ms)


Expected results:
Should work, ReX should be reliable.

Comment 2 Jan Hutař 2017-03-14 06:57:27 UTC
There seems to be no log of the problem on capsule side:

[root@gprfc040capsule5 ~]# grep -r -e 3f0f668f-54a7-457e-ab68-56088c90a3bb -e  60b12eb2-598a-451d-82a7-b1dcef4e46dc -e gprfc020container459.example.com /var/log/

[root@gprfc040capsule5 ~]# journalctl --since "2 days ago" | grep -e 3f0f668f-54a7-457e-ab68-56088c90a3bb -e  60b12eb2-598a-451d-82a7-b1dcef4e46dc -e gprfc020container459.example.com