Bug 1628145

Summary: Ansible Remote Execution Job Stalls at Scale
Product: Red Hat Satellite Reporter: sbadhwar
Component: Remote ExecutionAssignee: satellite6-bugs <satellite6-bugs>
Status: CLOSED DUPLICATE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 6.4CC: aruzicka, inecas, jhutar, mmccune, psuriset, sbadhwar
Target Milestone: UnspecifiedKeywords: Performance, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-18 11:23:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1628505, 1646745    
Bug Blocks:    

Description sbadhwar 2018-09-12 11:28:10 UTC
Description of problem:
While running Ansible package install command at scale (45K hosts with 1 package to be installed), the job got stalled with no progress happening. On taking a further look, it seems like the smart_proxy service on 1 of the capsules wasn't processing any task.

Version-Release number of selected component (if applicable):
Satellite 6.4 Snap 18

How reproducible:
Not sure

Steps to Reproduce:
1. Create a new remote execution job with Ansible package command
2. Execute the job on large scale (35k or more hosts)

Actual results:
The job gets stalled with no progress at certain time

Expected results:
The job execution finishes successfully

Additional info:
Foreman debug satellite: http://debugs.theforeman.org/foreman-debug-XHcxQ.tar.xz
Foreman debug capsule: http://debugs.theforeman.org/foreman-debug-a5qE7.tar.xz

Comment 1 Adam Ruzicka 2018-09-12 11:37:58 UTC
Additional info:
There was a bunch of tasks on the capsule, dynflow status showed +-850 events in the queue and 0/5 free workers, however the process was completely idle.