Let me explain on an example (reproducer): - Have Satellite 6.13 - Have 2 content hosts - jsenkyri-rhel9c and jsenkyri-rhel9d. - Create 2 roles - nfs_hang and run_check. - The nfs_hang role emulates a hung task. It will try to open a file hosted on a mounted nfs share while the nfs-service is not running. The process will therefore hang indefinitely. - The run_check role will simply add a new line to /tmp/ansible_runs.txt with date & time of the ansible run. - Assign role 'nfs_hang' to the jsenkyri-rhel9c host: --- # tasks file for nfs_hang - name: Try to open a file hosted on nfs server. If the nfs-service is not running then this should hang forever. command: cat /nfs/imports/test/file.txt register: cat_output ignore_errors: true - name: Print the output debug: var: cat_output.stdout_lines when: cat_output is defined and cat_output.rc == 0 - Assign role 'run_check' to the other host jsenkyri-rhel9d: --- # tasks file for run_check - name: Get current time and date set_fact: current_time: "{{ ansible_date_time.iso8601 }}" - name: Append time and date to /tmp/ansible_runs.txt lineinfile: path: /tmp/ansible_runs.txt line: "Current time and date: {{ current_time }}" create: yes - Select both hosts and do 'Run all Ansible roles'. Satellite will create 2 tasks - RunHostJob for 'jsenkyri-rhel9c' and RunHostJob for 'jsenkyri-rhel9d'. Result: - Both jobs will hang forever. The tasks are left in running/pending state until you cancel them. Nothing is added to '/tmp/ansible_runs.txt' on 'jsenkyri-rhel9d'. Expectation: - Job for 'jsenkyri-rhel9c' hangs. Job for 'jsenkyri-rhel9d' finishes successfully and a new line is added to '/tmp/ansible_runs.txt'. ########## If you set batch_size=1 the result is as one would expect - 'jsenkyri-rhel9c' will hang forever, 'jsenkyri-rhel9d' will finish successfully and write a new line to '/tmp/ansible_runs.txt'. However this impacts performance. I suppose batch_size=1 does the trick because then each job has its own ansible-runner: # ps -ef | grep ansible ~~~ foreman+ 3019108 1 0 12:49 ? 00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/f33162255c [mux] foreman+ 3019185 3018422 6 12:51 ? 00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3018422-18j5dcu -p playbook.yml foreman+ 3019186 3018422 7 12:51 ? 00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3018422-jjtqb3 -p playbook.yml foreman+ 3019187 3019185 30 12:51 pts/0 00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-18j5dcu/inventory playbook.yml foreman+ 3019189 3019186 32 12:51 pts/1 00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-jjtqb3/inventory playbook.yml foreman+ 3019201 3019187 10 12:51 pts/0 00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-18j5dcu/inventory playbook.yml foreman+ 3019209 1 0 12:51 ? 00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/b439dc56b3 [mux] foreman+ 3019218 3019201 1 12:51 pts/0 00:00:00 ssh -o ProxyCommand=none -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPersist=60s -o ServerAliveInterval=15 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -o Port=22 -o IdentityFile="/var/lib/foreman-proxy/ssh/id_rsa_foreman_proxy" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/var/lib/foreman-proxy/ansible/cp/f33162255c" -tt 10.37.195.32 /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1692010300.641188-3019201-245584559019420/AnsiballZ_setup.py && sleep 0' root 3019232 2954473 0 12:51 pts/9 00:00:00 grep --color=auto ansible ~~~ With default batch size (100) there's just 1 runner for both: # ps -ef | grep ansible ~~~ foreman+ 3021311 3021160 7 13:00 ? 00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3021160-1gekqmh -p playbook.yml foreman+ 3021312 3021311 21 13:00 pts/0 00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3021160-1gekqmh/inventory playbook.yml foreman+ 3021320 3021312 10 13:00 pts/0 00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3021160-1gekqmh/inventory playbook.yml foreman+ 3021331 1 0 13:00 ? 00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/f33162255c [mux] foreman+ 3021334 1 0 13:00 ? 00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/b439dc56b3 [mux] foreman+ 3021349 3021320 0 13:00 pts/0 00:00:00 ssh -o ProxyCommand=none -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPersist=60s -o ServerAliveInterval=15 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -o Port=22 -o IdentityFile="/var/lib/foreman-proxy/ssh/id_rsa_foreman_proxy" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/var/lib/foreman-proxy/ansible/cp/f33162255c" -tt 10.37.195.32 /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1692010837.379527-3021320-273526151698778/AnsiballZ_setup.py && sleep 0' root 3021362 2954473 0 13:00 pts/9 00:00:00 grep --color=auto ansible ~~~ This is a problem because a single hung/not responding host can freeze the entire batch of hosts. In bigger environments with recurring jobs the stuck jobs pile up quite quickly. This can lead to performance problems: [root@satellite tmp]# su - postgres -c "psql -d foreman -c 'select label,count(label),state,result from foreman_tasks_tasks where state <> '\''stopped'\'' group by label,state,result ORDER BY label;'" label | count | state | result ------------------------------------------------------+-------+-----------+--------- ... ... Actions::RemoteExecution::RunHostJob | 104 | paused | pending Actions::RemoteExecution::RunHostJob | 1996 | running | pending Actions::RemoteExecution::RunHostsJob | 1 | paused | error Actions::RemoteExecution::RunHostsJob | 2 | paused | pending Actions::RemoteExecution::RunHostsJob | 28 | running | pending Actions::RemoteExecution::RunHostsJob | 1 | scheduled | pending Other than configuring batch_size=1 one can try: a) Add 'timeout' [0] on the task level: - Once the timeout is reached, Satellite will fail the hung task which will allow the remaining tasks on other hosts to continue. - This doesn't scale well in big environments with many roles. b) Use 'free' strategy [1]: - By default ansible uses 'linear' strategy, 'free' strategy ensures the tasks are executed without waiting for all hosts. - One can simply clone the 'Ansible Roles - Ansible Default' job template and add 'strategy: free' as seen below: --- - hosts: all strategy: free pre_tasks: - name: Display all parameters known for the Foreman host debug: var: foreman tags: - always tasks: - name: Apply roles include_role: name: "{{ role }}" tags: - always loop: "{{ foreman_ansible_roles }}" loop_control: loop_var: role - This works to a certain extent. The task status in Satellite stays running/pending, so you have to cancel them to "unstuck" them. However that's a fake status. The tasks actually execute successfully, the status just never gets passed to Satellite: --- TASK [Apply roles] ************************************************************* TASK [run_check : Get current time and date] *********************************** ok: [jsenkyri-rhel9d.sysmgmt.lan] TASK [run_check : Append time and date to /tmp/ansible_runs.txt] *************** changed: [jsenkyri-rhel9d.sysmgmt.lan] --- ########## I am not sure to which degree this is a problem on Satellite side. When it comes to hung tasks, Ansible has certain limitations, see [2]. The part with 'free' strategy & task status *seems* to be Satellite bug though. Opening this BZ so we can check whether there's any bug or potential RFE on Satellite side. Any solutions/workarounds are very welcome as well. Note: Bug 2156532 [3] seems related. [0] https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html#task [1] https://docs.ansible.com/ansible/latest/collections/ansible/builtin/free_strategy.html [2] https://github.com/ansible/ansible/issues/30411 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2156532