Bug 2231853 - Ansible handling of hung jobs
Summary: Ansible handling of hung jobs
Keywords:
Status: NEW
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Ansible
Version: 6.13.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: Satellite QE Team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-14 12:52 UTC by Jan Senkyrik
Modified: 2023-08-18 00:19 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jan Senkyrik 2023-08-14 12:52:25 UTC
Let me explain on an example (reproducer):

- Have Satellite 6.13
- Have 2 content hosts - jsenkyri-rhel9c and jsenkyri-rhel9d.
- Create 2 roles - nfs_hang and run_check.
- The nfs_hang role emulates a hung task. It will try to open a file hosted on a mounted nfs share while the nfs-service is not running. The process will therefore hang indefinitely.
- The run_check role will simply add a new line to /tmp/ansible_runs.txt with date & time of the ansible run.
- Assign role 'nfs_hang' to the jsenkyri-rhel9c host:

---
# tasks file for nfs_hang
- name: Try to open a file hosted on nfs server. If the nfs-service is not running then this should hang forever.
  command: cat /nfs/imports/test/file.txt
  register: cat_output
  ignore_errors: true

- name: Print the output
  debug:
    var: cat_output.stdout_lines
  when: cat_output is defined and cat_output.rc == 0


- Assign role 'run_check' to the other host jsenkyri-rhel9d:

---
# tasks file for run_check
- name: Get current time and date
  set_fact:
    current_time: "{{ ansible_date_time.iso8601 }}"

- name: Append time and date to /tmp/ansible_runs.txt
  lineinfile:
    path: /tmp/ansible_runs.txt
    line: "Current time and date: {{ current_time }}"
    create: yes

- Select both hosts and do 'Run all Ansible roles'. Satellite will create 2 tasks - RunHostJob for 'jsenkyri-rhel9c' and RunHostJob for 'jsenkyri-rhel9d'.


Result:
- Both jobs will hang forever. The tasks are left in running/pending state until you cancel them. Nothing is added to '/tmp/ansible_runs.txt' on 'jsenkyri-rhel9d'.

Expectation:
- Job for 'jsenkyri-rhel9c' hangs. Job for 'jsenkyri-rhel9d' finishes successfully and a new line is added to '/tmp/ansible_runs.txt'.


##########

If you set batch_size=1 the result is as one would expect - 'jsenkyri-rhel9c' will hang forever, 'jsenkyri-rhel9d' will finish successfully and write a new line to '/tmp/ansible_runs.txt'. However this impacts performance.

I suppose batch_size=1 does the trick because then each job has its own ansible-runner:

# ps -ef | grep ansible
~~~
foreman+ 3019108       1  0 12:49 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/f33162255c [mux]
foreman+ 3019185 3018422  6 12:51 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3018422-18j5dcu -p playbook.yml
foreman+ 3019186 3018422  7 12:51 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3018422-jjtqb3 -p playbook.yml
foreman+ 3019187 3019185 30 12:51 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-18j5dcu/inventory playbook.yml
foreman+ 3019189 3019186 32 12:51 pts/1    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-jjtqb3/inventory playbook.yml
foreman+ 3019201 3019187 10 12:51 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-18j5dcu/inventory playbook.yml
foreman+ 3019209       1  0 12:51 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/b439dc56b3 [mux]
foreman+ 3019218 3019201  1 12:51 pts/0    00:00:00 ssh -o ProxyCommand=none -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPersist=60s -o ServerAliveInterval=15 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -o Port=22 -o IdentityFile="/var/lib/foreman-proxy/ssh/id_rsa_foreman_proxy" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/var/lib/foreman-proxy/ansible/cp/f33162255c" -tt 10.37.195.32 /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1692010300.641188-3019201-245584559019420/AnsiballZ_setup.py && sleep 0'
root     3019232 2954473  0 12:51 pts/9    00:00:00 grep --color=auto ansible
~~~

With default batch size (100) there's just 1 runner for both:

# ps -ef | grep ansible
~~~
foreman+ 3021311 3021160  7 13:00 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3021160-1gekqmh -p playbook.yml
foreman+ 3021312 3021311 21 13:00 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3021160-1gekqmh/inventory playbook.yml
foreman+ 3021320 3021312 10 13:00 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3021160-1gekqmh/inventory playbook.yml
foreman+ 3021331       1  0 13:00 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/f33162255c [mux]
foreman+ 3021334       1  0 13:00 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/b439dc56b3 [mux]
foreman+ 3021349 3021320  0 13:00 pts/0    00:00:00 ssh -o ProxyCommand=none -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPersist=60s -o ServerAliveInterval=15 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -o Port=22 -o IdentityFile="/var/lib/foreman-proxy/ssh/id_rsa_foreman_proxy" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/var/lib/foreman-proxy/ansible/cp/f33162255c" -tt 10.37.195.32 /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1692010837.379527-3021320-273526151698778/AnsiballZ_setup.py && sleep 0'
root     3021362 2954473  0 13:00 pts/9    00:00:00 grep --color=auto ansible
~~~

This is a problem because a single hung/not responding host can freeze the entire batch of hosts. In bigger environments with recurring jobs the stuck jobs pile up quite quickly. This can lead to performance problems:

[root@satellite tmp]# su - postgres -c "psql -d foreman -c 'select label,count(label),state,result from foreman_tasks_tasks where state <> '\''stopped'\'' group by label,state,result ORDER BY label;'"
                        label                         | count |   state   | result  
------------------------------------------------------+-------+-----------+---------
...
...
 Actions::RemoteExecution::RunHostJob                 |   104 | paused    | pending
 Actions::RemoteExecution::RunHostJob                 |  1996 | running   | pending
 Actions::RemoteExecution::RunHostsJob                |     1 | paused    | error
 Actions::RemoteExecution::RunHostsJob                |     2 | paused    | pending
 Actions::RemoteExecution::RunHostsJob                |    28 | running   | pending
 Actions::RemoteExecution::RunHostsJob                |     1 | scheduled | pending


Other than configuring batch_size=1 one can try:

a) Add 'timeout' [0] on the task level: 
- Once the timeout is reached, Satellite will fail the hung task which will allow the remaining tasks on other hosts to continue.
- This doesn't scale well in big environments with many roles.

b) Use 'free' strategy [1]:
- By default ansible uses 'linear' strategy, 'free' strategy ensures the tasks are executed without waiting for all hosts.
- One can simply clone the 'Ansible Roles - Ansible Default' job template and add 'strategy: free' as seen below:

---
- hosts: all
  strategy: free
  pre_tasks:
    - name: Display all parameters known for the Foreman host
      debug:
        var: foreman
      tags:
        - always
  tasks:
    - name: Apply roles
      include_role:
        name: "{{ role }}"
      tags:
        - always
      loop: "{{ foreman_ansible_roles }}"
      loop_control:
        loop_var: role


- This works to a certain extent. The task status in Satellite stays running/pending, so you have to cancel them to "unstuck" them. However that's a fake status. The tasks actually execute successfully, the status just never gets passed to Satellite:
---
TASK [Apply roles] *************************************************************

TASK [run_check : Get current time and date] ***********************************
ok: [jsenkyri-rhel9d.sysmgmt.lan]

TASK [run_check : Append time and date to /tmp/ansible_runs.txt] ***************
changed: [jsenkyri-rhel9d.sysmgmt.lan]
---


##########

I am not sure to which degree this is a problem on Satellite side. When it comes to hung tasks, Ansible has certain limitations, see [2]. The part with 'free' strategy & task status *seems* to be Satellite bug though. 

Opening this BZ so we can check whether there's any bug or potential RFE on Satellite side. Any solutions/workarounds are very welcome as well.

Note:
Bug 2156532 [3] seems related.


[0] https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html#task
[1] https://docs.ansible.com/ansible/latest/collections/ansible/builtin/free_strategy.html
[2] https://github.com/ansible/ansible/issues/30411
[3] https://bugzilla.redhat.com/show_bug.cgi?id=2156532


Note You need to log in before you can comment on or make changes to this bug.