2231853 – Ansible handling of hung jobs

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2231853 - Ansible handling of hung jobs

Summary: Ansible handling of hung jobs

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Ansible - Configuration Management
Sub Component:
Version:	6.13.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	Satellite QE Team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-08-14 12:52 UTC by Jan Senkyrik
Modified:	2024-06-06 16:27 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-06-06 16:27:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	SAT-19752	0	None	Migrated	None	2024-06-06 16:27:31 UTC

Description Jan Senkyrik 2023-08-14 12:52:25 UTC

Let me explain on an example (reproducer):

- Have Satellite 6.13
- Have 2 content hosts - jsenkyri-rhel9c and jsenkyri-rhel9d.
- Create 2 roles - nfs_hang and run_check.
- The nfs_hang role emulates a hung task. It will try to open a file hosted on a mounted nfs share while the nfs-service is not running. The process will therefore hang indefinitely.
- The run_check role will simply add a new line to /tmp/ansible_runs.txt with date & time of the ansible run.
- Assign role 'nfs_hang' to the jsenkyri-rhel9c host:

---
# tasks file for nfs_hang
- name: Try to open a file hosted on nfs server. If the nfs-service is not running then this should hang forever.
  command: cat /nfs/imports/test/file.txt
  register: cat_output
  ignore_errors: true

- name: Print the output
  debug:
    var: cat_output.stdout_lines
  when: cat_output is defined and cat_output.rc == 0


- Assign role 'run_check' to the other host jsenkyri-rhel9d:

---
# tasks file for run_check
- name: Get current time and date
  set_fact:
    current_time: "{{ ansible_date_time.iso8601 }}"

- name: Append time and date to /tmp/ansible_runs.txt
  lineinfile:
    path: /tmp/ansible_runs.txt
    line: "Current time and date: {{ current_time }}"
    create: yes

- Select both hosts and do 'Run all Ansible roles'. Satellite will create 2 tasks - RunHostJob for 'jsenkyri-rhel9c' and RunHostJob for 'jsenkyri-rhel9d'.


Result:
- Both jobs will hang forever. The tasks are left in running/pending state until you cancel them. Nothing is added to '/tmp/ansible_runs.txt' on 'jsenkyri-rhel9d'.

Expectation:
- Job for 'jsenkyri-rhel9c' hangs. Job for 'jsenkyri-rhel9d' finishes successfully and a new line is added to '/tmp/ansible_runs.txt'.


##########

If you set batch_size=1 the result is as one would expect - 'jsenkyri-rhel9c' will hang forever, 'jsenkyri-rhel9d' will finish successfully and write a new line to '/tmp/ansible_runs.txt'. However this impacts performance.

I suppose batch_size=1 does the trick because then each job has its own ansible-runner:

# ps -ef | grep ansible
~~~
foreman+ 3019108       1  0 12:49 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/f33162255c [mux]
foreman+ 3019185 3018422  6 12:51 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3018422-18j5dcu -p playbook.yml
foreman+ 3019186 3018422  7 12:51 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3018422-jjtqb3 -p playbook.yml
foreman+ 3019187 3019185 30 12:51 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-18j5dcu/inventory playbook.yml
foreman+ 3019189 3019186 32 12:51 pts/1    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-jjtqb3/inventory playbook.yml
foreman+ 3019201 3019187 10 12:51 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3018422-18j5dcu/inventory playbook.yml
foreman+ 3019209       1  0 12:51 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/b439dc56b3 [mux]
foreman+ 3019218 3019201  1 12:51 pts/0    00:00:00 ssh -o ProxyCommand=none -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPersist=60s -o ServerAliveInterval=15 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -o Port=22 -o IdentityFile="/var/lib/foreman-proxy/ssh/id_rsa_foreman_proxy" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/var/lib/foreman-proxy/ansible/cp/f33162255c" -tt 10.37.195.32 /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1692010300.641188-3019201-245584559019420/AnsiballZ_setup.py && sleep 0'
root     3019232 2954473  0 12:51 pts/9    00:00:00 grep --color=auto ansible
~~~

With default batch size (100) there's just 1 runner for both:

# ps -ef | grep ansible
~~~
foreman+ 3021311 3021160  7 13:00 ?        00:00:00 /usr/bin/python3.9 /usr/bin/ansible-runner run /tmp/d20230814-3021160-1gekqmh -p playbook.yml
foreman+ 3021312 3021311 21 13:00 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3021160-1gekqmh/inventory playbook.yml
foreman+ 3021320 3021312 10 13:00 pts/0    00:00:00 /usr/bin/python3.11 /usr/bin/ansible-playbook -i /tmp/d20230814-3021160-1gekqmh/inventory playbook.yml
foreman+ 3021331       1  0 13:00 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/f33162255c [mux]
foreman+ 3021334       1  0 13:00 ?        00:00:00 ssh: /var/lib/foreman-proxy/ansible/cp/b439dc56b3 [mux]
foreman+ 3021349 3021320  0 13:00 pts/0    00:00:00 ssh -o ProxyCommand=none -C -o ControlMaster=auto -o ControlPersist=60s -o ControlPersist=60s -o ServerAliveInterval=15 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -o Port=22 -o IdentityFile="/var/lib/foreman-proxy/ssh/id_rsa_foreman_proxy" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath="/var/lib/foreman-proxy/ansible/cp/f33162255c" -tt 10.37.195.32 /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1692010837.379527-3021320-273526151698778/AnsiballZ_setup.py && sleep 0'
root     3021362 2954473  0 13:00 pts/9    00:00:00 grep --color=auto ansible
~~~

This is a problem because a single hung/not responding host can freeze the entire batch of hosts. In bigger environments with recurring jobs the stuck jobs pile up quite quickly. This can lead to performance problems:

[root@satellite tmp]# su - postgres -c "psql -d foreman -c 'select label,count(label),state,result from foreman_tasks_tasks where state <> '\''stopped'\'' group by label,state,result ORDER BY label;'"
                        label                         | count |   state   | result  
------------------------------------------------------+-------+-----------+---------
...
...
 Actions::RemoteExecution::RunHostJob                 |   104 | paused    | pending
 Actions::RemoteExecution::RunHostJob                 |  1996 | running   | pending
 Actions::RemoteExecution::RunHostsJob                |     1 | paused    | error
 Actions::RemoteExecution::RunHostsJob                |     2 | paused    | pending
 Actions::RemoteExecution::RunHostsJob                |    28 | running   | pending
 Actions::RemoteExecution::RunHostsJob                |     1 | scheduled | pending


Other than configuring batch_size=1 one can try:

a) Add 'timeout' [0] on the task level: 
- Once the timeout is reached, Satellite will fail the hung task which will allow the remaining tasks on other hosts to continue.
- This doesn't scale well in big environments with many roles.

b) Use 'free' strategy [1]:
- By default ansible uses 'linear' strategy, 'free' strategy ensures the tasks are executed without waiting for all hosts.
- One can simply clone the 'Ansible Roles - Ansible Default' job template and add 'strategy: free' as seen below:

---
- hosts: all
  strategy: free
  pre_tasks:
    - name: Display all parameters known for the Foreman host
      debug:
        var: foreman
      tags:
        - always
  tasks:
    - name: Apply roles
      include_role:
        name: "{{ role }}"
      tags:
        - always
      loop: "{{ foreman_ansible_roles }}"
      loop_control:
        loop_var: role


- This works to a certain extent. The task status in Satellite stays running/pending, so you have to cancel them to "unstuck" them. However that's a fake status. The tasks actually execute successfully, the status just never gets passed to Satellite:
---
TASK [Apply roles] *************************************************************

TASK [run_check : Get current time and date] ***********************************
ok: [jsenkyri-rhel9d.sysmgmt.lan]

TASK [run_check : Append time and date to /tmp/ansible_runs.txt] ***************
changed: [jsenkyri-rhel9d.sysmgmt.lan]
---


##########

I am not sure to which degree this is a problem on Satellite side. When it comes to hung tasks, Ansible has certain limitations, see [2]. The part with 'free' strategy & task status *seems* to be Satellite bug though. 

Opening this BZ so we can check whether there's any bug or potential RFE on Satellite side. Any solutions/workarounds are very welcome as well.

Note:
Bug 2156532 [3] seems related.


[0] https://docs.ansible.com/ansible/latest/reference_appendices/playbooks_keywords.html#task
[1] https://docs.ansible.com/ansible/latest/collections/ansible/builtin/free_strategy.html
[2] https://github.com/ansible/ansible/issues/30411
[3] https://bugzilla.redhat.com/show_bug.cgi?id=2156532

Comment 2 Eric Helms 2024-06-06 16:27:33 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "SAT-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Note You need to log in before you can comment on or make changes to this bug.