Bug 2101708

Summary: when host is deleted on hypervisor while ansible job is running, hosts gets deleted on hypervisor level
Product: Red Hat Satellite Reporter: Stefan Nemeth <snemeth>
Component: Remote ExecutionAssignee: Adam Ruzicka <aruzicka>
Status: CLOSED ERRATA QA Contact: Peter Ondrejka <pondrejk>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.10.6CC: aruzicka, pcreech
Target Milestone: 6.13.0Keywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: smart_proxy_remote_execution_ssh-0.10.1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-03 13:21:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stefan Nemeth 2022-06-28 08:01:52 UTC
Description of problem:

When ansible run is executed and running, hosts gets deleted on hypervisor level. 

Job will not fail over time and gets stuck on either pending or on some percentages of progress for long time

Version-Release number of selected component (if applicable):

6.10.6

How reproducible:

100%

Steps to Reproduce:
1.execute ansible job which runs longer than few seconds
2.delete targeted host on hypervisor while job is running
3.

Actual results:

Job gets stuck on its current state

Expected results:

Job fails after some time 

Additional info:

Comment 1 Adam Ruzicka 2022-06-28 09:26:31 UTC
Just to double check, do I read that right that you do not remove the host from Satellite? Just kick off a job, go to the hypervisor and remove the host there?

Comment 2 Marek Hulan 2022-11-07 17:36:30 UTC
Adding a proper needinfo

Comment 4 Adam Ruzicka 2023-01-10 15:10:02 UTC
Right, I managed to reproduce it.

Local libvirt reproducer:
1) Have a satellite and a vm
2) Run long running ansible job against the vm
3) Do shut down > force off on the vm

foreman-proxy service runs ansible, ansible runs ssh. When the remote host is forcefully killed (or removed), the connection does not break. The connection remains ESTABLISHED long time after the host went away.

We could probably start setting a combination of ServerAliveInterval and ServerAliveCountMax for ssh.

Both ansible and script (in ssh mode) jobs are susceptible to this, ansible will need to be fixed in puppet foreman_proxy modules, rex in rex itself.

Comment 5 Bryan Kearney 2023-01-14 16:03:07 UTC
Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/35924 has been resolved.

Comment 6 Adam Ruzicka 2023-01-18 15:17:23 UTC
To elaborate, the fix for ssh is merged and we can ship it for 6.13. The ansible parts needs to happen in puppet modules and will need an additional installer change. We can deliver the ssh part for 6.13, but not the rest.

Comment 7 Brad Buckingham 2023-01-23 14:53:12 UTC
Hi Adam,

Thank you for the details.

For the installer changes mentioned in comment 6, is there another bugzilla to track those changes or should this bugzilla be cloned for Installer?

Comment 8 Adam Ruzicka 2023-01-24 13:44:37 UTC
As far as I know there is no other BZ, although I have it laid out in jira as subtasks if that counts.

Comment 9 Adam Ruzicka 2023-03-17 12:02:25 UTC
Looking at a 6.13 snap 13 box, this seems to have been full delivered already.

satellite-6.13.0-6.el8sat.noarch
rubygem-smart_proxy_remote_execution_ssh-0.10.1-1.el8sat.noarch
foreman-installer-katello-3.5.2.1-1.el8sat.noarch
foreman-installer-3.5.2.1-1.el8sat.noarch
satellite-installer-6.13.0.7-1.el8sat.noarch

Comment 10 Peter Ondrejka 2023-03-22 15:03:54 UTC
Verified on Sat 6.13 sn 15, both ansible and ssh script jobs get terminated when target host becomes unreachable

Comment 13 errata-xmlrpc 2023-05-03 13:21:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.13 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2097