2136531 – Batch of job-invocation tasks are marked as failed when one of them fails

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2136531 - Batch of job-invocation tasks are marked as failed when one of them fails

Summary: Batch of job-invocation tasks are marked as failed when one of them fails

Keywords:
Status:	CLOSED DUPLICATE of bug 2167396
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Ansible - Configuration Management
Sub Component:
Version:	6.12.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium with 1 vote
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	Satellite QE Team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-10-20 14:19 UTC by Pablo Mendez Hernandez
Modified:	2023-03-20 09:50 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-03-20 09:50:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Pablo Mendez Hernandez 2022-10-20 14:19:14 UTC

Description of problem:

When running a job-invocation, if one of the tasks fails, it marks the whole batch of tasks (usually 100) as failed (all of them with "Exit status: 4").


Version-Release number of selected component (if applicable):

Satellite 6.12.0 Snap 15


How reproducible:

Always


Steps to Reproduce:
1. Register one host but do not configure SSH keys for SSH ReX
2. Run: hammer job-invocation create --async --inputs command='date' --job-template 'Run Command - Ansible Default' --search-query 'name ~ $REGEXP_THAT_MATCHES_THAT_HOST'"
3.

Actual results:

The batch of 100 systems that includes that host will be mark as failed. If that's the only system with no access, the rest of tasks will have performed its duty but will be marked as failed.


Expected results:

Only the system(s) with "problems" should be marked as failed and not the rest of systems on its batch.


Additional info:

Comment 3 Pablo Mendez Hernandez 2022-10-25 15:01:24 UTC

The same behavior happens when enabling the MQTT provider (on the Satellite side) but the content host registration process fails to configure yggdrasild but the host appears registered from the Satellite point of view (due to some communication problem during the registration process probably because of multiple registrations).

Comment 4 Pablo Mendez Hernandez 2022-10-25 15:02:51 UTC

The /var/log/rhsm.log on the client reports this:

2022-10-24 16:59:38,068 [ERROR] subscription-manager:414:MainThread @cache.py:189 - Error updating system data on the server
2022-10-24 16:59:38,079 [ERROR] subscription-manager:414:MainThread @cache.py:190 - The read operation timed out
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/subscription_manager/cache.py", line 179, in update_check
    self._sync_with_server(uep, consumer_uuid)
  File "/usr/lib64/python3.6/site-packages/subscription_manager/cache.py", line 523, in _sync_with_server
    _combined_profile
  File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 1325, in updateCombinedProfile
    return self.conn.request_put(method, profile)
  File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 943, in request_put
    return self._request("PUT", method, params, headers=headers)
  File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 967, in _request
    info=info, headers=headers, cert_key_pairs=cert_key_pairs)
  File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 749, in _request
    response = conn.getresponse()
  File "/usr/lib64/python3.6/http/client.py", line 1365, in getresponse
    response.begin()
  File "/usr/lib64/python3.6/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.6/http/client.py", line 281, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib64/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib64/python3.6/ssl.py", line 971, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib64/python3.6/ssl.py", line 833, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib64/python3.6/ssl.py", line 590, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
2022-10-24 16:59:38,102 [ERROR] subscription-manager:414:MainThread @managercli.py:229 - exception caught in subscription-manager
2022-10-24 16:59:38,102 [ERROR] subscription-manager:414:MainThread @managercli.py:230 - The read operation timed out
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/subscription_manager/managercli.py", line 547, in main
    return_code = self._do_command()
  File "/usr/lib64/python3.6/site-packages/subscription_manager/managercli.py", line 2069, in _do_command
    profile_mgr.update_check(self.cp, consumer['uuid'], True)
  File "/usr/lib64/python3.6/site-packages/subscription_manager/cache.py", line 485, in update_check
    return CacheManager.update_check(self, uep, consumer_uuid, force)
  File "/usr/lib64/python3.6/site-packages/subscription_manager/cache.py", line 191, in update_check
    raise e
  File "/usr/lib64/python3.6/site-packages/subscription_manager/cache.py", line 179, in update_check
    self._sync_with_server(uep, consumer_uuid)
  File "/usr/lib64/python3.6/site-packages/subscription_manager/cache.py", line 523, in _sync_with_server
    _combined_profile
  File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 1325, in updateCombinedProfile
    return self.conn.request_put(method, profile)
  File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 943, in request_put
    return self._request("PUT", method, params, headers=headers)
  File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 967, in _request
    info=info, headers=headers, cert_key_pairs=cert_key_pairs)
  File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 749, in _request
    response = conn.getresponse()
  File "/usr/lib64/python3.6/http/client.py", line 1365, in getresponse
    response.begin()
  File "/usr/lib64/python3.6/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.6/http/client.py", line 281, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib64/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib64/python3.6/ssl.py", line 971, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib64/python3.6/ssl.py", line 833, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib64/python3.6/ssl.py", line 590, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

Comment 5 Pablo Mendez Hernandez 2023-01-08 15:40:07 UTC

This is still happening in 6.13.

The last time we've faced this the issue was that the content host was temporarily unreachable via SSH due to the load
and caused the whole batch of 100 jobs to be marked as failed.

Comment 6 nalfassi 2023-02-28 09:20:53 UTC

Is this the same as the issue reported here? https://bugzilla.redhat.com/show_bug.cgi?id=2167396

Comment 7 Pablo Mendez Hernandez 2023-03-20 09:17:14 UTC

No. In our case the jobs finish with error code 4.

Comment 8 nalfassi 2023-03-20 09:50:45 UTC

(In reply to Pablo Mendez Hernandez from comment #7)
> No. In our case the jobs finish with error code 4.

Yeah, it's the same issue. I'll mark it as a duplicate of BZ #2167396.

*** This bug has been marked as a duplicate of bug 2167396 ***

Note You need to log in before you can comment on or make changes to this bug.