Description of problem: When running a job-invocation, if one of the tasks fails, it marks the whole batch of tasks (usually 100) as failed (all of them with "Exit status: 4"). Version-Release number of selected component (if applicable): Satellite 6.12.0 Snap 15 How reproducible: Always Steps to Reproduce: 1. Register one host but do not configure SSH keys for SSH ReX 2. Run: hammer job-invocation create --async --inputs command='date' --job-template 'Run Command - Ansible Default' --search-query 'name ~ $REGEXP_THAT_MATCHES_THAT_HOST'" 3. Actual results: The batch of 100 systems that includes that host will be mark as failed. If that's the only system with no access, the rest of tasks will have performed its duty but will be marked as failed. Expected results: Only the system(s) with "problems" should be marked as failed and not the rest of systems on its batch. Additional info:
The same behavior happens when enabling the MQTT provider (on the Satellite side) but the content host registration process fails to configure yggdrasild but the host appears registered from the Satellite point of view (due to some communication problem during the registration process probably because of multiple registrations).
The /var/log/rhsm.log on the client reports this: 2022-10-24 16:59:38,068 [ERROR] subscription-manager:414:MainThread @cache.py:189 - Error updating system data on the server 2022-10-24 16:59:38,079 [ERROR] subscription-manager:414:MainThread @cache.py:190 - The read operation timed out Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/subscription_manager/cache.py", line 179, in update_check self._sync_with_server(uep, consumer_uuid) File "/usr/lib64/python3.6/site-packages/subscription_manager/cache.py", line 523, in _sync_with_server _combined_profile File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 1325, in updateCombinedProfile return self.conn.request_put(method, profile) File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 943, in request_put return self._request("PUT", method, params, headers=headers) File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 967, in _request info=info, headers=headers, cert_key_pairs=cert_key_pairs) File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 749, in _request response = conn.getresponse() File "/usr/lib64/python3.6/http/client.py", line 1365, in getresponse response.begin() File "/usr/lib64/python3.6/http/client.py", line 320, in begin version, status, reason = self._read_status() File "/usr/lib64/python3.6/http/client.py", line 281, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/lib64/python3.6/socket.py", line 586, in readinto return self._sock.recv_into(b) File "/usr/lib64/python3.6/ssl.py", line 971, in recv_into return self.read(nbytes, buffer) File "/usr/lib64/python3.6/ssl.py", line 833, in read return self._sslobj.read(len, buffer) File "/usr/lib64/python3.6/ssl.py", line 590, in read v = self._sslobj.read(len, buffer) socket.timeout: The read operation timed out 2022-10-24 16:59:38,102 [ERROR] subscription-manager:414:MainThread @managercli.py:229 - exception caught in subscription-manager 2022-10-24 16:59:38,102 [ERROR] subscription-manager:414:MainThread @managercli.py:230 - The read operation timed out Traceback (most recent call last): File "/usr/lib64/python3.6/site-packages/subscription_manager/managercli.py", line 547, in main return_code = self._do_command() File "/usr/lib64/python3.6/site-packages/subscription_manager/managercli.py", line 2069, in _do_command profile_mgr.update_check(self.cp, consumer['uuid'], True) File "/usr/lib64/python3.6/site-packages/subscription_manager/cache.py", line 485, in update_check return CacheManager.update_check(self, uep, consumer_uuid, force) File "/usr/lib64/python3.6/site-packages/subscription_manager/cache.py", line 191, in update_check raise e File "/usr/lib64/python3.6/site-packages/subscription_manager/cache.py", line 179, in update_check self._sync_with_server(uep, consumer_uuid) File "/usr/lib64/python3.6/site-packages/subscription_manager/cache.py", line 523, in _sync_with_server _combined_profile File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 1325, in updateCombinedProfile return self.conn.request_put(method, profile) File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 943, in request_put return self._request("PUT", method, params, headers=headers) File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 967, in _request info=info, headers=headers, cert_key_pairs=cert_key_pairs) File "/usr/lib64/python3.6/site-packages/rhsm/connection.py", line 749, in _request response = conn.getresponse() File "/usr/lib64/python3.6/http/client.py", line 1365, in getresponse response.begin() File "/usr/lib64/python3.6/http/client.py", line 320, in begin version, status, reason = self._read_status() File "/usr/lib64/python3.6/http/client.py", line 281, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/lib64/python3.6/socket.py", line 586, in readinto return self._sock.recv_into(b) File "/usr/lib64/python3.6/ssl.py", line 971, in recv_into return self.read(nbytes, buffer) File "/usr/lib64/python3.6/ssl.py", line 833, in read return self._sslobj.read(len, buffer) File "/usr/lib64/python3.6/ssl.py", line 590, in read v = self._sslobj.read(len, buffer) socket.timeout: The read operation timed out
This is still happening in 6.13. The last time we've faced this the issue was that the content host was temporarily unreachable via SSH due to the load and caused the whole batch of 100 jobs to be marked as failed.
Is this the same as the issue reported here? https://bugzilla.redhat.com/show_bug.cgi?id=2167396
No. In our case the jobs finish with error code 4.
(In reply to Pablo Mendez Hernandez from comment #7) > No. In our case the jobs finish with error code 4. Yeah, it's the same issue. I'll mark it as a duplicate of BZ #2167396. *** This bug has been marked as a duplicate of bug 2167396 ***