Description of problem: Having a network outage during pulp syncing a repository, the task gets stuck forever - despite the network hiccup is resolved hours ago. For a reproducer, it is enough to block traffic for 90 seconds just after kicking off the sync. It can be arguable if we should deal with such scenarios that are very improbable (a network issue for >minute among clouds), but: - it sometimes really happens - it affects custom repos as well, while availability of these repos might not be such 100% reliable - pulp should be robust to at least some extent to this kind of issues - it affects Satellite 6 the same way (where it might be seen as a bigger problem, in general, though we have hit it with RHUI only, so far) Version-Release number of selected component (if applicable): RHUI 4.3.0 - python39-pulpcore-3.21.0-1.0.1.el8ui.noarch How reproducible: 100% Steps to Reproduce: 1. Enable a bigger Red Hat repo on RHUI 2. Trigger a sync of it 3. Within a few seconds, simulate a network outage by simply dropping any packets from CDN: iptables -A INPUT --proto tcp --source cdn.redhat.com -j DROP sleep 90 iptables -F 4. Monitor the sync status Actual results: 4. Gets stuck forever, like: -= Repository Active Synchronization Status =- Last Refreshed: 13:36:42 (updated every 5 seconds, ctrl+c to exit) Current Sync Result ------------------------------------------------------------------------------ Red Hat Enterprise Linux 8 for x86_64 - AppStream from RHUI (RPMs) (8) 2023-10-02 08:27:15 Running Associating Content Downloading Artifacts Downloading Metadata Files: 4 Expected results: Either the synchronization fails over the network hiccup and retries in sync, or (rather?) cancels the task. Additional info: A possible implementation can similar that yum/dnf has: when download speed drops below some threshold, it cancels the download. This was even seen at a customer during a shared session.
Oh, I forgot to mention the backtrace of the child pulpcore worker process: # grep -e ^Thread -e py, gdb.txt | cut -d'(' -f1-2 Thread 3 (Thread 0x7fb108865700 #10 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb10886c230, for file /usr/lib64/python3.9/concurrent/futures/thread.py, line 81, in _worker #18 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb1092e6900, for file /usr/lib64/python3.9/threading.py, line 917, in run #25 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x555bbb1bfd20, for file /usr/lib64/python3.9/threading.py, line 980, in _bootstrap_inner #32 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x555bbb1bfb00, for file /usr/lib64/python3.9/threading.py, line 937, in _bootstrap Thread 2 (Thread 0x7fb1092ad700 #10 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb1092c7800, for file /usr/lib64/python3.9/concurrent/futures/thread.py, line 81, in _worker #18 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb1092c9ac0, for file /usr/lib64/python3.9/threading.py, line 917, in run #25 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb1092cb440, for file /usr/lib64/python3.9/threading.py, line 980, in _bootstrap_inner #32 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb1092c9900, for file /usr/lib64/python3.9/threading.py, line 937, in _bootstrap Thread 1 (Thread 0x7fb130de4740 #8 0x00007fb13077fcb5 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x555bbb18b980, for file /usr/lib64/python3.9/selectors.py, line 469, in select #15 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x555bbb18b750, for file /usr/lib64/python3.9/asyncio/base_events.py, line 1869, in _run_once #22 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb1092bdd60, for file /usr/lib64/python3.9/asyncio/base_events.py, line 601, in run_forever #29 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb108869b80, for file /usr/lib64/python3.9/asyncio/base_events.py, line 634, in run_until_complete #36 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb10886c610, for file /usr/lib/python3.9/site-packages/pulpcore/plugin/stages/declarative_version.py, line 161, in create #43 0x00007fb13077fcb5 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x555bbafc2180, for file /usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py, line 577, in synchronize #51 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x555bbb0f7350, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 452, in _perform_task #59 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb109321550, for file /usr/lib64/python3.9/multiprocessing/process.py, line 108, in run #66 0x00007fb13077e803 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x555bbb0510e0, for file /usr/lib64/python3.9/multiprocessing/process.py, line 315, in _bootstrap #75 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb10933ec40, for file /usr/lib64/python3.9/multiprocessing/popen_fork.py, line 71, in _launch #82 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb10937cd60, for file /usr/lib64/python3.9/multiprocessing/popen_fork.py, line 19, in __init__ #95 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb109314860, for file /usr/lib64/python3.9/multiprocessing/context.py, line 277, in _Popen #102 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb1093146c0, for file /usr/lib64/python3.9/multiprocessing/context.py, line 224, in _Popen #109 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb109314380, for file /usr/lib64/python3.9/multiprocessing/process.py, line 121, in start #116 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x555bbafae060, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 318, in supervise_task #123 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb1093343e0, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 386, in run_forever #130 0x00007fb13077fcb5 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb10937e210, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/entrypoint.py, line 28, in worker #138 0x00007fb13077e803 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb10937d240, for file /usr/lib/python3.9/site-packages/click/core.py, line 760, in invoke #148 0x00007fb130743233 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb109710900, for file /usr/lib/python3.9/site-packages/click/core.py, line 1404, in invoke #155 0x00007fb13077fcb5 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x555bbafe42b0, for file /usr/lib/python3.9/site-packages/click/core.py, line 1055, in main #165 0x00007fb13077e803 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7fb121656cf0, for file /usr/lib/python3.9/site-packages/click/core.py, line 1130, in __call__ #
I feel https://bugzilla.redhat.com/show_bug.cgi?id=2247646 is a duplicate of this pulp bug, just happening on a different product. Leaving on the devels / PrdM to decide if they want to track it per-product or just once.
This appears to have improved. With RHUI 4.6, which uses a newer Pulp version, the outage is handled as follows: # grep cdn.redhat.com /var/log/messages Nov 20 07:04:42 rhua pulpcore-worker[494938]: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [None]) Nov 20 07:04:42 rhua pulpcore-worker[494938]: pulp [72d41791d3174160a142a84ba5f3ef8b]: backoff:INFO: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [None]) Nov 20 07:06:53 rhua pulpcore-worker[494938]: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:06:53 rhua pulpcore-worker[494938]: pulp [72d41791d3174160a142a84ba5f3ef8b]: backoff:INFO: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:09:04 rhua pulpcore-worker[494938]: Backing off download_wrapper(...) for 1.4s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:09:04 rhua pulpcore-worker[494938]: pulp [72d41791d3174160a142a84ba5f3ef8b]: backoff:INFO: Backing off download_wrapper(...) for 1.4s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:11:52 rhua pulpcore-worker[494938]: Backing off download_wrapper(...) for 0.0s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:11:52 rhua pulpcore-worker[494938]: pulp [72d41791d3174160a142a84ba5f3ef8b]: backoff:INFO: Backing off download_wrapper(...) for 0.0s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:14:03 rhua pulpcore-worker[494938]: Backing off download_wrapper(...) for 1.8s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:14:03 rhua pulpcore-worker[494938]: pulp [72d41791d3174160a142a84ba5f3ef8b]: backoff:INFO: Backing off download_wrapper(...) for 1.8s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:16:14 rhua pulpcore-worker[494938]: Backing off download_wrapper(...) for 1.7s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:16:14 rhua pulpcore-worker[494938]: pulp [72d41791d3174160a142a84ba5f3ef8b]: backoff:INFO: Backing off download_wrapper(...) for 1.7s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:18:25 rhua pulpcore-worker[494938]: Backing off download_wrapper(...) for 3.4s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:18:25 rhua pulpcore-worker[494938]: pulp [72d41791d3174160a142a84ba5f3ef8b]: backoff:INFO: Backing off download_wrapper(...) for 3.4s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:20:40 rhua pulpcore-worker[494938]: Giving up download_wrapper(...) after 5 tries (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:20:40 rhua pulpcore-worker[494938]: pulp [72d41791d3174160a142a84ba5f3ef8b]: backoff:ERROR: Giving up download_wrapper(...) after 5 tries (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Nov 20 07:20:40 rhua pulpcore-worker[494938]: pulp [72d41791d3174160a142a84ba5f3ef8b]: pulpcore.tasking.tasks:INFO: Task 018beb8a-81ff-7d8d-9b13-fb8a6b9127ed failed (Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)]) Note that I kept the iptables rule fore more than 90 seconds in this case. And the sync task ended: rhui (sync) => vr Select a repository: Sync status Repository 1 - Error Red Hat Developer Tools (for RHEL 7 Server for ARM 64) RPMs from RHUI (aarch64) Cannot connect to host cdn.redhat.com:443 ssl:default [Connect call failed ('2.22.12.251', 443)] It will be retried in 6 hours. If I disable access to the CDN only for a while, then the sync resumes after some time.
Hmm, maybe I spoke too soon. Although the sync resumes, it doesn't seem to finish; it's now stuck on: Associating Content: 500 (90%) Downloading Artifacts: 505 So I wonder if it moved on to download the other artifacts when connectivity was restored but didn't retry those artifacts that were affected. That would mean this bug is still unresolved.
(In reply to Radek Bíba from comment #5) > Hmm, maybe I spoke too soon. Although the sync resumes, it doesn't seem to > finish; it's now stuck on: > > Associating Content: 500 (90%) > Downloading Artifacts: 505 > > So I wonder if it moved on to download the other artifacts when connectivity > was restored but didn't retry those artifacts that were affected. That would > mean this bug is still unresolved. Yes, this was my observation as well. Network issues happened/mimicked at some time when some packages were attempted to be downloaded. Those asyncio events didnt terminate properly (or got fishy some other way), causing these specific packages are not downloaded. Next requests to download next packages went fine. But the orchestrator of the work is still waiting for the stuck packages to be downloaded - forever.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHUI 4.9 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2024:4446