Bug 2247646
| Summary: | Capsule sync gets stuck forever after specific connection issues | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Pavel Moravec <pmoravec> |
| Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> |
| Status: | CLOSED MIGRATED | QA Contact: | Satellite QE Team <sat-qe-bz-list> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.12.0 | CC: | ahumbe, dalley, dhjoshi, dkliban, gformisa, ggainey, hyu, jbuckley, jpasqual, mkalyat, rchan, rlavi |
| Target Milestone: | Unspecified | Keywords: | MigratedToJIRA, PrioBumpGSS, Triaged, UserExperience |
| Target Release: | Unused | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-06-06 16:31:56 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Re-running some scenarios again, pulp behaves quite non-deterministically. Sometimes the BT follows after the "Backing off download_wrapper" log, sometimes not - in various scenarios. Sometimes the task gets cancelled due to "Cannot connect to host" error, sometimes not. So for QE, it is worth running the scenarios a few times and the pulp task must (somehow) finish in either case. From the client, it appears there might be a related issue to the capsules getting stuck in sync cycle. Maybe I am misunderstanding, but the "failure not being detected, reported or exposing itself" seems like it might be either a detail that should be added to the this bugzilla, or maybe a separate bz all by itself. - We have working and fully functional TCP networking on each of the Capsules and the Satellite server. - We suspect that there is a transport-layer condition that affects or alters TCP flow in the middle. - 24-25 capsules work just fine. - But on the 2-3 that aren’t working for synchronizations, these syncs seem to be able to fail without the failure being detected, reported, or exposing itself to error handling. (In reply to Jeremiah Buckley from comment #2) > From the client, it appears there might be a related issue to the capsules > getting stuck in sync cycle. Maybe I am misunderstanding, but the "failure > not being detected, reported or exposing itself" seems like it might be > either a detail that should be added to the this bugzilla, or maybe a > separate bz all by itself. > > > - We have working and fully functional TCP networking on each of the > Capsules and the Satellite server. > - We suspect that there is a transport-layer condition that affects or > alters TCP flow in the middle. > - 24-25 capsules work just fine. > - But on the 2-3 that aren’t working for synchronizations, these syncs seem > to be able to fail without the failure being detected, reported, or exposing > itself to error handling. I was checking one instance of capsule stuck for that customer, and noticed the "Backing off .." error in /var/log/messages. While WebUI or pulp task or similar showed no error. Is that already the "failure not being detected, reported or exposing itself"? Or could you provide a sosreport of Satellite and Capsule taken a short time after such an event? (and put it to the case and ping me, to let me investigate)? I feel https://bugzilla.redhat.com/show_bug.cgi?id=2241729 is a duplicate of this pulp bug, just happening on a different product. Leaving on the devels / PrdM to decide if they want to track it per-product or just once. Now we have another very similar user scenario when the sync gets stuck already during downloading metadata, per core_progressreport PostgreSQL table:
pulp_id | pulp_created | pulp_last_updated | message | state | total | done | suffix | task_id | code
--------------------------------------+-------------------------------+-------------------------------+----------------------------------------------+-----------+-------+-------+--------+--------------------------------------+---------------------------------
7a5db8c8-4622-41da-acf9-3383330e62ef | 2024-02-13 21:50:44.353064+08 | 2024-02-13 21:50:44.353079+08 | Downloading Artifacts | running | | 0 | | 02ccb069-db06-420a-932d-de15d7d2c36b | sync.downloading.artifacts
040627ad-12e1-473d-b03b-c4df62867c2f | 2024-02-13 21:50:44.355208+08 | 2024-02-13 21:50:44.355222+08 | Associating Content | running | | 0 | | 02ccb069-db06-420a-932d-de15d7d2c36b | associating.content
ff4eb533-2bf3-47ba-b01d-ce296a04005d | 2024-02-13 21:50:44.349573+08 | 2024-02-13 23:03:58.765218+08 | Downloading Metadata Files | running | | 6 | | 02ccb069-db06-420a-932d-de15d7d2c36b | sync.downloading.metadata
(and stuck for many hours there)
.. and third case. Bug is there since 6.12 already (no surprise, though). This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "SAT-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |
Description of problem: After specific temporary Sat<->Caps connection issues, Capsule sync gets stuck pending on downloading artifacs that will never happen. Particular scenarios for the connectivity issues: - drop packets between Sat and Caps for a while (2-3 minutes is enough) - reject packages between them for a specific amount of time (timing is crucial, see below for reproducer) - optionally, apply connection rate limiting first, and then drop packets (or reject for specific amount of time) I *think* the reproducer can be applicable to Satellite (syncing from some upstream repo) as well, since the scenarios are quite generic - just restrict connectivity during a repo sync task. Version-Release number of selected component (if applicable): Sat 6.12 / python39-pulpcore-3.18.16-1.el8pc.noarch (I expect any Sat / pulp3 version affected) How reproducible: 100% Steps to Reproduce: 1. Have a Capsule with Immediate download policy and no content downloaded. 2. Have one CV with one bigger repo (say, RHEL9 BaseOS) to have syncable content to a Capsule (optionally, do a Complete Sync of such one repo after purging any repo and content from the Capsule) 3. Launch the Capsule (Complete) Sync. 4. Check the dynflow step will track pulp task that will be downloading packages, like: message: Downloading Artifacts code: sync.downloading.artifacts state: running done: 259 5. When the packages download is progressing, run *either* below scenario to mimic connectivity issues (run on Capsule): 5a. drop packets to/from Satellite: SATELLITE=ip.address.of.satellite iptables -F iptables -A OUTPUT -p tcp -d $SATELLITE -j DROP iptables -A INPUT -p tcp -s $SATELLITE -j DROP 5b. reject poackets to/from Satellite: SATELLITE=ip.address.of.satellite iptables -F iptables -A OUTPUT -p tcp -d $SATELLITE -j REJECT iptables -A INPUT -p tcp -s $SATELLITE -j REJECT 5c. apply rate limiting and drop packets over the limit: SATELLITE=ip.address.of.satellite LIMIT=10/s iptables -F iptables -A OUTPUT -p tcp -d $SATELLITE -m limit --limit $LIMIT -j ACCEPT iptables -A INPUT -p tcp -s $SATELLITE -m limit --limit $LIMIT -j ACCEPT iptables -A OUTPUT -p tcp -d $SATELLITE -j DROP iptables -A INPUT -p tcp -s $SATELLITE -j DROP 5d. apply rate limiting and reject packages over the limit: SATELLITE=ip.address.of.satellite LIMIT=10/s iptables -F iptables -A OUTPUT -p tcp -d $SATELLITE -m limit --limit $LIMIT -j ACCEPT iptables -A INPUT -p tcp -s $SATELLITE -m limit --limit $LIMIT -j ACCEPT iptables -A OUTPUT -p tcp -d $SATELLITE -j REJECT iptables -A INPUT -p tcp -s $SATELLITE -j REJECT 6. Wait *some* time. At least until Capsule logs something like: Nov 2 13:43:36 pmoravec-caps612 pulpcore-worker-3[72785]: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pmoravec-sat612.domain.redhat.com:443 ssl:default [Connect call failed ('1.2.3.4', 443)]) Once that is logged, run "iptables -F" (to stop the temporary connectivity problem you mimic). 7. Monitor progress of the pulp task - it will download packets at a usual high speed now. 8. Once the progress does not move for some time, check the pulp task status - either it is done (as well as whole Caps sync), or it is stuck forever (waiting on downloading the few packets affected by the "Backing off download_wrapper" error that wont complete any time). Actual results: Scenario 5b. leads to failed Caps sync that is unable to download some package - that is expected. /var/log/messages contains a backtrace about it (BT in Additional info): Scenarios 5a., 5c. and sometimes 5d. leads to stuck Capsule sync. 5d. leads to stuck only when iptables are flushed *before* the exception from "BT in Additional info". If you are too slow, pulp detects too many connection problems and cancels the sync like in Scenario 5b. Per my tests, the BT from Additional info pops up 13 seconds after the "Backing off download_wrapper", so the 13s is the window of opportunity to hit the bug in Scenario 5d. Scenarios 5a. and 5c. (packages drop) seem to get stuck every time (just flush iptables *after* the "Backing off .." error, any time later on). (I *think* the Scenario 5b can mean that if you trigger Capsule Sync and restart some Satellite services and you are unlucky, you hit exactly 5b and Capsule gets stuck forever - that can be difficult to troubleshoot) Expected results: pulp task to somehow terminate in reasonable time in any Scenario. Even canceling the task due to temporary connection problems is definitely better outcome than Capsule sync stuck forever. Additional info: BT from pulp task giving up sync attempts (expected behaviour): Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: result = func(*args, **kwargs) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py", line 576, in synchronize Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: repo_version = dv.create() or repo.latest_version() Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/declarative_version.py", line 161, in create Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: loop.run_until_complete(pipeline) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: return future.result() Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 225, in create_pipeline Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: await asyncio.gather(*futures) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 43, in __call__ Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: await self.run() Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/artifact_stages.py", line 183, in run Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: pb.done += task.result() # download_count Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/artifact_stages.py", line 209, in _handle_content_unit Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: await asyncio.gather(*downloaders_for_content) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/models.py", line 89, in download Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: download_result = await downloader.run(extra_data=self.extra_data) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/download/http.py", line 273, in run Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: return await download_wrapper() Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/backoff/_async.py", line 133, in retry Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: ret = await target(*args, **kwargs) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulpcore/download/http.py", line 258, in download_wrapper Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: return await self._run(extra_data=extra_data) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib/python3.9/site-packages/pulp_rpm/app/downloaders.py", line 114, in _run Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: async with self.session.get( Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/client.py", line 1138, in __aenter__ Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: self._resp = await self._coro Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/client.py", line 535, in _request Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: conn = await self._connector.connect( Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 542, in connect Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: proto = await self._create_connection(req, traces, timeout) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 907, in _create_connection Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: _, proto = await self._create_direct_connection(req, traces, timeout) Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: raise last_exc Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: transp, proto = await self._wrap_create_connection( Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 992, in _wrap_create_connection Nov 2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]: raise client_error(req.connection_key, exc) from exc BT of a worker stuck forever in download attempts: Thread 1 (Thread 0x7ff0931f1740 #8 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559328068b10, for file /usr/lib64/python3.9/selectors.py, line 469, in select #15 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x5593280688e0, for file /usr/lib64/python3.9/asyncio/base_events.py, line 1869, in _run_once #22 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4469a0, for file /usr/lib64/python3.9/asyncio/base_events.py, line 601, in run_forever #29 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4465e0, for file /usr/lib64/python3.9/asyncio/base_events.py, line 634, in run_until_complete #36 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a461dd0, for file /usr/lib/python3.9/site-packages/pulpcore/plugin/stages/declarative_version.py, line 161, in create #43 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559327fb7990, for file /usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py, line 576, in synchronize #51 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559328051300, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 410, in _perform_task #59 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a431040, for file /usr/lib64/python3.9/multiprocessing/process.py, line 108, in run #66 0x00007ff092be4e63 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x5593280509b0, for file /usr/lib64/python3.9/multiprocessing/process.py, line 315, in _bootstrap #75 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a492240, for file /usr/lib64/python3.9/multiprocessing/popen_fork.py, line 71, in _launch #82 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4c8040, for file /usr/lib64/python3.9/multiprocessing/popen_fork.py, line 19, in __init__ #95 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a496380, for file /usr/lib64/python3.9/multiprocessing/context.py, line 277, in _Popen #102 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4961e0, for file /usr/lib64/python3.9/multiprocessing/context.py, line 224, in _Popen #109 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a483d40, for file /usr/lib64/python3.9/multiprocessing/process.py, line 121, in start #116 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x55932803e2a0, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 298, in supervise_task #123 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a46d040, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 366, in run_forever #130 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4c7040, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/entrypoint.py, line 28, in worker #138 0x00007ff092be4e63 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a529840, for file /usr/lib/python3.9/site-packages/click/core.py, line 754, in invoke #148 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a5a9580, for file /usr/lib/python3.9/site-packages/click/core.py, line 1395, in invoke #155 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559327e5b4f0, for file /usr/lib/python3.9/site-packages/click/core.py, line 1053, in main #165 0x00007ff092be4e63 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff083448b20, for file /usr/lib/python3.9/site-packages/click/core.py, line 1128, in __call__ (some "futures" object is incomplete (forever) due to improperly caught exception..?)