Bug 2247646

Summary:	Capsule sync gets stuck forever after specific connection issues
Product:	Red Hat Satellite	Reporter:	Pavel Moravec <pmoravec>
Component:	Pulp	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED MIGRATED	QA Contact:	Satellite QE Team <sat-qe-bz-list>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.12.0	CC:	ahumbe, dalley, dhjoshi, dkliban, gformisa, ggainey, hyu, jbuckley, jpasqual, mkalyat, rchan, rlavi
Target Milestone:	Unspecified	Keywords:	MigratedToJIRA, PrioBumpGSS, Triaged, UserExperience
Target Release:	Unused
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-06-06 16:31:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pavel Moravec 2023-11-02 14:55:58 UTC

Description of problem:
After specific temporary Sat<->Caps connection issues, Capsule sync gets stuck pending on downloading artifacs that will never happen.

Particular scenarios for the connectivity issues:
- drop packets between Sat and Caps for a while (2-3 minutes is enough)
- reject packages between them for a specific amount of time (timing is crucial, see below for reproducer)
- optionally, apply connection rate limiting first, and then drop packets (or reject for specific amount of time)


I *think* the reproducer can be applicable to Satellite (syncing from some upstream repo) as well, since the scenarios are quite generic - just restrict connectivity during a repo sync task.



Version-Release number of selected component (if applicable):
Sat 6.12 / python39-pulpcore-3.18.16-1.el8pc.noarch
(I expect any Sat / pulp3 version affected)



How reproducible:
100%



Steps to Reproduce:
1. Have a Capsule with Immediate download policy and no content downloaded.
2. Have one CV with one bigger repo (say, RHEL9 BaseOS) to have syncable content to a Capsule (optionally, do a Complete Sync of such one repo after purging any repo and content from the Capsule)
3. Launch the Capsule (Complete) Sync.
4. Check the dynflow step will track pulp task that will be downloading packages, like:

      message: Downloading Artifacts
      code: sync.downloading.artifacts
      state: running
      done: 259

5. When the packages download is progressing, run *either* below scenario to mimic connectivity issues (run on Capsule):

5a. drop packets to/from Satellite:
SATELLITE=ip.address.of.satellite
iptables -F
iptables -A OUTPUT -p tcp -d $SATELLITE -j DROP
iptables -A INPUT -p tcp -s $SATELLITE -j DROP

5b. reject poackets to/from Satellite:
SATELLITE=ip.address.of.satellite
iptables -F
iptables -A OUTPUT -p tcp -d $SATELLITE -j REJECT
iptables -A INPUT -p tcp -s $SATELLITE -j REJECT

5c. apply rate limiting and drop packets over the limit:
SATELLITE=ip.address.of.satellite
LIMIT=10/s
iptables -F
iptables -A OUTPUT -p tcp -d $SATELLITE -m limit --limit $LIMIT -j ACCEPT
iptables -A INPUT -p tcp -s $SATELLITE -m limit --limit $LIMIT -j ACCEPT
iptables -A OUTPUT -p tcp -d $SATELLITE -j DROP
iptables -A INPUT -p tcp -s $SATELLITE -j DROP

5d. apply rate limiting and reject packages over the limit:
SATELLITE=ip.address.of.satellite
LIMIT=10/s
iptables -F
iptables -A OUTPUT -p tcp -d $SATELLITE -m limit --limit $LIMIT -j ACCEPT
iptables -A INPUT -p tcp -s $SATELLITE -m limit --limit $LIMIT -j ACCEPT
iptables -A OUTPUT -p tcp -d $SATELLITE -j REJECT
iptables -A INPUT -p tcp -s $SATELLITE -j REJECT

6. Wait *some* time. At least until Capsule logs something like:
Nov  2 13:43:36 pmoravec-caps612 pulpcore-worker-3[72785]: Backing off download_wrapper(...) for 0.3s (aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pmoravec-sat612.domain.redhat.com:443 ssl:default [Connect call failed ('1.2.3.4', 443)])

Once that is logged, run "iptables -F" (to stop the temporary connectivity problem you mimic).

7. Monitor progress of the pulp task - it will download packets at a usual high speed now.

8. Once the progress does not move for some time, check the pulp task status - either it is done (as well as whole Caps sync), or it is stuck forever (waiting on downloading the few packets affected by the "Backing off download_wrapper" error that wont complete any time).



Actual results:
Scenario 5b. leads to failed Caps sync that is unable to download some package - that is expected. /var/log/messages contains a backtrace about it (BT in Additional info):

Scenarios 5a., 5c. and sometimes 5d. leads to stuck Capsule sync. 5d. leads to stuck only when iptables are flushed *before* the exception from "BT in Additional info". If you are too slow, pulp detects too many connection problems and cancels the sync like in Scenario 5b. Per my tests, the BT from Additional info pops up 13 seconds after the "Backing off download_wrapper", so the 13s is the window of opportunity to hit the bug in Scenario 5d.

Scenarios 5a. and 5c. (packages drop) seem to get stuck every time (just flush iptables *after* the "Backing off .." error, any time later on).


(I *think* the Scenario 5b can mean that if you trigger Capsule Sync and restart some Satellite services and you are unlucky, you hit exactly 5b and Capsule gets stuck forever - that can be difficult to troubleshoot)


Expected results:
pulp task to somehow terminate in reasonable time in any Scenario. Even canceling the task due to temporary connection problems is definitely better outcome than Capsule sync stuck forever.



Additional info:
BT from pulp task giving up sync attempts (expected behaviour):

Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    result = func(*args, **kwargs)
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py", line 576, in synchronize
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    repo_version = dv.create() or repo.latest_version()
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/declarative_version.py", line 161, in create
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    loop.run_until_complete(pipeline)
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    return future.result()
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 225, in create_pipeline
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    await asyncio.gather(*futures)
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 43, in __call__
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    await self.run()
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/artifact_stages.py", line 183, in run
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    pb.done += task.result()  # download_count
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/artifact_stages.py", line 209, in _handle_content_unit
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    await asyncio.gather(*downloaders_for_content)
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/models.py", line 89, in download
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    download_result = await downloader.run(extra_data=self.extra_data)
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib/python3.9/site-packages/pulpcore/download/http.py", line 273, in run
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    return await download_wrapper()
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib/python3.9/site-packages/backoff/_async.py", line 133, in retry
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    ret = await target(*args, **kwargs)
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib/python3.9/site-packages/pulpcore/download/http.py", line 258, in download_wrapper
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    return await self._run(extra_data=extra_data)
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib/python3.9/site-packages/pulp_rpm/app/downloaders.py", line 114, in _run
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    async with self.session.get(
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib64/python3.9/site-packages/aiohttp/client.py", line 1138, in __aenter__
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    self._resp = await self._coro
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib64/python3.9/site-packages/aiohttp/client.py", line 535, in _request
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    conn = await self._connector.connect(
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 542, in connect
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    proto = await self._create_connection(req, traces, timeout)
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 907, in _create_connection
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    _, proto = await self._create_direct_connection(req, traces, timeout)
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    raise last_exc
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    transp, proto = await self._wrap_create_connection(
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:  File "/usr/lib64/python3.9/site-packages/aiohttp/connector.py", line 992, in _wrap_create_connection
Nov  2 13:30:21 pmoravec-caps612 pulpcore-worker-4[72181]:    raise client_error(req.connection_key, exc) from exc


BT of a worker stuck forever in download attempts:

Thread 1 (Thread 0x7ff0931f1740 
#8  0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559328068b10, for file /usr/lib64/python3.9/selectors.py, line 469, in select 
#15 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x5593280688e0, for file /usr/lib64/python3.9/asyncio/base_events.py, line 1869, in _run_once 
#22 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4469a0, for file /usr/lib64/python3.9/asyncio/base_events.py, line 601, in run_forever 
#29 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4465e0, for file /usr/lib64/python3.9/asyncio/base_events.py, line 634, in run_until_complete 
#36 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a461dd0, for file /usr/lib/python3.9/site-packages/pulpcore/plugin/stages/declarative_version.py, line 161, in create 
#43 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559327fb7990, for file /usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py, line 576, in synchronize 
#51 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559328051300, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 410, in _perform_task 
#59 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a431040, for file /usr/lib64/python3.9/multiprocessing/process.py, line 108, in run 
#66 0x00007ff092be4e63 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x5593280509b0, for file /usr/lib64/python3.9/multiprocessing/process.py, line 315, in _bootstrap 
#75 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a492240, for file /usr/lib64/python3.9/multiprocessing/popen_fork.py, line 71, in _launch 
#82 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4c8040, for file /usr/lib64/python3.9/multiprocessing/popen_fork.py, line 19, in __init__ 
#95 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a496380, for file /usr/lib64/python3.9/multiprocessing/context.py, line 277, in _Popen 
#102 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4961e0, for file /usr/lib64/python3.9/multiprocessing/context.py, line 224, in _Popen 
#109 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a483d40, for file /usr/lib64/python3.9/multiprocessing/process.py, line 121, in start 
#116 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x55932803e2a0, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 298, in supervise_task 
#123 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a46d040, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/pulpcore_worker.py, line 366, in run_forever 
#130 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a4c7040, for file /usr/lib/python3.9/site-packages/pulpcore/tasking/entrypoint.py, line 28, in worker 
#138 0x00007ff092be4e63 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a529840, for file /usr/lib/python3.9/site-packages/click/core.py, line 754, in invoke 
#148 0x00007ff092b562d3 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff06a5a9580, for file /usr/lib/python3.9/site-packages/click/core.py, line 1395, in invoke 
#155 0x00007ff092bea7ca in _PyEval_EvalFrame (throwflag=0, f=Frame 0x559327e5b4f0, for file /usr/lib/python3.9/site-packages/click/core.py, line 1053, in main 
#165 0x00007ff092be4e63 in _PyEval_EvalFrame (throwflag=0, f=Frame 0x7ff083448b20, for file /usr/lib/python3.9/site-packages/click/core.py, line 1128, in __call__

(some "futures" object is incomplete (forever) due to improperly caught exception..?)

Comment 1 Pavel Moravec 2023-11-03 09:52:51 UTC

Re-running some scenarios again, pulp behaves quite non-deterministically. Sometimes the BT follows after the "Backing off download_wrapper" log, sometimes not - in various scenarios. Sometimes the task gets cancelled due to "Cannot connect to host" error, sometimes not.

So for QE, it is worth running the scenarios a few times and the pulp task must (somehow) finish in either case.

Comment 2 Jeremiah Buckley 2023-11-06 13:31:02 UTC

From the client, it appears there might be a related issue to the capsules getting stuck in sync cycle. Maybe I am misunderstanding, but the "failure not being detected, reported or exposing itself" seems like it might be either a detail that should be added to the this bugzilla, or maybe a separate bz all by itself.


- We have working and fully functional TCP networking on each of the Capsules and the Satellite server.
- We suspect that there is a transport-layer condition that affects or alters TCP flow in the middle.
- 24-25 capsules work just fine.
- But on the 2-3 that aren’t working for synchronizations, these syncs seem to be able to fail without the failure being detected, reported, or exposing itself to error handling.

Comment 3 Pavel Moravec 2023-11-06 13:50:25 UTC

(In reply to Jeremiah Buckley from comment #2)
> From the client, it appears there might be a related issue to the capsules
> getting stuck in sync cycle. Maybe I am misunderstanding, but the "failure
> not being detected, reported or exposing itself" seems like it might be
> either a detail that should be added to the this bugzilla, or maybe a
> separate bz all by itself.
> 
> 
> - We have working and fully functional TCP networking on each of the
> Capsules and the Satellite server.
> - We suspect that there is a transport-layer condition that affects or
> alters TCP flow in the middle.
> - 24-25 capsules work just fine.
> - But on the 2-3 that aren’t working for synchronizations, these syncs seem
> to be able to fail without the failure being detected, reported, or exposing
> itself to error handling.

I was checking one instance of capsule stuck for that customer, and noticed the "Backing off .." error in /var/log/messages. While WebUI or pulp task or similar showed no error. Is that already the "failure not being detected, reported or exposing itself"?

Or could you provide a sosreport of Satellite and Capsule taken a short time after such an event? (and put it to the case and ping me, to let me investigate)?

Comment 4 Pavel Moravec 2023-11-14 07:26:19 UTC

I feel https://bugzilla.redhat.com/show_bug.cgi?id=2241729 is a duplicate of this pulp bug, just happening on a different product. Leaving on the devels / PrdM to decide if they want to track it per-product or just once.

Comment 6 Pavel Moravec 2024-02-14 09:53:27 UTC

Now we have another very similar user scenario when the sync gets stuck already during downloading metadata, per core_progressreport PostgreSQL table:


               pulp_id                |         pulp_created          |       pulp_last_updated       |                   message                    |   state   | total | done  | suffix |               task_id                |              code               
--------------------------------------+-------------------------------+-------------------------------+----------------------------------------------+-----------+-------+-------+--------+--------------------------------------+---------------------------------
 7a5db8c8-4622-41da-acf9-3383330e62ef | 2024-02-13 21:50:44.353064+08 | 2024-02-13 21:50:44.353079+08 | Downloading Artifacts                        | running   |       |     0 |        | 02ccb069-db06-420a-932d-de15d7d2c36b | sync.downloading.artifacts
 040627ad-12e1-473d-b03b-c4df62867c2f | 2024-02-13 21:50:44.355208+08 | 2024-02-13 21:50:44.355222+08 | Associating Content                          | running   |       |     0 |        | 02ccb069-db06-420a-932d-de15d7d2c36b | associating.content
 ff4eb533-2bf3-47ba-b01d-ce296a04005d | 2024-02-13 21:50:44.349573+08 | 2024-02-13 23:03:58.765218+08 | Downloading Metadata Files                   | running   |       |     6 |        | 02ccb069-db06-420a-932d-de15d7d2c36b | sync.downloading.metadata

(and stuck for many hours there)

Comment 7 Pavel Moravec 2024-03-14 18:39:42 UTC

.. and third case. Bug is there since 6.12 already (no surprise, though).

Comment 15 Eric Helms 2024-06-06 16:31:56 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "SAT-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Comment 16 Red Hat Bugzilla 2024-10-05 04:26:52 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days