2062526 – Another deadlock issue when syncing repos with high concurrency

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2062526 - Another deadlock issue when syncing repos with high concurrency

Summary: Another deadlock issue when syncing repos with high concurrency

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Pulp
Sub Component:
Version:	6.10.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high with 2 votes
Target Milestone:	6.13.0
Assignee:	satellite6-bugs
QA Contact:	Vladimír Sedmík
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2119888 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-10 04:55 UTC by Hao Chang Yu
Modified:	2023-11-07 04:25 UTC (History)
CC List:	39 users (show)
Fixed In Version:	python-pulpcore-3.21.2-1, python-pulpcore-3.18.11-1, python-pulpcore-3.16.15-1
Doc Type:	Known Issue
Doc Text:
Clone Of:
Clones:	2082209 2166750 2170639 (view as bug list)
Environment:
Last Closed:	2023-05-03 13:21:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
RHEL 7 Hotfix RPM for Satellite 6.11.4 (692.07 KB, application/x-rpm) 2023-02-03 16:25 UTC, Ian Ballou	no flags	Details
RHEL 8 Hotfix RPM for Satellite 6.11.4 (662.82 KB, application/x-rpm) 2023-02-03 16:26 UTC, Ian Ballou	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	pulp pulpcore issues 2420	None	closed	bulk_create() deadlock	2022-04-05 17:14:56 UTC
Github	pulp pulpcore issues 2430	None	closed	bulk_update() in content-stages can cause (very rare) deadlock	2022-04-07 18:11:22 UTC
Github	pulp pulpcore issues 3111	None	closed	bulk_update() can still deadlock in content_stages	2022-11-02 14:03:40 UTC
Github	pulp pulpcore issues 3192	None	closed	content_stages deadlock - Once More Unto The Breach	2022-11-02 14:03:38 UTC
Github	pulp pulpcore issues 3284	None	closed	contentartifact bulk_update can still deadlock	2022-11-02 14:03:35 UTC
Red Hat Issue Tracker	SAT-12593	None	None	None	2022-08-30 15:24:05 UTC
Red Hat Issue Tracker	SAT-15718	None	None	None	2023-05-11 11:22:34 UTC
Red Hat Knowledge Base (Solution)	6718181	None	None	None	2022-08-10 05:19:33 UTC
Red Hat Product Errata	RHSA-2023:2097	None	None	None	2023-05-03 13:21:23 UTC

Description Hao Chang Yu 2022-03-10 04:55:14 UTC

Description of problem:
We fixed a deadlock issue of updating timestamp in bug #2043710. Now there is another deadlock error when bulk updating artifacts. Perhaps we need to fix all places that are using the bulk_update to completely avoid the possibility of deadlock.


pulpcore.tasking.pulpcore_worker:INFO: Task 01370352-5e23-454c-b44c-cf6c59f738fa failed (deadlock detected
DETAIL:  Process 13970 waits for ShareLock on transaction 3618061; blocked by process 13969.
Process 13969 waits for ShareLock on transaction 3618064; blocked by process 13970.
HINT:  See server log for query details.
CONTEXT:  while rechecking updated tuple (314,15) in relation "core_contentartifact"
)
pulpcore.tasking.pulpcore_worker:INFO:   File "/usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py", line 339, in _perform_task
result = func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/pulp_rpm/app/tasks/synchronizing.py", line 491, in synchronize
version = dv.create()
File "/usr/lib/python3.6/site-packages/pulpcore/plugin/stages/declarative_version.py", line 151, in create
loop.run_until_complete(pipeline)
File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/usr/lib/python3.6/site-packages/pulpcore/plugin/stages/api.py", line 225, in create_pipeline
await asyncio.gather(*futures)
File "/usr/lib/python3.6/site-packages/pulpcore/plugin/stages/api.py", line 43, in __call__
await self.run()
File "/usr/lib/python3.6/site-packages/pulpcore/plugin/stages/content_stages.py", line 160, in run
ContentArtifact.objects.bulk_update(to_update_ca_bulk, ["artifact"])
File "/usr/lib/python3.6/site-packages/django/db/models/manager.py", line 82, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/django/db/models/query.py", line 525, in bulk_update
self.filter(pk__in=pks).update(**update_kwargs)
File "/usr/lib/python3.6/site-packages/django/db/models/query.py", line 741, in update
rows = query.get_compiler(self.db).execute_sql(CURSOR)
File "/usr/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 1471, in execute_sql
cursor = super().execute_sql(result_type)
File "/usr/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 1142, in execute_sql
cursor.execute(sql, params)
File "/usr/lib/python3.6/site-packages/django/db/backends/utils.py", line 67, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/usr/lib/python3.6/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/usr/lib/python3.6/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/usr/lib/python3.6/site-packages/django/db/utils.py", line 89, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/usr/lib/python3.6/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)

Comment 2 Hao Chang Yu 2022-03-14 10:41:08 UTC

I can reproduce another similar deadlock which happens when doing "bulk_create".

I think both deadlocks will only happen when the download policy of the repositories in Capsule are set to "immediate" as Pulp is trying to create/update the artifact data in bulk.

Steps to reproduce:

Note: Ensure Capsule is setup to have 8 pulpcore workers

1) Enable and sync a large Red Hat repository. I use "Red Hat Software Collections RPMs for Red Hat Enterprise Linux 7 Server x86_64 7Server" in my reproducer.
2) Set the Capsule download policy to "immediate".
3) Create a new lifecycle environment named 'devel'. Add only it to the Capsule. Remove all other lifecycle environments from the Capsule.
4) In Settings page, Content tab, set "Sync Capsules after Content View promotion" to "No"
5) Create 1 content view and attach repo in step (1) to it.
6) Make 7 copies of the content view.
7) Publish and promote all 8 content views to 'devel' environment.
8) Trigger a complete capsule sync.


Expected results:
No deadlocks


Actual results:
Multiple deadlocks occur as below:

  error:
    traceback: |2
        File "/usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py", line 339, in _perform_task
          result = func(*args, **kwargs)
        File "/usr/lib/python3.6/site-packages/pulp_rpm/app/tasks/synchronizing.py", line 491, in synchronize
          version = dv.create()
        File "/usr/lib/python3.6/site-packages/pulpcore/plugin/stages/declarative_version.py", line 151, in create
          loop.run_until_complete(pipeline)
        File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
          return future.result()
        File "/usr/lib/python3.6/site-packages/pulpcore/plugin/stages/api.py", line 225, in create_pipeline
          await asyncio.gather(*futures)
        File "/usr/lib/python3.6/site-packages/pulpcore/plugin/stages/api.py", line 43, in __call__
          await self.run()
        File "/usr/lib/python3.6/site-packages/pulpcore/plugin/stages/artifact_stages.py", line 241, in run
          d_artifact.artifact for d_artifact in da_to_save
        File "/usr/lib/python3.6/site-packages/pulpcore/app/models/content.py", line 84, in bulk_get_or_create
          return super().bulk_create(objs, batch_size=batch_size)
        File "/usr/lib/python3.6/site-packages/django/db/models/manager.py", line 82, in manager_method
          return getattr(self.get_queryset(), name)(*args, **kwargs)
        File "/usr/lib/python3.6/site-packages/django/db/models/query.py", line 468, in bulk_create
          self._batched_insert(objs_with_pk, fields, batch_size, ignore_conflicts=ignore_conflicts)
        File "/usr/lib/python3.6/site-packages/django/db/models/query.py", line 1204, in _batched_insert
          ignore_conflicts=ignore_conflicts,
        File "/usr/lib/python3.6/site-packages/django/db/models/query.py", line 1186, in _insert
          return query.get_compiler(using=using).execute_sql(return_id)
        File "/usr/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 1377, in execute_sql
          cursor.execute(sql, params)
        File "/usr/lib/python3.6/site-packages/django/db/backends/utils.py", line 67, in execute
          return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
        File "/usr/lib/python3.6/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers
          return executor(sql, params, many, context)
        File "/usr/lib/python3.6/site-packages/django/db/backends/utils.py", line 84, in _execute
          return self.cursor.execute(sql, params)
        File "/usr/lib/python3.6/site-packages/django/db/utils.py", line 89, in __exit__
          raise dj_exc_value.with_traceback(traceback) from exc_value
        File "/usr/lib/python3.6/site-packages/django/db/backends/utils.py", line 84, in _execute
          return self.cursor.execute(sql, params)
    description: |
      deadlock detected
      DETAIL:  Process 24672 waits for ShareLock on transaction 34053354; blocked by process 24652.
      Process 24652 waits for ShareLock on transaction 34053353; blocked by process 24672.
      HINT:  See server log for query details.
      CONTEXT:  while inserting index tuple (3,5) in relation "core_artifact_sha256_key"
  worker: "/pulp/api/v3/workers/55462376-7328-4c42-b644-0d58b3f7e873/"

Comment 3 Hao Chang Yu 2022-03-16 08:07:04 UTC

Forgot to provide the pg log details:

2022-03-14 20:10:24 AEST ERROR:  deadlock detected
2022-03-14 20:10:24 AEST DETAIL:  Process 24672 waits for ShareLock on transaction 34053354; blocked by process 24652.
        Process 24652 waits for ShareLock on transaction 34053353; blocked by process 24672.
        Process 24672: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512", "timestamp_of_interest") VALUES ('c9e2704c-0eba-4cb4-9580-bd2219600973'::uuid, '2022-03-14T10:10:22.763785+00:00'::timestamptz, '2022-03-14T10:10:22.763823+00:00'::timestamptz, 'artifact/04/e69fee288db61a77bc71512c573a97d4d57be3fee5335fdba38b92d4a75d16', 93712, '1f84f08e6e3c605994c144036d237787', 'f7c701d5280d09b2e988dcaabe6823b561255c35', '6ed8480a9c89a673baf58150366bfd08326004c33ade2152c9548ab7', '04e69fee288db61a77bc71512c573a97d4d57be3fee5335fdba38b92d4a75d16', '5dde72af5557855a1483aafdee29d33da6daa66428267e95b15c7ddb99abe469c4430c45e4e927d5526bc04b2480da68', 'dbbbb267bb9a913e0996e73a6697da898e3dae47e24d63ef49641509a99ed7ce2835e567b8fcc1dcae5497d2a9b7783916a45fd135cb81ee1a9d903f97e07207', '2022-03-14T10:10:22.768579+00:00'::timestamptz), ('b7db0464-69ec-40cb-b677-8959ec2bab91'::uuid, '2022-03-14T10:10:22.768624+00:00'::timestamptz, '2022-03-14T10:10:22.768636

        Process 24652: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512", "timestamp_of_interest") VALUES ('3c2a12a5-139f-4a1b-afb3-116c04fd24d2'::uuid, '2022-03-14T10:10:22.808572+00:00'::timestamptz, '2022-03-14T10:10:22.808631+00:00'::timestamptz, 'artifact/50/7a394ae75878e2bc87a65fa68604f903b2c3c3dcf2690aec5ce31cdb8caf46', 201448, '31e0e7891c42e1d65531369ebae5fce3', '87ddb611dcc983eb5e51a8b9e8ec3b52471b9aa7', 'b26f083064b4739fd2a51a2820be2cab931c8f42e5ee7c63a1924ce0', '507a394ae75878e2bc87a65fa68604f903b2c3c3dcf2690aec5ce31cdb8caf46', 'cb245087311bd0a0a2e73b41cbf7d6d3723d102ef5a1f906839bdf80f95a11919c48f2b64b41d358c684035f12572848', '6d8ef5a2f13601174147acb0c3b8ce69052a9a155e36586a596cb88822cf84a8cec814bbcdae6c359b2d74669e2f265aeb83f86c0172c23468015ea3f55a9a31', '2022-03-14T10:10:22.814954+00:00'::timestamptz), ('c798aee7-7152-4205-ac7c-1b639a4e3abe'::uuid, '2022-03-14T10:10:22.815036+00:00'::timestamptz, '2022-03-14T10:10:22.81506
2022-03-14 20:10:24 AEST HINT:  See server log for query details.
2022-03-14 20:10:24 AEST CONTEXT:  while inserting index tuple (3,5) in relation "core_artifact_sha256_key"


2022-03-14 20:10:26 AEST ERROR:  deadlock detected
2022-03-14 20:10:26 AEST DETAIL:  Process 24680 waits for ShareLock on transaction 34053354; blocked by process 24652.
        Process 24652 waits for ShareLock on transaction 34053358; blocked by process 24680.
        Process 24680: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512", "timestamp_of_interest") VALUES ('fc1621f8-dce2-4ff9-b0f6-590af82a609d'::uuid, '2022-03-14T10:10:23.458319+00:00'::timestamptz, '2022-03-14T10:10:23.458388+00:00'::timestamptz, 'artifact/04/e69fee288db61a77bc71512c573a97d4d57be3fee5335fdba38b92d4a75d16', 93712, '1f84f08e6e3c605994c144036d237787', 'f7c701d5280d09b2e988dcaabe6823b561255c35', '6ed8480a9c89a673baf58150366bfd08326004c33ade2152c9548ab7', '04e69fee288db61a77bc71512c573a97d4d57be3fee5335fdba38b92d4a75d16', '5dde72af5557855a1483aafdee29d33da6daa66428267e95b15c7ddb99abe469c4430c45e4e927d5526bc04b2480da68', 'dbbbb267bb9a913e0996e73a6697da898e3dae47e24d63ef49641509a99ed7ce2835e567b8fcc1dcae5497d2a9b7783916a45fd135cb81ee1a9d903f97e07207', '2022-03-14T10:10:23.465319+00:00'::timestamptz), ('b8b49a74-7a79-4812-b86d-72114529cdb0'::uuid, '2022-03-14T10:10:23.465392+00:00'::timestamptz, '2022-03-14T10:10:23.465421
        Process 24652: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512", "timestamp_of_interest") VALUES ('3c2a12a5-139f-4a1b-afb3-116c04fd24d2'::uuid, '2022-03-14T10:10:22.808572+00:00'::timestamptz, '2022-03-14T10:10:22.808631+00:00'::timestamptz, 'artifact/50/7a394ae75878e2bc87a65fa68604f903b2c3c3dcf2690aec5ce31cdb8caf46', 201448, '31e0e7891c42e1d65531369ebae5fce3', '87ddb611dcc983eb5e51a8b9e8ec3b52471b9aa7', 'b26f083064b4739fd2a51a2820be2cab931c8f42e5ee7c63a1924ce0', '507a394ae75878e2bc87a65fa68604f903b2c3c3dcf2690aec5ce31cdb8caf46', 'cb245087311bd0a0a2e73b41cbf7d6d3723d102ef5a1f906839bdf80f95a11919c48f2b64b41d358c684035f12572848', '6d8ef5a2f13601174147acb0c3b8ce69052a9a155e36586a596cb88822cf84a8cec814bbcdae6c359b2d74669e2f265aeb83f86c0172c23468015ea3f55a9a31', '2022-03-14T10:10:22.814954+00:00'::timestamptz), ('c798aee7-7152-4205-ac7c-1b639a4e3abe'::uuid, '2022-03-14T10:10:22.815036+00:00'::timestamptz, '2022-03-14T10:10:22.81506
2022-03-14 20:10:26 AEST HINT:  See server log for query details.
2022-03-14 20:10:26 AEST CONTEXT:  while inserting index tuple (12,15) in relation "core_artifact_sha256_key"

Comment 4 Hao Chang Yu 2022-03-16 08:18:20 UTC

To prevent the deadlock during bulk_create, I think we need to enforce the order of the insert too. Maybe sort the objects before parsing to the bulk_create will work?

Comment 6 Hao Chang Yu 2022-03-25 07:05:13 UTC

(In reply to Grant Gainey from comment #5)
> @hyu, @pmoravec - I could use some eyes on a) the
> reproducer-script in the associated github issue (2420), and b) the fix in
> its associated PR (#2421). The script will force a deadlock every time if
> you a) have >10 workers, and b) aren't just doing get_or_create on existing
> Artifacts each time. With the PR< I haven't been able to recreate again.
> But, with deadlocks, More Eyes are *always* better.

Hi Grant

Regarding PR #2421:
1) In my opinion, the sorting should be done inside "bulk_get_or_create" so that it will be safe to call everywhere.
2) IIUC, each stage is processing 500 records per batch so we are only able to guarantee each 500 records are created in order so I think deadlock could still happen in a very low possibility.
3) The PR only fixing the issue in #comment 2 but not #comment 0 which I am still unable to reproduce it.

Regards
Hao

Comment 7 Grant Gainey 2022-03-25 12:05:43 UTC

[NOTE: I'm marking the discussion(s) here public - there's no sensitive info in them, and they may save some other poor soul in the future from having to wonder how we arrived at our fix(es)]

(In reply to Hao Chang Yu from comment #6)
> (In reply to Grant Gainey from comment #5)
> > @hyu, @pmoravec - I could use some eyes on a) the
> > reproducer-script in the associated github issue (2420), and b) the fix in
> > its associated PR (#2421). The script will force a deadlock every time if
> > you a) have >10 workers, and b) aren't just doing get_or_create on existing
> > Artifacts each time. With the PR< I haven't been able to recreate again.
> > But, with deadlocks, More Eyes are *always* better.
> 
> Hi Grant
> 
> Regarding PR #2421:
> 1) In my opinion, the sorting should be done inside "bulk_get_or_create" so
> that it will be safe to call everywhere.

You can't sort "inside", because you have to know what to sort on, which is known only by the caller. We only use this in two places currently, and the other one is already protected against this deadlock: 

https://github.com/pulp/pulpcore/blob/main/pulpcore/plugin/stages/content_stages.py#L165

> 2) IIUC, each stage is processing 500 records per batch so we are only able
> to guarantee each 500 records are created in order so I think deadlock could
> still happen in a very low possibility.

As long as the batch is processed as a single transaction (which I believe it is), and as long as every batch uses the same sort-algorithm, we're good here. If the entire set was one gigantic transaction of multiple batches, then you're right - but if that were the case, *none* of the deadlock-fixes we've done over the last few months would have worked.

> 3) The PR only fixing the issue in #comment 2 but not #comment 0 which I am
> still unable to reproduce it.

Oh ugh, you are correct, I managed to miss that there are two different sources here. The first tb comes from here:

https://github.com/pulp/pulpcore/blob/3.14/pulpcore/plugin/stages/content_stages.py#L160

We know that postgres updates in whatever order it feels like, unless you do the select_for_update() trick we ironed out while fixing touch(). However, in this case, the skipLocks=true trick may NOT be appropriate, because this update actually matters (unlike the touch() timestamp). I need to understand exactly how/where we're colliding.

On the plus side, we only rely on bulk_update() in two places - the one above, and

https://github.com/pulp/pulpcore/blob/main/pulpcore/app/tasks/reclaim_space.py#L63

which isn't in 3.14, and which I believe is only ever called from one worker.

I've opened github issue #2430 for this problem, and will start working on figuring out if we even *can* fix it. Stay tuned. 

G

Comment 8 Grant Gainey 2022-03-25 21:52:09 UTC

@hyu I am having zero luck attempting to force a deadlock on bulk_update(), even with code deliberately designed to Behave Badly. I'll attach the latest attempt; maybe you can see a way to make it more-deadlock-prone than I can.

The place where we're calling bulk_update() in content_stage, we know that the Content and ContentArtifact entities already exist in the database already, so all we need to do is update them after associating the ContentArtifact to an Artifact. I'm creating unique artifacts and content, creating sets of unique ContentArtifact entites and persisting them, and then linking them to the matching Artifacts and updating, each list-update in its own thread.  I'm running with 10 workers, and have thrown A LOT of concurrency at this machine today, and have had not a single deadlock.

Comment 9 Grant Gainey 2022-03-25 21:53:27 UTC

Created attachment 1868374 [details]
Script (attempting to) force deadlock using bulk_update

Comment 10 Hao Chang Yu 2022-03-28 10:48:56 UTC

I am able to reproduce the deadlock error with the following modified script but not every time. I think it is the matter of timing which is hard to tackle.

https://gist.github.com/hao-yu/ffc2408cfebe0ad98c3f6235691a7f09

Comment 11 Grant Gainey 2022-03-28 11:05:18 UTC

(In reply to Hao Chang Yu from comment #10)
> I am able to reproduce the deadlock error with the following modified script
> but not every time. I think it is the matter of timing which is hard to
> tackle.
> 
> https://gist.github.com/hao-yu/ffc2408cfebe0ad98c3f6235691a7f09

Oh-ho! Thank you, the changes are instructive. That gives me a direction to go in, thanks much!

Comment 12 Grant Gainey 2022-03-28 17:46:41 UTC

@hyu @pmoravec I am going to Think Out Loud here, and would appreciate thoughts/comments.

The problem we're (rarely) having here, is that:
* two threads are syncing
* there exists overlapping content
* The Content/ContentArtifact elements for some of that overlapping content have already been persisted into the DB.
* Both threads notice the above.
* Both threads are attempting to link up the ContentArtifacts, for the Content, with the Artifacts-on-disk that have just been sync'd.
* The threads update these elements in-bulk, and in arbitrary order.
* deadlock

Based on what we learned while fixing the touch() bulk-update problem (see https://hackmd.io/@pulp/touch_deadlock to refresh your minds on details) - select_for_update to enforce ordering fixes the problem.

The downside we ran into, is select_for_update() loses a lot of concurrency, and makes sync quite a bit slower.

The way we fixed that, was to realize that touch() doesn't care "who wins", as long as *somebody* does - so skip_locks got us our concurrency back.

If we knew that the only way to end up with a lock-conflict in this new case, was due to two threads updating Artifact at the same time - I contend that we *also* don't care who wins, *as long as somebody does*. If the conflicting threads are both updating the specific CA with the same Artifact, the result will be the same no matter which thread gets to "win". If the threads are trying to update with different Artifacts, right now whoever happens to execute "last" (assuming no deadlocks) will "win". If we were to use select_for_update(skip_locks=True) then whoever got there *first* would win. This sounds like it would change behavior - but where we are right now, which thread "wins" is a coin-toss, the threads aren't order-enforced in any way. So again, as long as *someone* updates the Artifact, we're where we are now.

The problem, however, is this - in the touch() case, we knew that, no matter who had an entity locked, *or why*, the timestamp field was going to be updated. In this case, we may not be able to "assume" that the right thing is going to happen - the current-lock-holder might have the ContentArtifact locked for some reason *other than* "updating the Artifact pointer".

My first thought was, "damn, we're going to have to just take the concurrency hit".

My second thought went like this:

* there are only three active fields in core_contentartifact: content_id, relative_path, and artifact_id. created and pulp_id are part of creating the entity, and pulp_last_updated changes only in response to one fo these three being changed.
* the content_id/relative_path combo is a unique identifier for CA. If either of them is changing - we're not dealing with the "same" ContentArtifact any more. Pulp asks the question "does a CA exist with Content foo and rel_path bar/blech", and if the answer is no, creates a new one - it doesn't change an existing one.
* that leaves us artifact_id - and I've already decided, I don't care *who* updates is, as long as *someone* wins, we're good.

SO - my tl;dr here, is this: I content that we can address this deadlock issue using the same pattern we used to fix touch(), *including* skip_locks=True to avoid blowing up our sync concurrency.

Y'all have been really involved in deadlock/concurrency issues; I would value any insights you have!

Comment 13 Hao Chang Yu 2022-03-30 05:45:47 UTC

Actually I am a bit doubt that 'select_for_update' without 'skip_lock' will cause huge performance trade off in our case because we are only updating the rows in quite a small batch. 500 rows per batch I think. I guess if a deadlock is rarely happened then select_for_update might not always block other transactions for acquiring locks. Maybe we can use the existing script to do some tests.

If select_for_update without skip_lock has only minimal impact then I think it is safer to do that so we don't have to worry the issue that you mentioned above.

Otherwise, I think skip lock sounds ok to me.

Comment 14 Pavel Moravec 2022-04-01 06:14:43 UTC

Grant's rationale sounds sane to me. Just it relies on "there is just one field in core_contentartifact we can really ask for an update". Shall not be this somewhere documented to prevent future issues when theoretically adding an "editable row" to the table?

Comment 15 Hao Chang Yu 2022-04-01 07:09:46 UTC

(In reply to Pavel Moravec from comment #14)
Shall not be
> this somewhere documented to prevent future issues when theoretically adding
> an "editable row" to the table?

+1 for this. It might be a chance that someone will make changes that will break this in future, such as adding new column. That is one of the reasons that I am more lean towards no lock skipping if the price to pay is minimal.

Comment 16 Grant Gainey 2022-04-04 19:10:07 UTC

(note: made previous comments public, there's nothing confidential in them and they may help future deadlock-fixers...)

(In reply to Hao Chang Yu from comment #13)
> Actually I am a bit doubt that 'select_for_update' without 'skip_lock' will
> cause huge performance trade off in our case because we are only updating
> the rows in quite a small batch. 500 rows per batch I think. I guess if a
> deadlock is rarely happened then select_for_update might not always block
> other transactions for acquiring locks. Maybe we can use the existing script
> to do some tests.
> 
> If select_for_update without skip_lock has only minimal impact then I think
> it is safer to do that so we don't have to worry the issue that you
> mentioned above.
> 
> Otherwise, I think skip lock sounds ok to me.

After A LOT of experimentation, the performance difference between skip_locked=True, and Not, is noticeable, but not overwhelming - on the order of 50%, in the pathologically-bad test case I'm using (many threads attempting to update the *exact same* content simultaneously). Given the...unease we share around assuming skip_lock=True being "safe", I'd rather take a small/manageable performance hit and be correct, than otherwise.

See https://github.com/pulp/pulpcore/pull/2478 for applying the fix; comments/critiques welcome!

The script I settled on to gather some timings is this:
=====
from pulpcore.app.models.content import Artifact, ContentArtifact                            
from pulp_file.app.models import FileContent                                                 
from hashlib import sha256                                                                   
from django.db import connection, transaction                                                
from multiprocessing import Process, Array                                                   
import statistics                                                                            
from time import time, sleep                                                                 
import _thread                                                                               
                                                                                             
artifacts = {}  # filename: Artifact                                                         
content = {}  # filename: FileContent                                                        
print(">>>BUILDING CONTENT/CA/ARTIFACTS...")                                                 
for i in range(10000):                                                                       
  path = f'/tmp/{i:06d}.txt'                                                                 
  content_path = f'{i:06d}.txt'                                                              
  with open(path, "w") as f:                                                                 
    f.write(path)                                                                            
  with open(path, "rb") as f:                                                                
    sum256 = sha256(f.read()).hexdigest()                                                    
  attrs = {"relative_path": content_path, "digest": sum256}                                  
  fc = FileContent(**attrs)                                                                  
  fc.save()                                                                                  
  content[path] = fc                                                                         
                                                                                             
  attrs = {"file": path, "sha256": sum256, "size": i}                                        
  a = Artifact(**attrs)                                                                      
  a.save()                                                                                   
  artifacts[path] = a                                                                        
                                                                                             
# lists of ContentArtifacts                                                                  
ca_lists = [[], [], [], [], []]                                                              
#ca_lists = [[]]                                                                             
for k in content.keys():                                                                     
  for i in range(len(ca_lists)):                                                             
    attrs = {"content": fc, "relative_path": f'{i}/{content[k].relative_path}'}              
    ca = ContentArtifact(**attrs)                                                            
    ca.save()                                                                                
    ca.artifact = artifacts[k]                                                               
    ca_lists[i].append(ca) 

OUTER=5                                                                                      
INNER=len(ca_lists)                                                                          
                                                                                             
def bulk_doit(ca_list):                                                          
  print(">>> ENTER...")                                                                      
  ids = [k.pulp_id for k in ca_list]                                                         
  connection.close()                                                                         
  connection.connect()                                                                       
  start = time()                                                                             
  with transaction.atomic():  
    # len() forces the QS to be evaluated in a way that emits the s_f_u - exists() and count() don't                                                               
    len(ContentArtifact.objects.filter(pulp_id__in=ids).only("pulp_id").order_by("pulp_id").select_for_update().values_list())
    ContentArtifact.objects.bulk_update(ca_list, ["artifact"], batch_size=500)               
  end = time()                                                                               
  print(f">>> EXIT {end-start}...")                                     
                                                                                             
for r in range(5):                                                                           
  print("Try round: %s" % str(r + 1))                                                        
  # Ensure there are something to update when rerunning this part                            
  connection.connect()                                                                       
  ContentArtifact.objects.filter().update(artifact_id=None)                                  
  processes = []                                                                             
  for i in range(OUTER):                                                                     
    for j in range(INNER):                                                                   
      p = Process(target=bulk_doit, args=(ca_lists[j],))            
      processes.append(p)                                                                    
      p.start()                                                                              
    sleep(3)                                                                                
  for p in processes:                                                                        
    p.join()                                                                                 
=====

Comment 23 Grant Gainey 2022-05-12 20:01:02 UTC

(In reply to Hao Chang Yu from comment #21)
> Sorry, for the late reply. I was off sick yesterday.
> 
> 1) About 500 ISE errors
> 
> It seems like pulpcore is using a lot of postgres connections here. There
> are 14 pulp content gunicorn processes each has 13 workers. 6 pulp wsgi
> guncorn processes each has 5 workers and 6 pulpcore workers. It appears that
> each worker is using more postgres connections. It could be normal but I am
> not familiar to this. @ggainey any idea why pulp is using so many
> postgres connections?
> 
> # ps -aux | grep -P "postgres: pulp pulpcore.* idle" | wc -l
> 422

I have spent some time experimenting/investigating here, and I can neither explain nor reproduce excessive pgres cnx from pulp. The number of connections open at once will be (pulp-content * workers-per-process) + (pulp-api(wsgi) * workers-per) + pulpcore-workers. For the setup on this reproducer, that would be (14*13) + (6*5) + 6 = 218 connections. That's a lot - but it's not 422.  Typically, when testing high-concurrency situations on my dev-setup, I run with 20-30 pulpcore-workers without hitting max-pgres-cnx problems.

I can't explain why the Satellite is using so many connections - but it's the reason the syncs on the capsule are failing, not the dup-key error, nor a deadlock (which we don't see any of)

> 2) About "psycopg2.errors.UniqueViolation: duplicate key value violates
> unique constraint "core_artifact_sha256_key" error. This appear to be
> multiple same rpms were downloading at the same time and caused race
> condition updating the same row.
> 
> Is it causing the Capsule failed to download the rpms? It seems not. All
> related requests responded with HTTP 200 (see below) and I don't see error
> from Capsule side downloading those rpms.
> 
> I think we need to raise another error to handle this error properly.

Actually, unless we can show the sync **failing** because of this, I suspect this is the database telling us "I tried to create an Artifact that wasn't here when I started, but Some Other Thread created while I was busy (re)downloading it." This is absolutely something that happens under high-concurrency, overlapping-content situations - each repo is checking "does this already exist", getting a "no", and downloading it at the same time. One of them wins. The others fail. As you note in your investigation, Hao, *the content is there* once the sync completes.

The net is, I wouldn't take a BZ on that traceback, until we can show that it's causing the affected repo to be missing the overlapping content.

@ltran - any chance we can restart the reproducer-machines here? Whatever is going on with the pgres-connections, it's not "expected pulp behavior", and seems to be connected to this particular configuration/setup.

Comment 24 pulp-infra@redhat.com 2022-05-17 10:01:16 UTC

Requesting needsinfo from upstream developer dkliban, ggainey because the 'FailedQA' flag is set.

Comment 25 pulp-infra@redhat.com 2022-05-17 10:01:20 UTC

Requesting needsinfo from upstream developer dkliban, ggainey because the 'FailedQA' flag is set.

Comment 30 Grant Gainey 2022-06-07 19:21:30 UTC

Debugging notes:

On the Satellite, I poked around inside postgres asking questions about pulpcore connections.

# su - postgres -c "psql pulpcore"
psql (12.11)
Type "help" for help.

pulpcore=# 

This query

pulpcore=# select datid, datname, pid, usesysid, usename, query from pg_stat_activity where datname = 'pulpcore';

shows lots of rows likew this one:

 24264 | pulpcore | 21198 |    24263 | pulp     | SELECT "core_remote"."pulp_id", "core_remote"."pulp_created", "core_remote"."pulp_last_updated", "core_remote"."pulp_type", "core_
remote"."name", "core_remote"."url", "core_remote"."ca_cert", "core_remote"."client_cert", "core_remote"."client_key", "core_remote"."tls_validation", "core_remote"."username", "co
re_remote"."password", "core_remote"."proxy_url", "core_remote"."proxy_username", "core_remote"."proxy_password", "core_remote"."download_concurrency", "core_remote"."max_retries",
 "core_remote"."policy", "core_remote"."total_timeout", "core_remote"."connect_timeout", "core_remote"."sock_connect_timeout", "core_remote"."sock_read_timeout", "core_remote"."hea
ders", "core_remote"."rate_limit" FROM "core_remote" WHERE "core_remote"."pulp_id" = '09325f08-7805-4e44-8580-98e65c11434e'::uuid LIMIT 21

pulpcore=# select count(pid) from pg_stat_activity where datname = 'pulpcore' and query like '%09325f08-7805-4e44-8580-98e65c11434e%';
 count 
-------
   413

Of the 434 connections pulp is holding to postgres, 413 (!) of them are idle, having last asked about this one remote.

So - we're running out of connections, because something is querying a remote, *and then not closing its db connection*. I think. I have no idea why/where that's happening. Debugging continues.

Comment 31 Grant Gainey 2022-06-07 20:01:28 UTC

The Artifact UQ-errors are due to the content-handler delivering content to the capsule, for 8 identical CVs, for a repo that was sync'd "on_demand". So occasionally, multiple content-handler threads ask for the same content "for the first time", at the same time. The on-demand processing for each such collision attempts to "make the content real" multiple times, one such thread "wins" and stores the Artifact, everyone else fails. I *think* the traceback is due to Thread-1 wins, Thread-2 fails, catches the error, and throws away the file it was downloading based on its sha256, Thread-3 *also* fails, catches the exception, and then fails during cleanup because Thread-2 already did it (that's the os.unlink() point in the traceback from #c21).  This should be harmless - we might consider a low-priority "please notice this harmless failure and do not log a traceback if it happens" BZ, but it isn't anything to do with the deadlock.

My current thinking is that the multiple-connections are involved in the "attempting to realize too many overlapping on_demand Artifacts at the same time". I don't yet see exactly *how*, but that's the trail I am currently following.

@hyu, @dkliban - any thoughts on whether I'm making sense here? (Also, any ideas would be very welcome)

Comment 32 Daniel Alley 2022-06-07 21:09:51 UTC

@ggainey I don't have any immediate answers but the google-able name for this problem is "request collapsing" or "request coalescing" and I think Nginx and some cache servers like Varnish have settings that can help with it.

https://developer.fastly.com/learning/concepts/request-collapsing/
https://cloud.google.com/cdn/docs/caching#request-coalescing
https://info.varnish-software.com/blog/request-coalescing-and-other-reasons-to-use-varnish-as-an-origin-shield
https://fly.io/blog/the-5-hour-content-delivery-network/

Comment 33 Hao Chang Yu 2022-06-08 07:02:12 UTC

Can probably implement something similar to the Pulp 2 model which Squid cached the first downloaded content and the deferred download process runs periodically to download the cached contents to the storage.

Comment 35 Grant Gainey 2022-06-09 14:24:19 UTC

OK gang, here we go. The "too many connections" issue is coming from the content-app; I have opened a pulpcore issue : https://github.com/pulp/pulpcore/issues/2816

This needs its own BZ - I think QE is the best person to be the "opener". Please link the above issue to the BZ when it's open.

In the meantime - to verify the *actual* BZ in question here (which is "the capsule deadlocks when syncing") - we should start with syncing RHEL7 *immediate* instead of on-demand on the Satellite side, and then following the rest of the test-steps, to see if we can force a deadlock on the capsule.

Comment 40 Stephen Wadeley 2022-06-14 13:07:17 UTC

(In reply to Grant Gainey from comment #35)
> OK gang, here we go. The "too many connections" issue is coming from the
> content-app; I have opened a pulpcore issue :
> https://github.com/pulp/pulpcore/issues/2816
> 
> This needs its own BZ - I think QE is the best person to be the "opener".
Bug 2096198 - Too many connection issue occurring for on-demand content sync

> Please link the above issue to the BZ when it's open.
> 
its done (2022-06-13)

Comment 43 pulp-infra@redhat.com 2022-06-14 16:34:52 UTC

Requesting needsinfo from upstream developer ggainey because the 'FailedQA' flag is set.

Comment 44 pulp-infra@redhat.com 2022-06-14 16:34:55 UTC

Requesting needsinfo from upstream developer ggainey because the 'FailedQA' flag is set.

Comment 47 ir. Jan Gerrit Kootstra 2022-06-28 10:14:47 UTC

We encountered this issue when upgrading Capsules from 6.9.9 to 6.10.6.

Comment 48 ir. Jan Gerrit Kootstra 2022-07-13 10:38:49 UTC

2022-06-28 10:14:47 UTC we encountered the issue on Acceptance and yesterday on Production.

Comment 50 pulp-infra@redhat.com 2022-08-10 16:36:05 UTC

Requesting needsinfo from upstream developer ggainey because the 'FailedQA' flag is set.

Comment 51 pulp-infra@redhat.com 2022-08-10 16:36:10 UTC

Requesting needsinfo from upstream developer ggainey because the 'FailedQA' flag is set.

Comment 62 Grant Gainey 2022-09-13 13:41:05 UTC

*** Bug 2119888 has been marked as a duplicate of this bug. ***

Comment 64 Lai 2022-09-15 17:14:57 UTC

followed the description for retesting.

from UI:
deadlock detected
DETAIL:  Process 2271 waits for ShareLock on transaction 76177; blocked by process 2253.
Process 2253 waits for ShareLock on transaction 70667; blocked by process 2271.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (16,67) in relation "core_contentartifact"
Could not lookup a publication_href for repo 22

From /var/log/messages
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: pulp [427040d9-f4ce-4908-abf2-1b65caf52169]: pulpcore.tasking.pulpcore_worker:INFO: Task 14f464a2-78b4-42a1-a75
8-36a74165a5e1 failed (deadlock detected
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: DETAIL:  Process 2271 waits for ShareLock on transaction 76177; blocked by process 2253.
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: Process 2253 waits for ShareLock on transaction 70667; blocked by process 2271.
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: HINT:  See server log for query details.
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: CONTEXT:  while locking tuple (16,67) in relation "core_contentartifact"
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: )
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: pulp [427040d9-f4ce-4908-abf2-1b65caf52169]: pulpcore.tasking.pulpcore_worker:INFO:   File "/usr/lib/python3.9/
site-packages/pulpcore/tasking/pulpcore_worker.py", line 410, in _perform_task
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    result = func(*args, **kwargs)
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py", line 556, in synchronize
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    repo_version = dv.create() or repo.latest_version()
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/declarative_version.py", line 161, in create
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    loop.run_until_complete(pipeline)
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib64/python3.9/asyncio/base_events.py", line 642, in run_until_complete
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    return future.result()
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 225, in create_pipeline
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    await asyncio.gather(*futures)
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 43, in __call__
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    await self.run()
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/content_stages.py", line 200, in run
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    await sync_to_async(process_batch)()
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib/python3.9/site-packages/asgiref/sync.py", line 414, in __call__
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    ret = await asyncio.wait_for(future, timeout=None)
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib64/python3.9/asyncio/tasks.py", line 442, in wait_for
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    return await fut
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib64/python3.9/concurrent/futures/thread.py", line 52, in run
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    result = self.fn(*self.args, **self.kwargs)
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib/python3.9/site-packages/asgiref/sync.py", line 455, in thread_handler
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    return func(*args, **kwargs)
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/content_stages.py", line 189, in process_batch
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    len(ContentArtifact.objects.filter(pk__in=subq).values_list())
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib/python3.9/site-packages/django/db/models/query.py", line 262, in __len__
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    self._fetch_all()
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib/python3.9/site-packages/django/db/models/query.py", line 1324, in _fetch_all
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:    self._result_cache = list(self._iterable_class(self))
Sep 15 11:14:40 localhost pulpcore-worker-5[2254]:  File "/usr/lib/python3.9/site-packages/django/db/models/query.py", line 140, in __iter__


Failed QA on 6.12.0 snap 11 with python39-pulpcore-3.18.10-1.el8pc.noarch

Comment 65 Grant Gainey 2022-09-15 17:30:03 UTC

postgres log out put frm the above stacktrace:

===
2022-09-15 11:14:40 EDT ERROR:  deadlock detected
2022-09-15 11:14:40 EDT DETAIL:  Process 2271 waits for ShareLock on transaction 76177; blocked by process 2253.
        Process 2253 waits for ShareLock on transaction 70667; blocked by process 2271.
        Process 2271: SELECT "core_contentartifact"."pulp_id", "core_contentartifact"."pulp_created", "core_contentartifact"."pulp_last_updated", "core_contentartifact"."artifact_id", "core_contentartifact"."content_id", "core_contentartifact"."relative_path" FROM "core_contentartifact" WHERE "core_contentartifact"."pulp_id" IN (SELECT U0."pulp_id" FROM "core_contentartifact" U0 WHERE U0."pulp_id" IN ('0a540da6-7eb0-4fa7-9d35-0c09c82b0849'::uuid, 'a31f82bf-00ad-475d-99b1-5aa125832052'::uuid, 'a387baf4-5cfe-4719-bb74-bec2f05fca1b'::uuid, 'a3975a19-dcd9-4358-8a27-7c343d3b4968'::uuid, 'a3e9026b-3ba8-4d0f-a9bf-b33c5daffbd7'::uuid, 'a407bf4e-f19e-42b1-8754-d4a87d4f15dc'::uuid, 'a4121c97-9708-4fca-b610-937e9af677b6'::uuid, 'a4853235-ae93-4aa8-8afa-8a8416393de7'::uuid, 'a48f4c67-e8cc-4b84-8e55-adbced9a67e5'::uuid, 'a5a95699-251c-483d-9c89-82f6835378ee'::uuid, 'a7268c36-03c6-418f-8327-0b4cd1429f21'::uuid, 'a780ec88-c44f-4e93-8e35-5a9014041644'::uuid, 'a79cde66-17c9-47c2-91a7-66c3467b3ce2'::uuid, 'a81051c3-59e9-4ba1-9097-e8386c10678e'::uu
        Process 2253: COMMIT
2022-09-15 11:14:40 EDT HINT:  See server log for query details.
2022-09-15 11:14:40 EDT CONTEXT:  while locking tuple (16,67) in relation "core_contentartifact"
2022-09-15 11:14:40 EDT STATEMENT:  SELECT "core_contentartifact"."pulp_id", "core_contentartifact"."pulp_created", "core_contentartifact"."pulp_last_updated", "core_contentartifact"."artifact_id", "core_contentartifact"."content_id", "core_contentartifact"."relative_path" FROM "core_contentartifact" WHERE "core_contentartifact"."pulp_id" IN (SELECT U0."pulp_id" FROM "core_contentartifact" U0 WHERE U0."pulp_id" IN ('0a540da6-7eb0-4fa7-9d35-0c09c82b0849'::uuid, 'a31f82bf-00ad-475d-99b1-5aa125832052'::uuid, 'a387baf4-5cfe-4719-bb74-bec2f05fca1b'::uuid, 'a3975a19-dcd9-4358-8a27-7c343d3b4968'::uuid, 'a3e9026b-3ba8-4d0f-a9bf-b33c5daffbd7'::uuid, 'a407bf4e-f19e-42b1-8754-d4a87d4f15dc'::uuid, 'a4121c97-9708-4fca-b610-937e9af677b6'::uuid, 'a4853235-ae93-4aa8-8afa-8a8416393de7'::uuid, [insert many ids here] '9fbd8d81-61a6-4d88-a358-07712d88153b'::uuid, 'a07d4e89-e7e5-401c-898c-687d08740fdc'::uuid, 'a09c7ca9-c2eb-4e70-bf0d-31afd8063554'::uuid, 'a1e5cd9c-8886-4bda-8d30-3b69d64f77f3'::uuid, 'a1f811a8-43c0-4bd0-b59a-145d7e5c22cb'::uuid) ORDER BY U0."pulp_id" ASC FOR UPDATE)
2022-09-15 11:14:40 EDT LOG:  duration: 1315.635 ms  statement: COMMIT
===

Comment 66 Lai 2022-09-15 18:17:47 UTC

Just an observation, but doing a complete resync does sync the capsule successfully.

Comment 67 Sayan Das 2022-09-22 10:39:24 UTC

(In reply to Lai from comment #66)
> Just an observation, but doing a complete resync does sync the capsule
> successfully.

Perhaps I understood wrong but Did you mention that, 

* The first "Complete Sync" attempt fails with deadlock

* But an immediate second "Complete Sync" attempt works fine and completes successfully?


it is the same for the cloned BZ ( for 6.11.3 ) as well ?

Comment 69 Lai 2022-09-23 13:40:05 UTC

Sayan,

That is correct.  A second "Complete Sync" completes successfully.  And yes, it's the same as the cloned bz for 6.11.3

Comment 92 Ian Ballou 2023-02-03 16:25:29 UTC

Created attachment 1942081 [details]
RHEL 7 Hotfix RPM for Satellite 6.11.4

Comment 93 Ian Ballou 2023-02-03 16:26:24 UTC

Created attachment 1942082 [details]
RHEL 8 Hotfix RPM for Satellite 6.11.4

Comment 94 Ian Ballou 2023-02-03 16:29:10 UTC

Hotfix RPMs are available for this BZ for Satellite 6.11.4 on RHEL 7 and RHEL 8.

INSTALL INSTRUCTIONS (Satellite 6.11.4 on RHEL 7):

1. Take a complete backup or snapshot of Satellite 6.11.4 server.

2. Download the applicable RHEL 7 hotfix RPM attached to this BZ.

3. # yum localinstall tfm-pulpcore-python3-pulpcore-3.16.14-2.HOTFIXRHBZ2062526.el7pc.noarch.rpm --disableplugin=foreman-protector

4. # satellite-maintain service restart

INSTALL INSTRUCTIONS (Satellite 6.11.4 on RHEL 8):

1. Take a complete backup or snapshot of Satellite 6.11.4 server.

2. Download the applicable RHEL 8 hotfix RPM attached to this BZ.

3. # yum localinstall python38-pulpcore-3.16.14-2.HOTFIXRHBZ2062526.el8pc.noarch.rpm --disableplugin=foreman-protector

4. # satellite-maintain service restart

Comment 95 Mike G 2023-02-03 19:43:28 UTC

I would recommend it to be a good idea to also say this should be applied to the capsule(s) as well? :) At least I at lease assume that to be the case. 

Much appreciated you guys getting this hotfix out ASAP. I will update here once I have more info from the latest full sync. 

Thanks!

Comment 96 Pavel Moravec 2023-02-03 21:30:06 UTC

(In reply to Mike G from comment #95)
> I would recommend it to be a good idea to also say this should be applied to
> the capsule(s) as well? :) At least I at lease assume that to be the case. 
> 
> Much appreciated you guys getting this hotfix out ASAP. I will update here
> once I have more info from the latest full sync. 
> 
> Thanks!

Good point! Until I am mistaken, the hotfix should be applied *primarily* on *Capsules* - as the Capsule's workers are those entering the deadlock, what pulp on Satellite does not affect.

Comment 98 Mike G 2023-02-07 20:29:51 UTC

Still having sync issues with the capsules, not sure if its related to this patch or not but no longer getting deadlock warnings, now its something new. I am working with Paul in a case and can update once we have more a definitive conclusion. 

Thanks

Comment 105 errata-xmlrpc 2023-05-03 13:21:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.13 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2097

Comment 106 Red Hat Bugzilla 2023-11-07 04:25:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

ahumbe
aperotti
avnkumar
bdm
dalley
dhjoshi
dkliban
fgrosjea
ggainey
gkeegan
hdo
iballou
jalviso
jentrena
jjansky
jpasqual
jsenkyri
karnsing
mgomon
momran
msunil
musman
oddvar.tjugen
osousa
pcreech
pdudley
pmendezh
pmoravec
rchan
sadas
satellite6-bugs
saydas
sfroemer
sigbjorn.lie
smeyer
swadeley
tasander
vdeshpan
welter