Bug 1974482 - deadlock detected on capsule sync
Summary: deadlock detected on capsule sync
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Pulp
Version: 6.10.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: 6.10.0
Assignee: satellite6-bugs
QA Contact: Lai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-21 19:54 UTC by Vladimír Sedmík
Modified: 2021-10-05 19:08 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-23 15:06:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Pulp Redmine 8750 0 High CLOSED - CURRENTRELEASE Deadlock on rpm repository pulp2pulp sync 2021-10-05 19:08:02 UTC
Pulp Redmine 8985 0 High CLOSED - DUPLICATE deadlock detected during pulp3 to pulp3 sync 2021-07-08 15:09:50 UTC

Description Vladimír Sedmík 2021-06-21 19:54:11 UTC
Description of problem:
Optimized synchronization of a varied content (yum, files, container images) to the external capsule fails with a `deadlock detected` error.


Version-Release number of selected component (if applicable):
6.10.0 snap 5


How reproducible:
always


Steps to Reproduce:
1. Have a Sat with yum/files/container_images content and a registered external capsule in the Library LCE
2. Trigger the Optimized sync


Actual results:
the sync task ends in a warning state with:
deadlock detected
DETAIL:  Process 21456 waits for ShareLock on transaction 25847; blocked by process 21471.
Process 21471 waits for ShareLock on transaction 25727; blocked by process 21456.
HINT:  See server log for query details.
CONTEXT:  while inserting index tuple (0,2) in relation "rpm_repometadatafile_data_type_checksum_relat_c9d7364a_uniq"


Expected results:
no deadlock


Additional info:
Complete sync succeeds without errors.

Comment 3 Tanya Tereshchenko 2021-06-29 11:34:18 UTC
Your main Satellite is on 6.10.0 snap 5, right?
What about Capsule? is it the same version?

If it's a capsule sync, please provide pulp logs from the capsule.
journalctl -u pulpocre-* or in /var/log/messages.

Comment 4 Vladimír Sedmík 2021-06-29 13:12:23 UTC
Yes, both are the same version/snap. The log follows...

Comment 6 pulp-infra@redhat.com 2021-06-29 18:08:41 UTC
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 7 pulp-infra@redhat.com 2021-06-29 18:08:42 UTC
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 9 pulp-infra@redhat.com 2021-06-29 22:07:39 UTC
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 10 pulp-infra@redhat.com 2021-06-29 22:07:40 UTC
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 12 pulp-infra@redhat.com 2021-07-01 16:10:41 UTC
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 13 pulp-infra@redhat.com 2021-07-08 15:09:52 UTC
The Pulp upstream bug status is at CLOSED - DUPLICATE. Updating the external tracker on this bug.

Comment 14 pulp-infra@redhat.com 2021-07-14 22:07:53 UTC
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 15 pulp-infra@redhat.com 2021-07-19 15:08:52 UTC
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 16 pulp-infra@redhat.com 2021-07-28 21:07:56 UTC
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.

Comment 17 Grant Gainey 2021-07-30 12:00:46 UTC
Update on investigation:

* Failed to reproduce on first attempt on the reproducer-machines 
* vsedmik removed pulp data and ran 'satellite-installer reset-data' - still couldn't repro
* recreate from snapshot, reproduced immediately (again) on first-sync
* No success in reproducing using pulp3-sync-from-external in a dev-env

One thing we've discussed, is that the reproducer setup is on Snap5, which is core/3.11 et al. 

I would like to have a reproducer on 6.10-latest - there have been A LOT of fixes between 3.11 and 3.14, and we want to make sure we're debugging the system that will be released for 6.10. Can we retest this on 6.10snap11 (I believe that's what just came out) and report on status here?

Meanwhile, I will see if any of my standalone-puilp-tests fail when I run in a 3.11 environment.

Comment 18 Grant Gainey 2021-07-30 15:46:34 UTC
Set up a dev-env with core/3.11 And Friends, and was able to reproduce this problem first try.

The script I am using: https://github.com/ggainey/pulp_startup/blob/main/8750_deadlocks/repro_3-11.sh

3.14 is running on the new tasking system, so the timing is **completely** different; it's possible that the underlying issue still exists, but it's become **much** harder to hit (a dozen test-runs failed to deadlock once on 3.14, with the same repo/remote setup). Now that I have a reproducer, I am going to see if I can build a fix based on a lock-ordering theory I have. 

HOWEVER - unless we can get a regular reproducer on core/3.14 and rpm/3.13, I would remove this as a blocker.

Output:
===
(pulp) [vagrant@pulp2-nightly-pulp3-source-centos7 8750_deadlocks]$ prestart
(pulp) [vagrant@pulp2-nightly-pulp3-source-centos7 8750_deadlocks]$ sudo systemctl start pulpcore-worker@3 pulpcore-worker@4
(pulp) [vagrant@pulp2-nightly-pulp3-source-centos7 8750_deadlocks]$ ./repro_3-11.sh 
CLEANUP
Error: Could not find remote with {'name': 'sat-maint-1'}.
Error: Could not find repository with {'name': 'sat-maint-1'}.
Error: Could not find remote with {'name': 'sat-maint-2'}.
Error: Could not find repository with {'name': 'sat-maint-2'}.
Error: Could not find remote with {'name': 'ansible-2.9-1'}.
Error: Could not find repository with {'name': 'ansible-2.9-1'}.
Error: Could not find remote with {'name': 'ansible-2.9-2'}.
Error: Could not find repository with {'name': 'ansible-2.9-2'}.
DeprecationWarning: The command orphans is deprecated.
DeprecationWarning: The command delete is deprecated.
Started background task /pulp/api/v3/tasks/1386df85-df82-4185-8c68-f767c323b89a/
Done.
{
  "pulp_href": "/pulp/api/v3/tasks/1386df85-df82-4185-8c68-f767c323b89a/",
  "pulp_created": "2021-07-30T15:35:47.234683Z",
  "state": "completed",
  "name": "pulpcore.app.tasks.orphan.orphan_cleanup",
  "logging_cid": "c2cf62e0f03e4b29840576b1deca6f40",
  "started_at": "2021-07-30T15:35:47.347304Z",
  "finished_at": "2021-07-30T15:35:47.391540Z",
  "error": null,
  "worker": "/pulp/api/v3/workers/38db073a-ff81-4799-96a9-11170b1573d4/",
  "parent_task": null,
  "child_tasks": [],
  "task_group": null,
  "progress_reports": [
    {
      "message": "Clean up orphan Content",
      "code": "clean-up.content",
      "state": "completed",
      "total": 0,
      "done": 0,
      "suffix": null
    },
    {
      "message": "Clean up orphan Artifacts",
      "code": "clean-up.content",
      "state": "completed",
      "total": 0,
      "done": 0,
      "suffix": null
    }
  ],
  "created_resources": [],
  "reserved_resources_record": []
}

CYCLE-START
SETUP REPOS AND REMOTES
"/pulp/api/v3/remotes/rpm/rpm/0cd8c91b-cc70-47c3-b7b8-63d3506eee2b/"
"/pulp/api/v3/repositories/rpm/rpm/a2a5dff1-9f41-48f5-97a0-3724cc5e81dc/"
"/pulp/api/v3/remotes/rpm/rpm/131db04d-24fd-4964-9fc1-0b7bb24a358a/"
"/pulp/api/v3/repositories/rpm/rpm/e2733877-af84-45e5-8399-d6f977c0402c/"
"/pulp/api/v3/remotes/rpm/rpm/3163bb03-5afe-416e-92b7-dde5f70310b7/"
"/pulp/api/v3/repositories/rpm/rpm/6b127970-8b58-46a3-a9ec-c681d5466115/"
"/pulp/api/v3/remotes/rpm/rpm/038a0a51-7c8d-4dc2-a40d-913dfac3deea/"
"/pulp/api/v3/repositories/rpm/rpm/87e1245a-caf0-487a-b450-f43162fc8f34/"
SYNCING...
Started background task /pulp/api/v3/tasks/c0379632-86ac-44e7-afaf-0d3708b22f48/
Not waiting for task because --background was specified.
Started background task /pulp/api/v3/tasks/6d4dfc3a-f813-4780-91f8-1c0e13382ea2/
Not waiting for task because --background was specified.
Started background task /pulp/api/v3/tasks/5ed0780e-f3b1-423d-8bda-efa6dc1d0ac3/
Not waiting for task because --background was specified.
Started background task /pulp/api/v3/tasks/292f2c2e-d008-463e-bb1d-1da4a65147cd/
Not waiting for task because --background was specified.
WAIT FOR COMPLETION....
.DONE
(pulp) [vagrant@pulp2-nightly-pulp3-source-centos7 pulp_startup]$ 
===

From journalctl:
===
Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14029]: return self.cursor.execute(sql, params)
Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14029]: django.db.utils.OperationalError: deadlock detected
Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14029]: DETAIL:  Process 14488 waits for ShareLock on transaction 1158585; blocked by process 14476.
Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14029]: Process 14476 waits for ShareLock on transaction 1158640; blocked by process 14488.
Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14029]: HINT:  See server log for query details.
Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14029]: CONTEXT:  while inserting index tuple (7,4) in relation "rpm_updaterecord_digest_key"
Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14266]: pulp [5dfdaf995a894cda8c47d30cf0c5d30c]: rq.worker:INFO: 14266.example.com: Job OK (5ed0780e-f3b1-423d-8bda-efa6dc1d0ac3)
Jul 30 15:36:01 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com gunicorn[14031]: pulp [8343faac596941d0af27301acada1973]: 127.0.0.1 - admin [30/Jul/2021:15:36:01 +0000] "GET /pulp/api/v3/tasks/?state=running&offset=0&limit=25 HTTP/1.1" 200 52 "-" "python-requests/2.26.0"
Jul 30 15:36:06 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com gunicorn[14031]: pulp [c7e999cf48114a8ebf4a400660a75e60]: 127.0.0.1 - admin [30/Jul/2021:15:36:06 +0000] "GET /pulp/api/v3/tasks/?state=failed&offset=0&limit=25 HTTP/1.1" 200 5297 "-" "python-requests/2.26.0"
===

Comment 19 Vladimír Sedmík 2021-07-30 17:33:51 UTC
Retested on snap 11 with python3-pulpcore-3.14.3-1.el7pc.noarch - I was not able to reproduce the issue when running the new (default) tasking system. After switching back to the old RQ TS the issue was reproduced successfully.

Comment 20 Grant Gainey 2021-08-02 22:25:34 UTC
Switching my dev-env to current-master, and using the old tasking system, I am still **unable** to reproduce. There have been A LOT of changes in the affected code-paths in pulp_rpm especially; it may just be that the timing on my dev-box is "different enough" that even the old-tasking-system won't invoke the problem for me.

Where I think we are with the upstream issue is "not critical, since we can't reproduce either standalone or satellite under 3.14".

Comment 28 pulp-infra@redhat.com 2021-09-21 23:09:11 UTC
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.

Comment 29 pulp-infra@redhat.com 2021-09-28 21:07:05 UTC
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.

Comment 30 pulp-infra@redhat.com 2021-10-05 19:08:03 UTC
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.


Note You need to log in before you can comment on or make changes to this bug.