Description of problem: Optimized synchronization of a varied content (yum, files, container images) to the external capsule fails with a `deadlock detected` error. Version-Release number of selected component (if applicable): 6.10.0 snap 5 How reproducible: always Steps to Reproduce: 1. Have a Sat with yum/files/container_images content and a registered external capsule in the Library LCE 2. Trigger the Optimized sync Actual results: the sync task ends in a warning state with: deadlock detected DETAIL: Process 21456 waits for ShareLock on transaction 25847; blocked by process 21471. Process 21471 waits for ShareLock on transaction 25727; blocked by process 21456. HINT: See server log for query details. CONTEXT: while inserting index tuple (0,2) in relation "rpm_repometadatafile_data_type_checksum_relat_c9d7364a_uniq" Expected results: no deadlock Additional info: Complete sync succeeds without errors.
Your main Satellite is on 6.10.0 snap 5, right? What about Capsule? is it the same version? If it's a capsule sync, please provide pulp logs from the capsule. journalctl -u pulpocre-* or in /var/log/messages.
Yes, both are the same version/snap. The log follows...
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.
The Pulp upstream bug status is at CLOSED - DUPLICATE. Updating the external tracker on this bug.
The Pulp upstream bug status is at ASSIGNED. Updating the external tracker on this bug.
Update on investigation: * Failed to reproduce on first attempt on the reproducer-machines * vsedmik removed pulp data and ran 'satellite-installer reset-data' - still couldn't repro * recreate from snapshot, reproduced immediately (again) on first-sync * No success in reproducing using pulp3-sync-from-external in a dev-env One thing we've discussed, is that the reproducer setup is on Snap5, which is core/3.11 et al. I would like to have a reproducer on 6.10-latest - there have been A LOT of fixes between 3.11 and 3.14, and we want to make sure we're debugging the system that will be released for 6.10. Can we retest this on 6.10snap11 (I believe that's what just came out) and report on status here? Meanwhile, I will see if any of my standalone-puilp-tests fail when I run in a 3.11 environment.
Set up a dev-env with core/3.11 And Friends, and was able to reproduce this problem first try. The script I am using: https://github.com/ggainey/pulp_startup/blob/main/8750_deadlocks/repro_3-11.sh 3.14 is running on the new tasking system, so the timing is **completely** different; it's possible that the underlying issue still exists, but it's become **much** harder to hit (a dozen test-runs failed to deadlock once on 3.14, with the same repo/remote setup). Now that I have a reproducer, I am going to see if I can build a fix based on a lock-ordering theory I have. HOWEVER - unless we can get a regular reproducer on core/3.14 and rpm/3.13, I would remove this as a blocker. Output: === (pulp) [vagrant@pulp2-nightly-pulp3-source-centos7 8750_deadlocks]$ prestart (pulp) [vagrant@pulp2-nightly-pulp3-source-centos7 8750_deadlocks]$ sudo systemctl start pulpcore-worker@3 pulpcore-worker@4 (pulp) [vagrant@pulp2-nightly-pulp3-source-centos7 8750_deadlocks]$ ./repro_3-11.sh CLEANUP Error: Could not find remote with {'name': 'sat-maint-1'}. Error: Could not find repository with {'name': 'sat-maint-1'}. Error: Could not find remote with {'name': 'sat-maint-2'}. Error: Could not find repository with {'name': 'sat-maint-2'}. Error: Could not find remote with {'name': 'ansible-2.9-1'}. Error: Could not find repository with {'name': 'ansible-2.9-1'}. Error: Could not find remote with {'name': 'ansible-2.9-2'}. Error: Could not find repository with {'name': 'ansible-2.9-2'}. DeprecationWarning: The command orphans is deprecated. DeprecationWarning: The command delete is deprecated. Started background task /pulp/api/v3/tasks/1386df85-df82-4185-8c68-f767c323b89a/ Done. { "pulp_href": "/pulp/api/v3/tasks/1386df85-df82-4185-8c68-f767c323b89a/", "pulp_created": "2021-07-30T15:35:47.234683Z", "state": "completed", "name": "pulpcore.app.tasks.orphan.orphan_cleanup", "logging_cid": "c2cf62e0f03e4b29840576b1deca6f40", "started_at": "2021-07-30T15:35:47.347304Z", "finished_at": "2021-07-30T15:35:47.391540Z", "error": null, "worker": "/pulp/api/v3/workers/38db073a-ff81-4799-96a9-11170b1573d4/", "parent_task": null, "child_tasks": [], "task_group": null, "progress_reports": [ { "message": "Clean up orphan Content", "code": "clean-up.content", "state": "completed", "total": 0, "done": 0, "suffix": null }, { "message": "Clean up orphan Artifacts", "code": "clean-up.content", "state": "completed", "total": 0, "done": 0, "suffix": null } ], "created_resources": [], "reserved_resources_record": [] } CYCLE-START SETUP REPOS AND REMOTES "/pulp/api/v3/remotes/rpm/rpm/0cd8c91b-cc70-47c3-b7b8-63d3506eee2b/" "/pulp/api/v3/repositories/rpm/rpm/a2a5dff1-9f41-48f5-97a0-3724cc5e81dc/" "/pulp/api/v3/remotes/rpm/rpm/131db04d-24fd-4964-9fc1-0b7bb24a358a/" "/pulp/api/v3/repositories/rpm/rpm/e2733877-af84-45e5-8399-d6f977c0402c/" "/pulp/api/v3/remotes/rpm/rpm/3163bb03-5afe-416e-92b7-dde5f70310b7/" "/pulp/api/v3/repositories/rpm/rpm/6b127970-8b58-46a3-a9ec-c681d5466115/" "/pulp/api/v3/remotes/rpm/rpm/038a0a51-7c8d-4dc2-a40d-913dfac3deea/" "/pulp/api/v3/repositories/rpm/rpm/87e1245a-caf0-487a-b450-f43162fc8f34/" SYNCING... Started background task /pulp/api/v3/tasks/c0379632-86ac-44e7-afaf-0d3708b22f48/ Not waiting for task because --background was specified. Started background task /pulp/api/v3/tasks/6d4dfc3a-f813-4780-91f8-1c0e13382ea2/ Not waiting for task because --background was specified. Started background task /pulp/api/v3/tasks/5ed0780e-f3b1-423d-8bda-efa6dc1d0ac3/ Not waiting for task because --background was specified. Started background task /pulp/api/v3/tasks/292f2c2e-d008-463e-bb1d-1da4a65147cd/ Not waiting for task because --background was specified. WAIT FOR COMPLETION.... .DONE (pulp) [vagrant@pulp2-nightly-pulp3-source-centos7 pulp_startup]$ === From journalctl: === Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14029]: return self.cursor.execute(sql, params) Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14029]: django.db.utils.OperationalError: deadlock detected Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14029]: DETAIL: Process 14488 waits for ShareLock on transaction 1158585; blocked by process 14476. Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14029]: Process 14476 waits for ShareLock on transaction 1158640; blocked by process 14488. Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14029]: HINT: See server log for query details. Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14029]: CONTEXT: while inserting index tuple (7,4) in relation "rpm_updaterecord_digest_key" Jul 30 15:35:59 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com rq[14266]: pulp [5dfdaf995a894cda8c47d30cf0c5d30c]: rq.worker:INFO: 14266.example.com: Job OK (5ed0780e-f3b1-423d-8bda-efa6dc1d0ac3) Jul 30 15:36:01 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com gunicorn[14031]: pulp [8343faac596941d0af27301acada1973]: 127.0.0.1 - admin [30/Jul/2021:15:36:01 +0000] "GET /pulp/api/v3/tasks/?state=running&offset=0&limit=25 HTTP/1.1" 200 52 "-" "python-requests/2.26.0" Jul 30 15:36:06 pulp2-nightly-pulp3-source-centos7.padre-fedora.example.com gunicorn[14031]: pulp [c7e999cf48114a8ebf4a400660a75e60]: 127.0.0.1 - admin [30/Jul/2021:15:36:06 +0000] "GET /pulp/api/v3/tasks/?state=failed&offset=0&limit=25 HTTP/1.1" 200 5297 "-" "python-requests/2.26.0" ===
Retested on snap 11 with python3-pulpcore-3.14.3-1.el7pc.noarch - I was not able to reproduce the issue when running the new (default) tasking system. After switching back to the old RQ TS the issue was reproduced successfully.
Switching my dev-env to current-master, and using the old tasking system, I am still **unable** to reproduce. There have been A LOT of changes in the affected code-paths in pulp_rpm especially; it may just be that the timing on my dev-box is "different enough" that even the old-tasking-system won't invoke the problem for me. Where I think we are with the upstream issue is "not critical, since we can't reproduce either standalone or satellite under 3.14".
The Pulp upstream bug status is at POST. Updating the external tracker on this bug.
The Pulp upstream bug status is at MODIFIED. Updating the external tracker on this bug.
The Pulp upstream bug status is at CLOSED - CURRENTRELEASE. Updating the external tracker on this bug.