Bug 2062526
| Summary: | Another deadlock issue when syncing repos with high concurrency | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Hao Chang Yu <hyu> | ||||||
| Component: | Pulp | Assignee: | satellite6-bugs <satellite6-bugs> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Vladimír Sedmík <vsedmik> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 6.10.3 | CC: | ahumbe, aperotti, avnkumar, bdm, dalley, dhjoshi, dkliban, fgrosjea, ggainey, gkeegan, hdo, iballou, jalviso, jentrena, jjansky, jpasqual, jsenkyri, karnsing, mgomon, momran, msunil, musman, oddvar.tjugen, osousa, pcreech, pdudley, pmendezh, pmoravec, rchan, sadas, satellite6-bugs, saydas, sfroemer, sigbjorn.lie, smeyer, swadeley, tasander, vdeshpan, welter | ||||||
| Target Milestone: | 6.13.0 | Keywords: | PrioBumpGSS, Triaged | ||||||
| Target Release: | Unused | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | python-pulpcore-3.21.2-1, python-pulpcore-3.18.11-1, python-pulpcore-3.16.15-1 | Doc Type: | Known Issue | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 2082209 2166750 2170639 (view as bug list) | Environment: | |||||||
| Last Closed: | 2023-05-03 13:21:11 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Hao Chang Yu
2022-03-10 04:55:14 UTC
I can reproduce another similar deadlock which happens when doing "bulk_create".
I think both deadlocks will only happen when the download policy of the repositories in Capsule are set to "immediate" as Pulp is trying to create/update the artifact data in bulk.
Steps to reproduce:
Note: Ensure Capsule is setup to have 8 pulpcore workers
1) Enable and sync a large Red Hat repository. I use "Red Hat Software Collections RPMs for Red Hat Enterprise Linux 7 Server x86_64 7Server" in my reproducer.
2) Set the Capsule download policy to "immediate".
3) Create a new lifecycle environment named 'devel'. Add only it to the Capsule. Remove all other lifecycle environments from the Capsule.
4) In Settings page, Content tab, set "Sync Capsules after Content View promotion" to "No"
5) Create 1 content view and attach repo in step (1) to it.
6) Make 7 copies of the content view.
7) Publish and promote all 8 content views to 'devel' environment.
8) Trigger a complete capsule sync.
Expected results:
No deadlocks
Actual results:
Multiple deadlocks occur as below:
error:
traceback: |2
File "/usr/lib/python3.6/site-packages/pulpcore/tasking/pulpcore_worker.py", line 339, in _perform_task
result = func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/pulp_rpm/app/tasks/synchronizing.py", line 491, in synchronize
version = dv.create()
File "/usr/lib/python3.6/site-packages/pulpcore/plugin/stages/declarative_version.py", line 151, in create
loop.run_until_complete(pipeline)
File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/usr/lib/python3.6/site-packages/pulpcore/plugin/stages/api.py", line 225, in create_pipeline
await asyncio.gather(*futures)
File "/usr/lib/python3.6/site-packages/pulpcore/plugin/stages/api.py", line 43, in __call__
await self.run()
File "/usr/lib/python3.6/site-packages/pulpcore/plugin/stages/artifact_stages.py", line 241, in run
d_artifact.artifact for d_artifact in da_to_save
File "/usr/lib/python3.6/site-packages/pulpcore/app/models/content.py", line 84, in bulk_get_or_create
return super().bulk_create(objs, batch_size=batch_size)
File "/usr/lib/python3.6/site-packages/django/db/models/manager.py", line 82, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/django/db/models/query.py", line 468, in bulk_create
self._batched_insert(objs_with_pk, fields, batch_size, ignore_conflicts=ignore_conflicts)
File "/usr/lib/python3.6/site-packages/django/db/models/query.py", line 1204, in _batched_insert
ignore_conflicts=ignore_conflicts,
File "/usr/lib/python3.6/site-packages/django/db/models/query.py", line 1186, in _insert
return query.get_compiler(using=using).execute_sql(return_id)
File "/usr/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 1377, in execute_sql
cursor.execute(sql, params)
File "/usr/lib/python3.6/site-packages/django/db/backends/utils.py", line 67, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/usr/lib/python3.6/site-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/usr/lib/python3.6/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/usr/lib/python3.6/site-packages/django/db/utils.py", line 89, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/usr/lib/python3.6/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
description: |
deadlock detected
DETAIL: Process 24672 waits for ShareLock on transaction 34053354; blocked by process 24652.
Process 24652 waits for ShareLock on transaction 34053353; blocked by process 24672.
HINT: See server log for query details.
CONTEXT: while inserting index tuple (3,5) in relation "core_artifact_sha256_key"
worker: "/pulp/api/v3/workers/55462376-7328-4c42-b644-0d58b3f7e873/"
Forgot to provide the pg log details:
2022-03-14 20:10:24 AEST ERROR: deadlock detected
2022-03-14 20:10:24 AEST DETAIL: Process 24672 waits for ShareLock on transaction 34053354; blocked by process 24652.
Process 24652 waits for ShareLock on transaction 34053353; blocked by process 24672.
Process 24672: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512", "timestamp_of_interest") VALUES ('c9e2704c-0eba-4cb4-9580-bd2219600973'::uuid, '2022-03-14T10:10:22.763785+00:00'::timestamptz, '2022-03-14T10:10:22.763823+00:00'::timestamptz, 'artifact/04/e69fee288db61a77bc71512c573a97d4d57be3fee5335fdba38b92d4a75d16', 93712, '1f84f08e6e3c605994c144036d237787', 'f7c701d5280d09b2e988dcaabe6823b561255c35', '6ed8480a9c89a673baf58150366bfd08326004c33ade2152c9548ab7', '04e69fee288db61a77bc71512c573a97d4d57be3fee5335fdba38b92d4a75d16', '5dde72af5557855a1483aafdee29d33da6daa66428267e95b15c7ddb99abe469c4430c45e4e927d5526bc04b2480da68', 'dbbbb267bb9a913e0996e73a6697da898e3dae47e24d63ef49641509a99ed7ce2835e567b8fcc1dcae5497d2a9b7783916a45fd135cb81ee1a9d903f97e07207', '2022-03-14T10:10:22.768579+00:00'::timestamptz), ('b7db0464-69ec-40cb-b677-8959ec2bab91'::uuid, '2022-03-14T10:10:22.768624+00:00'::timestamptz, '2022-03-14T10:10:22.768636
Process 24652: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512", "timestamp_of_interest") VALUES ('3c2a12a5-139f-4a1b-afb3-116c04fd24d2'::uuid, '2022-03-14T10:10:22.808572+00:00'::timestamptz, '2022-03-14T10:10:22.808631+00:00'::timestamptz, 'artifact/50/7a394ae75878e2bc87a65fa68604f903b2c3c3dcf2690aec5ce31cdb8caf46', 201448, '31e0e7891c42e1d65531369ebae5fce3', '87ddb611dcc983eb5e51a8b9e8ec3b52471b9aa7', 'b26f083064b4739fd2a51a2820be2cab931c8f42e5ee7c63a1924ce0', '507a394ae75878e2bc87a65fa68604f903b2c3c3dcf2690aec5ce31cdb8caf46', 'cb245087311bd0a0a2e73b41cbf7d6d3723d102ef5a1f906839bdf80f95a11919c48f2b64b41d358c684035f12572848', '6d8ef5a2f13601174147acb0c3b8ce69052a9a155e36586a596cb88822cf84a8cec814bbcdae6c359b2d74669e2f265aeb83f86c0172c23468015ea3f55a9a31', '2022-03-14T10:10:22.814954+00:00'::timestamptz), ('c798aee7-7152-4205-ac7c-1b639a4e3abe'::uuid, '2022-03-14T10:10:22.815036+00:00'::timestamptz, '2022-03-14T10:10:22.81506
2022-03-14 20:10:24 AEST HINT: See server log for query details.
2022-03-14 20:10:24 AEST CONTEXT: while inserting index tuple (3,5) in relation "core_artifact_sha256_key"
2022-03-14 20:10:26 AEST ERROR: deadlock detected
2022-03-14 20:10:26 AEST DETAIL: Process 24680 waits for ShareLock on transaction 34053354; blocked by process 24652.
Process 24652 waits for ShareLock on transaction 34053358; blocked by process 24680.
Process 24680: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512", "timestamp_of_interest") VALUES ('fc1621f8-dce2-4ff9-b0f6-590af82a609d'::uuid, '2022-03-14T10:10:23.458319+00:00'::timestamptz, '2022-03-14T10:10:23.458388+00:00'::timestamptz, 'artifact/04/e69fee288db61a77bc71512c573a97d4d57be3fee5335fdba38b92d4a75d16', 93712, '1f84f08e6e3c605994c144036d237787', 'f7c701d5280d09b2e988dcaabe6823b561255c35', '6ed8480a9c89a673baf58150366bfd08326004c33ade2152c9548ab7', '04e69fee288db61a77bc71512c573a97d4d57be3fee5335fdba38b92d4a75d16', '5dde72af5557855a1483aafdee29d33da6daa66428267e95b15c7ddb99abe469c4430c45e4e927d5526bc04b2480da68', 'dbbbb267bb9a913e0996e73a6697da898e3dae47e24d63ef49641509a99ed7ce2835e567b8fcc1dcae5497d2a9b7783916a45fd135cb81ee1a9d903f97e07207', '2022-03-14T10:10:23.465319+00:00'::timestamptz), ('b8b49a74-7a79-4812-b86d-72114529cdb0'::uuid, '2022-03-14T10:10:23.465392+00:00'::timestamptz, '2022-03-14T10:10:23.465421
Process 24652: INSERT INTO "core_artifact" ("pulp_id", "pulp_created", "pulp_last_updated", "file", "size", "md5", "sha1", "sha224", "sha256", "sha384", "sha512", "timestamp_of_interest") VALUES ('3c2a12a5-139f-4a1b-afb3-116c04fd24d2'::uuid, '2022-03-14T10:10:22.808572+00:00'::timestamptz, '2022-03-14T10:10:22.808631+00:00'::timestamptz, 'artifact/50/7a394ae75878e2bc87a65fa68604f903b2c3c3dcf2690aec5ce31cdb8caf46', 201448, '31e0e7891c42e1d65531369ebae5fce3', '87ddb611dcc983eb5e51a8b9e8ec3b52471b9aa7', 'b26f083064b4739fd2a51a2820be2cab931c8f42e5ee7c63a1924ce0', '507a394ae75878e2bc87a65fa68604f903b2c3c3dcf2690aec5ce31cdb8caf46', 'cb245087311bd0a0a2e73b41cbf7d6d3723d102ef5a1f906839bdf80f95a11919c48f2b64b41d358c684035f12572848', '6d8ef5a2f13601174147acb0c3b8ce69052a9a155e36586a596cb88822cf84a8cec814bbcdae6c359b2d74669e2f265aeb83f86c0172c23468015ea3f55a9a31', '2022-03-14T10:10:22.814954+00:00'::timestamptz), ('c798aee7-7152-4205-ac7c-1b639a4e3abe'::uuid, '2022-03-14T10:10:22.815036+00:00'::timestamptz, '2022-03-14T10:10:22.81506
2022-03-14 20:10:26 AEST HINT: See server log for query details.
2022-03-14 20:10:26 AEST CONTEXT: while inserting index tuple (12,15) in relation "core_artifact_sha256_key"
To prevent the deadlock during bulk_create, I think we need to enforce the order of the insert too. Maybe sort the objects before parsing to the bulk_create will work? (In reply to Grant Gainey from comment #5) > @hyu, @pmoravec - I could use some eyes on a) the > reproducer-script in the associated github issue (2420), and b) the fix in > its associated PR (#2421). The script will force a deadlock every time if > you a) have >10 workers, and b) aren't just doing get_or_create on existing > Artifacts each time. With the PR< I haven't been able to recreate again. > But, with deadlocks, More Eyes are *always* better. Hi Grant Regarding PR #2421: 1) In my opinion, the sorting should be done inside "bulk_get_or_create" so that it will be safe to call everywhere. 2) IIUC, each stage is processing 500 records per batch so we are only able to guarantee each 500 records are created in order so I think deadlock could still happen in a very low possibility. 3) The PR only fixing the issue in #comment 2 but not #comment 0 which I am still unable to reproduce it. Regards Hao [NOTE: I'm marking the discussion(s) here public - there's no sensitive info in them, and they may save some other poor soul in the future from having to wonder how we arrived at our fix(es)] (In reply to Hao Chang Yu from comment #6) > (In reply to Grant Gainey from comment #5) > > @hyu, @pmoravec - I could use some eyes on a) the > > reproducer-script in the associated github issue (2420), and b) the fix in > > its associated PR (#2421). The script will force a deadlock every time if > > you a) have >10 workers, and b) aren't just doing get_or_create on existing > > Artifacts each time. With the PR< I haven't been able to recreate again. > > But, with deadlocks, More Eyes are *always* better. > > Hi Grant > > Regarding PR #2421: > 1) In my opinion, the sorting should be done inside "bulk_get_or_create" so > that it will be safe to call everywhere. You can't sort "inside", because you have to know what to sort on, which is known only by the caller. We only use this in two places currently, and the other one is already protected against this deadlock: https://github.com/pulp/pulpcore/blob/main/pulpcore/plugin/stages/content_stages.py#L165 > 2) IIUC, each stage is processing 500 records per batch so we are only able > to guarantee each 500 records are created in order so I think deadlock could > still happen in a very low possibility. As long as the batch is processed as a single transaction (which I believe it is), and as long as every batch uses the same sort-algorithm, we're good here. If the entire set was one gigantic transaction of multiple batches, then you're right - but if that were the case, *none* of the deadlock-fixes we've done over the last few months would have worked. > 3) The PR only fixing the issue in #comment 2 but not #comment 0 which I am > still unable to reproduce it. Oh ugh, you are correct, I managed to miss that there are two different sources here. The first tb comes from here: https://github.com/pulp/pulpcore/blob/3.14/pulpcore/plugin/stages/content_stages.py#L160 We know that postgres updates in whatever order it feels like, unless you do the select_for_update() trick we ironed out while fixing touch(). However, in this case, the skipLocks=true trick may NOT be appropriate, because this update actually matters (unlike the touch() timestamp). I need to understand exactly how/where we're colliding. On the plus side, we only rely on bulk_update() in two places - the one above, and https://github.com/pulp/pulpcore/blob/main/pulpcore/app/tasks/reclaim_space.py#L63 which isn't in 3.14, and which I believe is only ever called from one worker. I've opened github issue #2430 for this problem, and will start working on figuring out if we even *can* fix it. Stay tuned. G @hyu I am having zero luck attempting to force a deadlock on bulk_update(), even with code deliberately designed to Behave Badly. I'll attach the latest attempt; maybe you can see a way to make it more-deadlock-prone than I can. The place where we're calling bulk_update() in content_stage, we know that the Content and ContentArtifact entities already exist in the database already, so all we need to do is update them after associating the ContentArtifact to an Artifact. I'm creating unique artifacts and content, creating sets of unique ContentArtifact entites and persisting them, and then linking them to the matching Artifacts and updating, each list-update in its own thread. I'm running with 10 workers, and have thrown A LOT of concurrency at this machine today, and have had not a single deadlock. Created attachment 1868374 [details]
Script (attempting to) force deadlock using bulk_update
I am able to reproduce the deadlock error with the following modified script but not every time. I think it is the matter of timing which is hard to tackle. https://gist.github.com/hao-yu/ffc2408cfebe0ad98c3f6235691a7f09 (In reply to Hao Chang Yu from comment #10) > I am able to reproduce the deadlock error with the following modified script > but not every time. I think it is the matter of timing which is hard to > tackle. > > https://gist.github.com/hao-yu/ffc2408cfebe0ad98c3f6235691a7f09 Oh-ho! Thank you, the changes are instructive. That gives me a direction to go in, thanks much! @hyu @pmoravec I am going to Think Out Loud here, and would appreciate thoughts/comments. The problem we're (rarely) having here, is that: * two threads are syncing * there exists overlapping content * The Content/ContentArtifact elements for some of that overlapping content have already been persisted into the DB. * Both threads notice the above. * Both threads are attempting to link up the ContentArtifacts, for the Content, with the Artifacts-on-disk that have just been sync'd. * The threads update these elements in-bulk, and in arbitrary order. * deadlock Based on what we learned while fixing the touch() bulk-update problem (see https://hackmd.io/@pulp/touch_deadlock to refresh your minds on details) - select_for_update to enforce ordering fixes the problem. The downside we ran into, is select_for_update() loses a lot of concurrency, and makes sync quite a bit slower. The way we fixed that, was to realize that touch() doesn't care "who wins", as long as *somebody* does - so skip_locks got us our concurrency back. If we knew that the only way to end up with a lock-conflict in this new case, was due to two threads updating Artifact at the same time - I contend that we *also* don't care who wins, *as long as somebody does*. If the conflicting threads are both updating the specific CA with the same Artifact, the result will be the same no matter which thread gets to "win". If the threads are trying to update with different Artifacts, right now whoever happens to execute "last" (assuming no deadlocks) will "win". If we were to use select_for_update(skip_locks=True) then whoever got there *first* would win. This sounds like it would change behavior - but where we are right now, which thread "wins" is a coin-toss, the threads aren't order-enforced in any way. So again, as long as *someone* updates the Artifact, we're where we are now. The problem, however, is this - in the touch() case, we knew that, no matter who had an entity locked, *or why*, the timestamp field was going to be updated. In this case, we may not be able to "assume" that the right thing is going to happen - the current-lock-holder might have the ContentArtifact locked for some reason *other than* "updating the Artifact pointer". My first thought was, "damn, we're going to have to just take the concurrency hit". My second thought went like this: * there are only three active fields in core_contentartifact: content_id, relative_path, and artifact_id. created and pulp_id are part of creating the entity, and pulp_last_updated changes only in response to one fo these three being changed. * the content_id/relative_path combo is a unique identifier for CA. If either of them is changing - we're not dealing with the "same" ContentArtifact any more. Pulp asks the question "does a CA exist with Content foo and rel_path bar/blech", and if the answer is no, creates a new one - it doesn't change an existing one. * that leaves us artifact_id - and I've already decided, I don't care *who* updates is, as long as *someone* wins, we're good. SO - my tl;dr here, is this: I content that we can address this deadlock issue using the same pattern we used to fix touch(), *including* skip_locks=True to avoid blowing up our sync concurrency. Y'all have been really involved in deadlock/concurrency issues; I would value any insights you have! Actually I am a bit doubt that 'select_for_update' without 'skip_lock' will cause huge performance trade off in our case because we are only updating the rows in quite a small batch. 500 rows per batch I think. I guess if a deadlock is rarely happened then select_for_update might not always block other transactions for acquiring locks. Maybe we can use the existing script to do some tests. If select_for_update without skip_lock has only minimal impact then I think it is safer to do that so we don't have to worry the issue that you mentioned above. Otherwise, I think skip lock sounds ok to me. Grant's rationale sounds sane to me. Just it relies on "there is just one field in core_contentartifact we can really ask for an update". Shall not be this somewhere documented to prevent future issues when theoretically adding an "editable row" to the table? (In reply to Pavel Moravec from comment #14) Shall not be > this somewhere documented to prevent future issues when theoretically adding > an "editable row" to the table? +1 for this. It might be a chance that someone will make changes that will break this in future, such as adding new column. That is one of the reasons that I am more lean towards no lock skipping if the price to pay is minimal. (note: made previous comments public, there's nothing confidential in them and they may help future deadlock-fixers...) (In reply to Hao Chang Yu from comment #13) > Actually I am a bit doubt that 'select_for_update' without 'skip_lock' will > cause huge performance trade off in our case because we are only updating > the rows in quite a small batch. 500 rows per batch I think. I guess if a > deadlock is rarely happened then select_for_update might not always block > other transactions for acquiring locks. Maybe we can use the existing script > to do some tests. > > If select_for_update without skip_lock has only minimal impact then I think > it is safer to do that so we don't have to worry the issue that you > mentioned above. > > Otherwise, I think skip lock sounds ok to me. After A LOT of experimentation, the performance difference between skip_locked=True, and Not, is noticeable, but not overwhelming - on the order of 50%, in the pathologically-bad test case I'm using (many threads attempting to update the *exact same* content simultaneously). Given the...unease we share around assuming skip_lock=True being "safe", I'd rather take a small/manageable performance hit and be correct, than otherwise. See https://github.com/pulp/pulpcore/pull/2478 for applying the fix; comments/critiques welcome! The script I settled on to gather some timings is this: ===== from pulpcore.app.models.content import Artifact, ContentArtifact from pulp_file.app.models import FileContent from hashlib import sha256 from django.db import connection, transaction from multiprocessing import Process, Array import statistics from time import time, sleep import _thread artifacts = {} # filename: Artifact content = {} # filename: FileContent print(">>>BUILDING CONTENT/CA/ARTIFACTS...") for i in range(10000): path = f'/tmp/{i:06d}.txt' content_path = f'{i:06d}.txt' with open(path, "w") as f: f.write(path) with open(path, "rb") as f: sum256 = sha256(f.read()).hexdigest() attrs = {"relative_path": content_path, "digest": sum256} fc = FileContent(**attrs) fc.save() content[path] = fc attrs = {"file": path, "sha256": sum256, "size": i} a = Artifact(**attrs) a.save() artifacts[path] = a # lists of ContentArtifacts ca_lists = [[], [], [], [], []] #ca_lists = [[]] for k in content.keys(): for i in range(len(ca_lists)): attrs = {"content": fc, "relative_path": f'{i}/{content[k].relative_path}'} ca = ContentArtifact(**attrs) ca.save() ca.artifact = artifacts[k] ca_lists[i].append(ca) OUTER=5 INNER=len(ca_lists) def bulk_doit(ca_list): print(">>> ENTER...") ids = [k.pulp_id for k in ca_list] connection.close() connection.connect() start = time() with transaction.atomic(): # len() forces the QS to be evaluated in a way that emits the s_f_u - exists() and count() don't len(ContentArtifact.objects.filter(pulp_id__in=ids).only("pulp_id").order_by("pulp_id").select_for_update().values_list()) ContentArtifact.objects.bulk_update(ca_list, ["artifact"], batch_size=500) end = time() print(f">>> EXIT {end-start}...") for r in range(5): print("Try round: %s" % str(r + 1)) # Ensure there are something to update when rerunning this part connection.connect() ContentArtifact.objects.filter().update(artifact_id=None) processes = [] for i in range(OUTER): for j in range(INNER): p = Process(target=bulk_doit, args=(ca_lists[j],)) processes.append(p) p.start() sleep(3) for p in processes: p.join() ===== (In reply to Hao Chang Yu from comment #21) > Sorry, for the late reply. I was off sick yesterday. > > 1) About 500 ISE errors > > It seems like pulpcore is using a lot of postgres connections here. There > are 14 pulp content gunicorn processes each has 13 workers. 6 pulp wsgi > guncorn processes each has 5 workers and 6 pulpcore workers. It appears that > each worker is using more postgres connections. It could be normal but I am > not familiar to this. @ggainey any idea why pulp is using so many > postgres connections? > > # ps -aux | grep -P "postgres: pulp pulpcore.* idle" | wc -l > 422 I have spent some time experimenting/investigating here, and I can neither explain nor reproduce excessive pgres cnx from pulp. The number of connections open at once will be (pulp-content * workers-per-process) + (pulp-api(wsgi) * workers-per) + pulpcore-workers. For the setup on this reproducer, that would be (14*13) + (6*5) + 6 = 218 connections. That's a lot - but it's not 422. Typically, when testing high-concurrency situations on my dev-setup, I run with 20-30 pulpcore-workers without hitting max-pgres-cnx problems. I can't explain why the Satellite is using so many connections - but it's the reason the syncs on the capsule are failing, not the dup-key error, nor a deadlock (which we don't see any of) > 2) About "psycopg2.errors.UniqueViolation: duplicate key value violates > unique constraint "core_artifact_sha256_key" error. This appear to be > multiple same rpms were downloading at the same time and caused race > condition updating the same row. > > Is it causing the Capsule failed to download the rpms? It seems not. All > related requests responded with HTTP 200 (see below) and I don't see error > from Capsule side downloading those rpms. > > I think we need to raise another error to handle this error properly. Actually, unless we can show the sync **failing** because of this, I suspect this is the database telling us "I tried to create an Artifact that wasn't here when I started, but Some Other Thread created while I was busy (re)downloading it." This is absolutely something that happens under high-concurrency, overlapping-content situations - each repo is checking "does this already exist", getting a "no", and downloading it at the same time. One of them wins. The others fail. As you note in your investigation, Hao, *the content is there* once the sync completes. The net is, I wouldn't take a BZ on that traceback, until we can show that it's causing the affected repo to be missing the overlapping content. @ltran - any chance we can restart the reproducer-machines here? Whatever is going on with the pgres-connections, it's not "expected pulp behavior", and seems to be connected to this particular configuration/setup. Requesting needsinfo from upstream developer dkliban, ggainey because the 'FailedQA' flag is set. Requesting needsinfo from upstream developer dkliban, ggainey because the 'FailedQA' flag is set. Debugging notes: On the Satellite, I poked around inside postgres asking questions about pulpcore connections. # su - postgres -c "psql pulpcore" psql (12.11) Type "help" for help. pulpcore=# This query pulpcore=# select datid, datname, pid, usesysid, usename, query from pg_stat_activity where datname = 'pulpcore'; shows lots of rows likew this one: 24264 | pulpcore | 21198 | 24263 | pulp | SELECT "core_remote"."pulp_id", "core_remote"."pulp_created", "core_remote"."pulp_last_updated", "core_remote"."pulp_type", "core_ remote"."name", "core_remote"."url", "core_remote"."ca_cert", "core_remote"."client_cert", "core_remote"."client_key", "core_remote"."tls_validation", "core_remote"."username", "co re_remote"."password", "core_remote"."proxy_url", "core_remote"."proxy_username", "core_remote"."proxy_password", "core_remote"."download_concurrency", "core_remote"."max_retries", "core_remote"."policy", "core_remote"."total_timeout", "core_remote"."connect_timeout", "core_remote"."sock_connect_timeout", "core_remote"."sock_read_timeout", "core_remote"."hea ders", "core_remote"."rate_limit" FROM "core_remote" WHERE "core_remote"."pulp_id" = '09325f08-7805-4e44-8580-98e65c11434e'::uuid LIMIT 21 pulpcore=# select count(pid) from pg_stat_activity where datname = 'pulpcore' and query like '%09325f08-7805-4e44-8580-98e65c11434e%'; count ------- 413 Of the 434 connections pulp is holding to postgres, 413 (!) of them are idle, having last asked about this one remote. So - we're running out of connections, because something is querying a remote, *and then not closing its db connection*. I think. I have no idea why/where that's happening. Debugging continues. The Artifact UQ-errors are due to the content-handler delivering content to the capsule, for 8 identical CVs, for a repo that was sync'd "on_demand". So occasionally, multiple content-handler threads ask for the same content "for the first time", at the same time. The on-demand processing for each such collision attempts to "make the content real" multiple times, one such thread "wins" and stores the Artifact, everyone else fails. I *think* the traceback is due to Thread-1 wins, Thread-2 fails, catches the error, and throws away the file it was downloading based on its sha256, Thread-3 *also* fails, catches the exception, and then fails during cleanup because Thread-2 already did it (that's the os.unlink() point in the traceback from #c21). This should be harmless - we might consider a low-priority "please notice this harmless failure and do not log a traceback if it happens" BZ, but it isn't anything to do with the deadlock. My current thinking is that the multiple-connections are involved in the "attempting to realize too many overlapping on_demand Artifacts at the same time". I don't yet see exactly *how*, but that's the trail I am currently following. @hyu, @dkliban - any thoughts on whether I'm making sense here? (Also, any ideas would be very welcome) @ggainey I don't have any immediate answers but the google-able name for this problem is "request collapsing" or "request coalescing" and I think Nginx and some cache servers like Varnish have settings that can help with it. https://developer.fastly.com/learning/concepts/request-collapsing/ https://cloud.google.com/cdn/docs/caching#request-coalescing https://info.varnish-software.com/blog/request-coalescing-and-other-reasons-to-use-varnish-as-an-origin-shield https://fly.io/blog/the-5-hour-content-delivery-network/ Can probably implement something similar to the Pulp 2 model which Squid cached the first downloaded content and the deferred download process runs periodically to download the cached contents to the storage. OK gang, here we go. The "too many connections" issue is coming from the content-app; I have opened a pulpcore issue : https://github.com/pulp/pulpcore/issues/2816 This needs its own BZ - I think QE is the best person to be the "opener". Please link the above issue to the BZ when it's open. In the meantime - to verify the *actual* BZ in question here (which is "the capsule deadlocks when syncing") - we should start with syncing RHEL7 *immediate* instead of on-demand on the Satellite side, and then following the rest of the test-steps, to see if we can force a deadlock on the capsule. (In reply to Grant Gainey from comment #35) > OK gang, here we go. The "too many connections" issue is coming from the > content-app; I have opened a pulpcore issue : > https://github.com/pulp/pulpcore/issues/2816 > > This needs its own BZ - I think QE is the best person to be the "opener". Bug 2096198 - Too many connection issue occurring for on-demand content sync > Please link the above issue to the BZ when it's open. > its done (2022-06-13) Requesting needsinfo from upstream developer ggainey because the 'FailedQA' flag is set. Requesting needsinfo from upstream developer ggainey because the 'FailedQA' flag is set. We encountered this issue when upgrading Capsules from 6.9.9 to 6.10.6. 2022-06-28 10:14:47 UTC we encountered the issue on Acceptance and yesterday on Production. Requesting needsinfo from upstream developer ggainey because the 'FailedQA' flag is set. Requesting needsinfo from upstream developer ggainey because the 'FailedQA' flag is set. *** Bug 2119888 has been marked as a duplicate of this bug. *** followed the description for retesting. from UI: deadlock detected DETAIL: Process 2271 waits for ShareLock on transaction 76177; blocked by process 2253. Process 2253 waits for ShareLock on transaction 70667; blocked by process 2271. HINT: See server log for query details. CONTEXT: while locking tuple (16,67) in relation "core_contentartifact" Could not lookup a publication_href for repo 22 From /var/log/messages Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: pulp [427040d9-f4ce-4908-abf2-1b65caf52169]: pulpcore.tasking.pulpcore_worker:INFO: Task 14f464a2-78b4-42a1-a75 8-36a74165a5e1 failed (deadlock detected Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: DETAIL: Process 2271 waits for ShareLock on transaction 76177; blocked by process 2253. Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: Process 2253 waits for ShareLock on transaction 70667; blocked by process 2271. Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: HINT: See server log for query details. Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: CONTEXT: while locking tuple (16,67) in relation "core_contentartifact" Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: ) Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: pulp [427040d9-f4ce-4908-abf2-1b65caf52169]: pulpcore.tasking.pulpcore_worker:INFO: File "/usr/lib/python3.9/ site-packages/pulpcore/tasking/pulpcore_worker.py", line 410, in _perform_task Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: result = func(*args, **kwargs) Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib/python3.9/site-packages/pulp_rpm/app/tasks/synchronizing.py", line 556, in synchronize Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: repo_version = dv.create() or repo.latest_version() Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/declarative_version.py", line 161, in create Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: loop.run_until_complete(pipeline) Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib64/python3.9/asyncio/base_events.py", line 642, in run_until_complete Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: return future.result() Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 225, in create_pipeline Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: await asyncio.gather(*futures) Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/api.py", line 43, in __call__ Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: await self.run() Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/content_stages.py", line 200, in run Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: await sync_to_async(process_batch)() Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib/python3.9/site-packages/asgiref/sync.py", line 414, in __call__ Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: ret = await asyncio.wait_for(future, timeout=None) Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib64/python3.9/asyncio/tasks.py", line 442, in wait_for Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: return await fut Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib64/python3.9/concurrent/futures/thread.py", line 52, in run Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: result = self.fn(*self.args, **self.kwargs) Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib/python3.9/site-packages/asgiref/sync.py", line 455, in thread_handler Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: return func(*args, **kwargs) Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib/python3.9/site-packages/pulpcore/plugin/stages/content_stages.py", line 189, in process_batch Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: len(ContentArtifact.objects.filter(pk__in=subq).values_list()) Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib/python3.9/site-packages/django/db/models/query.py", line 262, in __len__ Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: self._fetch_all() Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib/python3.9/site-packages/django/db/models/query.py", line 1324, in _fetch_all Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: self._result_cache = list(self._iterable_class(self)) Sep 15 11:14:40 localhost pulpcore-worker-5[2254]: File "/usr/lib/python3.9/site-packages/django/db/models/query.py", line 140, in __iter__ Failed QA on 6.12.0 snap 11 with python39-pulpcore-3.18.10-1.el8pc.noarch postgres log out put frm the above stacktrace:
===
2022-09-15 11:14:40 EDT ERROR: deadlock detected
2022-09-15 11:14:40 EDT DETAIL: Process 2271 waits for ShareLock on transaction 76177; blocked by process 2253.
Process 2253 waits for ShareLock on transaction 70667; blocked by process 2271.
Process 2271: SELECT "core_contentartifact"."pulp_id", "core_contentartifact"."pulp_created", "core_contentartifact"."pulp_last_updated", "core_contentartifact"."artifact_id", "core_contentartifact"."content_id", "core_contentartifact"."relative_path" FROM "core_contentartifact" WHERE "core_contentartifact"."pulp_id" IN (SELECT U0."pulp_id" FROM "core_contentartifact" U0 WHERE U0."pulp_id" IN ('0a540da6-7eb0-4fa7-9d35-0c09c82b0849'::uuid, 'a31f82bf-00ad-475d-99b1-5aa125832052'::uuid, 'a387baf4-5cfe-4719-bb74-bec2f05fca1b'::uuid, 'a3975a19-dcd9-4358-8a27-7c343d3b4968'::uuid, 'a3e9026b-3ba8-4d0f-a9bf-b33c5daffbd7'::uuid, 'a407bf4e-f19e-42b1-8754-d4a87d4f15dc'::uuid, 'a4121c97-9708-4fca-b610-937e9af677b6'::uuid, 'a4853235-ae93-4aa8-8afa-8a8416393de7'::uuid, 'a48f4c67-e8cc-4b84-8e55-adbced9a67e5'::uuid, 'a5a95699-251c-483d-9c89-82f6835378ee'::uuid, 'a7268c36-03c6-418f-8327-0b4cd1429f21'::uuid, 'a780ec88-c44f-4e93-8e35-5a9014041644'::uuid, 'a79cde66-17c9-47c2-91a7-66c3467b3ce2'::uuid, 'a81051c3-59e9-4ba1-9097-e8386c10678e'::uu
Process 2253: COMMIT
2022-09-15 11:14:40 EDT HINT: See server log for query details.
2022-09-15 11:14:40 EDT CONTEXT: while locking tuple (16,67) in relation "core_contentartifact"
2022-09-15 11:14:40 EDT STATEMENT: SELECT "core_contentartifact"."pulp_id", "core_contentartifact"."pulp_created", "core_contentartifact"."pulp_last_updated", "core_contentartifact"."artifact_id", "core_contentartifact"."content_id", "core_contentartifact"."relative_path" FROM "core_contentartifact" WHERE "core_contentartifact"."pulp_id" IN (SELECT U0."pulp_id" FROM "core_contentartifact" U0 WHERE U0."pulp_id" IN ('0a540da6-7eb0-4fa7-9d35-0c09c82b0849'::uuid, 'a31f82bf-00ad-475d-99b1-5aa125832052'::uuid, 'a387baf4-5cfe-4719-bb74-bec2f05fca1b'::uuid, 'a3975a19-dcd9-4358-8a27-7c343d3b4968'::uuid, 'a3e9026b-3ba8-4d0f-a9bf-b33c5daffbd7'::uuid, 'a407bf4e-f19e-42b1-8754-d4a87d4f15dc'::uuid, 'a4121c97-9708-4fca-b610-937e9af677b6'::uuid, 'a4853235-ae93-4aa8-8afa-8a8416393de7'::uuid, [insert many ids here] '9fbd8d81-61a6-4d88-a358-07712d88153b'::uuid, 'a07d4e89-e7e5-401c-898c-687d08740fdc'::uuid, 'a09c7ca9-c2eb-4e70-bf0d-31afd8063554'::uuid, 'a1e5cd9c-8886-4bda-8d30-3b69d64f77f3'::uuid, 'a1f811a8-43c0-4bd0-b59a-145d7e5c22cb'::uuid) ORDER BY U0."pulp_id" ASC FOR UPDATE)
2022-09-15 11:14:40 EDT LOG: duration: 1315.635 ms statement: COMMIT
===
Just an observation, but doing a complete resync does sync the capsule successfully. (In reply to Lai from comment #66) > Just an observation, but doing a complete resync does sync the capsule > successfully. Perhaps I understood wrong but Did you mention that, * The first "Complete Sync" attempt fails with deadlock * But an immediate second "Complete Sync" attempt works fine and completes successfully? it is the same for the cloned BZ ( for 6.11.3 ) as well ? Sayan, That is correct. A second "Complete Sync" completes successfully. And yes, it's the same as the cloned bz for 6.11.3 Created attachment 1942081 [details]
RHEL 7 Hotfix RPM for Satellite 6.11.4
Created attachment 1942082 [details]
RHEL 8 Hotfix RPM for Satellite 6.11.4
Hotfix RPMs are available for this BZ for Satellite 6.11.4 on RHEL 7 and RHEL 8. INSTALL INSTRUCTIONS (Satellite 6.11.4 on RHEL 7): 1. Take a complete backup or snapshot of Satellite 6.11.4 server. 2. Download the applicable RHEL 7 hotfix RPM attached to this BZ. 3. # yum localinstall tfm-pulpcore-python3-pulpcore-3.16.14-2.HOTFIXRHBZ2062526.el7pc.noarch.rpm --disableplugin=foreman-protector 4. # satellite-maintain service restart INSTALL INSTRUCTIONS (Satellite 6.11.4 on RHEL 8): 1. Take a complete backup or snapshot of Satellite 6.11.4 server. 2. Download the applicable RHEL 8 hotfix RPM attached to this BZ. 3. # yum localinstall python38-pulpcore-3.16.14-2.HOTFIXRHBZ2062526.el8pc.noarch.rpm --disableplugin=foreman-protector 4. # satellite-maintain service restart I would recommend it to be a good idea to also say this should be applied to the capsule(s) as well? :) At least I at lease assume that to be the case. Much appreciated you guys getting this hotfix out ASAP. I will update here once I have more info from the latest full sync. Thanks! (In reply to Mike G from comment #95) > I would recommend it to be a good idea to also say this should be applied to > the capsule(s) as well? :) At least I at lease assume that to be the case. > > Much appreciated you guys getting this hotfix out ASAP. I will update here > once I have more info from the latest full sync. > > Thanks! Good point! Until I am mistaken, the hotfix should be applied *primarily* on *Capsules* - as the Capsule's workers are those entering the deadlock, what pulp on Satellite does not affect. Still having sync issues with the capsules, not sure if its related to this patch or not but no longer getting deadlock warnings, now its something new. I am working with Paul in a case and can update once we have more a definitive conclusion. Thanks Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Satellite 6.13 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:2097 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |