Description of problem: Trying to publish a CV with filters, it gets stuck for hours in a very specific situation: - dynflow shows "Actions::Pulp3::Repository::MultiCopyContent" stuck - particular pulp task running for hours: pulp_data: !ruby/hash:ActiveSupport::HashWithIndifferentAccess pulp_href: "/pulp/api/v3/tasks/b80f4c0d-53bf-43b3-8795-d91abc4d5459/" pulp_created: '2021-08-18T08:20:03.469+00:00' state: running name: pulp_rpm.app.tasks.copy.copy_content logging_cid: c8dd201a-d6f1-424f-9e73-5ab21de80409 started_at: '2021-08-18T08:27:22.914+00:00' worker: "/pulp/api/v3/workers/ebd2e107-c180-4fab-9a92-034453ae5cbe/" child_tasks: [] progress_reports: [] created_resources: [] reserved_resources_record: - "/pulp/api/v3/repositories/rpm/rpm/9abdfada-1688-469c-9444-cedaec1571d3/" - "/pulp/api/v3/repositories/rpm/rpm/747f3752-43f0-4979-b35f-34b39ed03ece/" - "/pulp/api/v3/repositories/rpm/rpm/b9ea0ab0-3afd-4f06-8963-39c57c4e95bd/" - "/pulp/api/v3/repositories/rpm/rpm/b5532b58-1266-45c1-a013-934fd480631e/" - "/pulp/api/v3/repositories/rpm/rpm/1cbe9139-97df-4671-87d9-30e4c392de31/" - "/pulp/api/v3/repositories/rpm/rpm/273702ef-bcad-4d5c-97f6-7019081c059d/" - "/pulp/api/v3/repositories/rpm/rpm/57de3423-d519-407b-aeea-5a807d8795d1/" - "/pulp/api/v3/repositories/rpm/rpm/1a317e8f-035f-4efb-b208-592089b0bc8a/" href: "/pulp/api/v3/tasks/b80f4c0d-53bf-43b3-8795-d91abc4d5459/" - no pulp worker is anyhow active - BUT one postgres process spins 100% CPU all the time - some debugging shows the postgres process is processing a pulpcore-worker request: SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE (\"rpm_package\".\"name\" = 'authd' AND \"rpm_package\".\"content_ptr_id\" IN (SELECT V0.\"pulp_id\" FROM \"core_content\" V0 INNER JOIN \"core_repositorycontent\" V1 ON (V0.\"pulp_id\" = V1.\"content_id\") WHERE (V1.\"pulp_id\" IN (SELECT U0.\"pulp_id\" FROM \"core_repositorycontent\" U0 INNER JOIN \"core_repositoryversion\" U2 ON (U0.\"version_added_id\" = U2.\"pulp_id\") LEFT OUTER JOIN \"core_repositoryversion\" U3 ON (U0.\"version_removed_id\" = U3.\"pulp_id\") WHERE (U0.\"repository_id\" = '57de3423-d519-407b-aeea-5a807d8795d1'::uuid AND U2.\"number\" <= 1 AND NOT (U3.\"number\" <= 1 AND U3.\"number\" IS NOT NULL))) AND V0.\"pulp_id\" IN ('001955cb-b57d-4f12-bf56-0bb44f9d423a'::uuid, '001d1a44-6b78-4e33-a6d9-cbe208451b19'::uuid, '0061c430-8991-4a84-9f8b-96e1f5634e9e'::uuid, ........ .. (a list of 3916 core_content.pulp_id items follow) .. '10578ec2-acef-4d11-a0d4-b17635343eea'::uuid)))) LIMIT 1" Sadly I am unable to see what the pulpcore-worker backtrace is (I would *love* to get a procedure to generate a backtrace with locals for a python3 running process). I have a coredump available of the pulpcore-worker, if needed. Version-Release number of selected component (if applicable): Sat 6.10 snap 13: pulpcore-selinux-1.2.4-1.el7pc.x86_64 python3-pulp-file-1.8.1-1.el7pc.noarch python3-pulp-rpm-3.14.0-1.el7pc.noarch pulp-client-1.0-1.noarch python3-pulp-ansible-0.9.0-1.el7pc.noarch python3-pulpcore-3.14.3-1.el7pc.noarch python3-pulp-container-2.7.1-1.el7pc.noarch How reproducible: I bet 100%, as I noticed the same few times on few different CVs with filters on multiple repos. Steps to Reproduce: 1. Sync 4 repos (basically those for a Capsule but forget satellite maintenance repo): - RHEL7 software collections - Sat6.9 Capsule - RHEL7 - Ansible 2.9 2. Create a CV with the four repos and filter "all content frozen on 2019-01-01". One can use below hammer commands (customize repo IDs 1,2,3,4 if needed): content-view create --name mycv_Caps_no_include_2019-01-01 --organization-id=1 --repository-ids=1,2,3,4 --solve-dependencies=no content-view filter create --organization-id=1 --content-view=mycv_Caps_no_include_2019-01-01 --name=include_base --inclusion=true --original-packages=true --type=rpm content-view filter create --organization-id=1 --content-view=mycv_Caps_no_include_2019-01-01 --name=include_errata --inclusion=true --type=erratum content-view filter rule create --organization-id=1 --content-view=mycv_Caps_no_include_2019-01-01 --content-view-filter=include_errata --date-type='updated' --end-date='2019-01-01' content-view filter create --organization-id=1 --content-view=mycv_Caps_no_include_2019-01-01 --name=include_modules --inclusion=true --original-module-streams=true --type=modulemd content-view publish --name mycv_Caps_no_include_2019-01-01 --organization-id 1 3. Wait for Godo.. eh for hours hoping to complete the Publish. Actual results: 3. CV publish stuck for >8 hours in the state described above. Expected results: 3. CV Publish to complete within <30 minutes (6.9 needs 22mins). Additional info:
There is some progress: the pulpcore-worker is querying one package after another for the same. I.e. queries like: "SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE (\"rpm_package\".\"name\" = 'authd' AND \"rpm_package\".\"content_ptr_id\" IN .. .. SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE (\"rpm_package\".\"name\" = 'mpitests-mvapich23-psm' AND \"rpm_package\".\"content_ptr_id\" IN .. SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE (\"rpm_package\".\"name\" = 'mvapich2-2.0-doc' AND .. .. SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE (\"rpm_package\".\"name\" = 'mvapich2-2.2-doc' AND .. .. SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE (\"rpm_package\".\"name\" = 'mvapich2-2.2-psm-devel' AND .. (sometimes there is a different psql query like: SELECT \"core_content\".\"pulp_id\", \"core_content\".\"pulp_created\", \"core_content\".\"pulp_last_updated\", \"core_content\".\"pulp_type\", \"core_content\".\"upstream_id\", \"core_content\".\"timestamp_of_interest\", \"rpm_package\".\"content_ptr_id\", .. but the above dominate) postgresql logs show that an average such psql query (of either of the two types) takes 50 seconds(!!!). Now multiply it by the number of packages in the CV.
EXPLAIN ANALYZE of the "SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE .. " Limit (cost=3719.94..5444.00 rows=1 width=4) (actual time=50092.715..50092.726 rows=0 loops=1) -> Nested Loop Semi Join (cost=3719.94..5444.00 rows=1 width=4) (actual time=50092.714..50092.724 rows=0 loops=1) Join Filter: (rpm_package.content_ptr_id = v0.pulp_id) Rows Removed by Join Filter: 23436 -> Bitmap Heap Scan on rpm_package (cost=4.50..44.07 rows=10 width=16) (actual time=0.069..0.144 rows=6 loops=1) Recheck Cond: ((name)::text = 'xorg-x11-apps'::text) Heap Blocks: exact=6 -> Bitmap Index Scan on rpm_package_name_epoch_version_relea_c9003ffa_uniq (cost=0.00..4.50 rows=10 width=0) (actual time=0.057..0.058 rows=6 loops=1) Index Cond: ((name)::text = 'xorg-x11-apps'::text) -> Materialize (cost=3715.44..5399.78 rows=1 width=32) (actual time=60.107..8347.616 rows=3906 loops=6) -> Nested Loop (cost=3715.44..5399.77 rows=1 width=32) (actual time=360.634..50073.121 rows=3906 loops=1) -> Nested Loop (cost=3715.02..3715.39 rows=1 width=16) (actual time=359.339..874.968 rows=36969 loops=1) -> HashAggregate (cost=3714.60..3714.61 rows=1 width=16) (actual time=359.307..401.529 rows=36969 loops=1) Group Key: u0.pulp_id -> Nested Loop (cost=986.92..3714.60 rows=1 width=16) (actual time=2.282..334.230 rows=36969 loops=1) Join Filter: (u0.version_added_id = u2.pulp_id) Rows Removed by Join Filter: 1072101 -> Hash Left Join (cost=986.92..3711.87 rows=1 width=32) (actual time=2.270..18.856 rows=36969 loops=1) Hash Cond: (u0.version_removed_id = u3.pulp_id) Filter: ((u3.number > 1) OR (u3.number IS NULL)) -> Bitmap Heap Scan on core_repositorycontent u0 (cost=984.36..3611.79 rows=37154 width=48) (actual time=2.205..9.221 rows=36969 loops=1) Recheck Cond: (repository_id = '57de3423-d519-407b-aeea-5a807d8795d1'::uuid) Heap Blocks: exact=494 -> Bitmap Index Scan on core_repositorycontent_repository_id_1946e61b (cost=0.00..975.07 rows=37154 width=0) (actual time=2.145..2.145 ro ws=36969 loops=1) Index Cond: (repository_id = '57de3423-d519-407b-aeea-5a807d8795d1'::uuid) -> Hash (cost=1.69..1.69 rows=69 width=20) (actual time=0.048..0.050 rows=82 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 13kB -> Seq Scan on core_repositoryversion u3 (cost=0.00..1.69 rows=69 width=20) (actual time=0.012..0.025 rows=82 loops=1) -> Seq Scan on core_repositoryversion u2 (cost=0.00..1.86 rows=69 width=16) (actual time=0.002..0.006 rows=30 loops=36969) Filter: (number <= 1) -> Index Scan using core_repositorycontent_pkey on core_repositorycontent v1 (cost=0.42..0.79 rows=1 width=32) (actual time=0.011..0.011 rows=1 loops=36969) Index Cond: (pulp_id = u0.pulp_id) -> Index Only Scan using core_content_pkey on core_content v0 (cost=0.42..1683.52 rows=1 width=16) (actual time=0.193..0.193 rows=0 loops=36969) Index Cond: ((pulp_id = v1.content_id) AND (pulp_id = ANY ('{001955cb-b57d-4f12-bf56-0bb44f9d423a,.......}'::uuid[]))) Heap Fetches: 3906 Planning Time: 6.272 ms Execution Time: 50093.049 ms EXPLAIN ANALYZE of the SELECT "core_content"."pulp_id", "core_content"."pulp_created", "core_content"."pulp_last_updated", "core_content"."pulp_type", "core_content"."upstream_id", "core_content"."timestamp_of_interest", "rpm_package"."content_ptr_id", "rpm_package"."name", "rpm_package"."epoch", "rpm_package"."version", "rpm_package"."release", "rpm_package"."arch", "rpm_package"."evr", "rpm_package"."pkgId", "rpm_package"."checksum_type", "rpm_package"."summary", "rpm_package"."description", "rpm_package"."url", "rpm_package"."changelogs", "rpm_package"."files", "rpm_package"."requires", "rpm_package"."provides", "rpm_package"."conflicts", "rpm_package"."obsoletes", "rpm_package"."suggests", "rpm_package"."enhances", "rpm_package"."recommends", "rpm_package"."supplements", "rpm_package"."location_base", "rpm_package"."location_href", "rpm_package"."rpm_buildhost", "rpm_package"."rpm_group", "rpm_package"."rpm_license", "rpm_package"."rpm_packager", "rpm_package"."rpm_sourcerpm", "rpm_package"."rpm_vendor", "rpm_package"."rpm_header_start", "rpm_package"."rpm_header_end", "rpm_package"."size_archive", "rpm_package"."size_installed", "rpm_package"."size_package", "rpm_package"."time_build", "rpm_package"."time_file", "rpm_package"."is_modular" FROM "rpm_package" INNER JOIN "core_content" ON ("rpm_package"."content_ptr_id" = "core_content"."pulp_id") WHERE ("rpm_package"."name" = 'gcc-objc' AND "rpm_package"."content_ptr_id" IN (SELECT V0."pulp_id" FROM "core_content" V0 INNER JOIN "core_repositorycontent" V1 ON (V0."pulp_id" = V1."content_id") WHERE (V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '57de3423-d519-407b-aeea-5a807d8795d1'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL))) AND V0."pulp_id" IN ('001955cb-b57d-4f12-bf56-0bb44f9d423a'::uuid,....(huge_list) : Nested Loop (cost=3717.79..5444.32 rows=1 width=1665) (actual time=49916.914..49934.622 rows=1 loops=1) Join Filter: (rpm_package.content_ptr_id = core_content.pulp_id) -> Nested Loop Semi Join (cost=3717.37..5441.43 rows=1 width=1629) (actual time=49916.883..49934.589 rows=1 loops=1) Join Filter: (rpm_package.content_ptr_id = v0.pulp_id) Rows Removed by Join Filter: 113144 -> Bitmap Heap Scan on rpm_package (cost=4.50..44.07 rows=10 width=1597) (actual time=0.557..3.550 rows=29 loops=1) Recheck Cond: ((name)::text = 'gcc-objc'::text) Heap Blocks: exact=29 -> Bitmap Index Scan on rpm_package_name_epoch_version_relea_c9003ffa_uniq (cost=0.00..4.50 rows=10 width=0) (actual time=0.485..0.486 rows=29 loops=1) Index Cond: ((name)::text = 'gcc-objc'::text) -> Materialize (cost=3712.88..5397.22 rows=1 width=32) (actual time=3.528..1721.304 rows=3902 loops=29) -> Nested Loop (cost=3712.88..5397.21 rows=1 width=32) (actual time=102.279..49901.653 rows=3906 loops=1) -> Nested Loop (cost=3712.46..3712.83 rows=1 width=16) (actual time=100.993..600.317 rows=36969 loops=1) -> HashAggregate (cost=3712.04..3712.05 rows=1 width=16) (actual time=100.910..143.389 rows=36969 loops=1) Group Key: u0.pulp_id -> Nested Loop (cost=987.06..3712.03 rows=1 width=16) (actual time=5.008..78.188 rows=36969 loops=1) -> Hash Left Join (cost=986.92..3711.87 rows=1 width=32) (actual time=4.972..28.655 rows=36969 loops=1) Hash Cond: (u0.version_removed_id = u3.pulp_id) Filter: ((u3.number > 1) OR (u3.number IS NULL)) -> Bitmap Heap Scan on core_repositorycontent u0 (cost=984.36..3611.79 rows=37154 width=48) (actual time=4.871..19.649 rows=36969 loops=1) Recheck Cond: (repository_id = '57de3423-d519-407b-aeea-5a807d8795d1'::uuid) Heap Blocks: exact=494 -> Bitmap Index Scan on core_repositorycontent_repository_id_1946e61b (cost=0.00..975.07 rows=37154 width=0) (actual time=4.759..4.760 ro ws=36969 loops=1) Index Cond: (repository_id = '57de3423-d519-407b-aeea-5a807d8795d1'::uuid) -> Hash (cost=1.69..1.69 rows=69 width=20) (actual time=0.075..0.076 rows=82 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 13kB -> Seq Scan on core_repositoryversion u3 (cost=0.00..1.69 rows=69 width=20) (actual time=0.023..0.034 rows=82 loops=1) -> Index Scan using core_repositoryversion_pkey on core_repositoryversion u2 (cost=0.14..0.16 rows=1 width=16) (actual time=0.001..0.001 rows=1 loops =36969) Index Cond: (pulp_id = u0.version_added_id) Filter: (number <= 1) -> Index Scan using core_repositorycontent_pkey on core_repositorycontent v1 (cost=0.42..0.79 rows=1 width=32) (actual time=0.011..0.011 rows=1 loops=36969) Index Cond: (pulp_id = u0.pulp_id) -> Index Only Scan using core_content_pkey on core_content v0 (cost=0.42..1683.52 rows=1 width=16) (actual time=0.194..0.194 rows=0 loops=36969) Index Cond: ((pulp_id = v1.content_id) AND (pulp_id = ANY ('{001955cb-b57d-4f12-bf56-0bb44f9d423a,001d1a44-6b78-4e33-a6d9-cbe208451b19,....... .....,10578ec2-acef-4d11-a0d4-b17635343eea}'::uuid[]))) Heap Fetches: 3906 -> Index Scan using core_content_pkey on core_content (cost=0.42..2.88 rows=1 width=68) (actual time=0.023..0.023 rows=1 loops=1) Index Cond: (pulp_id = v0.pulp_id) Planning Time: 22.829 ms Execution Time: 49935.637 ms
The explain-plans above happen when, during copy-with-dependencies, we're getting "all the packages in the source and destination repositories". There is no doubt that we need these to be better - but they're only called once per advanced-copy invocation, and are completely swamped (in my experience, anyway) in large copy-requests by the 800-1100msec-per-call to run_solver_jobs(), which is invoked **once per content-unit** being copied. Current investigation shows postgres being 100%, for 2-3 minutes, at the start of a copy-request. After that, it's 100% in python in pulpcore-worker until completion; the 10K-unit-chunk-size in Satellite can take 2-3 **hours** in python.
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.
The Pulp upstream bug status is at CLOSED - DUPLICATE. Updating the external tracker on this bug.
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.
I do feel this can be closed as dup of https://bugzilla.redhat.com/show_bug.cgi?id=2003764 or is there something more required here? The performance improvement per https://bugzilla.redhat.com/show_bug.cgi?id=2003764 sounds sufficient to me, though that BZ addressed a different issue (with similar improvement outcome).
I'm good w/ closing this as a dup of '3764. There may be some suboptimal sql in here, but a) it's kind of necessary for what the workflow is trying to accomplish, and b) as noted is swamped by the bulk of the work in that workflow - optimizing it to *zero* would not have a noticeable effect on large publishes. Pavel, you're the opener - if you're good w/ close-as-dup, who am I to disagree? :) *** This bug has been marked as a duplicate of bug 2003764 ***
The Pulp upstream bug status is at closed. Updating the external tracker on this bug.