Bug 1995232 - Publishing CV with filters stuck on psql request from pulpcore-worker for hours
Summary: Publishing CV with filters stuck on psql request from pulpcore-worker for hours
Keywords:
Status: CLOSED DUPLICATE of bug 2003764
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Pulp
Version: 6.10.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: Lai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-18 16:29 UTC by Pavel Moravec
Modified: 2022-05-16 16:21 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Content View promotion performance Promoting a Content View with filters for a large repository may take noticeably longer than under Pulp 2.
Clone Of:
Environment:
Last Closed: 2022-03-03 21:34:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github pulp pulp_rpm issues 2301 0 None closed Backport sql-performance fixes for dep solving to 3.14.z 2022-05-16 16:21:23 UTC

Description Pavel Moravec 2021-08-18 16:29:34 UTC
Description of problem:
Trying to publish a CV with filters, it gets stuck for hours in a very specific situation:

- dynflow shows "Actions::Pulp3::Repository::MultiCopyContent" stuck
- particular pulp task running for hours: 
  pulp_data: !ruby/hash:ActiveSupport::HashWithIndifferentAccess
    pulp_href: "/pulp/api/v3/tasks/b80f4c0d-53bf-43b3-8795-d91abc4d5459/"
    pulp_created: '2021-08-18T08:20:03.469+00:00'
    state: running
    name: pulp_rpm.app.tasks.copy.copy_content
    logging_cid: c8dd201a-d6f1-424f-9e73-5ab21de80409
    started_at: '2021-08-18T08:27:22.914+00:00'
    worker: "/pulp/api/v3/workers/ebd2e107-c180-4fab-9a92-034453ae5cbe/"
    child_tasks: []
    progress_reports: []
    created_resources: []
    reserved_resources_record:
    - "/pulp/api/v3/repositories/rpm/rpm/9abdfada-1688-469c-9444-cedaec1571d3/"
    - "/pulp/api/v3/repositories/rpm/rpm/747f3752-43f0-4979-b35f-34b39ed03ece/"
    - "/pulp/api/v3/repositories/rpm/rpm/b9ea0ab0-3afd-4f06-8963-39c57c4e95bd/"
    - "/pulp/api/v3/repositories/rpm/rpm/b5532b58-1266-45c1-a013-934fd480631e/"
    - "/pulp/api/v3/repositories/rpm/rpm/1cbe9139-97df-4671-87d9-30e4c392de31/"
    - "/pulp/api/v3/repositories/rpm/rpm/273702ef-bcad-4d5c-97f6-7019081c059d/"
    - "/pulp/api/v3/repositories/rpm/rpm/57de3423-d519-407b-aeea-5a807d8795d1/"
    - "/pulp/api/v3/repositories/rpm/rpm/1a317e8f-035f-4efb-b208-592089b0bc8a/"
  href: "/pulp/api/v3/tasks/b80f4c0d-53bf-43b3-8795-d91abc4d5459/"
- no pulp worker is anyhow active
- BUT one postgres process spins 100% CPU all the time
- some debugging shows the postgres process is processing a pulpcore-worker request:

SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE (\"rpm_package\".\"name\" = 'authd' AND \"rpm_package\".\"content_ptr_id\" IN (SELECT V0.\"pulp_id\" FROM \"core_content\" V0 INNER JOIN \"core_repositorycontent\" V1 ON (V0.\"pulp_id\" = V1.\"content_id\") WHERE (V1.\"pulp_id\" IN (SELECT U0.\"pulp_id\" FROM \"core_repositorycontent\" U0 INNER JOIN \"core_repositoryversion\" U2 ON (U0.\"version_added_id\" = U2.\"pulp_id\") LEFT OUTER JOIN \"core_repositoryversion\" U3 ON (U0.\"version_removed_id\" = U3.\"pulp_id\") WHERE (U0.\"repository_id\" = '57de3423-d519-407b-aeea-5a807d8795d1'::uuid AND U2.\"number\" <= 1 AND NOT (U3.\"number\" <= 1 AND U3.\"number\" IS NOT NULL))) AND V0.\"pulp_id\" IN ('001955cb-b57d-4f12-bf56-0bb44f9d423a'::uuid, '001d1a44-6b78-4e33-a6d9-cbe208451b19'::uuid, '0061c430-8991-4a84-9f8b-96e1f5634e9e'::uuid, ........
..
(a list of 3916 core_content.pulp_id items follow)
..
'10578ec2-acef-4d11-a0d4-b17635343eea'::uuid))))  LIMIT 1"


Sadly I am unable to see what the pulpcore-worker backtrace is (I would *love* to get a procedure to generate a backtrace with locals for a python3 running process). I have a coredump available of the pulpcore-worker, if needed.


Version-Release number of selected component (if applicable):
Sat 6.10 snap 13:
  pulpcore-selinux-1.2.4-1.el7pc.x86_64
  python3-pulp-file-1.8.1-1.el7pc.noarch
  python3-pulp-rpm-3.14.0-1.el7pc.noarch
  pulp-client-1.0-1.noarch
  python3-pulp-ansible-0.9.0-1.el7pc.noarch
  python3-pulpcore-3.14.3-1.el7pc.noarch
  python3-pulp-container-2.7.1-1.el7pc.noarch


How reproducible:
I bet 100%, as I noticed the same few times on few different CVs with filters on multiple repos.


Steps to Reproduce:
1. Sync 4 repos (basically those for a Capsule but forget satellite maintenance repo):
- RHEL7 software collections
- Sat6.9 Capsule
- RHEL7
- Ansible 2.9
2. Create a CV with the four repos and filter "all content frozen on 2019-01-01". One can use below hammer commands (customize repo IDs 1,2,3,4 if needed):

 content-view create --name mycv_Caps_no_include_2019-01-01 --organization-id=1 --repository-ids=1,2,3,4 --solve-dependencies=no
 content-view filter create --organization-id=1 --content-view=mycv_Caps_no_include_2019-01-01 --name=include_base --inclusion=true --original-packages=true --type=rpm
 content-view filter create --organization-id=1 --content-view=mycv_Caps_no_include_2019-01-01 --name=include_errata --inclusion=true --type=erratum
 content-view filter rule create --organization-id=1 --content-view=mycv_Caps_no_include_2019-01-01 --content-view-filter=include_errata --date-type='updated' --end-date='2019-01-01'
 content-view filter create --organization-id=1 --content-view=mycv_Caps_no_include_2019-01-01 --name=include_modules --inclusion=true --original-module-streams=true --type=modulemd
content-view publish --name mycv_Caps_no_include_2019-01-01 --organization-id 1

3. Wait for Godo.. eh for hours hoping to complete the Publish.


Actual results:
3. CV publish stuck for >8 hours in the state described above.


Expected results:
3. CV Publish to complete within <30 minutes (6.9 needs 22mins).


Additional info:

Comment 1 Pavel Moravec 2021-08-18 20:54:10 UTC
There is some progress: the pulpcore-worker is querying one package after another for the same. I.e. queries like:

"SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE (\"rpm_package\".\"name\" = 'authd' AND \"rpm_package\".\"content_ptr_id\" IN ..
..
SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE (\"rpm_package\".\"name\" = 'mpitests-mvapich23-psm' AND \"rpm_package\".\"content_ptr_id\" IN
..
SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE (\"rpm_package\".\"name\" = 'mvapich2-2.0-doc' AND ..
..
SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE (\"rpm_package\".\"name\" = 'mvapich2-2.2-doc' AND ..
..
SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE (\"rpm_package\".\"name\" = 'mvapich2-2.2-psm-devel' AND ..


(sometimes there is a different psql query like:  SELECT \"core_content\".\"pulp_id\", \"core_content\".\"pulp_created\", \"core_content\".\"pulp_last_updated\", \"core_content\".\"pulp_type\", \"core_content\".\"upstream_id\", \"core_content\".\"timestamp_of_interest\", \"rpm_package\".\"content_ptr_id\", ..  but the above dominate)


postgresql logs show that an average such psql query (of either of the two types) takes 50 seconds(!!!). Now multiply it by the number of packages in the CV.

Comment 2 Pavel Moravec 2021-08-19 14:08:27 UTC
EXPLAIN ANALYZE of the "SELECT (1) AS \"a\" FROM \"rpm_package\" WHERE .. "

 Limit  (cost=3719.94..5444.00 rows=1 width=4) (actual time=50092.715..50092.726 rows=0 loops=1)
   ->  Nested Loop Semi Join  (cost=3719.94..5444.00 rows=1 width=4) (actual time=50092.714..50092.724 rows=0 loops=1)
         Join Filter: (rpm_package.content_ptr_id = v0.pulp_id)
         Rows Removed by Join Filter: 23436
         ->  Bitmap Heap Scan on rpm_package  (cost=4.50..44.07 rows=10 width=16) (actual time=0.069..0.144 rows=6 loops=1)
               Recheck Cond: ((name)::text = 'xorg-x11-apps'::text)
               Heap Blocks: exact=6
               ->  Bitmap Index Scan on rpm_package_name_epoch_version_relea_c9003ffa_uniq  (cost=0.00..4.50 rows=10 width=0) (actual time=0.057..0.058 rows=6 loops=1)
                     Index Cond: ((name)::text = 'xorg-x11-apps'::text)
         ->  Materialize  (cost=3715.44..5399.78 rows=1 width=32) (actual time=60.107..8347.616 rows=3906 loops=6)
               ->  Nested Loop  (cost=3715.44..5399.77 rows=1 width=32) (actual time=360.634..50073.121 rows=3906 loops=1)
                     ->  Nested Loop  (cost=3715.02..3715.39 rows=1 width=16) (actual time=359.339..874.968 rows=36969 loops=1)
                           ->  HashAggregate  (cost=3714.60..3714.61 rows=1 width=16) (actual time=359.307..401.529 rows=36969 loops=1)
                                 Group Key: u0.pulp_id
                                 ->  Nested Loop  (cost=986.92..3714.60 rows=1 width=16) (actual time=2.282..334.230 rows=36969 loops=1)
                                       Join Filter: (u0.version_added_id = u2.pulp_id)
                                       Rows Removed by Join Filter: 1072101
                                       ->  Hash Left Join  (cost=986.92..3711.87 rows=1 width=32) (actual time=2.270..18.856 rows=36969 loops=1)
                                             Hash Cond: (u0.version_removed_id = u3.pulp_id)
                                             Filter: ((u3.number > 1) OR (u3.number IS NULL))
                                             ->  Bitmap Heap Scan on core_repositorycontent u0  (cost=984.36..3611.79 rows=37154 width=48) (actual time=2.205..9.221 rows=36969 loops=1)
                                                   Recheck Cond: (repository_id = '57de3423-d519-407b-aeea-5a807d8795d1'::uuid)
                                                   Heap Blocks: exact=494
                                                   ->  Bitmap Index Scan on core_repositorycontent_repository_id_1946e61b  (cost=0.00..975.07 rows=37154 width=0) (actual time=2.145..2.145 ro
ws=36969 loops=1)
                                                         Index Cond: (repository_id = '57de3423-d519-407b-aeea-5a807d8795d1'::uuid)
                                             ->  Hash  (cost=1.69..1.69 rows=69 width=20) (actual time=0.048..0.050 rows=82 loops=1)
                                                   Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                                   ->  Seq Scan on core_repositoryversion u3  (cost=0.00..1.69 rows=69 width=20) (actual time=0.012..0.025 rows=82 loops=1)
                                       ->  Seq Scan on core_repositoryversion u2  (cost=0.00..1.86 rows=69 width=16) (actual time=0.002..0.006 rows=30 loops=36969)
                                             Filter: (number <= 1)
                           ->  Index Scan using core_repositorycontent_pkey on core_repositorycontent v1  (cost=0.42..0.79 rows=1 width=32) (actual time=0.011..0.011 rows=1 loops=36969)
                                 Index Cond: (pulp_id = u0.pulp_id)
                     ->  Index Only Scan using core_content_pkey on core_content v0  (cost=0.42..1683.52 rows=1 width=16) (actual time=0.193..0.193 rows=0 loops=36969)
                           Index Cond: ((pulp_id = v1.content_id) AND (pulp_id = ANY ('{001955cb-b57d-4f12-bf56-0bb44f9d423a,.......}'::uuid[])))
                           Heap Fetches: 3906
 Planning Time: 6.272 ms
 Execution Time: 50093.049 ms


EXPLAIN ANALYZE of the SELECT "core_content"."pulp_id", "core_content"."pulp_created", "core_content"."pulp_last_updated", "core_content"."pulp_type", "core_content"."upstream_id", "core_content"."timestamp_of_interest", "rpm_package"."content_ptr_id", "rpm_package"."name", "rpm_package"."epoch", "rpm_package"."version", "rpm_package"."release", "rpm_package"."arch", "rpm_package"."evr", "rpm_package"."pkgId", "rpm_package"."checksum_type", "rpm_package"."summary", "rpm_package"."description", "rpm_package"."url", "rpm_package"."changelogs", "rpm_package"."files", "rpm_package"."requires", "rpm_package"."provides", "rpm_package"."conflicts", "rpm_package"."obsoletes", "rpm_package"."suggests", "rpm_package"."enhances", "rpm_package"."recommends", "rpm_package"."supplements", "rpm_package"."location_base", "rpm_package"."location_href", "rpm_package"."rpm_buildhost", "rpm_package"."rpm_group", "rpm_package"."rpm_license", "rpm_package"."rpm_packager", "rpm_package"."rpm_sourcerpm", "rpm_package"."rpm_vendor", "rpm_package"."rpm_header_start", "rpm_package"."rpm_header_end", "rpm_package"."size_archive", "rpm_package"."size_installed", "rpm_package"."size_package", "rpm_package"."time_build", "rpm_package"."time_file", "rpm_package"."is_modular" FROM "rpm_package" INNER JOIN "core_content" ON ("rpm_package"."content_ptr_id" = "core_content"."pulp_id") WHERE ("rpm_package"."name" = 'gcc-objc' AND "rpm_package"."content_ptr_id" IN (SELECT V0."pulp_id" FROM "core_content" V0 INNER JOIN "core_repositorycontent" V1 ON (V0."pulp_id" = V1."content_id") WHERE (V1."pulp_id" IN (SELECT U0."pulp_id" FROM "core_repositorycontent" U0 INNER JOIN "core_repositoryversion" U2 ON (U0."version_added_id" = U2."pulp_id") LEFT OUTER JOIN "core_repositoryversion" U3 ON (U0."version_removed_id" = U3."pulp_id") WHERE (U0."repository_id" = '57de3423-d519-407b-aeea-5a807d8795d1'::uuid AND U2."number" <= 1 AND NOT (U3."number" <= 1 AND U3."number" IS NOT NULL))) AND V0."pulp_id" IN ('001955cb-b57d-4f12-bf56-0bb44f9d423a'::uuid,....(huge_list) :


 Nested Loop  (cost=3717.79..5444.32 rows=1 width=1665) (actual time=49916.914..49934.622 rows=1 loops=1)
   Join Filter: (rpm_package.content_ptr_id = core_content.pulp_id)
   ->  Nested Loop Semi Join  (cost=3717.37..5441.43 rows=1 width=1629) (actual time=49916.883..49934.589 rows=1 loops=1)
         Join Filter: (rpm_package.content_ptr_id = v0.pulp_id)
         Rows Removed by Join Filter: 113144
         ->  Bitmap Heap Scan on rpm_package  (cost=4.50..44.07 rows=10 width=1597) (actual time=0.557..3.550 rows=29 loops=1)
               Recheck Cond: ((name)::text = 'gcc-objc'::text)
               Heap Blocks: exact=29
               ->  Bitmap Index Scan on rpm_package_name_epoch_version_relea_c9003ffa_uniq  (cost=0.00..4.50 rows=10 width=0) (actual time=0.485..0.486 rows=29 loops=1)
                     Index Cond: ((name)::text = 'gcc-objc'::text)
         ->  Materialize  (cost=3712.88..5397.22 rows=1 width=32) (actual time=3.528..1721.304 rows=3902 loops=29)
               ->  Nested Loop  (cost=3712.88..5397.21 rows=1 width=32) (actual time=102.279..49901.653 rows=3906 loops=1)
                     ->  Nested Loop  (cost=3712.46..3712.83 rows=1 width=16) (actual time=100.993..600.317 rows=36969 loops=1)
                           ->  HashAggregate  (cost=3712.04..3712.05 rows=1 width=16) (actual time=100.910..143.389 rows=36969 loops=1)
                                 Group Key: u0.pulp_id
                                 ->  Nested Loop  (cost=987.06..3712.03 rows=1 width=16) (actual time=5.008..78.188 rows=36969 loops=1)
                                       ->  Hash Left Join  (cost=986.92..3711.87 rows=1 width=32) (actual time=4.972..28.655 rows=36969 loops=1)
                                             Hash Cond: (u0.version_removed_id = u3.pulp_id)
                                             Filter: ((u3.number > 1) OR (u3.number IS NULL))
                                             ->  Bitmap Heap Scan on core_repositorycontent u0  (cost=984.36..3611.79 rows=37154 width=48) (actual time=4.871..19.649 rows=36969 loops=1)
                                                   Recheck Cond: (repository_id = '57de3423-d519-407b-aeea-5a807d8795d1'::uuid)
                                                   Heap Blocks: exact=494
                                                   ->  Bitmap Index Scan on core_repositorycontent_repository_id_1946e61b  (cost=0.00..975.07 rows=37154 width=0) (actual time=4.759..4.760 ro
ws=36969 loops=1)
                                                         Index Cond: (repository_id = '57de3423-d519-407b-aeea-5a807d8795d1'::uuid)
                                             ->  Hash  (cost=1.69..1.69 rows=69 width=20) (actual time=0.075..0.076 rows=82 loops=1)
                                                   Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                                   ->  Seq Scan on core_repositoryversion u3  (cost=0.00..1.69 rows=69 width=20) (actual time=0.023..0.034 rows=82 loops=1)
                                       ->  Index Scan using core_repositoryversion_pkey on core_repositoryversion u2  (cost=0.14..0.16 rows=1 width=16) (actual time=0.001..0.001 rows=1 loops
=36969)
                                             Index Cond: (pulp_id = u0.version_added_id)
                                             Filter: (number <= 1)
                           ->  Index Scan using core_repositorycontent_pkey on core_repositorycontent v1  (cost=0.42..0.79 rows=1 width=32) (actual time=0.011..0.011 rows=1 loops=36969)
                                 Index Cond: (pulp_id = u0.pulp_id)
                     ->  Index Only Scan using core_content_pkey on core_content v0  (cost=0.42..1683.52 rows=1 width=16) (actual time=0.194..0.194 rows=0 loops=36969)
                           Index Cond: ((pulp_id = v1.content_id) AND (pulp_id = ANY ('{001955cb-b57d-4f12-bf56-0bb44f9d423a,001d1a44-6b78-4e33-a6d9-cbe208451b19,.......
.....,10578ec2-acef-4d11-a0d4-b17635343eea}'::uuid[])))
                           Heap Fetches: 3906
   ->  Index Scan using core_content_pkey on core_content  (cost=0.42..2.88 rows=1 width=68) (actual time=0.023..0.023 rows=1 loops=1)
         Index Cond: (pulp_id = v0.pulp_id)
 Planning Time: 22.829 ms
 Execution Time: 49935.637 ms

Comment 4 Grant Gainey 2021-09-02 18:38:26 UTC
The explain-plans above happen when, during copy-with-dependencies, we're getting "all the packages in the source and destination repositories". There is no doubt that we need these to be better - but they're only called once per advanced-copy invocation, and are completely swamped (in my experience, anyway) in large copy-requests by the 800-1100msec-per-call to run_solver_jobs(), which is invoked **once per content-unit** being copied.

Current investigation shows postgres being 100%, for 2-3 minutes, at the start of a copy-request. After that, it's 100% in python in pulpcore-worker until completion; the 10K-unit-chunk-size in Satellite can take 2-3 **hours** in python.

Comment 5 pulp-infra@redhat.com 2021-09-10 16:11:15 UTC
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 6 pulp-infra@redhat.com 2021-09-10 16:11:17 UTC
The Pulp upstream bug priority is at Normal. Updating the external tracker on this bug.

Comment 10 pulp-infra@redhat.com 2021-10-11 20:09:03 UTC
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 11 pulp-infra@redhat.com 2021-10-11 20:09:05 UTC
The Pulp upstream bug priority is at High. Updating the external tracker on this bug.

Comment 13 pulp-infra@redhat.com 2021-12-22 16:13:12 UTC
The Pulp upstream bug status is at CLOSED - DUPLICATE. Updating the external tracker on this bug.

Comment 14 pulp-infra@redhat.com 2021-12-22 17:18:57 UTC
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 16 Pavel Moravec 2022-03-03 21:28:08 UTC
I do feel this can be closed as dup of https://bugzilla.redhat.com/show_bug.cgi?id=2003764 or is there something more required here?

The performance improvement per https://bugzilla.redhat.com/show_bug.cgi?id=2003764 sounds sufficient to me, though that BZ addressed a different issue (with similar improvement outcome).

Comment 17 Grant Gainey 2022-03-03 21:34:22 UTC
I'm good w/ closing this as a dup of '3764. There may be some suboptimal sql in here, but a) it's kind of necessary for what the workflow is trying to accomplish, and b) as noted is swamped by the bulk of the work in that workflow - optimizing it to *zero* would not have a noticeable effect on large publishes.

Pavel, you're the opener - if you're good w/ close-as-dup, who am I to disagree? :)

*** This bug has been marked as a duplicate of bug 2003764 ***

Comment 18 pulp-infra@redhat.com 2022-05-16 16:21:24 UTC
The Pulp upstream bug status is at closed. Updating the external tracker on this bug.


Note You need to log in before you can comment on or make changes to this bug.