2251200 – getting versions of an ansible collections does not scale

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2251200 - getting versions of an ansible collections does not scale

Summary: getting versions of an ansible collections does not scale

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Pulp
Sub Component:
Version:	6.13.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	6.15.0
Assignee:	satellite6-bugs
QA Contact:	Gaurav Talreja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-11-23 12:17 UTC by Pavel Moravec
Modified:	2024-08-22 04:25 UTC (History)
CC List:	15 users (show)
Fixed In Version:	pulp-ansible-0.18.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-04-23 17:15:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	pulp pulp_ansible issues 1410	None	open	refactor version sorting in CollectionVersionViewSet.list()	2023-11-23 12:50:56 UTC
Red Hat Issue Tracker	SAT-21473	None	None	None	2023-11-23 12:19:57 UTC
Red Hat Product Errata	RHSA-2024:2010	None	None	None	2024-04-23 17:15:54 UTC

Description Pavel Moravec 2023-11-23 12:17:37 UTC

Description of problem:
When one synchronizes whole https://galaxy.ansible.com/api/ to Satellite, then Satellite (katello and pulp) hits some scalability+performance issues:

1) Actions::Katello::Repository::IndexContent dynflow step does often fail during repo sync. Either with 502 response error code or with Faraday::ConnectionFailed EOFError (EOFError) error. Partially it is consequence of 2) , but also due to a lack of pagination, I guess. Sincethe dynflow step raises query:

Nov 23 09:22:22 pmoravec-sat614 pulpcore-api[2078]: pulp [6643347abff54c979e14631fe16d71ed]:  - - [23/Nov/2023:08:22:22 +0000] "GET /pulp/api/v3/content/ansible/collection_versions/?limit=2000&offset=0&repository_version=%2Fpulp%2Fapi%2Fv3%2Frepositories%2Fansible%2Fansible%2F2b92bb80-5770-4d3a-a02d-61cb30eeb5e2%2Fversions%2F1%2F HTTP/1.1" 200 1246516303 "-" "OpenAPI-Generator/0.16.1/ruby"

that lasted for 8 minutes until pulpcore-api got signal 9. See the response length, over 1G data. Worth paginating it..?

Also, sidekiq process consumed 8.5GB of memory during that, too much.


2) underlying pulp queries are VERY inefficient - I guess they lack some filters so many redundant data are processed..? E.g.:

2a) the particular query from katello. It always failed for me /o\ after it consumed 7GB of memory.

2b) ansible-galaxy raises queries like:

/api/v3/collections/community/general/versions/?limit=100

and then ..../versions/8.0.2/ for each individual version. The "get me list of versions" is the very slow here, in particular:

  - /api/v3/collections/community/general/versions/?limit=100 :
    - run 22s, pulp gunicorn process consumed 3091340kB RSS
  - /api/v3/collections/community/general/versions/?limit=1 :
    - 22s, 3092328 RSS
  - /api/v3/collections/community/general/versions/?limit=1&ordering=is_highest :
    - 22s, 3186600 RSS

Curiously, querying:
  - /api/v3/collections/community/general/versions/?limit=1&is_highest=true :
    - 0.3s, no memory increase spotted!

Checking where pulp spends the most of time, it is inside queryset = sorted(..). 

/usr/lib/python3.9/site-packages/pulp_ansible/app/galaxy/v3/views.py :

    def list(self, request, *args, **kwargs):
        """
        Returns paginated CollectionVersions list.
        """
        queryset = self.filter_queryset(self.get_queryset())
        queryset = sorted(
            queryset, key=lambda obj: semantic_version.Version(obj.version), reverse=True
        )

The "queryset = self.filter_queryset(self.get_queryset())" takes 1-2 seconds, the queryset = sorted(..) takes 20 seconds. Even for "limit=1" or "limit=1&ordering=is_highest". While "limit=1&is_highest=true" query following the same code is pretty fast.


I understand there is usually no need to sync all collections and we should use Requirements to filter collections of interest, but then state in documentation (https://access.redhat.com/documentation/en-us/red_hat_satellite/6.13/html-single/managing_configurations_using_ansible_integration_in_red_hat_satellite/index#synchronizing-ansible-collections_ansible) we support only filtered content.

But I would rather see some improvement - why getting 139 entries takes 20 seconds and consume 3GB memory?


Version-Release number of selected component (if applicable):
Sat 6.13 (also in 6.14)


How reproducible:
100%


Steps to Reproduce:
1. Sync repo of type ansible, with upstream URL https://galaxy.ansible.com/api/ and monitor sidekiq + pulp's gunicorn memory usage.
2. Try to list versions of some collection, like:

time curl -L -k 'https://localhost/pulp_ansible/galaxy/ORGANIZATION/Library/custom/PRODUCT/REPOSITORY/api/v3/collections/community/general/versions/?limit=100&offset=0' > galaxy_whole.versions.100.json

and use other URIs from above.

3. monitor memory usage of pulp's gunicorn processes


Actual results:
1. sidekiq consumes 8.5GB memory, pulp consumes 7GB memory, sync often/always fails on indexing katello content.
2. times and memory usage for given URIs:

  - /api/v3/collections/community/general/versions/?limit=100 :
    - 22s, 3091340kB RSS
  - /api/v3/collections/community/general/versions/?limit=1 :
    - 22s, 3092328 RSS
  - /api/v3/collections/community/general/versions/?limit=1&ordering=is_highest :
    - 22s, 3186600 RSS
  - /api/v3/collections/community/general/versions/?limit=1&is_highest=true :
    - 0.3s, no memory increase spotted!
(just this latest is great)


Expected results:
Several times faster requests, less memory usage.


Additional info:
Does "limit" parameter work well for the query? Since sub-URIs:

/versions/?limit=100
/versions/?limit=200
/versions/?limit=200&offset=0

all return:

{
    "meta": {
        "count": 139
    },
    "links": {
        "first": "/pulp_ansible/galaxy/RedHat/Library/custom/Ansible_Galaxy/galaxy_whole/api/v3/plugin/ansible/content/RedHat/Library/custom/Ansible_Galaxy/galaxy_whole/collections/index/community/general/versions/?limit=100&offset=0",
        "previous": null,
        "next": "/pulp_ansible/galaxy/RedHat/Library/custom/Ansible_Galaxy/galaxy_whole/api/v3/plugin/ansible/content/RedHat/Library/custom/Ansible_Galaxy/galaxy_whole/collections/index/community/general/versions/?limit=100&offset=100",
        "last": "/pulp_ansible/galaxy/RedHat/Library/custom/Ansible_Galaxy/galaxy_whole/api/v3/plugin/ansible/content/RedHat/Library/custom/Ansible_Galaxy/galaxy_whole/collections/index/community/general/versions/?limit=100&offset=39"
    },

followed by *100* items of data. Not 200 per limit=200.

Comment 1 Pavel Moravec 2023-11-23 12:50:57 UTC

Seems very relevant https://github.com/pulp/pulp_ansible/issues/1410 , maybe sufficient improvement via https://github.com/pulp/pulp_ansible/pull/1476 .

Comment 2 hakon.gislason 2023-11-23 13:02:02 UTC

>I understand there is usually no need to sync all collections and we should use Requirements to filter collections of interest, but then state in documentation (https://access.redhat.com/documentation/en-us/red_hat_satellite/6.13/html-single/managing_configurations_using_ansible_integration_in_red_hat_satellite/index#synchronizing-ansible-collections_ansible) we support only filtered content.

Usually, maybe.

In our case, synchronizing individual collections and/or versions is not an option. We can't maintain a list of collections users want to use or expect users to create a ticket asking for collection.xyz version 1.2.3 to be added to the mirror. Imagine doing that with RPM packages.

The majority of our systems have no internet access and rely entirely on Satellite (and other servers) for content like RPM packages, ansible collections, etc...

Even systems with internet access are set to prefer the internal mirror due to:
1) not wanting to hit rate-limits on galaxy.ansible.com
2) keeping content installation (rpm/collections/etc..) as fast as possible.

(think: ci/cd pipelines using containers that may execute multiple 'ansible-galaxy collection install' in each job for each stage of the pipeline, resulting in 10+ 'ansible-galaxy collection install community.general' per pipeline run. Expectations would be that it should take seconds, a similar amount of time like 'dnf install htop', not minutes)

Our ansible lint/test/deploy pipelines would take 10 minutes in total (2 minutes to lint, 8 minutes to test+deploy) before hitting this issue, but now take 45-60 minutes.

Comment 3 Pavel Moravec 2023-11-23 13:35:27 UTC

These bits from #1476 will *definitely* help in this bugzilla report:

https://github.com/pulp/pulp_ansible/blame/main/pulp_ansible/app/galaxy/v3/views.py#L943-L955 :

        queryset = queryset.only(
            "pk",
            "content_ptr_id",
            "marks",
            "namespace",
            "name",
            "version",
            "pulp_created",
            "pulp_last_updated",
            "requires_ansible",
            "collection",
        )

limiting the queryset to just some relevant fields. Since at the end, the API response contains a few fields per an item, not whole `ansible_collectionversion` record at all (which is huge).

Comment 4 Pavel Moravec 2023-11-23 13:41:35 UTC

(In reply to Pavel Moravec from comment #3)
> These bits from #1476 will *definitely* help in this bugzilla report:
> 
> https://github.com/pulp/pulp_ansible/blame/main/pulp_ansible/app/galaxy/v3/
> views.py#L943-L955 :
> 
>         queryset = queryset.only(
>             "pk",
>             "content_ptr_id",
>             "marks",
>             "namespace",
>             "name",
>             "version",
>             "pulp_created",
>             "pulp_last_updated",
>             "requires_ansible",
>             "collection",
>         )
> 
> limiting the queryset to just some relevant fields. Since at the end, the
> API response contains a few fields per an item, not whole
> `ansible_collectionversion` record at all (which is huge).

Just plain adding this to /usr/lib/python3.9/site-packages/pulp_ansible/app/galaxy/v3/views.py :

    def list(self, request, *args, **kwargs):
        """
        Returns paginated CollectionVersions list.
        """
        import logging
        log = logging.getLogger(__name__)
        log.info(f"PavelM: collection list, args={args}")
        queryset = self.filter_queryset(self.get_queryset())
        queryset = queryset.only(
            "pk",
            "content_ptr_id",
            "marks",
            "namespace",
            "name",
            "version",
            "pulp_created",
            "pulp_last_updated",
            "requires_ansible",
            "collection",
        )
        log.info(f"PavelM: collection list, args={args} queryset={queryset}")
        queryset = sorted(
            queryset, key=lambda obj: semantic_version.Version(obj.version), reverse=True
        )
        log.info(f"PavelM: collection list, sorted")


does drastically help: from 22s to 0.38s, from 3GB memory to no memory increase spotted.

(this change alone can have harmful impact to other requests so I discourage to patching a system with it, though, it is just an observation where the culprit is)

Comment 5 Ron Lavi 2023-11-23 15:12:29 UTC

 maybe this one is related: https://bugzilla.redhat.com/show_bug.cgi?id=2247864

Comment 7 Daniel Alley 2023-11-27 14:53:47 UTC

> maybe this one is related: https://bugzilla.redhat.com/show_bug.cgi?id=2247864

I don't see how that would be related

@gerrod Do you think that https://github.com/pulp/pulp_ansible/pull/1476/ covers this BZ fully?  Are there other recent optimization PRs that may also help?  And is it possible to backport them, if so?

Comment 8 Gerrod 2023-11-27 15:07:47 UTC

Yes, https://github.com/pulp/pulp_ansible/pull/1476/ covers the majority of this BZ. Maybe some extra optimizations on the pulp content api could cover the first part on the katello index failing (they could also use the `fields` option to limit which fields they are bringing in). What version would this need to be backported to? #1476 has a migration in it so it is not fully backportable.

Comment 9 Daniel Alley 2023-11-27 15:15:53 UTC

It's reported against Satellite 6.13, so I believe the backport would be needed to pulp_ansible 0.15 and 0.16.

Pavel mentioned that cherrypicking a small portion of the patch helps some queries and hurts others - hopefully it is possible to get the speedup without doing any harm? I'm not sure, maybe that relies on the migration.

Comment 10 Pavel Moravec 2023-11-27 15:31:11 UTC

(In reply to Daniel Alley from comment #9)
> It's reported against Satellite 6.13, so I believe the backport would be
> needed to pulp_ansible 0.15 and 0.16.
> 
> Pavel mentioned that cherrypicking a small portion of the patch helps some
> queries and hurts others - hopefully it is possible to get the speedup
> without doing any harm? I'm not sure, maybe that relies on the migration.

Cherrypickying the "queryset = queryset.only( .." bit might hurt elsewhere, if the current downstream code relies on some other fields outside the projection. I expect it is a false suspicion, but somebody more knowledgable of the code should bless it - I have seen pulp ansible code last week for the first time :).

Comment 11 Daniel Alley 2023-11-27 16:41:55 UTC

I'm marking this as POST because it ought to be fixed in Stream / future-6.15 currently.  However we will likely still want to backport, and it might be a somewhat labor intensive backport as we can't backport the migration.  So @pmoravec I'm not sure if we want a separate BZ filed for that, or if we can just do standard clones?

Comment 12 Pavel Moravec 2023-11-28 08:04:48 UTC

(In reply to Daniel Alley from comment #11)
> I'm marking this as POST because it ought to be fixed in Stream /
> future-6.15 currently.  However we will likely still want to backport, and
> it might be a somewhat labor intensive backport as we can't backport the
> migration.  So @pmoravec I'm not sure if we want a separate BZ
> filed for that, or if we can just do standard clones?

I would vote for regular z-stream clones and limited backport there. I.e. backport just some part(s) of the big upstream fix, without the migration. Maybe just the https://bugzilla.redhat.com/show_bug.cgi?id=2251200#c3 is sufficient to backport, *if* they are really safe.

Comment 17 Gaurav Talreja 2024-02-16 14:24:30 UTC

Verified.

Tested on Satellite 6.15.0 Snap 9.0
Version: rubygem-pulp_ansible_client-0.20.2-1.el8sat.noarch

Steps:
1. Sync repo of type ansible collection, with upstream URL https://galaxy.ansible.com/api/ or https://old-galaxy.ansible.com/api/
2. Monitor sidekiq + pulp's gunicorn memory usage using ps command below and check monitor_ps.log
# nohup bash -c 'while true; do date; ps aux | sort -nk6 | tail; sleep 10; done' > monitor_ps.log 2>&1 &

Observation:
To reproduce and verify difference with 6.14.2, syncing task with 6.14 took 4+ hours and in 6.15 it took ~1 hour to complete, and for memory usage, sidekiq process took >~5GB (5662096) memory and pulp gunicorn process took ~1.3G (1337144) memory, where as for 6.15.0 sidekiq process took ~0.78G (781636) and pulp gunicorn isn't captured in monitor_ps.log guessing due to sort -nk6, which looks excellent optimization and reduction in memory usage.

And, tested the mentioned endpoints to check the response time for pulp_ansible with fix and it also show great reduction in response time,
# time curl -L -k 'https://localhost/pulp_ansible/galaxy/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/api/v3/collections/community/general/versions/?limit=100&offset=0'
{"meta":{"count":0},"links":{"first":"/pulp_ansible/galaxy/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/api/v3/plugin/ansible/content/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/collections/index/community/general/versions/?limit=100&offset=0","previous":null,"next":null,"last":"/pulp_ansible/galaxy/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/api/v3/plugin/ansible/content/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/collections/index/community/general/versions/?limit=100&offset=0"},"data":[]}
real	0m0.103s
user	0m0.011s
sys	0m0.005s
 
# time curl -L -k 'https://localhost/pulp_ansible/galaxy/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/api/v3/collections/community/general/versions/?limit=1'
{"meta":{"count":0},"links":{"first":"/pulp_ansible/galaxy/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/api/v3/plugin/ansible/content/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/collections/index/community/general/versions/?limit=1&offset=0","previous":null,"next":null,"last":"/pulp_ansible/galaxy/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/api/v3/plugin/ansible/content/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/collections/index/community/general/versions/?limit=1&offset=0"},"data":[]}
real	0m0.147s
user	0m0.007s
sys	0m0.009s
 
# time curl -L -k 'https://localhost/pulp_ansible/galaxy/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/api/v3/collections/community/general/versions/?limit=1&ordering=is_highest'
{"meta":{"count":0},"links":{"first":"/pulp_ansible/galaxy/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/api/v3/plugin/ansible/content/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/collections/index/community/general/versions/?limit=1&offset=0&ordering=is_highest","previous":null,"next":null,"last":"/pulp_ansible/galaxy/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/api/v3/plugin/ansible/content/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/collections/index/community/general/versions/?limit=1&offset=0&ordering=is_highest"},"data":[]}
real	0m0.083s
user	0m0.010s
sys	0m0.007s
 
# time curl -L -k 'https://localhost/pulp_ansible/galaxy/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/api/v3/collections/community/general/versions/?limit=1&is_highest=true'
{"meta":{"count":0},"links":{"first":"/pulp_ansible/galaxy/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/api/v3/plugin/ansible/content/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/collections/index/community/general/versions/?is_highest=true&limit=1&offset=0","previous":null,"next":null,"last":"/pulp_ansible/galaxy/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/api/v3/plugin/ansible/content/Default_Organization/Library/custom/test_ansible_product/test_ansible_repo/collections/index/community/general/versions/?is_highest=true&limit=1&offset=0"},"data":[]}
real	0m0.073s
user	0m0.010s
sys	0m0.

Also, error mentioned by Imaan, is related to upstream issue https://issues.redhat.com/browse/AAH-2836, and here it suggests to use old-galaxy https://old-galaxy.ansible.com/api/ as workaround, so we've used old-galaxy for above test.

Comment 20 errata-xmlrpc 2024-04-23 17:15:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.15.0 release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:2010

Comment 21 Red Hat Bugzilla 2024-08-22 04:25:18 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.