Bug 1925165 - [RFE] Unordered RPMs in repodata decrease compression efficiency
Summary: [RFE] Unordered RPMs in repodata decrease compression efficiency
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Pulp
Version: Unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: 6.12.0
Assignee: satellite6-bugs
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-04 14:09 UTC by Daniel Mach
Modified: 2022-11-16 13:32 UTC (History)
10 users (show)

Fixed In Version: pulp_rpm-3.17.6-1, pulp_rpm-3.14.17-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2116565 (view as bug list)
Environment:
Last Closed: 2022-11-16 13:32:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github pulp pulp_rpm issues 2274 0 None closed Published RPM metadata isn't sorted properly 2022-06-16 13:39:09 UTC
Red Hat Product Errata RHSA-2022:8506 0 None None None 2022-11-16 13:32:36 UTC

Description Daniel Mach 2021-02-04 14:09:50 UTC
The RHEL 8 gz compressed repodata is simply too big: over 300MB.
It is possible to compress it to about 14MB using zstd -15 --long.
The --long option enables long distance matching.

When I looked closer, I realized that createrepo_c always produces repodata with sorted RPMs and the RHEL repodata do not have the RPMs sorted. This leads to a situation that similar metadata chunks are not next to each other and this significantly decreases compression ratio when using zstd (there's only minimal, but still positive impact on gz).

I propose that Pulp should switch to createrepo_c behavior and always sort RPMs in repodata.

Comment 1 Robin Chan 2021-02-04 20:50:55 UTC
@dmach Did you mean to file this as an upstream Pulp 2 behavior so that Red Hat can host RHEL8 content in this manner on the CDN/EXD or were you specifically looking for Satellite to behave this way?

Comment 2 Daniel Mach 2021-02-05 09:40:29 UTC
(In reply to Robin Chan from comment #1)
> @dmach Did you mean to file this as an upstream Pulp 2 behavior
> so that Red Hat can host RHEL8 content in this manner on the CDN/EXD or were
> you specifically looking for Satellite to behave this way?

I'd say both.
Tanya already reached to me yesterday saying basically the same as you - I'll need to follow up with CDN/EXD people to make the change also on their end. I was under impression that they have migrated their Pulp to a newer version, but it turned to be a false assumption...

Comment 3 Robin Chan 2021-02-05 16:09:19 UTC
Would it be a true statement that Satellite customers would not benefit from this much if CDN/EXD changes are not made? We can link it to a JIRA ticket on their end to consider if they plan to do this work.

Comment 4 Daniel Alley 2021-02-05 20:47:12 UTC
I would say probably that's probably not true. Satellite customers would download metadata from Satellite much more frequently than they download from the CDN, and as long as they aren't explicitly mirroring the CDN's metadata, they would benefit from Pulp making this change.

Customers that are just mirroring the CDN explicitly, without doing any management of content, would be dependent on the EXD implementing this to see a benefit.

At least, I believe so. It's been a long time since I did significant work with Pulp 2.

Speculatively, I think we would just need to lexicographically sort the RPM queryset here to implement this: [0]. See caveat above.

[0] https://github.com/pulp/pulp_rpm/blob/2-master/plugins/pulp_rpm/plugins/distributors/yum/publish.py#L795

Comment 5 Daniel Alley 2021-02-05 20:49:19 UTC
Pulp 3 uses createrepo_c for publishing metadata so we probably don't need to change anything there.

Comment 6 Ina Panova 2021-02-09 12:14:04 UTC
Would be great to know into which Sat release this request would land. If into a Sat release that uses pulp2 than an upstream pulp issue should filed and tracked.

Comment 10 Daniel Alley 2021-05-07 18:15:43 UTC
Update: createrepo_c "the library" doesn't natively do any sorting, just the CLI tool.  So Pulp 3 has the same unsorted packages issue, and we will need to address that intentionally.

Comment 11 Neal Gompa 2021-05-08 14:21:37 UTC
Would it make sense for the library itself to do that sorting?

Comment 12 Daniel Alley 2021-05-08 14:24:27 UTC
For the APIs in question, no.  There's no repository abstraction so the way to write the XML is to create the metadata file objects and "add_pkg()" to them, which is a very thin abstraction around just writing and appending XML strings.  No room to do automatic sorting.

https://github.com/pulp/pulp_rpm/blob/master/pulp_rpm/app/tasks/publishing.py#L463-L470

We just need to sort the packages as they come out of the database.

Comment 13 Daniel Alley 2021-05-08 14:25:00 UTC
*No repository abstraction in createrepo_c

Comment 14 Tanya Tereshchenko 2021-06-07 16:26:34 UTC
This RFE is not on the roadmap in the short term, we will re-evaluate it in a few months.

Comment 15 pulp-infra@redhat.com 2021-06-07 17:03:48 UTC
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.

Comment 16 pulp-infra@redhat.com 2021-06-07 17:03:49 UTC
The Pulp upstream bug priority is at Low. Updating the external tracker on this bug.

Comment 17 Tanya Tereshchenko 2021-10-30 15:22:33 UTC
This RFE is not on the roadmap in the short term, we will re-evaluate it in a few months.

Comment 18 pulp-infra@redhat.com 2021-12-22 16:14:05 UTC
The Pulp upstream bug status is at CLOSED - DUPLICATE. Updating the external tracker on this bug.

Comment 19 pulp-infra@redhat.com 2021-12-22 17:20:01 UTC
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.

Comment 21 Daniel Alley 2022-06-07 23:50:23 UTC
I suggest this be closed. The huge metadata problem needs to be dealt with at the source: https://issues.redhat.com/browse/RHELDST-11212

The compression factor is mostly an roundabout way of addressing the above issue.  Pulp 3 / Satellite 6.10 does a better job about handling this (by truncating the changelog list to 10 for newly published repos) and while ordering them would still help, it wouldn't help by much.

Comment 22 Robin Chan 2022-06-08 18:22:11 UTC
We will look to our internal Red Hat release pipeline to address this issue (RHELDST issue linked in comment 21) and live with the behavior added in Sat 6.10/Pulp 3.

Comment 23 pulp-infra@redhat.com 2022-06-16 13:39:10 UTC
The Pulp upstream bug status is at closed. Updating the external tracker on this bug.

Comment 24 Daniel Alley 2022-07-29 16:54:34 UTC
While this issue isn't really valid as-written, there are other benefits to doing this, so it has been done.  The compression benefit is marginal (single-digit percentage) but because the RPMs can be expected to be in the same order every time, the resulting metadata is the same across publishes of the same repository version, which means the artifacts can be deduplicated.  Additionally it means the differences between metadata can more easily be discerned by diffing tools, which is a support benefit.

Comment 25 Lai 2022-08-15 16:08:15 UTC
Steps to retest:

1. Enabled and sync a few rhel repos
2. Create a cv, add repos, and publish
3. Verified that published is successful
4. Register a client to the satellite
5. Install a package from the repos in step 1

Expected results:
3. Published should be successful
5. Packages should be able to install successfully

Actual results:
3. Published is successful
5. Packages are able to be installed successfully

verified on 6.12 snap 6.1

Comment 30 errata-xmlrpc 2022-11-16 13:32:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.12 Release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8506


Note You need to log in before you can comment on or make changes to this bug.