The RHEL 8 gz compressed repodata is simply too big: over 300MB. It is possible to compress it to about 14MB using zstd -15 --long. The --long option enables long distance matching. When I looked closer, I realized that createrepo_c always produces repodata with sorted RPMs and the RHEL repodata do not have the RPMs sorted. This leads to a situation that similar metadata chunks are not next to each other and this significantly decreases compression ratio when using zstd (there's only minimal, but still positive impact on gz). I propose that Pulp should switch to createrepo_c behavior and always sort RPMs in repodata.
@dmach Did you mean to file this as an upstream Pulp 2 behavior so that Red Hat can host RHEL8 content in this manner on the CDN/EXD or were you specifically looking for Satellite to behave this way?
(In reply to Robin Chan from comment #1) > @dmach Did you mean to file this as an upstream Pulp 2 behavior > so that Red Hat can host RHEL8 content in this manner on the CDN/EXD or were > you specifically looking for Satellite to behave this way? I'd say both. Tanya already reached to me yesterday saying basically the same as you - I'll need to follow up with CDN/EXD people to make the change also on their end. I was under impression that they have migrated their Pulp to a newer version, but it turned to be a false assumption...
Would it be a true statement that Satellite customers would not benefit from this much if CDN/EXD changes are not made? We can link it to a JIRA ticket on their end to consider if they plan to do this work.
I would say probably that's probably not true. Satellite customers would download metadata from Satellite much more frequently than they download from the CDN, and as long as they aren't explicitly mirroring the CDN's metadata, they would benefit from Pulp making this change. Customers that are just mirroring the CDN explicitly, without doing any management of content, would be dependent on the EXD implementing this to see a benefit. At least, I believe so. It's been a long time since I did significant work with Pulp 2. Speculatively, I think we would just need to lexicographically sort the RPM queryset here to implement this: [0]. See caveat above. [0] https://github.com/pulp/pulp_rpm/blob/2-master/plugins/pulp_rpm/plugins/distributors/yum/publish.py#L795
Pulp 3 uses createrepo_c for publishing metadata so we probably don't need to change anything there.
Would be great to know into which Sat release this request would land. If into a Sat release that uses pulp2 than an upstream pulp issue should filed and tracked.
Update: createrepo_c "the library" doesn't natively do any sorting, just the CLI tool. So Pulp 3 has the same unsorted packages issue, and we will need to address that intentionally.
Would it make sense for the library itself to do that sorting?
For the APIs in question, no. There's no repository abstraction so the way to write the XML is to create the metadata file objects and "add_pkg()" to them, which is a very thin abstraction around just writing and appending XML strings. No room to do automatic sorting. https://github.com/pulp/pulp_rpm/blob/master/pulp_rpm/app/tasks/publishing.py#L463-L470 We just need to sort the packages as they come out of the database.
*No repository abstraction in createrepo_c
This RFE is not on the roadmap in the short term, we will re-evaluate it in a few months.
The Pulp upstream bug status is at NEW. Updating the external tracker on this bug.
The Pulp upstream bug priority is at Low. Updating the external tracker on this bug.
The Pulp upstream bug status is at CLOSED - DUPLICATE. Updating the external tracker on this bug.
All upstream Pulp bugs are at MODIFIED+. Moving this bug to POST.
I suggest this be closed. The huge metadata problem needs to be dealt with at the source: https://issues.redhat.com/browse/RHELDST-11212 The compression factor is mostly an roundabout way of addressing the above issue. Pulp 3 / Satellite 6.10 does a better job about handling this (by truncating the changelog list to 10 for newly published repos) and while ordering them would still help, it wouldn't help by much.
We will look to our internal Red Hat release pipeline to address this issue (RHELDST issue linked in comment 21) and live with the behavior added in Sat 6.10/Pulp 3.
The Pulp upstream bug status is at closed. Updating the external tracker on this bug.
While this issue isn't really valid as-written, there are other benefits to doing this, so it has been done. The compression benefit is marginal (single-digit percentage) but because the RPMs can be expected to be in the same order every time, the resulting metadata is the same across publishes of the same repository version, which means the artifacts can be deduplicated. Additionally it means the differences between metadata can more easily be discerned by diffing tools, which is a support benefit.
Steps to retest: 1. Enabled and sync a few rhel repos 2. Create a cv, add repos, and publish 3. Verified that published is successful 4. Register a client to the satellite 5. Install a package from the repos in step 1 Expected results: 3. Published should be successful 5. Packages should be able to install successfully Actual results: 3. Published is successful 5. Packages are able to be installed successfully verified on 6.12 snap 6.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Satellite 6.12 Release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:8506