Red Hat Bugzilla – Bug 1255090
Kickstart Trees duplicate packages in /var/lib/pulp/content/rpm
Last modified: 2017-02-06 17:09:37 EST
Description of problem:
When downloading a kickstart tree with Sat6 together with the appropriate product repo, all packages in the kickstart tree appear as separate file in /var/lib/pulp/content/rpm consuming additional space.
For example, after syncing RHEL7 7Server base and the 7.1 kickstart I end up with two copies of 389-ds-base-184.108.40.206-13.el7.x86_64.rpm
and the other in
Exactly the same file, two different inodes
Version-Release number of selected component (if applicable):
This problem is present in Sat6.0 and in Sat6.1.1
You find the duplicates in any pulp/content/rpm with kickstart and base repo.
Steps to Reproduce:
1. hammer repository-set enable --organization "$ORG" --product 'Red Hat Enterprise Linux Server' --basearch='x86_64' --releasever='7Server' --name 'Red Hat Enterprise Linux 7 Server (RPMs)'
2. hammer repository-set enable --organization "$ORG" --product 'Red Hat Enterprise Linux Server' --basearch='x86_64' --releasever='7.1' --name 'Red Hat Enterprise Linux 7 Server (Kickstart)'
3. hammer repository synchronize --organization "$ORG" --product 'Red Hat Enterprise Linux Server' --name 'Red Hat Enterprise Linux 7 Server RPMs x86_64 7Server'
4. wait until sync is finished
5. du -sh /var/lib/pulp/content/rpm
6. hammer repository synchronize --organization "$ORG" --product 'Red Hat Enterprise Linux Server' --name 'Red Hat Enterprise Linux 7 Server Kickstart x86_64 7.1'
7. du -sh /var/lib/pulp/content/rpm
more than 3G additional space consumed
no additional space consumed since all packages in the kickstart tree for RHEL 7.1 are already present in 7Server
With the current 6.2 beta snap this problem remains unresolved.
find /var/lib/pulp/content/units/rpm/ -name bash-4.2.46-19\*
sha256sum /var/lib/pulp/content/units/rpm/b5/701df8aa2a86d9884c8c946a55593fea073e6a9575207556378752dc768055/bash-4.2.46-19.el7.x86_64.rpm /var/lib/pulp/content/units/rpm/1a/53b1410864e0b5fd1bc2d71409c8e11c0c03fac1d0b65b30416eddeaa4f512/bash-4.2.46-19.el7.x86_64.rpm
So the cryptic directory name apparently does not reflect the checksum anymore.
We still have the complete content of the kickstart tree duplicated in the pulp space.
Michael, is the behavior described by Sebastian expected for 6.2?
What you see is expected. Pulp will not remove content from the filesystem unless two things happen in order:
1) the content is removed from all repositories. If the RPM is still associated with any other repo, pulp will keep it.
2) an "orphan purge" task is initiated
Both of those are in katello's control, so I'm not sure when they would be expected to happen.
To be clear, this duplication happens on a freshly installed Satellite.
There is no existing content so the expectation is not that Pulp or Katello removes something but just does not store something in a second location that it already has.
Ok, thank you for that clarification. I don't think the dual-storage problem is going to be completely resolved in the near future, but we have been working with RCM on a longer-term resolution.
I believe RCM cleaned up the checksum mis-match on their end, so that should help future syncs.
We also changed pulp to always store RPMs using the sha256 checksum value for uniqueness, regardless of which algorithm is used in the remote repo metadata. That will also prevent this problem during future syncs. This change landed in pulp 2.9.0.
Based on comment 18, I am moving this bug to 6.3 for verification since 6.3 will include pulp 2.9.
In Satellite 6.3 snap 6 there is a change in paths but this seems to persist, for example:
~]# find /var/lib/pulp/ -name bash-4.2.46-12.el7.x86_64.rpm
Also, you can list duplicate packages in UI at Content > Packages, revealing the checksum mismatch persists (compare attached screenshots)
Created attachment 1225908 [details]
Created attachment 1225909 [details]
(In reply to Peter Ondrejka from comment #20)
> In Satellite 6.3 snap 6 there is a change in paths but this seems to
> persist, for example:
> ~]# find /var/lib/pulp/ -name bash-4.2.46-12.el7.x86_64.rpm
> Also, you can list duplicate packages in UI at Content > Packages, revealing
> the checksum mismatch persists (compare attached screenshots)
Was this using a download policy of on_demand or background? In that use case, there is nothing Pulp can do, because we only know the checksum for an rpm as it is listed in the repo metadata.
For example, consider that you have repo_a whose metadata uses sha1, and repo_b whose metadata uses sha256. They contain the same RPMs.
If you sync repo_a with the on_demand policy, it will create an entry in the database for each RPM using a sha1 checksum.
If you then sync repo_b with the on_demand policy, it will create new entries for each RPM using the sha256 checksum. Pulp has no way to compare these with the existing sha1 checksums, so the equivalence goes unrecognized.
If you sync'd both using the "immediate" policy, pulp calculates all supported checksum types, and always makes DB entries using the sha256 algorithm. That would enable pulp to recognize the equivalence during the sync of repo_b.
I'm closing as "wontfix" since I believe Pulp is already doing everything it can given available information. We could consider an RFE to de-duplicate data after-the-fact, but that would not be possible without major changes to the way content is served . Those changes would only be possible post-pulp3.
Let me know if you have any additional questions or feedback.
 To explain from a high level, if pulp recognized that two RPMs are in the DB twice, and thus on disk twice, that means published data likely exists with symlinks to both. Pulp doesn't currently track publications, so it would be impossible to remove one of the files without risk of breaking lots of published repos.
*** Bug 1418676 has been marked as a duplicate of this bug. ***