Bug 1255090

Summary: Kickstart Trees duplicate packages in /var/lib/pulp/content/rpm
Product: Red Hat Satellite Reporter: Sebastian Hetze <shetze>
Component: PulpAssignee: Brad Buckingham <bbuckingham>
Status: CLOSED WONTFIX QA Contact: Peter Ondrejka <pondrejk>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1.0CC: bbuckingham, bkearney, dgregor, mhrivnak, mverma, pmutha, pondrejk, sthirugn
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-19 21:29:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
389-kickstart_repo
none
389-server_repo none

Description Sebastian Hetze 2015-08-19 15:00:56 UTC
Description of problem:
When downloading a kickstart tree with Sat6 together with the appropriate product repo, all packages in the kickstart tree appear as separate file in /var/lib/pulp/content/rpm consuming additional space.

For example, after syncing RHEL7 7Server base and the 7.1 kickstart I end up with two copies of 389-ds-base-1.3.3.1-13.el7.x86_64.rpm
one in
389-ds-base/1.3.3.1/13.el7/x86_64/4c3d975b338d3cda0abdc962fd1db60f2218224fdd2312601f2ea958da52dd8a
and the other in
389-ds-base/1.3.3.1/13.el7/x86_64/66ffb47302738ca3f7f55649ffd79f6d186bde97

Exactly the same file, two different inodes


Version-Release number of selected component (if applicable):
This problem is present in Sat6.0 and in Sat6.1.1

How reproducible:
You find the duplicates in any pulp/content/rpm with kickstart and base repo.

Steps to Reproduce:
1. hammer repository-set enable --organization "$ORG" --product 'Red Hat Enterprise Linux Server' --basearch='x86_64' --releasever='7Server' --name 'Red Hat Enterprise Linux 7 Server (RPMs)'
2. hammer repository-set enable --organization "$ORG" --product 'Red Hat Enterprise Linux Server' --basearch='x86_64' --releasever='7.1' --name 'Red Hat Enterprise Linux 7 Server (Kickstart)'
3. hammer repository synchronize --organization "$ORG" --product 'Red Hat Enterprise Linux Server'  --name  'Red Hat Enterprise Linux 7 Server RPMs x86_64 7Server'
4. wait until sync is finished
5. du -sh /var/lib/pulp/content/rpm
6. hammer repository synchronize --organization "$ORG" --product 'Red Hat Enterprise Linux Server'  --name  'Red Hat Enterprise Linux 7 Server Kickstart x86_64 7.1'
7. du -sh /var/lib/pulp/content/rpm


Actual results:
more than 3G additional space consumed

Expected results:
no additional space consumed since all packages in the kickstart tree for RHEL 7.1 are already present in 7Server

Comment 11 Sebastian Hetze 2016-04-05 16:42:05 UTC
With the current 6.2 beta snap this problem remains unresolved.

find /var/lib/pulp/content/units/rpm/ -name bash-4.2.46-19\*
/var/lib/pulp/content/units/rpm/b5/701df8aa2a86d9884c8c946a55593fea073e6a9575207556378752dc768055/bash-4.2.46-19.el7.x86_64.rpm
/var/lib/pulp/content/units/rpm/1a/53b1410864e0b5fd1bc2d71409c8e11c0c03fac1d0b65b30416eddeaa4f512/bash-4.2.46-19.el7.x86_64.rpm

sha256sum /var/lib/pulp/content/units/rpm/b5/701df8aa2a86d9884c8c946a55593fea073e6a9575207556378752dc768055/bash-4.2.46-19.el7.x86_64.rpm /var/lib/pulp/content/units/rpm/1a/53b1410864e0b5fd1bc2d71409c8e11c0c03fac1d0b65b30416eddeaa4f512/bash-4.2.46-19.el7.x86_64.rpm
88b662408745b64513268d6c3c57484a0625c264f63e4509174c6b6f2507cf96  /var/lib/pulp/content/units/rpm/b5/701df8aa2a86d9884c8c946a55593fea073e6a9575207556378752dc768055/bash-4.2.46-19.el7.x86_64.rpm
88b662408745b64513268d6c3c57484a0625c264f63e4509174c6b6f2507cf96  /var/lib/pulp/content/units/rpm/1a/53b1410864e0b5fd1bc2d71409c8e11c0c03fac1d0b65b30416eddeaa4f512/bash-4.2.46-19.el7.x86_64.rpm

So the cryptic directory name apparently does not reflect the checksum anymore.
We still have the complete content of the kickstart tree duplicated in the pulp space.

Comment 12 Brad Buckingham 2016-04-06 15:06:26 UTC
Michael, is the behavior described by Sebastian expected for 6.2?

Comment 13 Michael Hrivnak 2016-04-14 14:15:16 UTC
What you see is expected. Pulp will not remove content from the filesystem unless two things happen in order:

1) the content is removed from all repositories. If the RPM is still associated with any other repo, pulp will keep it.
2) an "orphan purge" task is initiated

Both of those are in katello's control, so I'm not sure when they would be expected to happen.

Comment 14 Sebastian Hetze 2016-04-18 10:54:00 UTC
To be clear, this duplication happens on a freshly installed Satellite.

There is no existing content so the expectation is not that Pulp or Katello removes something but just does not store something in a second location that it already has.

Comment 15 Michael Hrivnak 2016-04-19 14:22:28 UTC
Ok, thank you for that clarification. I don't think the dual-storage problem is going to be completely resolved in the near future, but we have been working with RCM on a longer-term resolution.

Comment 18 Michael Hrivnak 2016-10-05 12:57:51 UTC
I believe RCM cleaned up the checksum mis-match on their end, so that should help future syncs.

We also changed pulp to always store RPMs using the sha256 checksum value for uniqueness, regardless of which algorithm is used in the remote repo metadata. That will also prevent this problem during future syncs. This change landed in pulp 2.9.0.

Comment 19 Bryan Kearney 2016-10-05 13:09:42 UTC
Based on comment 18, I am moving this bug to 6.3 for verification since 6.3 will include pulp 2.9.

Comment 20 Peter Ondrejka 2016-11-29 15:57:57 UTC
In Satellite 6.3 snap 6 there is a change in paths but this seems to persist, for example:

~]# find /var/lib/pulp/ -name bash-4.2.46-12.el7.x86_64.rpm
/var/lib/pulp/published/yum/master/yum_distributor/Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_RPMs_x86_64_7Server/1480431662.42/bash-4.2.46-12.el7.x86_64.rpm
/var/lib/pulp/published/yum/master/yum_distributor/Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_Kickstart_x86_64_7_1/1480432704.29/Packages/bash-4.2.46-12.el7.x86_64.rpm
/var/lib/pulp/published/yum/master/yum_distributor/Default_Organization-Red_Hat_Enterprise_Linux_Server-Red_Hat_Enterprise_Linux_7_Server_Kickstart_x86_64_7_1/1480432704.29/bash-4.2.46-12.el7.x86_64.rpm

Also, you can list duplicate packages in UI at Content > Packages, revealing the checksum mismatch persists (compare attached screenshots)

Comment 21 Peter Ondrejka 2016-11-29 15:58:45 UTC
Created attachment 1225908 [details]
389-kickstart_repo

Comment 22 Peter Ondrejka 2016-11-29 15:59:29 UTC
Created attachment 1225909 [details]
389-server_repo

Comment 23 Michael Hrivnak 2016-12-13 20:37:32 UTC
(In reply to Peter Ondrejka from comment #20)
> In Satellite 6.3 snap 6 there is a change in paths but this seems to
> persist, for example:
> 
> ~]# find /var/lib/pulp/ -name bash-4.2.46-12.el7.x86_64.rpm
> /var/lib/pulp/published/yum/master/yum_distributor/Default_Organization-
> Red_Hat_Enterprise_Linux_Server-
> Red_Hat_Enterprise_Linux_7_Server_RPMs_x86_64_7Server/1480431662.42/bash-4.2.
> 46-12.el7.x86_64.rpm
> /var/lib/pulp/published/yum/master/yum_distributor/Default_Organization-
> Red_Hat_Enterprise_Linux_Server-
> Red_Hat_Enterprise_Linux_7_Server_Kickstart_x86_64_7_1/1480432704.29/
> Packages/bash-4.2.46-12.el7.x86_64.rpm
> /var/lib/pulp/published/yum/master/yum_distributor/Default_Organization-
> Red_Hat_Enterprise_Linux_Server-
> Red_Hat_Enterprise_Linux_7_Server_Kickstart_x86_64_7_1/1480432704.29/bash-4.
> 2.46-12.el7.x86_64.rpm
> 
> Also, you can list duplicate packages in UI at Content > Packages, revealing
> the checksum mismatch persists (compare attached screenshots)

Was this using a download policy of on_demand or background? In that use case, there is nothing Pulp can do, because we only know the checksum for an rpm as it is listed in the repo metadata.

For example, consider that you have repo_a whose metadata uses sha1, and repo_b whose metadata uses sha256. They contain the same RPMs.

If you sync repo_a with the on_demand policy, it will create an entry in the database for each RPM using a sha1 checksum.

If you then sync repo_b with the on_demand policy, it will create new entries for each RPM using the sha256 checksum. Pulp has no way to compare these with the existing sha1 checksums, so the equivalence goes unrecognized.

If you sync'd both using the "immediate" policy, pulp calculates all supported checksum types, and always makes DB entries using the sha256 algorithm. That would enable pulp to recognize the equivalence during the sync of repo_b.

Comment 24 Michael Hrivnak 2016-12-19 21:29:12 UTC
I'm closing as "wontfix" since I believe Pulp is already doing everything it can given available information. We could consider an RFE to de-duplicate data after-the-fact, but that would not be possible without major changes to the way content is served [0]. Those changes would only be possible post-pulp3.

Let me know if you have any additional questions or feedback.

[0] To explain from a high level, if pulp recognized that two RPMs are in the DB twice, and thus on disk twice, that means published data likely exists with symlinks to both. Pulp doesn't currently track publications, so it would be impossible to remove one of the files without risk of breaking lots of published repos.

Comment 25 Michael Hrivnak 2017-02-06 22:09:37 UTC
*** Bug 1418676 has been marked as a duplicate of this bug. ***