Bug 1903367 - Publishing a new Content View version brings in old metadata files from multiple previous versions; regenerating CV metadata then fixes it [NEEDINFO]
Summary: Publishing a new Content View version brings in old metadata files from multi...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Repositories
Version: 6.8.0
Hardware: x86_64
OS: Linux
urgent
high vote
Target Milestone: 6.9.5
Assignee: Justin Sherrill
QA Contact: Cole Higgins
URL:
Whiteboard:
: 1921752 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-01 22:14 UTC by Pablo Hess
Modified: 2022-03-09 20:14 UTC (History)
29 users (show)

Fixed In Version: foreman-installer-2.3.1.19-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-31 12:04:00 UTC
Target Upstream Version:
musman: needinfo? (satellite6-bugs)


Attachments (Terms of Use)
bz1903367-pulp-steps (4.78 KB, text/plain)
2020-12-11 18:17 UTC, Ina Panova
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Foreman Issue Tracker 32966 0 Normal Closed support remove_old_repodata options in yum_distributor for puppet-pulp 2021-07-23 11:44:57 UTC
Foreman Issue Tracker 32996 0 Normal Ready For Testing support deleting unused yum metadata files within /var/lib/pulp for pulp2 2021-07-29 13:37:35 UTC
Red Hat Knowledge Base (Solution) 5923911 0 None None None 2021-04-01 12:29:06 UTC
Red Hat Product Errata RHBA-2021:3387 0 None None None 2021-08-31 12:04:18 UTC

Description Pablo Hess 2020-12-01 22:14:50 UTC
Description of problem:
Verified and easily reproduced on Satellite 6.8.1, possibly also on 6.8.0: when you create a new Content View on Satellite 6.8.1 it will bring multiple copies of each repodata file. Here is the repodata dir from v2.0 of an example CV:

[root@sat68a ~]# ls /var/lib/pulp/published/yum/master/yum_distributor/1-cv_capsule-v2_0-626293c0-3cc2-4a90-af5b-6d48a18ac53e/1606858126.05/repodata/ | sort -k2 -t '-'
repomd.xml
078bdcd5-96a5-45d2-b838-9414c0bc1a84
a27718cc28ec6d71432e0ef3e6da544b7f9d93f6bb7d0a55aacd592d03144b70-comps.xml
0dcc1ba33605c46cb7f13182dca411cd5674cc2d6439f1cca7cfb4ab900a297b-filelists.xml.gz
155ad64e8d53d6181cf6a5699430ec7dd8ba6c5d5dc4b9a5885e1701d7f0243b-filelists.xml.gz
8f5557c9a8c1aeda866068c1f4ceb33c26a32018182cfc687cf760bced85d610-filelists.xml.gz
dff9cdfa1661020eedd26c2d75604d8c4ab9db07e88d447ebe0051d168333c85-filelists.xml.gz
1320ec97c60d14f9000cefcccd075fc4996ba89cd73ec1600dcc987c0011ce64-other.xml.gz
ce270097fdb29bb8631a3d53e8151bf80ea0d9fe600159417257619d7171dbbc-other.xml.gz
e56d7eaaa87fd95453e6c9b410faffa2473e0cef088b38cc83929f8d0c2d3c86-other.xml.gz
f2e0545170c19c4099846e429b1c27207cb3e687a3ec7c6cec4052977e92bfda-other.xml.gz
710b62803ec0f2eb614dbe1580afdbdabcbd44a1b5a6525573fdac2fe593ddad-primary.xml.gz
723299a84ed858cc01a703a7a581716b62c6f69b1bf6ba712faa5a3e8d18d245-primary.xml.gz
9bcc3ec9efc83f2d1335fa564026b40a52095acb58a187f5d17828a0a0369d98-primary.xml.gz
a01cfe348212eaa7b37cc30e780f866463b4438b6c9f16ac24cca6431a638131-primary.xml.gz
3fd64dd74cb72e980d3a16b3963bc46f7bf191f180eecbc94a2bb2e48d325c1e-updateinfo.xml.gz
7b275643bd56d0dbb92b9944a7b5c998d1631a313bf379cff413d30c272f53ca-updateinfo.xml.gz
7f83f5fc77d09449182d645632acb260e02e41a39512bb8db8ca451f2695f62f-updateinfo.xml.gz
94b654e18f09b60ce4dfd2caccec5ca974661971d782e8471445985c84f7a93e-updateinfo.xml.gz


After triggering a new regenerate metadata task on this CV, the task completes successfully and the result is a much saner outlook:

[root@sat68a ~]# ls /var/lib/pulp/published/yum/master/yum_distributor/1-cv_capsule-v2_0-626293c0-3cc2-4a90-af5b-6d48a18ac53e/*/repodata/ | sort -k2 -t '-'                                                                       
repomd.xml
078bdcd5-96a5-45d2-b838-9414c0bc1a84
a27718cc28ec6d71432e0ef3e6da544b7f9d93f6bb7d0a55aacd592d03144b70-comps.xml
17b2942633263c77adf92181ad818b10025367cde29a6e85d6df6c838ffff739-filelists.xml.gz
95c725baf5ae32fd4b7d2261eac81f078d9630c21e76ab36800a9fc2ae62f451-other.xml.gz
40aadea28d5af03a4017476f1e8985c678e5808883bdebc82abd19490cbdf26e-primary.xml.gz
9ad5b0fc5c1967f68457a905208efec6f8e12e962bd68d39819969c1b19b3f76-updateinfo.xml.gz


The `repodata.xml` file in the directory references only one "copy" of each metadata type.

I believe I can trace the issue back to the "upstream" publishers, e.g.:

[root@sat68a ~]# ls /var/lib/pulp/published/yum/master/yum_distributor/626293c0-3cc2-4a90-af5b-6d48a18ac53e/*/repodata/                                                                                                                       
078bdcd5-96a5-45d2-b838-9414c0bc1a84                                                94b654e18f09b60ce4dfd2caccec5ca974661971d782e8471445985c84f7a93e-updateinfo.xml.gz
0dcc1ba33605c46cb7f13182dca411cd5674cc2d6439f1cca7cfb4ab900a297b-filelists.xml.gz   9bcc3ec9efc83f2d1335fa564026b40a52095acb58a187f5d17828a0a0369d98-primary.xml.gz
1320ec97c60d14f9000cefcccd075fc4996ba89cd73ec1600dcc987c0011ce64-other.xml.gz       a01cfe348212eaa7b37cc30e780f866463b4438b6c9f16ac24cca6431a638131-primary.xml.gz
155ad64e8d53d6181cf6a5699430ec7dd8ba6c5d5dc4b9a5885e1701d7f0243b-filelists.xml.gz   a27718cc28ec6d71432e0ef3e6da544b7f9d93f6bb7d0a55aacd592d03144b70-comps.xml
3fd64dd74cb72e980d3a16b3963bc46f7bf191f180eecbc94a2bb2e48d325c1e-updateinfo.xml.gz  ce270097fdb29bb8631a3d53e8151bf80ea0d9fe600159417257619d7171dbbc-other.xml.gz
710b62803ec0f2eb614dbe1580afdbdabcbd44a1b5a6525573fdac2fe593ddad-primary.xml.gz     dff9cdfa1661020eedd26c2d75604d8c4ab9db07e88d447ebe0051d168333c85-filelists.xml.gz
723299a84ed858cc01a703a7a581716b62c6f69b1bf6ba712faa5a3e8d18d245-primary.xml.gz     e56d7eaaa87fd95453e6c9b410faffa2473e0cef088b38cc83929f8d0c2d3c86-other.xml.gz
7b275643bd56d0dbb92b9944a7b5c998d1631a313bf379cff413d30c272f53ca-updateinfo.xml.gz  f2e0545170c19c4099846e429b1c27207cb3e687a3ec7c6cec4052977e92bfda-other.xml.gz
7f83f5fc77d09449182d645632acb260e02e41a39512bb8db8ca451f2695f62f-updateinfo.xml.gz  repomd.xml
8f5557c9a8c1aeda866068c1f4ceb33c26a32018182cfc687cf760bced85d610-filelists.xml.gz


So, would a sync from the CDN bringing in those duplicates cause every new derived repo on Satellite to also bring in the same duplicates?






Version-Release number of selected component (if applicable):
Tested and verified on Satellite 6.8.1:
pulp-katello-1.0.3-1.el7sat.noarch
pulp-server-2.21.3-1.el7sat.noarch
python-pulp-bindings-2.21.3-1.el7sat.noarch
python-pulp-client-lib-2.21.3-1.el7sat.noarch
python-pulp-common-2.21.3-1.el7sat.noarch
python-pulp-integrity-2.21.3.1-1.el7sat.noarch
python-pulp-oid_validation-2.21.3-1.el7sat.noarch
python-pulp-repoauth-2.21.3-1.el7sat.noarch
python-pulp-rpm-common-2.21.3.1-1.el7sat.noarch
satellite-6.8.1-1.el7sat.noarch


How reproducible:
100% of the time on tested Satellites.

Steps to Reproduce:
1. Have Satellite 6.8.1.
2. Create a new CV containing any given repository from the Red Hat CDN (not yet tested with custom repos).
3. Publish this new CV.

Actual results:
Duplicated metadata seen under /var/lib/pulp/published/yum/master/yum_distributor/<org_id>-<cv_name>-<cv_version>-<root_uuid>/<timestamp>/repodata/.

Expected results:
Non-duplicated metadata.

Additional info:
Duplicating metadata per se is not so urgent. I'm tagging this ticket as urgent because it becomes very feasible (likely, even) to have a Satellite fill up its /var/lib/pulp filesystem with a single CV publish of all existing CVs, say, overnight.

Comment 2 Pablo Hess 2020-12-01 22:33:09 UTC
Script that may safely be run on any fully-published repository that will remove any repodata files that are not referenced in repomd.xml:

# ( grep '^<data type' repomd.xml | fgrep '.xml' | cut -f3 -d= |cut -f2 -d '"' | cut -f2 -d/ ; ls *-*.xml* ) | sort | uniq -u | xargs rm

This is an alternative to re-publishing metadata for the CV that will also work fine to reduce storage space consumption on a published repository under /var/lib/pulp/.

Comment 4 Pablo Hess 2020-12-02 15:11:02 UTC
Here is one confirmed permanent solution: remove the unnecessary "duplicated" metadata files from the root repository yum_distributor dir. From then on, new CV versions containing this repository will no longer contain the "duplicates" and will behave as usual and expected.


The root repository dir at /var/lib/pulp/published/yum/master/yum_distributor/e463a3be-9beb-4b58-954b-3c70b4d76ff9/1606714503.72/repodata contains those "duplicated" metadata files:

[root@sat68a repodata]# ls /var/lib/pulp/published/yum/master/yum_distributor/e463a3be-9beb-4b58-954b-3c70b4d76ff9/1606714503.72/repodata/
170d2873c8c64bf2284ddf4cfd286b1a8d00e745bf059697ae0cc1a80595b27b-primary.xml.gz    
29b35070e9e5e0ce665634a6695657dd2c0e90f984ecf9edc9ae0c2514317609-primary.xml.gz    
50cfa17ec516642f998f9b4a1f3fbd69c7ae2a801725f74c47dc262f908a4bc0-filelists.xml.gz  
81a50a950c81d298ed82532f06ff8e77b5c2baae6f614b2b7954ed52d2cfe217-other.xml.gz      
887f98b08fd5a31c2f862b5a813212c2edacb7b71025f31739bdae1b056c6a58-other.xml.gz      
8d4daf96-3bdc-41fd-b5ad-7ab37ef3cd58
9f9055ce8ec3f81556b037756b9db2646fe9a85c2b9edf607dd0305c7edb96d8-updateinfo.xml.gz
af22d713b7535abba900ee1d1163e04fdaca3aa5cbc80e631032553cb3afaa2e-filelists.xml.gz
b468996a0064ff1f778008a88e75ebda41dbb0e9067243c7fb83aad3bd8ea652-updateinfo.xml.gz
c716c2a27a0c069aa95fc982573b552da9089b870cb1534b412704ba4e66ca91-comps.xml
repomd.xml


A published CV's yum_distributor dir that contains this repository also contains the "duplicates":

[root@sat68a repodata]# ls/var/lib/pulp/published/yum/master/yum_distributor/1-cv_test_bz-v1_0-e463a3be-9beb-4b58-954b-3c70b4d76ff9/1606920752.03/repodata/
170d2873c8c64bf2284ddf4cfd286b1a8d00e745bf059697ae0cc1a80595b27b-primary.xml.gz    
29b35070e9e5e0ce665634a6695657dd2c0e90f984ecf9edc9ae0c2514317609-primary.xml.gz    
50cfa17ec516642f998f9b4a1f3fbd69c7ae2a801725f74c47dc262f908a4bc0-filelists.xml.gz  
81a50a950c81d298ed82532f06ff8e77b5c2baae6f614b2b7954ed52d2cfe217-other.xml.gz      
887f98b08fd5a31c2f862b5a813212c2edacb7b71025f31739bdae1b056c6a58-other.xml.gz      
8d4daf96-3bdc-41fd-b5ad-7ab37ef3cd58
9f9055ce8ec3f81556b037756b9db2646fe9a85c2b9edf607dd0305c7edb96d8-updateinfo.xml.gz
af22d713b7535abba900ee1d1163e04fdaca3aa5cbc80e631032553cb3afaa2e-filelists.xml.gz
b468996a0064ff1f778008a88e75ebda41dbb0e9067243c7fb83aad3bd8ea652-updateinfo.xml.gz
c716c2a27a0c069aa95fc982573b552da9089b870cb1534b412704ba4e66ca91-comps.xml
repomd.xml



So I'll remove all the unnecessary "duplicates" i.e. metadata files that are not referenced by repomd.xml:

# ( grep '^<data type' repomd.xml | fgrep '.xml' | cut -f3 -d= |cut -f2 -d '"' | cut -f2 -d/ ; ls *-*.xml* ) | sort | uniq -u | xargs rm


Then I'll publish a new version of my CV that contains this repository and check if the latest version has the stubborn "duplicates":

[root@sat68a repodata]# ls /var/lib/pulp/published/yum/master/yum_distributor/1-cv_test_bz-v2_0-e463a3be-9beb-4b58-954b-3c70b4d76ff9/*/repodata/ -1
170d2873c8c64bf2284ddf4cfd286b1a8d00e745bf059697ae0cc1a80595b27b-primary.xml.gz
887f98b08fd5a31c2f862b5a813212c2edacb7b71025f31739bdae1b056c6a58-other.xml.gz
8d4daf96-3bdc-41fd-b5ad-7ab37ef3cd58
af22d713b7535abba900ee1d1163e04fdaca3aa5cbc80e631032553cb3afaa2e-filelists.xml.gz
b468996a0064ff1f778008a88e75ebda41dbb0e9067243c7fb83aad3bd8ea652-updateinfo.xml.gz
c716c2a27a0c069aa95fc982573b552da9089b870cb1534b412704ba4e66ca91-comps.xml
repomd.xml


End result: no more duplicates.


The question that remains is: why did the root repo's yum_distributor dir end up with duplicated metadata files?

Comment 5 Ina Panova 2020-12-11 18:17:31 UTC
This behaviour is expected. Old repodata files are stored by default for 14 days.
This can be configured by setting the remove_old_repodata_threshold, although I am not sure if it is exposed in katello https://docs.pulpproject.org/en/2.21/plugins/pulp_rpm/tech-reference/yum-plugins.html?highlight=remove_old_repodata_threshold

To reproduce  this behaviour it is enough to add new content into the repo multiple times and perform publish in between.
Files older than 14 days( by default)  are removed.
Force metadata regeneration removes old files, since in this case publish operation is performed from scratch.


If it is not desirable to store old repodata files for 14 days it can be configured to a smaller time period.

Step to reproduce in directly in pulp are attached.

Comment 6 Ina Panova 2020-12-11 18:17:59 UTC
Created attachment 1738485 [details]
bz1903367-pulp-steps

Comment 9 Pablo Hess 2021-02-02 16:28:01 UTC
Hi Ina, to turn your comment into action I understand we can enable a 1-day threshold by adding the contents below to the `/etc/pulp/server/plugins.conf.d/yum_distributor.json` file which does not exist by default on Red Hat Satellite 6.x:


    # cat /etc/pulp/server/plugins.conf.d/yum_distributor.json
    {
        "remove_old_repodata": True,
        "remove_old_repodata_threshold": 1
    }


New question: can we set the threshold to zero if we want to completely prevent old repodata from needlessly existing in new repos?

Comment 10 Pablo Hess 2021-02-04 19:19:56 UTC
WARNING: my mistake: if you use the `yum_distributor.json` contents from my comment above you will get this error below when running `pulp-manage-db` either directly or indirectly (through e.g. `satellite-installer`):

Updating the database with types []
Found the following type definitions that were not present in the update collection [puppet_module, docker_tag, ostree, modulemd_defaults, package_langpacks, erratum, docker_blob, docker_manifest, yum_repo_metadata_file, package_group, pa
ckage_category, iso, package_environment, drpm, distribution, modulemd, rpm, srpm, docker_image, docker_manifest_list]
Updating the database with types [puppet_module, drpm, ostree, modulemd_defaults, package_langpacks, docker_manifest, docker_blob, erratum, yum_repo_metadata_file, package_group, package_category, iso, package_environment, docker_tag, dis
tribution, modulemd, rpm, srpm, docker_image, docker_manifest_list]
Content types loaded.
Ensuring the admin role and user are in place.
Admin role and user are in place.
Beginning database migrations.
No JSON object could be decoded
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/pulp/server/db/manage.py", line 280, in main
    return _auto_manage_db(options)
  File "/usr/lib/python2.7/site-packages/pulp/server/db/manage.py", line 347, in _auto_manage_db
    migrate_database(options)
  File "/usr/lib/python2.7/site-packages/pulp/server/db/manage.py", line 83, in migrate_database
    migration_packages = models.get_migration_packages()
  File "/usr/lib/python2.7/site-packages/pulp/server/db/migrate/models.py", line 348, in get_migration_packages
    migration_packages.append(MigrationPackage(migration_package_module))
  File "/usr/lib/python2.7/site-packages/pulp/server/db/migrate/models.py", line 172, in __init__
    available_versions = self.available_versions
  File "/usr/lib/python2.7/site-packages/pulp/server/db/migrate/models.py", line 219, in available_versions
    migrations = self.migrations
  File "/usr/lib/python2.7/site-packages/pulp/server/db/migrate/models.py", line 248, in migrations
    migration_modules.append(MigrationModule(module_name))
  File "/usr/lib/python2.7/site-packages/pulp/server/db/migrate/models.py", line 91, in __init__
    self._module = _import_all_the_way(python_module_name)
  File "/usr/lib/python2.7/site-packages/pulp/server/db/migrate/models.py", line 365, in _import_all_the_way
    module = __import__(module_string)
  File "/usr/lib/python2.7/site-packages/pulp_rpm/plugins/migrations/0016_new_yum_distributor.py", line 33, in <module>
    NEW_DISTRIBUTOR_CONF = read_json_config(NEW_DISTRIBUTOR_CONF_FILE_PATH)
  File "/usr/lib/python2.7/site-packages/pulp/common/config.py", line 681, in read_json_config
    config = json.load(f)
  File "/usr/lib64/python2.7/json/__init__.py", line 290, in load
    **kw)
  File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded






FIX:
Use true (lowercase) instead of True.

THE CORRECT CONTENTS ARE:

    # cat /etc/pulp/server/plugins.conf.d/yum_distributor.json
    {
        "remove_old_repodata": true,                  <==== note `true` is all lowercase
        "remove_old_repodata_threshold": 1
    }

Comment 11 Ina Panova 2021-02-09 13:17:35 UTC
The docs are suboptimal and do not explicitly mention but the value for the threshold is expected to be specified in seconds.

It should suffice to specify small enough value and any old repodata that pass that threshold will be removed. I believe setting the value to 0 will remove any old repdata regardless of its age, however, I think this case might not be fully tested.

Comment 13 Pablo Hess 2021-02-10 16:58:08 UTC
After some testing, I see that this is not working to prevent the propagation of unnecessary old metadata:

    # cat /etc/pulp/server/plugins.conf.d/yum_distributor.json
    {
        "remove_old_repodata": true,
        "remove_old_repodata_threshold": 1
    }


(In reply to Ina Panova from comment #11)
> The docs are suboptimal and do not explicitly mention but the value for the
> threshold is expected to be specified in seconds.
> 
> It should suffice to specify small enough value and any old repodata that
> pass that threshold will be removed. I believe setting the value to 0 will
> remove any old repdata regardless of its age, however, I think this case
> might not be fully tested.

A value of 0 is also not working as a means to prevent the propagation of unnecessary old metadata.

Perhaps the logic that turns remove_old_repodata into action is misbehaving?

On Satellite 6.8, I've located and modified /usr/lib/python2.7/site-packages/pulp_rpm/plugins/distributors/yum/publish.py:

1270         threshold = self.get_config().get(
1271             'remove_old_repodata_threshold',
1272             #datetime.timedelta(days=14).total_seconds())   <== commented out the default value
1273             1)                                              <== new default value is 1 second


This did not solve the issue either.

So here is the test I'm performing:
# ls ./published/yum/master/yum_distributor/1-cv_mycv-v13_0-*/*/repodata

./published/yum/master/yum_distributor/1-cv_mycv-v13_0-626293c0-3cc2-4a90-af5b-6d48a18ac53e/1612975312.32/repodata:
078bdcd5-96a5-45d2-b838-9414c0bc1a84                                                86b01422fe44ee18f18cf43be24b09362dfbf22efaadc5e8c497647bf9419794-updateinfo.xml.gz
2a3d2c604a604e3796e83a2deb58906d3186b05d999da5871a8ef07caf874480-other.xml.gz       a27718cc28ec6d71432e0ef3e6da544b7f9d93f6bb7d0a55aacd592d03144b70-comps.xml
3efa8ee78de989cd08682cb605de03c4e2e8a005ffc2026adc0dc615e953898d-updateinfo.xml.gz  a36be2bed4496e10a4af858bf743b4628c62175d2b11bab040ce0f1609405927-primary.xml.gz
3fbd7e543de1f82086d2f0b836d40f9b18c5359ba3b164df04018e509f11e2a9-filelists.xml.gz   d6d40f1d6f4e5663f5dcafa03fa1585652a5a07fe84b04e02ea4568cf8aaebcb-filelists.xml.gz
5b2a6258eb64ba18142e37454d820499fa993b3ad458116caacb25878c9b43a3-primary.xml.gz     repomd.xml
76bbf7f0bc7ac45ae01ff40a7b05681e703004f2b418f22afba32d06d7c5ed23-other.xml.gz

./published/yum/master/yum_distributor/1-cv_mycv-v13_0-7c326edf-b01b-4584-8f53-79696f4bb8d8/1612975312.39/repodata:
056dfc15b97ca51d6c0b17e6d96cfe9812a07bb9b9c0b6d13672b1082a784b05-updateinfo.xml.gz  91b29cd7bc395982f51ae7360297c6f4a107479d9b8a0504136037e3a5702d6c-filelists.xml.gz
0d053db8860939b4316087571ca80dcf91bd52adf3a4a8fb4258619b871262ac-other.xml.gz       a27718cc28ec6d71432e0ef3e6da544b7f9d93f6bb7d0a55aacd592d03144b70-comps.xml
3fa96672cfc133a227eb733ed73507b65a46df74309fefa74f85aa0a0c0859dc-primary.xml.gz     cc39646353313b6b41f0aceda68182416e443eccebc38840c3e8816bae7dd443-filelists.xml.gz
41ef974230c69d2bd9888190f2f827d2e1201f31597a81f92a6aa74fec688ba1-other.xml.gz       fd3ab046ab62d410cf1e3a81a83dfcf56bb0fbca071542c159dfb58d1d63167f-primary.xml.gz
8745c868fea4065dcac832c8b343eeac7d9f82b5b11551dbac229264ee66a862-updateinfo.xml.gz  repomd.xml


Then, publish new version.
NOTE: no changes to contents happened, I'm simply publishing a new version that is bound to have no difference in contents to v13.

Re-run `ls`:
./published/yum/master/yum_distributor/1-cv_mycv-v14_0-626293c0-3cc2-4a90-af5b-6d48a18ac53e/1612975428.77/repodata:
078bdcd5-96a5-45d2-b838-9414c0bc1a84                                                86b01422fe44ee18f18cf43be24b09362dfbf22efaadc5e8c497647bf9419794-updateinfo.xml.gz
2a3d2c604a604e3796e83a2deb58906d3186b05d999da5871a8ef07caf874480-other.xml.gz       a27718cc28ec6d71432e0ef3e6da544b7f9d93f6bb7d0a55aacd592d03144b70-comps.xml
3efa8ee78de989cd08682cb605de03c4e2e8a005ffc2026adc0dc615e953898d-updateinfo.xml.gz  a36be2bed4496e10a4af858bf743b4628c62175d2b11bab040ce0f1609405927-primary.xml.gz
3fbd7e543de1f82086d2f0b836d40f9b18c5359ba3b164df04018e509f11e2a9-filelists.xml.gz   d6d40f1d6f4e5663f5dcafa03fa1585652a5a07fe84b04e02ea4568cf8aaebcb-filelists.xml.gz
5b2a6258eb64ba18142e37454d820499fa993b3ad458116caacb25878c9b43a3-primary.xml.gz     repomd.xml
76bbf7f0bc7ac45ae01ff40a7b05681e703004f2b418f22afba32d06d7c5ed23-other.xml.gz

./published/yum/master/yum_distributor/1-cv_mycv-v14_0-7c326edf-b01b-4584-8f53-79696f4bb8d8/1612975428.58/repodata:
056dfc15b97ca51d6c0b17e6d96cfe9812a07bb9b9c0b6d13672b1082a784b05-updateinfo.xml.gz  91b29cd7bc395982f51ae7360297c6f4a107479d9b8a0504136037e3a5702d6c-filelists.xml.gz
0d053db8860939b4316087571ca80dcf91bd52adf3a4a8fb4258619b871262ac-other.xml.gz       a27718cc28ec6d71432e0ef3e6da544b7f9d93f6bb7d0a55aacd592d03144b70-comps.xml
3fa96672cfc133a227eb733ed73507b65a46df74309fefa74f85aa0a0c0859dc-primary.xml.gz     cc39646353313b6b41f0aceda68182416e443eccebc38840c3e8816bae7dd443-filelists.xml.gz
41ef974230c69d2bd9888190f2f827d2e1201f31597a81f92a6aa74fec688ba1-other.xml.gz       fd3ab046ab62d410cf1e3a81a83dfcf56bb0fbca071542c159dfb58d1d63167f-primary.xml.gz
8745c868fea4065dcac832c8b343eeac7d9f82b5b11551dbac229264ee66a862-updateinfo.xml.gz  repomd.xml



We can see that metadata files are simply being copied over to the new version -- no evaluation of each metadata file's age is being done.

I can't say this (new CV version with no content differences) is exactly what is happening on all Satellites facing this issue, but at least in this scenario the issue appears but as far as I can tell the 'remove_old_repodata_threshold' logic should be kicking in and removing from the repodata/ directory any metadata files older than 1 second.


@Ina, any thoughts or suggested next steps?

Comment 14 Pablo Hess 2021-02-10 17:24:19 UTC
I now see that not examining metadata file timestamps is part of the repository clone process that takes place when there are no changes to a given repo.

Can this behavior be changed by some config setting? Can RemoveOldRepodataStep be added to the repository cloning process?


So, back to the original situation: on Satellite we are seeing old repository metadata files be unnecessarily kept under /var/lib/pulp/, taking more and more storage space.
If one publishes multiple CVs every day -- and this is a perfecly valid usage scenario for Satellite -- one may end up with hundreds of GB of wasted storage space.

Setting remove_old_repodata_threshold to 1 inside /etc/pulp/server/plugins.conf.d/yum_distributor.json has not helped prevent this phenomenon.

This is showing to be a big problem, please prioritize it as such.

Comment 15 Ina Panova 2021-02-10 20:29:37 UTC
@Pablo, I have checked once again that the distributor.json works on the pulp side as it should be even with 1 second set.
There are some specifics ongoing on the katello side. I will defer the rest to @jsherrill

Comment 16 Justin Sherrill 2021-02-11 14:25:22 UTC
Hey Pablo! 

There are a couple of issues at play here i think:

1.  Generating metadata during repo sync and content view publish are keeping old copies of metadata around (you may need a 'skip metadata sync' to see it actually republish the metadata)

2a.  Promoting Content Views versions with a repo with multiple copies of metadata is propagating that into lifecycle environment repos at promotion time
2b.  Even when you've published a new content view version with only a single set of metadata, promoting them is not causing any data to be 

I believe the configuration that ina mentions should resolve 1), can you confirm if it is?  

I understand that even if 1) is solved, new content view versions may not contain changes and thus may not have their metadata regenerated when promoting to a lifecycle environment?  If thats the case you may want to manually trigger a 'regenerate repo metadata' for each content view versions.  You can do this from the content view version list or via 'hammer content-view version republish-repositories' command.

There's a lot in satellite that attempts to reduce the amount of work done in the cases where there are no changes, and i think this is preventing the fix from being seen (assuming 1) is working)

Comment 17 Pablo Hess 2021-03-04 14:33:11 UTC
Thank you Justin for chiming in and for the comprehensive review, and Ina for the constructive discussion and testing.

Regarding #1:
> 1.  Generating metadata during repo sync and content view publish are keeping old copies of metadata around (you may need a 'skip metadata sync' to see it actually republish the metadata)
    (...)
> I understand that even if 1) is solved, new content view versions may not contain changes and thus may not have their metadata regenerated when promoting to a lifecycle environment?  If thats the case you may want to manually trigger a 'regenerate repo metadata' for each content view versions.

I believe this is the core of the issue: having to regenerate metadata (this is expensive) in order to simply keep old metadata from being copied around doesn't seem more efficient than taking a quick, cheap look (it ought to be cheap) at metadata file age before blindly copying them over to the new repo (or CV version).

> There's a lot in satellite that attempts to reduce the amount of work done in the cases where there are no changes, and i think this is preventing the fix from being seen (assuming 1) is working)

I know this and I thoroughly appreciate it. I'm asking for the logic to just be improved a (hopefully) tiny bit to accommodate these cases. In my experience this (hopefully) tiny change in the logic would benefit a significant number of customers.

Comment 18 Pablo Hess 2021-03-04 18:02:06 UTC
One more question I think should be asked: why do we go with a default of 14 days for keeping old metadata along with new metadata? Why even keep old metadata? What purpose is the old metadata expected to serve since it's not referenced in repomd.xml?

Going forward, if we could have Satellite adopt a default of

    "remove_old_repodata_threshold": 0

...it would serve our users much better in terms of preventing old metadata build-up.


If the reluctance to keeping old metadata around sounds like an overreaction, please bear in mind rhel-7-server-rpms metadata is now 809 MB in size, with a whopping 771 MB being used for the "other" data type alone. Having this repo metadata duplicated in one single CV means the CV uses 1+ GB of space for metadata alone. And if you then publish a new CV version with no changes, the new version metadata will also be 1+ GB in size. Now think of this problem applied to big CVs, and picture these big CVs being published with or without change every week, and also promoted. The problem gets *really* big.

Comment 19 Jean-Baptiste Dancre 2021-04-01 13:00:45 UTC
I second the comments made by Pablo, with a real life experience, and the effect it has on large systems:

In my example, I'm going to publish a new version for the CV for rhel7, and then publish the associated CCV (BASE is "just" rhel7 + sat repos, EXTENDED is rhel7 + sat tools + epel7 + extra + optionnals)
Every time, i'm looking at the free storage left on /var/lib/pulp

prior new version of CV RHEL7:
> 288 645M
post new version of CV RHEL7:
> 285 499M (aka approx 3GB more storage used)
Post New version of CCV (BASE) > Create the Library
> 281 790M (aka approx 4GB more storage used)
Post Promote of CCV to Early
> 281 561M (good news, stays approx the same)
Remove the old version of the CCV
> 283 141M (frees only 2GB - compared to 4GB used for the new version)
Post New version of CCV EXTENDED
> 279 513M  (approx 4GB more storage used)
Post promote of CCV to Early
> 278 726M (approx 1GB used, strange compared to the BASE CCV)
Remove the old version of the CCV
> 280 776M (frees about 2GB, in line with BASE)

Long story short, with approx 100CCV to publish (20 orgs, and approx 5 CCV per orgs), I end up churning 300GB for my publish cycle !
This was not happening like that prior 6.8.x

Comment 24 Tanya Tereshchenko 2021-04-08 13:34:41 UTC
(In reply to Pablo Hess from comment #17)
> Thank you Justin for chiming in and for the comprehensive review, and Ina
> for the constructive discussion and testing.
> 
> Regarding #1:
> > 1.  Generating metadata during repo sync and content view publish are keeping old copies of metadata around (you may need a 'skip metadata sync' to see it actually republish the metadata)
>     (...)
> > I understand that even if 1) is solved, new content view versions may not contain changes and thus may not have their metadata regenerated when promoting to a lifecycle environment?  If thats the case you may want to manually trigger a 'regenerate repo metadata' for each content view versions.
> 
> I believe this is the core of the issue: having to regenerate metadata (this
> is expensive) in order to simply keep old metadata from being copied around
> doesn't seem more efficient than taking a quick, cheap look (it ought to be
> cheap) at metadata file age before blindly copying them over to the new repo
> (or CV version).
> 
> > There's a lot in satellite that attempts to reduce the amount of work done in the cases where there are no changes, and i think this is preventing the fix from being seen (assuming 1) is working)
> 
> I know this and I thoroughly appreciate it. I'm asking for the logic to just
> be improved a (hopefully) tiny bit to accommodate these cases. In my
> experience this (hopefully) tiny change in the logic would benefit a
> significant number of customers.

Hi Pablo, it's Tanya from Pulp.

I understand the frustration, and, unfortunately, fixing the problem without noticeably degrading performance (e.g. remove some optimizations) is more complicated than it looks on the surface.
The first step towards the solution is to ensure that the threshold setting on pulp side works as expected, and then we can try to figure out what can be done on Katello side.

Could you confirm what Justin asks in bz1903367#c16?
> I believe the configuration that ina mentions should resolve 1), can you confirm if it is?

FWIW, your tests in the bz1903367#c13 with no changes to content and without 'skip metadata sync' didn't change metadata as designed/expected.
This is due to optimizations on Pulp side. You need to change some content or bypass the optimizations explicitly with 'skip metadata sync'.

Thank you.

Comment 25 Pablo Hess 2021-04-08 16:43:51 UTC
(In reply to Tanya Tereshchenko from comment #24)

> Could you confirm what Justin asks in bz1903367#c16?
> > I believe the configuration that ina mentions should resolve 1), can you confirm if it is?

I believe so. However, like I said in c#18, we would need Satellite to adopt a default "remove_old_repodata_threshold" value of zero in order to prevent it from adding old metadata from the beginning. If old metadata sneaks in at any point in time, it will be blindly copied around from then on. Only by having a value of zero since the first content sync would Pulp be able to avoid holding old metadata.

On an existing Satellite, I have tried setting "remove_old_repodata_threshold" to 1 (I have not tried 0). This works if and only if we manually remove all unreferenced old metadata (e.g. with the commands in c#4) prior to doing any CV publish/promote. In this case, Satellite will not keep old unreferenced metadata around when we publish CVs next time.


> FWIW, your tests in the bz1903367#c13 with no changes to content and without
> 'skip metadata sync' didn't change metadata as designed/expected.
> This is due to optimizations on Pulp side. You need to change some content
> or bypass the optimizations explicitly with 'skip metadata sync'.

Right, makes sense. Thank you for clarifying.

Comment 27 Justin Sherrill 2021-05-25 17:45:18 UTC
Plan of action:

1) set remove_old_repodata_threshold to '0' by default
2) provide a foreman-maintain command to cleanup these duplicates

If a users upgrades to include 1), their content views should get cleaned up automatically as they publish/promote new versions (and delete old versions).  This only happens if a new version has some changes in it (so should happen over time).  But if a user wants to speed this up, they could run the foreman-maintain script

Comment 28 Justin Sherrill 2021-05-25 17:58:12 UTC
*** Bug 1921752 has been marked as a duplicate of this bug. ***

Comment 30 Bryan Kearney 2021-07-12 20:02:55 UTC
Upstream bug assigned to jsherril

Comment 31 Bryan Kearney 2021-07-12 20:02:58 UTC
Upstream bug assigned to jsherril

Comment 35 Bryan Kearney 2021-08-03 20:03:48 UTC
Moving this bug to POST for triage into Satellite since the upstream issue https://projects.theforeman.org/issues/32966 has been resolved.

Comment 37 Justin Sherrill 2021-08-19 20:07:25 UTC
The pulp puppet module didn't seem to get an update, i'd expect 8.2.0 according to https://gitlab.sat.engineering.redhat.com/satellite6/foreman-installer/-/commit/5d18354bda7d7e0f9926e2ce834d5319b46edf1b 

but i see 8.1.1:

# cat  /usr/share/foreman-installer/modules/pulp/metadata.json  | grep version 
  "version": "8.1.1",


rpm -q foreman-installer 
foreman-installer-2.3.1.18-1.el7sat.noarch

Comment 39 Justin Sherrill 2021-08-19 21:25:12 UTC
this is a build issue, see https://bugzilla.redhat.com/show_bug.cgi?id=1903367#c37  for details

Comment 41 Danny Synk 2021-08-26 20:03:52 UTC
Verified on Satellite 6.9.5, snap 3.

Steps to Test:
1. On a separate Satellite server from the Satellite 6.9.5 instance being tested, create a directory in /var/www/html/pub:

# mkdir /var/www/html/pub/repo

2. Change into the directory, download an RPM to the directory, and create a repository in the directory:

~~~
# cd /var/www/html/pub/repo
# wget https://jlsherrill.fedorapeople.org/fake-repos/needed-errata/bear-4.1-1.noarch.rpm
# createrepo .
~~~

3. On the Content > Products page of the Satellite webUI, create a new custom product.
4. Create a custom repository in the product with the repository created in the previous step as the upstream URL.
5. Synchronize the repository.
6. On the Satellite hosting the upstream repository, download a second RPM to the repositorydirectory and run `createrepo` again:

~~~
# wget https://jlsherrill.fedorapeople.org/fake-repos/needed-errata/camel-0.1-1.noarch.rpm
# createrepo .
~~~

7. On the Satellite 6.9.5 instance being tested, synchronize the custom repository a second time.
8. Create a new content view containing the custom repository.
9. Publish three versions of the content view.
10. Check each published content view version for duplicate repodata:

~~~
# ls /var/lib/pulp/published/yum/master/yum_distributor/1-repodata_dupe_test-v3_0-cb49cf3b-fe30-4dad-96ec-a689a705fc53/1630006867.47/repodata/
3cf3c635f08ac92c0483f0ca96eaf60f62131d1ae828d44d94ae43e2fc2d0346-primary.xml.gz
3d9cbba2f0b04239db929749a6d57ed975cb659240040a46bd4616bbe33049ff-other.xml.gz
a27718cc28ec6d71432e0ef3e6da544b7f9d93f6bb7d0a55aacd592d03144b70-comps.xml
ab2a4d42f8a2b35c95b8d9b5f946d76688f1316eb0bf48752948101da67447c9-updateinfo.xml.gz
cf62ad33787a800a447a349e4969df93e4e2b6c7ddb5395a2b98d9b84dfa5a26-filelists.xml.gz
repomd.xml

# ls /var/lib/pulp/published/yum/master/yum_distributor/1-repodata_dupe_test-v2_0-cb49cf3b-fe30-4dad-96ec-a689a705fc53/1630006855.15/repodata/
3cf3c635f08ac92c0483f0ca96eaf60f62131d1ae828d44d94ae43e2fc2d0346-primary.xml.gz
3d9cbba2f0b04239db929749a6d57ed975cb659240040a46bd4616bbe33049ff-other.xml.gz
a27718cc28ec6d71432e0ef3e6da544b7f9d93f6bb7d0a55aacd592d03144b70-comps.xml
ab2a4d42f8a2b35c95b8d9b5f946d76688f1316eb0bf48752948101da67447c9-updateinfo.xml.gz
cf62ad33787a800a447a349e4969df93e4e2b6c7ddb5395a2b98d9b84dfa5a26-filelists.xml.gz
repomd.xml

# ls /var/lib/pulp/published/yum/master/yum_distributor/1-repodata_dupe_test-v1_0-cb49cf3b-fe30-4dad-96ec-a689a705fc53/1630006838.87/repodata/
3cf3c635f08ac92c0483f0ca96eaf60f62131d1ae828d44d94ae43e2fc2d0346-primary.xml.gz
3d9cbba2f0b04239db929749a6d57ed975cb659240040a46bd4616bbe33049ff-other.xml.gz
a27718cc28ec6d71432e0ef3e6da544b7f9d93f6bb7d0a55aacd592d03144b70-comps.xml
ab2a4d42f8a2b35c95b8d9b5f946d76688f1316eb0bf48752948101da67447c9-updateinfo.xml.gz
cf62ad33787a800a447a349e4969df93e4e2b6c7ddb5395a2b98d9b84dfa5a26-filelists.xml.gz
repomd.xml
~~~

Expected Results:
No duplicate repodata files are present in any of the content view versions.

Actual Results:
No duplicate repodata files are present in any of the content view versions.

Comment 46 errata-xmlrpc 2021-08-31 12:04:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Satellite 6.9.5 Async Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3387


Note You need to log in before you can comment on or make changes to this bug.