Bug 1940951 - [RFE] To display total size of a yum\docker\iso\file\custom repository in Satellite GUI after syncing them [NEEDINFO]
Summary: [RFE] To display total size of a yum\docker\iso\file\custom repository in Sat...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Repositories
Version: 6.11.0
Hardware: All
OS: All
medium
medium
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: Satellite QE Team
URL:
Whiteboard:
: 1694836 1902756 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-19 15:47 UTC by Sayan Das
Modified: 2024-03-26 00:34 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:
fperalta: needinfo? (satellite6-bugs)
fperalta: needinfo? (dsinglet)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Foreman Issue Tracker 32143 0 Normal New Feature request to display the total size of a yum\docker\iso\file\custom repository in GUI after syncing them. 2023-04-05 14:28:14 UTC
Red Hat Knowledge Base (Solution) 7006345 0 None None None 2023-04-05 15:14:46 UTC

Description Sayan Das 2021-03-19 15:47:31 UTC
1. Proposed title of this feature request

To display the total size of a yum\docker\iso\file\custom repository in Satellite GUI after syncing them.

2. What is the nature and description of the request?

Satellite server should be able to show the total size of a repository [based on its metadata] after its sync has been completed. If the size of the repository changes after the second or third sync, that should be reflected and updated total size should be displayed in the UI>


3. Why does the customer need this? (List the business requirements here)

Let's assume a scenario where a customer wants to enable 30+ repos and wants to sync all of their contents in Satellite while the download policy is set to immediate or the customer wants to do "Validate content sync" for 20+ yum repos.

They will literally have to guess how much disk space they would want to allocate to /var/lib/pulp to do so and sometimes they will end up adding either very less and way too much space in that FS. 

The idea here is for the satellite to be able to calculate the total size of the repository from the PULP_MANIFEST or metadata stored and to display the same in the UI at the end of the sync where repository data will be present.

In that way, future planning can be done on how much space all the enabled repositories may consume when we decide the download all the content from them, and accordingly, one can get a near accurate value of disk space that can be allocated to /var/lib/pulp.


Also, It's not necessary to show the actual size as well i.e. what is the current size of the repo in filesystem based on actual content it has downloaded but if that can be calculated and displayed separately as well, it will turn out to be good info for the customer as well as the support team to rely on.

NOTE: I am well aware of the fact that some repos with different release versions i.e. 7.9 vs 7Server will have much identical content and hence I only wanted to calculate and show their expected total size.


4. How would the customer like to achieve this? (List the functional requirements here)

- From Satellite UI --> Content --> Sync Status --> Expand All, 
- Select few repos and sync them
- Come back to the Content --> Products page, go into a product and then check the details of synced repositories where the Total Size (and if possible Consumer size) will be displayed.



5. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.

- From Satellite UI --> Content --> Sync Status --> Expand All, 
- Select few repos and sync them
- Come back to the Content --> Products page, go into a product, and then check the details of synced repositories where the Total Size (and if possible Consumer size) will be displayed.

6. Is there already an existing RFE upstream or in Red Hat Bugzilla?

NA


7. Does the customer have any specific timeline dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?

Suggested in Satellite 6.10 or Satellite 7.0


8. Is the sales team involved in this request and do they have any additional input?
No.


9. List any affected packages or components.

katello
pulp


10. Would the customer be able to assist in testing this functionality if implemented?

NA at this moment but the support delivery team can

Comment 2 Daniel Alley 2022-08-25 00:34:49 UTC
*** Bug 1694836 has been marked as a duplicate of this bug. ***

Comment 3 Daniel Alley 2022-08-25 00:37:57 UTC
*** Bug 1902756 has been marked as a duplicate of this bug. ***

Comment 8 Pavel Moravec 2023-04-05 15:14:46 UTC
This can be relatively simply implemented (up to the UX layer), at least for the already-downloaded content. See https://access.redhat.com/solutions/7006345 and namely its core part:

(in pulpcore-manager shell):

# QuerySet of all artifacts that are published either where; each record consists of:
# - pulp_id of artifact (for reference from repo_artifacts below)
# - size of the artifact
# - count = number of times the artifact is published (e.g. via 2 versions of same CV for same root repo)
all_artifacts = PublishedArtifact.objects.filter(content_artifact__artifact__isnull=False) \
                                 .values('content_artifact__artifact__pulp_id', 'content_artifact__artifact__size') \
                                 .annotate(count=Count('content_artifact__artifact__pulp_id'))


If you generate the same "per repo for each katello root repo", like:

    repo_artifacts = PublishedArtifact.objects.filter(content_artifact__artifact__isnull=False, publication__repository_version__repository__name__startswith=rrepo) \
                                              .values('content_artifact__artifact__pulp_id', 'content_artifact__artifact__size') \
                                              .annotate(count=Count('content_artifact__artifact__pulp_id'))

then you can get the "per root repo" statistics described in the KCS.

This data - for one root repository - can be computed ad hoc (on a click in WebUI, e.g.) as the underlying code is executed within a few seconds for a root repo.

Just add some API and UX to it - *if* it is desired to provide exactly this information (it differs a bit from the RFE requirements that needs total size of repo, not total size of already downloaded data).

Comment 9 Daniel Alley 2023-04-05 15:23:36 UTC
@Pavel the problem is that sizes are non-additive.  Sure, the number might say that the repo is 60gb, but if I delete it and run orphan cleanup it may be the case that zero space is actually freed up.  That's the primary issue - any UX we provide needs to be in sync with the user expectations around what those numbers mean and what actions they can take based on them.

Comment 10 Pavel Moravec 2023-04-07 19:24:30 UTC
(In reply to Daniel Alley from comment #9)
> @Pavel the problem is that sizes are non-additive.  Sure, the number might
> say that the repo is 60gb, but if I delete it and run orphan cleanup it may
> be the case that zero space is actually freed up.  That's the primary issue
> - any UX we provide needs to be in sync with the user expectations around
> what those numbers mean and what actions they can take based on them.

That is why the script counts two pair of values:
- number and sum of sizes of already downloaded packages - your comment applies to this
- number and sum of sizes of downloaded packages that are associated *just* with the repo (or either of its clone in CV) - see how I calculate all_artifacts[artifact_uuid]['count'] as the number of repos associated with the artifact.

So when all_artifacts[artifact_uuid]['count'] == repo_artifacts[artifact_uuid]['count'] , I know this artifact is associated just with this repo (or either its clone). And only then I add it to the "OwnSize".

So OwnSize and OwnPkgs values counted for a repo should really stand for "if I would delete just the repo (and its clones), then orphan cleanup(*) would remove that number of packages of that cumulative size

(*) zith zero orphan protection time or after the "timeout" elapses.


However the original request can be read as "what is the amount of data I might need to store, in case all packages would be downloaded?" - so I extended my script accordingly, by adding:

    repo_ca = PublishedArtifact.objects.filter(publication__repository_version__repository__name__startswith=rrepo).values('content_artifact__pulp_id')
    ra = RemoteArtifact.objects.filter(content_artifact__in=repo_ca)
    rrepos[rrepo]['RemotePkgs'] = ra.count()
    rrepos[rrepo]['RemoteSize'] = ra.aggregate(Sum('size'))['size__sum']

(ra = QuerySet of all RemoteArtifact objects associated with the repo or its clone)


Let try(*) the script attached to https://access.redhat.com/solutions/7006345 , I think it can be handy tool alone.

(*) preferably on Satellite, not on standalone pulpcore server. Since the script relies on katello-defined naming convention of pulp repo names, per https://github.com/Katello/katello/blob/master/app/services/katello/pulp3/repository.rb#L113-L115 (the script truncates trailing number in a repo name to identify sets of repo clones belonging to the same katello's root repository)


Anyway I agree a devil can be hidden in details how customers can interpret the values (or what they expect in particular here). I understand there can be a confusion, also due to the nonzero orphan protection time ("I removed the repo and run orphan cleanup but nothing was really removed" support cases soon after the RFE implemented :) ). So while I *think* I implemented sort-of TUI version of the RFE, I understand it might not be possible to productise it in this way due to the interpretation of values.


Note You need to log in before you can comment on or make changes to this bug.