Bug 1915976 - DNF/RPM Copy on Write enablement for all variants
Summary: DNF/RPM Copy on Write enablement for all variants
Keywords:
Status: ASSIGNED
Alias: None
Product: Fedora
Classification: Fedora
Component: Changes Tracking
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Matthew Almond
QA Contact:
URL:
Whiteboard:
Depends On: 1919003 1922920
Blocks: F42Changes
TreeView+ depends on / blocked
 
Reported: 2021-01-13 20:58 UTC by Ben Cotton
Modified: 2024-08-26 17:48 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: ---
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rpm-software-management librepo pull 222 0 None open Add support for rpm2extents transcoder 2023-07-31 19:04:18 UTC
Github rpm-software-management rpm pull 1470 0 None closed RPM with Copy on Write 2023-07-31 19:04:21 UTC
Github rpm-software-management rpm pull 2378 0 None closed RPM with Copy on Write 2023-07-31 19:04:24 UTC

Internal Links: 1922920

Description Ben Cotton 2021-01-13 20:58:19 UTC
This is a tracking bug for Change: DNF/RPM Copy on Write enablement for all variants
For more details, see: https://fedoraproject.org/wiki/Changes/RPMCoW

RPM Copy on Write provides a better experience for Fedora Users as it reduces the amount of I/O and offsets CPU cost of package decompression. RPM Copy on Write uses reflinking capabilities in btrfs, which is the default filesystem in Fedora 33 for most variants.

Comment 1 Matthew Almond 2021-01-14 23:39:24 UTC
Here's my plan

1. Fix the simple stuff in librepo PR
2. Re-write my super trivial dnf python plugin as a libdnf plugin. This eliminates a new top level package from Fedora. It will be a sub-package.
3. Prototype signature/digest verification during transcode (rpm2extents) in rpm. I think this is challenging, but not impossible.
4. Address other issues in rpm PR
5. Produce PRs for the src.fedoraproject.org/rpms packages. This is simple spec changes, and patches derived from PRs. Depending on rate of updates, I might get some or all of the upstream PRs accepted/merged. I expect each will need some kind of version bump so we can update rawhide
6. Performance numbers. I have another patch for rpm that adds a "measure" plugin. It's super hacky and need a dnf plugin to collect the values. I aim to get some public numbers
7. I want to make the measure tool usable for others, so we go beyond "trust me" to something independently verifiable.

I am aiming for end of January for all this, and I realize that this is insanely optimistic. For visibility, I'm giving a talk at CentOS Dojo 2021 @ Fosdem (https://hopin.com/events/centos-dojo-fosdem) on Feb 5th which is a 45 minute presentation. I aim to cover what we've got so far, and (more interestingly) what I think we can do next.

Comment 2 Matthew Almond 2021-01-20 02:15:14 UTC
1. done
2. I thought I could avoid creating a new package in Fedora by contributing a change to libdnf. Turns out this is not the right approach. The right approach for libdnf is to use seperate sources and reference libdnf-devel. Add to this the complexity that CentOS doesn't ship with -devel packages, and the complexity is much higher than I anticipated. My plan now: I *will* create a new package in Fedora, nominally called 'dnf-plugin-cow' that builds python3-dnf-plugin-cow (naming convention), and later will build libdnf-plugin-cow.
3. I will switch focus to this - I have partial / code. I need to spend more time on it.

Comment 3 Konstantin 2021-01-20 11:59:53 UTC
Proposal authors say:

> Ballpark performance difference is about half the duration for file download+install time

It sounds all good, the problem is, reflinks are not necessarily faster than plain copy, so seeing actual numbers would be helpful.

Incidentally, just yesterday I benchmarked ccache with and without reflinks¹, and I found out that reflinks/CoW was consistently 30 *slower* in building libinput than plain copy.

Of course `ccache` is not dnf, and since I'm no ccache dev, I can't know for sure that it isn't because ccache screwed up their reflinks usage badly. I doubt it though, in part because the benchmark also shows 30% less CPU usage, so apparently the excess time is spent doing IO, which is likely inside BTRFS. Anyway, that's the reasoning behind the question on actual numbers.

1: https://github.com/ccache/ccache/issues/213#issuecomment-762714286

Comment 4 Matthew Almond 2021-02-01 05:50:50 UTC
I wrote an excellent reply, and I just lost it all due to a form-repost action in Bugzilla. Gah. I'm going to summarize what I remember.

> 1. Fix the simple stuff in librepo PR
Done. Now we're pending PR1470 on rpm before this is accepted.

> 2. Re-write my super trivial dnf python plugin as a libdnf plugin. This eliminates a new top level package from Fedora. It will be a sub-package.

See https://github.com/facebookincubator/dnf-plugin-cow/issues/1

> 3. Prototype signature/digest verification during transcode (rpm2extents) in rpm. I think this is challenging, but not impossible.

There's two concerns:

## DoS of disk space on client
There's a potential DoS exploit where a rogue mirror could feed valid data into a compressor that could fill up the local storage during download. This sounds bad, but after looking at rpmfiArchiveReadToFilePsm() I see the header's idea of size is honored, so we only need to verify the header, not the header+payload.

### Potential of a Remote Code Execution (RCE) exploit
on a vulnerable decompression library. I've been experimenting with conditional transcoding dependant on file size. Files under a limit (e.g. 64MB?) can be buffered in memory, revealing the full file digest. It's possible to employ all the checks. The main restriction is that signatures with trusted key should not be transcoded. If a given file is over the limit, it is not transcoded. The bit that sucks here is that larger rpms are the ones that benefit most from transcoding.

> 4. Address other issues in rpm PR

https://github.com/rpm-software-management/rpm/pull/1470/commits

> 5. Produce PRs for the src.fedoraproject.org/rpms packages. This is simple spec changes, and patches derived from PRs. Depending on rate of updates, I might get some or all of the upstream PRs accepted/merged. I expect each will need some kind of version bump so we can update rawhide

I've got forks of rpm, librepo which I need to keep in sync with the PRs. I've made a COPR repo for testing on f33: https://copr.fedorainfracloud.org/coprs/malmond/rpmcow/

> 6. Performance numbers. I have another patch for rpm that adds a "measure" plugin. It's super hacky and need a dnf plugin to collect the values. I aim to get some public numbers
> 7. I want to make the measure tool usable for others, so we go beyond "trust me" to something independently verifiable.

The goal is to satisfy: https://fedoraproject.org/wiki/Changes/RPMCoW#Performance_Metrics . I've added a new section https://fedoraproject.org/wiki/Changes/RPMCoW#update_2021-01-31 to explain progress. I'm going to use bug 1922920 to track this work.

Comment 5 Matthew Almond 2021-02-03 17:06:37 UTC
I've been communicating with the maintainer of RPM on the pull request and it's become clear that this likely depends on the creation of a public, supportable API for RPM. This is not achievable within the window for Fedora 34, so I'm withdrawing the change for Fedora 34 at this time. I will continue to work on this, and expect to re-submit for Fedora 35.

[1] https://github.com/rpm-software-management/rpm/pull/1470#issuecomment-772410935

Comment 6 Ben Cotton 2021-02-03 17:10:13 UTC
(In reply to Matthew Almond from comment #5)
> I'm withdrawing the change for Fedora 34 at this time. I will
> continue to work on this, and expect to re-submit for Fedora 35.

You don't need to withdraw it if you don't want to. We can defer it to F35. However, if you think the end result will be significantly different from your F34 proposal, then it's better to withdraw and resubmit at a later time. I'll let you decide which approach is more appropriate.

Comment 7 Matthew Almond 2021-02-03 18:36:41 UTC
Didn't know that was an option here[1]. I do prefer to defer - the substance of the change proposal doesn't change, just some of the implementation details.

[1] based on https://docs.fedoraproject.org/en-US/program_management/changes_policy/

Comment 8 Ben Cotton 2021-02-03 18:43:59 UTC
Okay, I'll update the appropriate trackers, etc. I'll also add improving that documentation to explicitly address deferring changes.

Comment 9 Ben Cotton 2021-08-10 11:32:06 UTC
Deferring to F36 per malmond in chat.

Comment 10 Colin Walters 2021-11-24 17:04:49 UTC
It'd be clearer to call this change something like "DNF/RPM Copy on Write"
i.e. drop the "for all variants", since it doesn't apply for rpm-ostree based systems.

Comment 11 Ben Cotton 2022-02-08 21:07:33 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 36 development cycle.
Changing version to 36.

Comment 12 Ben Cotton 2022-02-08 21:15:06 UTC
Today we reached the Code Complete (testable) milestone in the F36 schedule: https://fedorapeople.org/groups/schedule/f-36/f-36-key-tasks.html

All code for this change should be complete enough for testing. You can indicate this by setting the bug status to MODIFIED. (If the code is fully complete, you can go ahead and set it to ON_QA.)

If you need to defer this Change to F37, please needinfo bcotton.

Comment 13 Manu Bretelle 2022-02-09 23:55:45 UTC
Work has been resumed on RPMCoW, progress made, but this is definitely not on schedule for F36. Let's defer to F37.

Comment 14 Ben Cotton 2022-02-10 14:12:26 UTC
Will do, thanks!

Comment 15 Ben Cotton 2022-08-09 16:02:52 UTC
Today we reached the Code Complete (Testable) milestone on the F37 schedule: https://fedorapeople.org/groups/schedule/f-37/f-37-key-tasks.html

At this time, all F37 Changes should be complete enough to be testable. You can indicate this by setting this tracker to the MODIFIED status. If the Change is 100% code complete, you can set the tracker to ON_QA. If you need to defer this Change to F38, please NEEDINFO me.

Changes that have not reached at least the MODIFIED status will be given to FESCo for evaluation of contingency plans.

Comment 16 Davide Cavalca 2022-08-09 20:20:57 UTC
This is still in progress, we're going to defer to Fedora 38.

Comment 17 Ben Cotton 2022-08-10 15:38:00 UTC
Deferring to F38, thanks!

Comment 18 Ben Cotton 2023-02-07 14:27:33 UTC
Today we reached the Code Complete (Testable) milestone on the F38 schedule: https://fedorapeople.org/groups/schedule/f-38/f-38-key-tasks.html

At this time, all F38 Changes should be complete enough to be testable. You can indicate this by setting this tracker to the MODIFIED status. If the Change is 100% code complete, you can set the tracker to ON_QA. If you need to defer this Change to F39, please NEEDINFO me.

Changes that have not reached at least the MODIFIED status will be given to FESCo for evaluation of contingency plans.

Comment 19 Davide Cavalca 2023-02-07 16:29:46 UTC
Let's defer to F39 please, https://github.com/rpm-software-management/rpm/pull/2378 is the latest iteration on this but it's not ready to merge yet.

Comment 20 Ben Cotton 2023-02-10 14:10:09 UTC
Paperwork handled

Comment 21 Adam Williamson 2023-08-22 21:03:34 UTC
It looks like this is still not done for F39, neither the latest rpm PR attempt nor the librepo PR have been merged. This looks definitely at-risk for F39. Do we need to defer to F40?

Comment 22 Davide Cavalca 2023-08-22 21:19:25 UTC
Yes, we should defer, thanks for the reminder. For the record, the latest work is happening in https://github.com/rpm-software-management/rpm/pull/2557 https://github.com/rpm-software-management/rpm/pull/2416 and https://github.com/rpm-software-management/rpm/discussions/2057

Comment 23 Aoife Moloney 2024-02-19 17:45:02 UTC
Hi Davide, Matthew, Manu,

Is this change still targeting F40 release? We are passed the Testable deadline for changes and are approaching the 100% complete deadline and beta freeze, so I am checking in on all F40 changes to see whether they are still on track or not. Any status update you can share would be most appreciated.


Thanks!
Aoife

Comment 24 Davide Cavalca 2024-02-19 18:40:33 UTC
Richard made progress on this Change and Matteo is picking up the work, but we won't be ready for F40. Let's defer to F41 please.

Comment 25 Davide Cavalca 2024-02-19 18:41:50 UTC
The latest iteration (not yet submitted upstream) is being developed in https://github.com/teknoraver/rpm/tree/cow

Comment 26 Adam Williamson 2024-02-21 02:50:54 UTC
OK, deferred (Aoife is on vacation). Thanks.

Comment 27 Geraldo Simião 2024-06-14 18:21:25 UTC
I see that this change is aimed at f41 now. How this is affected by the dnf5 change, aprooved for the same release?

Comment 28 Davide Cavalca 2024-06-15 07:23:51 UTC
We plan to convert the plugin to the new API. It's worth noting however that based on f2f conversations at DevConf this week it is likely this Change will be deferred again to f42.

Comment 29 Aoife Moloney 2024-08-22 12:08:05 UTC
Hi Davide & Matthew, is this change still on track for F41, or do you need to defer to F42? The change needs to be code complete by next Tuesday 27th before Beta freeze, so please let me know the status of the change and next steps. Thanks!

Comment 30 Matteo Croce 2024-08-22 14:24:12 UTC
Hi Aoife,

the CoW code was ported to the RPM hyperscale package:
https://git.centos.org/rpms/rpm/pull-request/18

And to latest master:
https://github.com/teknoraver/rpm/commits/cow/

So the code is still on track, but we won't do it for F41.

Regards,

Comment 31 Aoife Moloney 2024-08-22 14:25:45 UTC
Thanks for the update Matteo, would it make sense then to retarget this to F42? I can update the change wiki accordingly if so.

Comment 32 Matteo Croce 2024-08-22 14:38:45 UTC
Yes please.

The maintainers asked to share some technical details to decide how to integrate it into the current code base, so F42 is a safer choice.

Thanks,

Comment 33 Aoife Moloney 2024-08-26 17:48:55 UTC
Thanks for the confirmation Matteo, I have retargeted this bug to track against F42 and will make sure the wiki is updated too and check back in early in the F42 release cycle.


Note You need to log in before you can comment on or make changes to this bug.