Bug 2029709 - [btrfs] rpm takes a very long of time and CPU to copy files from papirus-icon-theme and flat-remix-icon-theme, possible regression
Summary: [btrfs] rpm takes a very long of time and CPU to copy files from papirus-icon...
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: fedora-kernel-btrfs
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 2035684 2044082 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-07 06:37 UTC by angel.segarra
Modified: 2022-04-20 08:32 UTC (History)
47 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-20 08:32:44 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dnf log of installation (2.63 KB, text/plain)
2021-12-07 06:37 UTC, angel.segarra
no flags Details
contents of /usr/share/icons/Papirus* (410.09 KB, application/x-xz)
2022-01-23 13:22 UTC, Rubén
no flags Details
contents of /usr/share/icons (479.04 KB, application/x-xz)
2022-01-23 15:10 UTC, Rubén
no flags Details
Flamegraph of the rpm transaction (168.17 KB, image/svg+xml)
2022-02-01 22:48 UTC, Ondrej Mosnacek
no flags Details

Description angel.segarra 2021-12-07 06:37:53 UTC
Created attachment 1845031 [details]
dnf log of installation

Description of problem:

dnf took 6 minutes and 100% cpu usage to install papirus-icon-theme 20210901-2.fc35

I'm on a 11th Gen Intel i7-1165G7 (8) @ 4.700GHz


Version-Release number of selected component (if applicable):

dnf - 4.9.0
papirus-icon-theme 20210901-2


How reproducible:
Happened on first install, was tempted to kill it but it eventually finished

Steps to Reproduce:
1. install papirus-icon-theme

Actual results:

took 6 minutes and 100% cpu usage

Expected results:
This package should install relatively quickly, way less than 6 minutes.

Additional info:

Comment 1 Marek Blaha 2021-12-14 15:29:16 UTC
My guess is that the package is quite big (30M the package, 103M installed) and contains a lot of files (106805 according to `dnf repoquery -l papirus-icon-theme-0:20210901-2.fc35.noarch) therefore rpm needs a long time to deal with it.
On my machine (with similar processor but super fast nvme disk) the instalation takes approx. 50 seconds.

I'm re-assigning the bug to papirius-icon-theme component for further investigation.

Comment 2 Robert-André Mauchin 🐧 2021-12-17 23:15:27 UTC
@mblaha yes there is more than 100,000 small files but it didn't use to take that much time. I can't pinpoint when it did slow down but it was working fine earlier this year. 
It takes 600 seconds on my system with an nvme and a recent Ryzen.
It seems that the copy of the files is very long and single-threaded.

Comment 3 Marek Blaha 2021-12-20 10:45:36 UTC
I did some tests on various versions of dnf / rpm / papirus-icon-theme on Fedora 34 and 35. In minimal containers all combinations of different versions behave basically the same - installation takes approximately 15 seconds. On my system it is still 50 seconds (no matter what version of dnf / rpm is used).
So the slow down must be caused by some factor other than rpm/dnf versions. But unfortunately I haven't found it yet.

Comment 4 postix 2021-12-26 22:31:06 UTC
*** Bug 2035684 has been marked as a duplicate of this bug. ***

Comment 5 postix 2021-12-26 22:33:44 UTC
Can confirm it on `Fedora 35´, `dnf-0:4.9.0-1.fc35.noarch` and the package `papirus-icon-theme-20211201-1.fc35.noarch` using a NVMe SSD.
I had not experienced this issue earlier with other versions of the package on the same machine.

Comment 6 P D 2021-12-26 22:38:24 UTC
Happens with flat-remix-icon-theme also. Didn't happen on older Fedora releases.

Comment 7 postix 2021-12-26 22:54:56 UTC
I also tried to install the papirus icon theme with zypper:
```
dnf install zypper
zypper addrepo /etc/yum.repos.d/fedora.repo
zypper refresh
zypper install papirus-icon-theme
```
the first part of the installation of `papirus-icon-theme-20210701-1.fc35.noarch.rpm` was quiet fast (< a minute) on my system, but the postrans script `/usr/bin/gtk-update-icon-cache /usr/share/icons/{ePapirus|Papirus|Papirus-Light|...}` takes several minutes and consumes 100% of a CPU core. 

I guess this obversation is also applicable on `dnf`?

Comment 8 Panu Matilainen 2022-01-03 08:37:49 UTC
(In reply to postix from comment #7)
> the first part of the installation of
> `papirus-icon-theme-20210701-1.fc35.noarch.rpm` was quiet fast (< a minute)
> on my system, but the postrans script `/usr/bin/gtk-update-icon-cache
> /usr/share/icons/{ePapirus|Papirus|Papirus-Light|...}` takes several minutes
> and consumes 100% of a CPU core. 
> 
> I guess this obversation is also applicable on `dnf`?

Since the time is spent on a scriptlet executing an external command, yes it doesn't matter which rpm client is being used.

Reassigning to gtk3, dnf can't help it if gtk-update-icon-cache has a performance regression.

Comment 9 Rubén 2022-01-20 18:00:09 UTC
Hello.

Same problem with papirus-icon-theme 2022-01-01

I didn't let it finish. I killed the task.

Comment 10 Robert-André Mauchin 🐧 2022-01-22 23:13:24 UTC
(In reply to Rubén from comment #9)
> Hello.
> 
> Same problem with papirus-icon-theme 2022-01-01
> 
> I didn't let it finish. I killed the task.

Can you list the current content of /usr/share/icons/

I can't repro the performance regression when I run gtk-update-icon-cache manually.

Comment 11 Rubén 2022-01-23 13:22:16 UTC
Created attachment 1852843 [details]
contents of /usr/share/icons/Papirus*

Comment 12 Rubén 2022-01-23 13:23:23 UTC
(In reply to Robert-André Mauchin 🐧 from comment #10)
> (In reply to Rubén from comment #9)
> > Hello.
> > 
> > Same problem with papirus-icon-theme 2022-01-01
> > 
> > I didn't let it finish. I killed the task.
> 
> Can you list the current content of /usr/share/icons/
> 
> I can't repro the performance regression when I run gtk-update-icon-cache
> manually.

Hello.

See attachment.

Comment 13 Robert-André Mauchin 🐧 2022-01-23 14:19:51 UTC
(In reply to Rubén from comment #12)
> (In reply to Robert-André Mauchin 🐧 from comment #10)
> > (In reply to Rubén from comment #9)
> > > Hello.
> > > 
> > > Same problem with papirus-icon-theme 2022-01-01
> > > 
> > > I didn't let it finish. I killed the task.
> > 
> > Can you list the current content of /usr/share/icons/
> > 
> > I can't repro the performance regression when I run gtk-update-icon-cache
> > manually.
> 
> Hello.
> 
> See attachment.

>contents of /usr/share/icons/Papirus*

I don't only want the Papirus ones, I'd like to see if there is no formerly used directory that haven't been properly removed.

Comment 14 Rubén 2022-01-23 15:10:40 UTC
Created attachment 1852856 [details]
contents of /usr/share/icons

Comment 15 Rubén 2022-01-23 15:15:05 UTC
(In reply to Robert-André Mauchin 🐧 from comment #13)
> (In reply to Rubén from comment #12)
> > (In reply to Robert-André Mauchin 🐧 from comment #10)
> > > (In reply to Rubén from comment #9)
> > > > Hello.
> > > > 
> > > > Same problem with papirus-icon-theme 2022-01-01
> > > > 
> > > > I didn't let it finish. I killed the task.
> > > 
> > > Can you list the current content of /usr/share/icons/
> > > 
> > > I can't repro the performance regression when I run gtk-update-icon-cache
> > > manually.
> > 
> > Hello.
> > 
> > See attachment.
> 
> >contents of /usr/share/icons/Papirus*
> 
> I don't only want the Papirus ones, I'd like to see if there is no formerly
> used directory that haven't been properly removed.

papirus-icon-theme package is not installed. In my case it was taking more than six minutes and one core at 100%. I didn't let it finish, but the folders remained at /usr/share/icons.

Shall I remove them manually?

Comment 16 Robert-André Mauchin 🐧 2022-01-23 22:38:51 UTC
(In reply to postix from comment #7)
> I also tried to install the papirus icon theme with zypper:
> ```
> dnf install zypper
> zypper addrepo /etc/yum.repos.d/fedora.repo
> zypper refresh
> zypper install papirus-icon-theme
> ```
> the first part of the installation of
> `papirus-icon-theme-20210701-1.fc35.noarch.rpm` was quiet fast (< a minute)
> on my system, but the postrans script `/usr/bin/gtk-update-icon-cache
> /usr/share/icons/{ePapirus|Papirus|Papirus-Light|...}` takes several minutes
> and consumes 100% of a CPU core. 
> 
> I guess this obversation is also applicable on `dnf`?

I tested with a package with no scriplets and the problem still happens, so it is elsewhere.


(In reply to Rubén from comment #14)
> Created attachment 1852856 [details]
> contents of /usr/share/icons


What I noticed is that the directories contains a lot of

│   │   │   ├── action-unavailable.svg;61e7d3cc
│   │   │   ├── action-unavailable.svg;61e7d4bc
│   │   │   ├── action-unavailable.svg;61e9a2ba

which is happening on my system when the install is slow.
It seems the files are first copied with a suffix, then renamed to their final file name. And this step takes a lot and a lot of time.

@pmatilai Could you weight in with your experience regarding how files are copied?

(In reply to P D from comment #6)
> Happens with flat-remix-icon-theme also. Didn't happen on older Fedora
> releases.

This makes me think it's a rpm issue.
It also happens with rpm -i.

I'm reassigning to rpm because it seems the right component,

Comment 17 Robert-André Mauchin 🐧 2022-01-23 23:03:57 UTC
> (In reply to Rubén from comment #14)
> > Created attachment 1852856 [details]
> > contents of /usr/share/icons
> 
> 
> What I noticed is that the directories contains a lot of
> 
> │   │   │   ├── action-unavailable.svg;61e7d3cc
> │   │   │   ├── action-unavailable.svg;61e7d4bc
> │   │   │   ├── action-unavailable.svg;61e9a2ba
> 
> which is happening on my system when the install is slow.
> It seems the files are first copied with a suffix, then renamed to their
> final file name. And this step takes a lot and a lot of time.
> 
> @pmatilai Could you weight in with your experience regarding how
> files are copied?
> 
It also doesn't happen in a Mock chroot.

Comment 18 Robert-André Mauchin 🐧 2022-01-24 00:21:11 UTC
It seems mv is massively slower on the host rather in a nspawn chroot

[root@cassini icons]# mkdir test
[root@cassini icons]# for (( i = 0; i < 10000; i++ )) do > test/file_$i; done
[root@cassini icons]# cd test

On host:

time $(for f in *; do     mv "$f" "${f%}.txt"; done)
real    2m3,500s
user    0m3,966s
sys     2m0,431s

In nspawn container:

<mock-chroot> sh-5.1# time $(for f in *; do     mv "$f" "${f%}.txt"; done)
real    0m6.702s
user    0m4.237s
sys     0m3.344s

Since papirus-icon-theme contains more than 100,000 files, it is a problem. One minute of waiting is ok, 20 mn is not.

Could it be a filesystem problem? A regression that happened around F35 or just before?
I know btrfs became the default in F35, but I was using it since at least F27.

What are your filesystems guys who have this problem?

Comment 19 P D 2022-01-24 00:52:53 UTC
I use BTRFS and I have this issue. I have a cheap SSD and this wastes writes trying to install papirus or flat remix icon themes from DNF so I just install them manually instead. For me, I noticed this issue in Fedora 35. I don't recall it happening in Fedora 34.

Comment 20 Marek Blaha 2022-01-24 14:35:31 UTC
*** Bug 2044082 has been marked as a duplicate of this bug. ***

Comment 21 Panu Matilainen 2022-01-24 15:07:16 UTC
Rpm has always worked by unpacking files to a temporary suffix and then moving into the final location. What has changed in F35 (rpm >= 4.17) is that now rpm first unpacks all files from a package and only if that succeeds, renames them to final destination, whereas older versions did this on file-by-file basis. This is always just a rename(3), never a copy.
So if rename() is taking ages then it sounds like a filesystem thing indeed. I haven't seen this on any of my systems, but then they're all on ext4.

Comment 22 Marek Blaha 2022-01-24 15:27:49 UTC
I can reproduce the behavior from comment#18. I'm on Fedora 35, ext4 filesystem:

On the host:

# time rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm 

real    0m30.845s
user    0m22.084s
sys     0m6.788s

In the podman container:

$ podman run --rm -ti registry.fedoraproject.org/fedora:35 bash
# time rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm

real    0m5.530s
user    0m1.748s
sys     0m3.561s

Comment 23 Rubén 2022-01-24 19:05:01 UTC
(In reply to Robert-André Mauchin 🐧 from comment #18)
> It seems mv is massively slower on the host rather in a nspawn chroot
> 
> [root@cassini icons]# mkdir test
> [root@cassini icons]# for (( i = 0; i < 10000; i++ )) do > test/file_$i; done
> [root@cassini icons]# cd test
> 
> On host:
> 
> time $(for f in *; do     mv "$f" "${f%}.txt"; done)
> real    2m3,500s
> user    0m3,966s
> sys     2m0,431s
> 
> In nspawn container:
> 
> <mock-chroot> sh-5.1# time $(for f in *; do     mv "$f" "${f%}.txt"; done)
> real    0m6.702s
> user    0m4.237s
> sys     0m3.344s
> 
> Since papirus-icon-theme contains more than 100,000 files, it is a problem.
> One minute of waiting is ok, 20 mn is not.
> 
> Could it be a filesystem problem? A regression that happened around F35 or
> just before?
> I know btrfs became the default in F35, but I was using it since at least
> F27.
> 
> What are your filesystems guys who have this problem?

Hi.

I also use btrfs, I've always had got papirus-icon-theme package in my system but never this issue until version 2022-01-01.

Regards.

Comment 24 Chris Murphy 2022-01-24 23:51:06 UTC
For what it's worth, Btrfs became the default on desktops in Fedora 33. And for Cloud edition in Fedora 35.

Comment 25 Chris Murphy 2022-01-25 00:45:55 UTC
I've asked a btrfs developer to take a look at this, even though the same problem is happening on ext4.

Is there an fsync() after each rename()? Or do you do one sync() at the completion of the rename()? Using fsync() 100K+ times is going to hammer the log on ext4, xfs, and btrfs, and it'll be slow. So it sounds like it could be a per file fsync()? If so, what's the motivation of doing that? Is rpm able to know and handle partially installed rpm's somehow and recover? Does it know the system is in some in-between state? Because if it doesn't, the package that's been interrupted needs to be reinstalled, correct? If that's true, fsync'ing the files individually after each rename() helps with what exactly?

Comment 26 Florian Festi 2022-01-25 09:59:07 UTC
Normally rpm does not call fsync on single files. It does fsync for database changes. But this is only once per package. We also run sync() or syncfs() once at the end of the transaction.

But there is a macro to call fsync for every file - not on rename but after writing: _flush_io
This feature should be turned off - as it obviously slows things down a lot. I wonder if something in a regular installation turns this on (try rpm -E "%{?_flush_io}") but those mock or other chroots are small enough to not include this.

Comment 27 Panu Matilainen 2022-01-27 10:40:36 UTC
I'm seeing effectively the reverse here, host is a bit sluggish but podman is almost twice slower. This is very repeatable:

[pmatilai🎩︎localhost rpm]$ podman run --rm -ti registry.fedoraproject.org/fedora:35 bash
[root@13ca4f74dfa0 /]# 
[...]
[root@13ca4f74dfa0 /]# time rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm 

real	0m55.563s
user	0m4.146s
sys	0m8.002s

Whereas on the host:
[root@localhost tmp]# time rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm  
real	0m37.459s
user	0m26.588s
sys	0m9.677s

(In reply to Robert-André Mauchin 🐧 from comment #18)
> It seems mv is massively slower on the host rather in a nspawn chroot

Mv being slower too still suggests the issue is not rpm.

Comment 28 Panu Matilainen 2022-01-27 10:53:22 UTC
Oh... there's a huge difference between root and non-root podman. If running podman as root, I too see the enormous speedup:

[root@localhost ~]# podman run --rm -ti registry.fedoraproject.org/fedora:35 bash
[...]
[root@93b06c0eea74 /]# time rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm 
real	0m7.409s
user	0m2.341s
sys	0m4.868s

Comment 29 Panu Matilainen 2022-01-27 11:09:59 UTC
So the issue is two-fold:

1) On the container image, there's no /usr/bin/gtk-update-icon-cache which amounts to roughly half of the execution time (add --noscripts on host to witness the effect)
2) On the container image, there's also no SELinux whose labels need setting (or other plugins), which amounts to the other half of the execution time

So what you see on the container is quite literally equivalent to the following (in this case, not in general):

# time rpm -i --noplugins --noscripts papirus-icon-theme-20220101-1.fc35.noarch.rpm 
real	0m6.541s
user	0m2.263s
sys	0m3.437s

...and lo and behold, the time on the host is similar now. If you do more work it usually takes more time.

Comment 30 Robert-André Mauchin 🐧 2022-01-27 17:30:16 UTC
Thank you for taking the time to check this.

I don't understand why it started with F35 though. Nor what I can do about it.

Comment 31 Chris Murphy 2022-01-27 18:03:09 UTC
(In reply to Florian Festi from comment #26)
> Normally rpm does not call fsync on single files. It does fsync for database
> changes. But this is only once per package. We also run sync() or syncfs()
> once at the end of the transaction.

Seems reasonable.
 
> But there is a macro to call fsync for every file - not on rename but after
> writing: _flush_io
> This feature should be turned off - as it obviously slows things down a lot.

Agreed.

(In reply to Panu Matilainen from comment #27)
> I'm seeing effectively the reverse here, host is a bit sluggish but podman
> is almost twice slower. This is very repeatable:

The different storage stacks could be a partial factor in folks getting different relative results? Fedora default, rootful podman uses kernel overlayfs, rootless podman uses fuse-overlayfs. ext4 folks get a fairly expensive copy-up operation, when required. Whereas on XFS and Btrfs, overlayfs will automatically use reflinks for the copy-up which is much faster. Anyone on Btrfs who opts into the btrfs graph driver would also see different results.


(In reply to Robert-André Mauchin 🐧 from comment #30)
> I don't understand why it started with F35 though. Nor what I can do about
> it.

Me neither. bcc-tools includes 'fileslower' which looks at the VFS layer to see what files have high latencies, which could help give a clue *what* is slow, but not why.

Comment 32 Panu Matilainen 2022-01-28 11:12:13 UTC
> a clue *what* is slow, but not why

See comment #29, it seems that SELinux is what the "slow" thing is. As to whether it's slower than in F34, or whether people just didn't happen to have SELinux enabled in F34 or some other configuration difference, only apples to apples testing would tell.

Mind you, I don't think this is in any way a papirus-icon-theme bug. It may have more files than the average package, but that's not a bug, it just is.

Comment 33 Ondrej Mosnacek 2022-01-28 18:11:23 UTC
(In reply to Panu Matilainen from comment #32)
> See comment #29, it seems that SELinux is what the "slow" thing is. As to
> whether it's slower than in F34, or whether people just didn't happen to
> have SELinux enabled in F34 or some other configuration difference, only
> apples to apples testing would tell.

The SELinux labeling work does slow things down, but not to an unbearable extent (considering the package has >100k files). However, it seems most of the measurements here were done on ext4, but I did some experiments on the default Fedora Cloud images (which are BTRFS since F35) and there is evidence of something being seriously messed up on F35 + BTRFS, with no such issue on F34 + ext4:

Fedora 34 VM (ext4)
===================

# strace -f --summary-only rpm -i papirus-icon-theme-20220101-1.fc34.noarch.rpm
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 16.60    0.424584           3    110475           rename
 12.41    0.317400           5     54693           symlink
 11.37    0.290779           5     56230       134 openat
 10.66    0.272696           2    110607           lsetxattr
  9.73    0.248995           1    128825    111169 newfstatat
  6.82    0.174607           3     55843           write
  5.46    0.139770           1    110612           utimensat
  4.84    0.123849           1     90201         3 fcntl
  4.32    0.110634           2     54693           lchown
  3.71    0.094849           1     55914           chown
  2.86    0.073044           0    166394           getuid
  2.44    0.062481           1     55914           chmod
  2.14    0.054804           1     27593           pwrite64
  1.62    0.041437           0     56133           close
  1.61    0.041313           0    111572           umask
  1.44    0.036952           1     22999           pread64
  0.56    0.014367          11      1249           read
  0.50    0.012762         580        22        10 wait4
  0.19    0.004875        1218         4           fdatasync
  0.18    0.004621        4621         1           syncfs
[...]
------ ----------- ----------- --------- --------- ----------------
100.00    2.558383           2   1271418    111370 total

Fedora 34 VM (ext4) + SELinux disabled
======================================

# strace -f --summary-only rpm -i papirus-icon-theme-20220101-1.fc34.noarch.rpm
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 21.34    0.778346           7    110475           rename
 12.87    0.469338           8     56148       134 openat
 11.26    0.410647           7     54693           symlink
 11.03    0.402248           3    128744    111166 newfstatat
  8.53    0.311059           5     55839           write
  6.73    0.245509           2    110612           utimensat
  4.69    0.170887           3     55914           chown
  4.47    0.162871           2     54693           lchown
  3.94    0.143688           0    166394           getuid
  3.39    0.123622           1     90201         3 fcntl
  3.36    0.122498           2     55914           chmod
  2.68    0.097760           0    111572           umask
  2.42    0.088147           1     56051           close
  1.21    0.044141           1     27543           pwrite64
  0.88    0.031937           1     22944           pread64
  0.43    0.015644          14      1070           read
  0.23    0.008402         381        22        10 wait4
  0.08    0.003030         757         4           fdatasync
  0.08    0.002875        2875         1           syncfs
[...]
------ ----------- ----------- --------- --------- ----------------
100.00    3.646814           3   1160269    111368 total

Fedora 35 VM (btrfs)
====================

# strace -f --summary-only rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.31  798.092903        7224    110475           rename
  0.15    1.236912           9    128821    111169 newfstatat
  0.15    1.170149          20     56222       137 openat
  0.13    1.078128          19     54693           symlink
  0.08    0.653151          11     55843           write
  0.03    0.260786           2    110607           lsetxattr
  0.02    0.189932           1    111572           umask
  0.02    0.178879           1     90209         3 fcntl
  0.02    0.165392           2     56125           close
  0.01    0.094274           0    110612           utimensat
  0.01    0.088903           1     54693           lchown
  0.01    0.088510           3     27232           pwrite64
  0.01    0.067501           1     55914           chown
  0.01    0.063673           2     22672           pread64
  0.01    0.057810           0    166394           getuid
  0.01    0.040419           0     55914           chmod
  0.00    0.024754          20      1232           read
  0.00    0.021616         982        22        10 wait4
  0.00    0.009522        2380         4           fdatasync
  0.00    0.004072         313        13         5 execve
  0.00    0.004011        4011         1           syncfs
[...]
------ ----------- ----------- --------- --------- ----------------
100.00  803.603148         632   1270718    111375 total


Fedora 35 VM (btrfs) + SELinux disabled
=======================================

# strace -f --summary-only rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.77  437.400856        3959    110475           rename
  0.24    1.062701           8    128740    111166 newfstatat
  0.16    0.702037          12     56140       137 openat
  0.14    0.637711          11     54693           symlink
  0.11    0.486302           4    110612           utimensat
  0.11    0.469732           2    166394           getuid
  0.11    0.465608           8     55839           write
  0.08    0.357098           6     55914           chown
  0.08    0.350235           6     54693           lchown
  0.06    0.250576           4     55914           chmod
  0.03    0.153681           1     90209         3 fcntl
  0.03    0.138061           1    111572           umask
  0.03    0.116479           2     56043           close
  0.02    0.110487           4     27291           pwrite64
  0.02    0.077224           3     22727           pread64
  0.00    0.019560          18      1052           read
  0.00    0.014404        3601         4           fdatasync
  0.00    0.013865         630        22        10 wait4
  0.00    0.009208        9208         1           syncfs
[...]
------ ----------- ----------- --------- --------- ----------------
100.00  442.846943         381   1159792    111373 total

Basically, on BTRFS the rename(2) calls are extremely slow for some reason, regardless of SELinux (though disabling SELinux makes a difference, they're still way too slow). When watching the strace output during the rename(2) loop on BTRFS, I noticed that the rename(2) calls are fast for the first files in a given directory, but gradually get slower and slower as the directory fills up. As soon as rpm advances to the next directory, the rename(2)s go back to being fast until the directory fills up again.

The excessive 6m installation time reported in comment #0 therefore seems to be coming mostly from the BTRFS bug, not SELinux. Also note there were no excessive fsync-like calls as suspected in comment #26.

Comment 34 Chris Murphy 2022-01-28 20:05:53 UTC
@Ondrej can you let us know what kernel version you're using for the results in c33? Plus for the Btrfs file system setup, what mount options, and the result from 'btrfs fi df $mountpoint'. Thanks.

Comment 35 Ondrej Mosnacek 2022-01-28 21:03:39 UTC
It was freshly provisioned VMs from the official Fedora Cloud qcow2 images. So it should be very easy to replicate the setups. The virtual disk was located on a tmpfs mount (so there should be no effect of the host filesystem/disk).

F34 kernel: 5.11.12-300.fc34.x86_64
F35 kernel: 5.14.10-300.fc35.x86_64

BTRFS details:
# mount | grep btrfs
/dev/vda5 on / type btrfs (rw,relatime,seclabel,compress=zstd:1,space_cache,subvolid=256,subvol=/root)
/dev/vda5 on /home type btrfs (rw,relatime,seclabel,compress=zstd:1,space_cache,subvolid=258,subvol=/home)

# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
zram0  251:0    0  1.9G  0 disk [SWAP]
vda    252:0    0    5G  0 disk 
├─vda1 252:1    0    1M  0 part 
├─vda2 252:2    0  500M  0 part /boot
├─vda3 252:3    0  100M  0 part /boot/efi
├─vda4 252:4    0    4M  0 part 
└─vda5 252:5    0  4.4G  0 part /home
                                /

Before papirus-icon-theme installation:
# btrfs fi df /
Data, single: total=936.00MiB, used=247.16MiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=256.00MiB, used=24.73MiB
GlobalReserve, single: total=3.25MiB, used=0.00B

After papirus-icon-theme installation:
# btrfs fi df /
Data, single: total=936.00MiB, used=526.66MiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=256.00MiB, used=131.72MiB
GlobalReserve, single: total=3.25MiB, used=0.00B
# df -h | grep vda5
/dev/vda5       4.5G  794M  3.4G  19% /
/dev/vda5       4.5G  794M  3.4G  19% /home

BTW, I also tried to downgrade to the F34 kernel (5.11.12-300.fc34.x86_64) on the BTFRS machine and with that kernel the overhead of rename(2)s and the duration of the whole installation was normal (the rename(2) overhead somewhat worse than on ext4, but OK). Also tried to upgrade to the latest F35 kernel (5.15.16-200.fc35), but got the same result as with the 5.14.10. Didn't try anything else.

Comment 36 Chris Murphy 2022-01-28 22:44:10 UTC
@Ondrej thanks for the detailed report!

Comment 37 Vitaly 2022-01-29 18:59:06 UTC
Can confirm the same issue with breeze-icon-theme on NVMe SSD and ext4+LUKS.

ML thread: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/5LZK3LPNKG74RSCCE5PWO45KPKLIOBGU/

Comment 38 Ondrej Mosnacek 2022-01-29 20:59:20 UTC
(In reply to Vitaly Zaitsev from comment #37)
> Can confirm the same issue with breeze-icon-theme on NVMe SSD and ext4+LUKS.
> 
> ML thread:
> https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/
> thread/5LZK3LPNKG74RSCCE5PWO45KPKLIOBGU/

I digged into this and it's actually a different bug, this time in rpm itself:
https://bugzilla.redhat.com/show_bug.cgi?id=2048168

Comment 39 Josef Bacik 2022-02-01 19:54:34 UTC
FYI this is likely because of the behavior of rename() for btrfs and ext4.  RPM does the

write newfile
rename newfile oldfile
unlink newfile

dance.  RPM (and lots of other applications) assume that rename is atomic, as in they'll get either newfile or oldfile.  But instead you get the 0 length files if you crash.  What ext4 and btrfs do to make this less likely to happen is we detect you're doing this, and we flush the dirty data for newfile to disk, so the application gets the behavior it's expecting without needing to actually do the thing correctly, which would be to do

write newfile
fsync newfile
rename newfile oldfile
unlink newfile

If RPM/application did this, you'd see fsync taking forever, and rename would be normal.

But Josef, if both ext4 and btrfs are doing the same thing, why does btrfs suck so badly?  Well you're kind of comparing apples to oranges here.  Btrfs with transparent compression on means that your writes are getting splayed out into the CPU's to do the compressing, so you're going to eat a lot of CPU.  This is generally worth it, because generally you're saving space on disk, and time writing to disk is usually greater than CPU time spent compressing things.  ext4 doesn't have this, tho I imagine if you put it in front of LUKS or something else you'd see a similar drop.

As an aside there's also another gotcha when it comes to rename on btrfs.  If you were to, in a loop, do the following

write newfile
fsync newfile
rename newfile oldfile
unlink newfile

You would trip over another fun thing where we have to go through the fsync log and make sure we preserve the rename in the log, as we fsynced the file and then changed its name.  This can cause a certain amount of overhead compared with ext4, tho not on the level you're seeing.  There's work in this area to fix this overhead and it's been going in over the last few releases, but I believe the most recent work is still in our tree.

Comment 40 Ondrej Mosnacek 2022-02-01 22:46:57 UTC
(In reply to Josef Bacik from comment #39)
> If RPM/application did this, you'd see fsync taking forever, and rename
> would be normal.
> 
> But Josef, if both ext4 and btrfs are doing the same thing, why does btrfs
> suck so badly?  Well you're kind of comparing apples to oranges here.  Btrfs
> with transparent compression on means that your writes are getting splayed
> out into the CPU's to do the compressing, so you're going to eat a lot of
> CPU.  This is generally worth it, because generally you're saving space on
> disk, and time writing to disk is usually greater than CPU time spent
> compressing things.  ext4 doesn't have this, tho I imagine if you put it in
> front of LUKS or something else you'd see a similar drop.

Judging by the flame graph (will attach shortly), this doesn't seem to be what's happening (though I may misunderstand). Almost all of the rename(2) time is spent in btrfs_log_new_name()/btrfs_log_inode_parent() and I can't see any obvious evidence of compression or implicit data flushing being involved.

Also, let me reiterate that when I was watching the output of strace on the reproducer, I saw that for each directory the rename(2) calls were fast in the beginning, then got slower and slower as rpm crunched through the given directory, and then jumped back to fast at the beginning of the next directory. So for some reason batch-renaming N files in the same directory appears to have O(N^2) complexity instead of the ideal/expected O(N) and that could explain why the installation is so much slower for this particular package (it contains several directories with thousands of files).

I don't know if my non-filesystem-developer interpretation is correct, but the O(N^2)-like behavior of btrfs_log_new_name() seems suspicious to me. Any idea why it behaves like that and if it can be optimized?

Comment 41 Ondrej Mosnacek 2022-02-01 22:48:11 UTC
Created attachment 1858492 [details]
Flamegraph of the rpm transaction

Comment 42 Ondrej Mosnacek 2022-02-02 18:09:02 UTC
So I randomly decided to check the linux-btrfs mailing list for any relevant reports/patches and look what was right near the top :)
https://lore.kernel.org/linux-btrfs/cover.1642676248.git.fdmanana@suse.com/T/

Sounds like it fixes exactly this problem. And seems to be already merged in -next:
https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next

Comment 43 Josef Bacik 2022-02-03 00:16:56 UTC
Ondrej, yup that's the work I was referring to.  We were doing more than what was required of us for fsync and rename, and it could bog down like this.  Filipe has been working a lot in this area recently, the patches he has will make it much better, and IIRC he has some more work to do in this area, so it should just get better from here.

In the meantime I'll try and work out an artificial performance test to reproduce this behavior and add it to our nightly performance testing so we don't regress this in the future.

Comment 44 Zbigniew Jędrzejewski-Szmek 2022-02-13 16:14:52 UTC
I think I saw the same issue in a different context: dovecot (IMAP server) would take a few minutes to
write changes to a user's Maildir, after a modification that changed read status of thousands of messages.
Attaching strace showed that it was renaming thousands of files very slowly. (Maildir has one file
per message and uses filename suffixes like ":2,RS" or ":2,ajm" to set flags on messages.)

kernel-core-5.14.10-300.fc35.x86_64
btrfs with rw,relatime,seclabel,compress=zstd:1,space_cache,subvolid=256,subvol=/root
selinux enabled

Comment 45 Ondrej Mosnacek 2022-04-20 08:32:44 UTC
This seems to be already fixed in current rawhide kernel (and even in the stable 5.16.x kernels).


Note You need to log in before you can comment on or make changes to this bug.