Created attachment 1845031 [details] dnf log of installation Description of problem: dnf took 6 minutes and 100% cpu usage to install papirus-icon-theme 20210901-2.fc35 I'm on a 11th Gen Intel i7-1165G7 (8) @ 4.700GHz Version-Release number of selected component (if applicable): dnf - 4.9.0 papirus-icon-theme 20210901-2 How reproducible: Happened on first install, was tempted to kill it but it eventually finished Steps to Reproduce: 1. install papirus-icon-theme Actual results: took 6 minutes and 100% cpu usage Expected results: This package should install relatively quickly, way less than 6 minutes. Additional info:
My guess is that the package is quite big (30M the package, 103M installed) and contains a lot of files (106805 according to `dnf repoquery -l papirus-icon-theme-0:20210901-2.fc35.noarch) therefore rpm needs a long time to deal with it. On my machine (with similar processor but super fast nvme disk) the instalation takes approx. 50 seconds. I'm re-assigning the bug to papirius-icon-theme component for further investigation.
@mblaha yes there is more than 100,000 small files but it didn't use to take that much time. I can't pinpoint when it did slow down but it was working fine earlier this year. It takes 600 seconds on my system with an nvme and a recent Ryzen. It seems that the copy of the files is very long and single-threaded.
I did some tests on various versions of dnf / rpm / papirus-icon-theme on Fedora 34 and 35. In minimal containers all combinations of different versions behave basically the same - installation takes approximately 15 seconds. On my system it is still 50 seconds (no matter what version of dnf / rpm is used). So the slow down must be caused by some factor other than rpm/dnf versions. But unfortunately I haven't found it yet.
*** Bug 2035684 has been marked as a duplicate of this bug. ***
Can confirm it on `Fedora 35´, `dnf-0:4.9.0-1.fc35.noarch` and the package `papirus-icon-theme-20211201-1.fc35.noarch` using a NVMe SSD. I had not experienced this issue earlier with other versions of the package on the same machine.
Happens with flat-remix-icon-theme also. Didn't happen on older Fedora releases.
I also tried to install the papirus icon theme with zypper: ``` dnf install zypper zypper addrepo /etc/yum.repos.d/fedora.repo zypper refresh zypper install papirus-icon-theme ``` the first part of the installation of `papirus-icon-theme-20210701-1.fc35.noarch.rpm` was quiet fast (< a minute) on my system, but the postrans script `/usr/bin/gtk-update-icon-cache /usr/share/icons/{ePapirus|Papirus|Papirus-Light|...}` takes several minutes and consumes 100% of a CPU core. I guess this obversation is also applicable on `dnf`?
(In reply to postix from comment #7) > the first part of the installation of > `papirus-icon-theme-20210701-1.fc35.noarch.rpm` was quiet fast (< a minute) > on my system, but the postrans script `/usr/bin/gtk-update-icon-cache > /usr/share/icons/{ePapirus|Papirus|Papirus-Light|...}` takes several minutes > and consumes 100% of a CPU core. > > I guess this obversation is also applicable on `dnf`? Since the time is spent on a scriptlet executing an external command, yes it doesn't matter which rpm client is being used. Reassigning to gtk3, dnf can't help it if gtk-update-icon-cache has a performance regression.
Hello. Same problem with papirus-icon-theme 2022-01-01 I didn't let it finish. I killed the task.
(In reply to Rubén from comment #9) > Hello. > > Same problem with papirus-icon-theme 2022-01-01 > > I didn't let it finish. I killed the task. Can you list the current content of /usr/share/icons/ I can't repro the performance regression when I run gtk-update-icon-cache manually.
Created attachment 1852843 [details] contents of /usr/share/icons/Papirus*
(In reply to Robert-André Mauchin 🐧 from comment #10) > (In reply to Rubén from comment #9) > > Hello. > > > > Same problem with papirus-icon-theme 2022-01-01 > > > > I didn't let it finish. I killed the task. > > Can you list the current content of /usr/share/icons/ > > I can't repro the performance regression when I run gtk-update-icon-cache > manually. Hello. See attachment.
(In reply to Rubén from comment #12) > (In reply to Robert-André Mauchin 🐧 from comment #10) > > (In reply to Rubén from comment #9) > > > Hello. > > > > > > Same problem with papirus-icon-theme 2022-01-01 > > > > > > I didn't let it finish. I killed the task. > > > > Can you list the current content of /usr/share/icons/ > > > > I can't repro the performance regression when I run gtk-update-icon-cache > > manually. > > Hello. > > See attachment. >contents of /usr/share/icons/Papirus* I don't only want the Papirus ones, I'd like to see if there is no formerly used directory that haven't been properly removed.
Created attachment 1852856 [details] contents of /usr/share/icons
(In reply to Robert-André Mauchin 🐧 from comment #13) > (In reply to Rubén from comment #12) > > (In reply to Robert-André Mauchin 🐧 from comment #10) > > > (In reply to Rubén from comment #9) > > > > Hello. > > > > > > > > Same problem with papirus-icon-theme 2022-01-01 > > > > > > > > I didn't let it finish. I killed the task. > > > > > > Can you list the current content of /usr/share/icons/ > > > > > > I can't repro the performance regression when I run gtk-update-icon-cache > > > manually. > > > > Hello. > > > > See attachment. > > >contents of /usr/share/icons/Papirus* > > I don't only want the Papirus ones, I'd like to see if there is no formerly > used directory that haven't been properly removed. papirus-icon-theme package is not installed. In my case it was taking more than six minutes and one core at 100%. I didn't let it finish, but the folders remained at /usr/share/icons. Shall I remove them manually?
(In reply to postix from comment #7) > I also tried to install the papirus icon theme with zypper: > ``` > dnf install zypper > zypper addrepo /etc/yum.repos.d/fedora.repo > zypper refresh > zypper install papirus-icon-theme > ``` > the first part of the installation of > `papirus-icon-theme-20210701-1.fc35.noarch.rpm` was quiet fast (< a minute) > on my system, but the postrans script `/usr/bin/gtk-update-icon-cache > /usr/share/icons/{ePapirus|Papirus|Papirus-Light|...}` takes several minutes > and consumes 100% of a CPU core. > > I guess this obversation is also applicable on `dnf`? I tested with a package with no scriplets and the problem still happens, so it is elsewhere. (In reply to Rubén from comment #14) > Created attachment 1852856 [details] > contents of /usr/share/icons What I noticed is that the directories contains a lot of │ │ │ ├── action-unavailable.svg;61e7d3cc │ │ │ ├── action-unavailable.svg;61e7d4bc │ │ │ ├── action-unavailable.svg;61e9a2ba which is happening on my system when the install is slow. It seems the files are first copied with a suffix, then renamed to their final file name. And this step takes a lot and a lot of time. @pmatilai Could you weight in with your experience regarding how files are copied? (In reply to P D from comment #6) > Happens with flat-remix-icon-theme also. Didn't happen on older Fedora > releases. This makes me think it's a rpm issue. It also happens with rpm -i. I'm reassigning to rpm because it seems the right component,
> (In reply to Rubén from comment #14) > > Created attachment 1852856 [details] > > contents of /usr/share/icons > > > What I noticed is that the directories contains a lot of > > │ │ │ ├── action-unavailable.svg;61e7d3cc > │ │ │ ├── action-unavailable.svg;61e7d4bc > │ │ │ ├── action-unavailable.svg;61e9a2ba > > which is happening on my system when the install is slow. > It seems the files are first copied with a suffix, then renamed to their > final file name. And this step takes a lot and a lot of time. > > @pmatilai Could you weight in with your experience regarding how > files are copied? > It also doesn't happen in a Mock chroot.
It seems mv is massively slower on the host rather in a nspawn chroot [root@cassini icons]# mkdir test [root@cassini icons]# for (( i = 0; i < 10000; i++ )) do > test/file_$i; done [root@cassini icons]# cd test On host: time $(for f in *; do mv "$f" "${f%}.txt"; done) real 2m3,500s user 0m3,966s sys 2m0,431s In nspawn container: <mock-chroot> sh-5.1# time $(for f in *; do mv "$f" "${f%}.txt"; done) real 0m6.702s user 0m4.237s sys 0m3.344s Since papirus-icon-theme contains more than 100,000 files, it is a problem. One minute of waiting is ok, 20 mn is not. Could it be a filesystem problem? A regression that happened around F35 or just before? I know btrfs became the default in F35, but I was using it since at least F27. What are your filesystems guys who have this problem?
I use BTRFS and I have this issue. I have a cheap SSD and this wastes writes trying to install papirus or flat remix icon themes from DNF so I just install them manually instead. For me, I noticed this issue in Fedora 35. I don't recall it happening in Fedora 34.
*** Bug 2044082 has been marked as a duplicate of this bug. ***
Rpm has always worked by unpacking files to a temporary suffix and then moving into the final location. What has changed in F35 (rpm >= 4.17) is that now rpm first unpacks all files from a package and only if that succeeds, renames them to final destination, whereas older versions did this on file-by-file basis. This is always just a rename(3), never a copy. So if rename() is taking ages then it sounds like a filesystem thing indeed. I haven't seen this on any of my systems, but then they're all on ext4.
I can reproduce the behavior from comment#18. I'm on Fedora 35, ext4 filesystem: On the host: # time rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm real 0m30.845s user 0m22.084s sys 0m6.788s In the podman container: $ podman run --rm -ti registry.fedoraproject.org/fedora:35 bash # time rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm real 0m5.530s user 0m1.748s sys 0m3.561s
(In reply to Robert-André Mauchin 🐧 from comment #18) > It seems mv is massively slower on the host rather in a nspawn chroot > > [root@cassini icons]# mkdir test > [root@cassini icons]# for (( i = 0; i < 10000; i++ )) do > test/file_$i; done > [root@cassini icons]# cd test > > On host: > > time $(for f in *; do mv "$f" "${f%}.txt"; done) > real 2m3,500s > user 0m3,966s > sys 2m0,431s > > In nspawn container: > > <mock-chroot> sh-5.1# time $(for f in *; do mv "$f" "${f%}.txt"; done) > real 0m6.702s > user 0m4.237s > sys 0m3.344s > > Since papirus-icon-theme contains more than 100,000 files, it is a problem. > One minute of waiting is ok, 20 mn is not. > > Could it be a filesystem problem? A regression that happened around F35 or > just before? > I know btrfs became the default in F35, but I was using it since at least > F27. > > What are your filesystems guys who have this problem? Hi. I also use btrfs, I've always had got papirus-icon-theme package in my system but never this issue until version 2022-01-01. Regards.
For what it's worth, Btrfs became the default on desktops in Fedora 33. And for Cloud edition in Fedora 35.
I've asked a btrfs developer to take a look at this, even though the same problem is happening on ext4. Is there an fsync() after each rename()? Or do you do one sync() at the completion of the rename()? Using fsync() 100K+ times is going to hammer the log on ext4, xfs, and btrfs, and it'll be slow. So it sounds like it could be a per file fsync()? If so, what's the motivation of doing that? Is rpm able to know and handle partially installed rpm's somehow and recover? Does it know the system is in some in-between state? Because if it doesn't, the package that's been interrupted needs to be reinstalled, correct? If that's true, fsync'ing the files individually after each rename() helps with what exactly?
Normally rpm does not call fsync on single files. It does fsync for database changes. But this is only once per package. We also run sync() or syncfs() once at the end of the transaction. But there is a macro to call fsync for every file - not on rename but after writing: _flush_io This feature should be turned off - as it obviously slows things down a lot. I wonder if something in a regular installation turns this on (try rpm -E "%{?_flush_io}") but those mock or other chroots are small enough to not include this.
I'm seeing effectively the reverse here, host is a bit sluggish but podman is almost twice slower. This is very repeatable: [pmatilai🎩︎localhost rpm]$ podman run --rm -ti registry.fedoraproject.org/fedora:35 bash [root@13ca4f74dfa0 /]# [...] [root@13ca4f74dfa0 /]# time rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm real 0m55.563s user 0m4.146s sys 0m8.002s Whereas on the host: [root@localhost tmp]# time rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm real 0m37.459s user 0m26.588s sys 0m9.677s (In reply to Robert-André Mauchin 🐧 from comment #18) > It seems mv is massively slower on the host rather in a nspawn chroot Mv being slower too still suggests the issue is not rpm.
Oh... there's a huge difference between root and non-root podman. If running podman as root, I too see the enormous speedup: [root@localhost ~]# podman run --rm -ti registry.fedoraproject.org/fedora:35 bash [...] [root@93b06c0eea74 /]# time rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm real 0m7.409s user 0m2.341s sys 0m4.868s
So the issue is two-fold: 1) On the container image, there's no /usr/bin/gtk-update-icon-cache which amounts to roughly half of the execution time (add --noscripts on host to witness the effect) 2) On the container image, there's also no SELinux whose labels need setting (or other plugins), which amounts to the other half of the execution time So what you see on the container is quite literally equivalent to the following (in this case, not in general): # time rpm -i --noplugins --noscripts papirus-icon-theme-20220101-1.fc35.noarch.rpm real 0m6.541s user 0m2.263s sys 0m3.437s ...and lo and behold, the time on the host is similar now. If you do more work it usually takes more time.
Thank you for taking the time to check this. I don't understand why it started with F35 though. Nor what I can do about it.
(In reply to Florian Festi from comment #26) > Normally rpm does not call fsync on single files. It does fsync for database > changes. But this is only once per package. We also run sync() or syncfs() > once at the end of the transaction. Seems reasonable. > But there is a macro to call fsync for every file - not on rename but after > writing: _flush_io > This feature should be turned off - as it obviously slows things down a lot. Agreed. (In reply to Panu Matilainen from comment #27) > I'm seeing effectively the reverse here, host is a bit sluggish but podman > is almost twice slower. This is very repeatable: The different storage stacks could be a partial factor in folks getting different relative results? Fedora default, rootful podman uses kernel overlayfs, rootless podman uses fuse-overlayfs. ext4 folks get a fairly expensive copy-up operation, when required. Whereas on XFS and Btrfs, overlayfs will automatically use reflinks for the copy-up which is much faster. Anyone on Btrfs who opts into the btrfs graph driver would also see different results. (In reply to Robert-André Mauchin 🐧 from comment #30) > I don't understand why it started with F35 though. Nor what I can do about > it. Me neither. bcc-tools includes 'fileslower' which looks at the VFS layer to see what files have high latencies, which could help give a clue *what* is slow, but not why.
> a clue *what* is slow, but not why See comment #29, it seems that SELinux is what the "slow" thing is. As to whether it's slower than in F34, or whether people just didn't happen to have SELinux enabled in F34 or some other configuration difference, only apples to apples testing would tell. Mind you, I don't think this is in any way a papirus-icon-theme bug. It may have more files than the average package, but that's not a bug, it just is.
(In reply to Panu Matilainen from comment #32) > See comment #29, it seems that SELinux is what the "slow" thing is. As to > whether it's slower than in F34, or whether people just didn't happen to > have SELinux enabled in F34 or some other configuration difference, only > apples to apples testing would tell. The SELinux labeling work does slow things down, but not to an unbearable extent (considering the package has >100k files). However, it seems most of the measurements here were done on ext4, but I did some experiments on the default Fedora Cloud images (which are BTRFS since F35) and there is evidence of something being seriously messed up on F35 + BTRFS, with no such issue on F34 + ext4: Fedora 34 VM (ext4) =================== # strace -f --summary-only rpm -i papirus-icon-theme-20220101-1.fc34.noarch.rpm % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 16.60 0.424584 3 110475 rename 12.41 0.317400 5 54693 symlink 11.37 0.290779 5 56230 134 openat 10.66 0.272696 2 110607 lsetxattr 9.73 0.248995 1 128825 111169 newfstatat 6.82 0.174607 3 55843 write 5.46 0.139770 1 110612 utimensat 4.84 0.123849 1 90201 3 fcntl 4.32 0.110634 2 54693 lchown 3.71 0.094849 1 55914 chown 2.86 0.073044 0 166394 getuid 2.44 0.062481 1 55914 chmod 2.14 0.054804 1 27593 pwrite64 1.62 0.041437 0 56133 close 1.61 0.041313 0 111572 umask 1.44 0.036952 1 22999 pread64 0.56 0.014367 11 1249 read 0.50 0.012762 580 22 10 wait4 0.19 0.004875 1218 4 fdatasync 0.18 0.004621 4621 1 syncfs [...] ------ ----------- ----------- --------- --------- ---------------- 100.00 2.558383 2 1271418 111370 total Fedora 34 VM (ext4) + SELinux disabled ====================================== # strace -f --summary-only rpm -i papirus-icon-theme-20220101-1.fc34.noarch.rpm % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 21.34 0.778346 7 110475 rename 12.87 0.469338 8 56148 134 openat 11.26 0.410647 7 54693 symlink 11.03 0.402248 3 128744 111166 newfstatat 8.53 0.311059 5 55839 write 6.73 0.245509 2 110612 utimensat 4.69 0.170887 3 55914 chown 4.47 0.162871 2 54693 lchown 3.94 0.143688 0 166394 getuid 3.39 0.123622 1 90201 3 fcntl 3.36 0.122498 2 55914 chmod 2.68 0.097760 0 111572 umask 2.42 0.088147 1 56051 close 1.21 0.044141 1 27543 pwrite64 0.88 0.031937 1 22944 pread64 0.43 0.015644 14 1070 read 0.23 0.008402 381 22 10 wait4 0.08 0.003030 757 4 fdatasync 0.08 0.002875 2875 1 syncfs [...] ------ ----------- ----------- --------- --------- ---------------- 100.00 3.646814 3 1160269 111368 total Fedora 35 VM (btrfs) ==================== # strace -f --summary-only rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.31 798.092903 7224 110475 rename 0.15 1.236912 9 128821 111169 newfstatat 0.15 1.170149 20 56222 137 openat 0.13 1.078128 19 54693 symlink 0.08 0.653151 11 55843 write 0.03 0.260786 2 110607 lsetxattr 0.02 0.189932 1 111572 umask 0.02 0.178879 1 90209 3 fcntl 0.02 0.165392 2 56125 close 0.01 0.094274 0 110612 utimensat 0.01 0.088903 1 54693 lchown 0.01 0.088510 3 27232 pwrite64 0.01 0.067501 1 55914 chown 0.01 0.063673 2 22672 pread64 0.01 0.057810 0 166394 getuid 0.01 0.040419 0 55914 chmod 0.00 0.024754 20 1232 read 0.00 0.021616 982 22 10 wait4 0.00 0.009522 2380 4 fdatasync 0.00 0.004072 313 13 5 execve 0.00 0.004011 4011 1 syncfs [...] ------ ----------- ----------- --------- --------- ---------------- 100.00 803.603148 632 1270718 111375 total Fedora 35 VM (btrfs) + SELinux disabled ======================================= # strace -f --summary-only rpm -i papirus-icon-theme-20220101-1.fc35.noarch.rpm % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 98.77 437.400856 3959 110475 rename 0.24 1.062701 8 128740 111166 newfstatat 0.16 0.702037 12 56140 137 openat 0.14 0.637711 11 54693 symlink 0.11 0.486302 4 110612 utimensat 0.11 0.469732 2 166394 getuid 0.11 0.465608 8 55839 write 0.08 0.357098 6 55914 chown 0.08 0.350235 6 54693 lchown 0.06 0.250576 4 55914 chmod 0.03 0.153681 1 90209 3 fcntl 0.03 0.138061 1 111572 umask 0.03 0.116479 2 56043 close 0.02 0.110487 4 27291 pwrite64 0.02 0.077224 3 22727 pread64 0.00 0.019560 18 1052 read 0.00 0.014404 3601 4 fdatasync 0.00 0.013865 630 22 10 wait4 0.00 0.009208 9208 1 syncfs [...] ------ ----------- ----------- --------- --------- ---------------- 100.00 442.846943 381 1159792 111373 total Basically, on BTRFS the rename(2) calls are extremely slow for some reason, regardless of SELinux (though disabling SELinux makes a difference, they're still way too slow). When watching the strace output during the rename(2) loop on BTRFS, I noticed that the rename(2) calls are fast for the first files in a given directory, but gradually get slower and slower as the directory fills up. As soon as rpm advances to the next directory, the rename(2)s go back to being fast until the directory fills up again. The excessive 6m installation time reported in comment #0 therefore seems to be coming mostly from the BTRFS bug, not SELinux. Also note there were no excessive fsync-like calls as suspected in comment #26.
@Ondrej can you let us know what kernel version you're using for the results in c33? Plus for the Btrfs file system setup, what mount options, and the result from 'btrfs fi df $mountpoint'. Thanks.
It was freshly provisioned VMs from the official Fedora Cloud qcow2 images. So it should be very easy to replicate the setups. The virtual disk was located on a tmpfs mount (so there should be no effect of the host filesystem/disk). F34 kernel: 5.11.12-300.fc34.x86_64 F35 kernel: 5.14.10-300.fc35.x86_64 BTRFS details: # mount | grep btrfs /dev/vda5 on / type btrfs (rw,relatime,seclabel,compress=zstd:1,space_cache,subvolid=256,subvol=/root) /dev/vda5 on /home type btrfs (rw,relatime,seclabel,compress=zstd:1,space_cache,subvolid=258,subvol=/home) # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS zram0 251:0 0 1.9G 0 disk [SWAP] vda 252:0 0 5G 0 disk ├─vda1 252:1 0 1M 0 part ├─vda2 252:2 0 500M 0 part /boot ├─vda3 252:3 0 100M 0 part /boot/efi ├─vda4 252:4 0 4M 0 part └─vda5 252:5 0 4.4G 0 part /home / Before papirus-icon-theme installation: # btrfs fi df / Data, single: total=936.00MiB, used=247.16MiB System, DUP: total=8.00MiB, used=16.00KiB Metadata, DUP: total=256.00MiB, used=24.73MiB GlobalReserve, single: total=3.25MiB, used=0.00B After papirus-icon-theme installation: # btrfs fi df / Data, single: total=936.00MiB, used=526.66MiB System, DUP: total=8.00MiB, used=16.00KiB Metadata, DUP: total=256.00MiB, used=131.72MiB GlobalReserve, single: total=3.25MiB, used=0.00B # df -h | grep vda5 /dev/vda5 4.5G 794M 3.4G 19% / /dev/vda5 4.5G 794M 3.4G 19% /home BTW, I also tried to downgrade to the F34 kernel (5.11.12-300.fc34.x86_64) on the BTFRS machine and with that kernel the overhead of rename(2)s and the duration of the whole installation was normal (the rename(2) overhead somewhat worse than on ext4, but OK). Also tried to upgrade to the latest F35 kernel (5.15.16-200.fc35), but got the same result as with the 5.14.10. Didn't try anything else.
@Ondrej thanks for the detailed report!
Can confirm the same issue with breeze-icon-theme on NVMe SSD and ext4+LUKS. ML thread: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/5LZK3LPNKG74RSCCE5PWO45KPKLIOBGU/
(In reply to Vitaly Zaitsev from comment #37) > Can confirm the same issue with breeze-icon-theme on NVMe SSD and ext4+LUKS. > > ML thread: > https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/ > thread/5LZK3LPNKG74RSCCE5PWO45KPKLIOBGU/ I digged into this and it's actually a different bug, this time in rpm itself: https://bugzilla.redhat.com/show_bug.cgi?id=2048168
FYI this is likely because of the behavior of rename() for btrfs and ext4. RPM does the write newfile rename newfile oldfile unlink newfile dance. RPM (and lots of other applications) assume that rename is atomic, as in they'll get either newfile or oldfile. But instead you get the 0 length files if you crash. What ext4 and btrfs do to make this less likely to happen is we detect you're doing this, and we flush the dirty data for newfile to disk, so the application gets the behavior it's expecting without needing to actually do the thing correctly, which would be to do write newfile fsync newfile rename newfile oldfile unlink newfile If RPM/application did this, you'd see fsync taking forever, and rename would be normal. But Josef, if both ext4 and btrfs are doing the same thing, why does btrfs suck so badly? Well you're kind of comparing apples to oranges here. Btrfs with transparent compression on means that your writes are getting splayed out into the CPU's to do the compressing, so you're going to eat a lot of CPU. This is generally worth it, because generally you're saving space on disk, and time writing to disk is usually greater than CPU time spent compressing things. ext4 doesn't have this, tho I imagine if you put it in front of LUKS or something else you'd see a similar drop. As an aside there's also another gotcha when it comes to rename on btrfs. If you were to, in a loop, do the following write newfile fsync newfile rename newfile oldfile unlink newfile You would trip over another fun thing where we have to go through the fsync log and make sure we preserve the rename in the log, as we fsynced the file and then changed its name. This can cause a certain amount of overhead compared with ext4, tho not on the level you're seeing. There's work in this area to fix this overhead and it's been going in over the last few releases, but I believe the most recent work is still in our tree.
(In reply to Josef Bacik from comment #39) > If RPM/application did this, you'd see fsync taking forever, and rename > would be normal. > > But Josef, if both ext4 and btrfs are doing the same thing, why does btrfs > suck so badly? Well you're kind of comparing apples to oranges here. Btrfs > with transparent compression on means that your writes are getting splayed > out into the CPU's to do the compressing, so you're going to eat a lot of > CPU. This is generally worth it, because generally you're saving space on > disk, and time writing to disk is usually greater than CPU time spent > compressing things. ext4 doesn't have this, tho I imagine if you put it in > front of LUKS or something else you'd see a similar drop. Judging by the flame graph (will attach shortly), this doesn't seem to be what's happening (though I may misunderstand). Almost all of the rename(2) time is spent in btrfs_log_new_name()/btrfs_log_inode_parent() and I can't see any obvious evidence of compression or implicit data flushing being involved. Also, let me reiterate that when I was watching the output of strace on the reproducer, I saw that for each directory the rename(2) calls were fast in the beginning, then got slower and slower as rpm crunched through the given directory, and then jumped back to fast at the beginning of the next directory. So for some reason batch-renaming N files in the same directory appears to have O(N^2) complexity instead of the ideal/expected O(N) and that could explain why the installation is so much slower for this particular package (it contains several directories with thousands of files). I don't know if my non-filesystem-developer interpretation is correct, but the O(N^2)-like behavior of btrfs_log_new_name() seems suspicious to me. Any idea why it behaves like that and if it can be optimized?
Created attachment 1858492 [details] Flamegraph of the rpm transaction
So I randomly decided to check the linux-btrfs mailing list for any relevant reports/patches and look what was right near the top :) https://lore.kernel.org/linux-btrfs/cover.1642676248.git.fdmanana@suse.com/T/ Sounds like it fixes exactly this problem. And seems to be already merged in -next: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next
Ondrej, yup that's the work I was referring to. We were doing more than what was required of us for fsync and rename, and it could bog down like this. Filipe has been working a lot in this area recently, the patches he has will make it much better, and IIRC he has some more work to do in this area, so it should just get better from here. In the meantime I'll try and work out an artificial performance test to reproduce this behavior and add it to our nightly performance testing so we don't regress this in the future.
I think I saw the same issue in a different context: dovecot (IMAP server) would take a few minutes to write changes to a user's Maildir, after a modification that changed read status of thousands of messages. Attaching strace showed that it was renaming thousands of files very slowly. (Maildir has one file per message and uses filename suffixes like ":2,RS" or ":2,ajm" to set flags on messages.) kernel-core-5.14.10-300.fc35.x86_64 btrfs with rw,relatime,seclabel,compress=zstd:1,space_cache,subvolid=256,subvol=/root selinux enabled
This seems to be already fixed in current rawhide kernel (and even in the stable 5.16.x kernels).