1. Please describe the problem: Hello, on kernel 6.5.5, for the second time I encountered the problem that my one kernel worker thread named `kworker/u8:12+flush-btrfs-2` uses one core to 100% and that subsequent reads/writes to one of the attached btrfs filesystems are completely impossible. Both of the times, this happened after more than one day of operation. The machine on which I observed this is used as a server. All attached drives report no SMART errors and after the first occurrence, I ran `btrfs check` on both filesystems, which did not report any errors. 2. What is the Version-Release number of the kernel: 6.5.5 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : I have never encountered this problem before kernel 6.5.5. Now, I am running kernel 6.4.15 again and I'll report if the problem exists there as well. 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: I am not completely sure if it has to do something with it, but I guess that the error was triggered by btrbk transferring snapshots from one drive to another. This involves btrfs send and receive operations. I assume that because on later invocations of btrbk, it always stated that the subvolume of the snapshot existed but that the received UUID was not set yet. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: To be quite honest, I don't really want to run a rawhide kernel on my server. 6. Are you running any modules that not shipped with directly Fedora's kernel?: No 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. Reproducible: Always
Created attachment 1992294 [details] journalctl -k output of the boot with the problem.
I forgot to mention that unmounting the affected filesystem fails since the device is busy. Even after lazily unmounting the filesystem, the machine did not shut down correctly and I had to perform a hard reset.
The CPU usage graph of Cockpit supports my theory regarding btrbk and the snapshot transfer. Both times, the error apparently occurred during the transfer of the snapshots to another drive.
This also happens on a F37 system, EXT4 partition over a mdadm raid1 (HDD 4T). Whole kernel line 6.5 (tested 6.5.4, 6.5.5 and 6.5.6 that are available on koji) display this behaviour, while previous line 6.4 (tested 6.4.4, 6.4.15) doesn't. What triggers this behaviour in my case is creating small files on the raid array's partition, ie: #for i in {0001..0200}; echo "some text" > "file_${i}.txt" After a few seconds the kworker/flush kicks in for a variable amount of time dependent of the number of created files. During the time the kworker/flush is 100% CPU, trying to delete these files is more or less impossible. Removing these files (once the kworker/flush goes away) is fast and doesn't trig this behaviour. Writing one huge file (dd if=/dev/zero of=/raid/file) doesn't seem to trig this behaviour. I also experienced the behaviour in Comment #2, which lead to a reconstruction of the raid array, youpii. On the same system, a small SSD (16G) is installed for the system with a EXT4 partition, no raid. Writing smalls file on this SSD partition doesn't trig the kworker/flush to eat 100% CPU. I am willing to test kernels as long as they work on F37 (for now) and I don't have to build them. Building Fedora kernels are not an option for me. Last time I tried it took several hours just to fail after filling remaining 16G disk space on a I7 laptop (ok not last generation, but still).
I can confirm that 6.4.15 does not show this behaviour. The error has not occurred with this kernel version yet.