Bug 2242391

Summary:

Kernel worker thread on 100% CPU core utilisation and one btrfs file system completely unusable

Product:

[Fedora] Fedora

Reporter:

Joshua Noeske <fedora>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

CLOSED EOL

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

CC:

acaringi, adscvr, airlied, alciregi, bskeggs, glandvador, hdegoede, hpa, jarod, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, ptalbert, steved

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

---

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2024-05-31 08:38:14 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
journalctl -k output of the boot with the problem.	none

Description Joshua Noeske 2023-10-05 20:25:24 UTC

1. Please describe the problem:
Hello, on kernel 6.5.5, for the second time I encountered the problem that my one kernel worker thread named `kworker/u8:12+flush-btrfs-2` uses one core to 100% and that subsequent reads/writes to one of the attached btrfs filesystems are completely impossible. Both of the times, this happened after more than one day of operation. The machine on which I observed this is used as a server.
All attached drives report no SMART errors and after the first occurrence, I ran `btrfs check` on both filesystems, which did not report any errors.

2. What is the Version-Release number of the kernel:
6.5.5

3. Did it work previously in Fedora? If so, what kernel version did the issue
*first* appear? Old kernels are available for download at
https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
I have never encountered this problem before kernel 6.5.5. Now, I am running kernel 6.4.15 again and I'll report if the problem exists there as well.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
the issue below:
I am not completely sure if it has to do something with it, but I guess that the error was triggered by btrbk transferring snapshots from one drive to another. This involves btrfs send and receive operations. I assume that because on later invocations of btrbk, it always stated that the subvolume of the snapshot existed but that the received UUID was not set yet.

5. Does this problem occur with the latest Rawhide kernel? To install the
Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
``sudo dnf update --enablerepo=rawhide kernel``:
To be quite honest, I don't really want to run a rawhide kernel on my server.

6. Are you running any modules that not shipped with directly Fedora's kernel?:
No

7. Please attach the kernel logs. You can get the complete kernel log
for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
issue occurred on a previous boot, use the journalctl ``-b`` flag.

Reproducible: Always

Comment 1 Joshua Noeske 2023-10-05 20:26:15 UTC

Created attachment 1992294 [details]
journalctl -k output of the boot with the problem.

Comment 2 Joshua Noeske 2023-10-05 20:43:40 UTC

I forgot to mention that unmounting the affected filesystem fails since the device is busy. Even after lazily unmounting the filesystem, the machine did not shut down correctly and I had to perform a hard reset.

Comment 3 Joshua Noeske 2023-10-05 20:52:20 UTC

The CPU usage graph of Cockpit supports my theory regarding btrbk and the snapshot transfer. Both times, the error apparently occurred during the transfer of the snapshots to another drive.

Comment 4 Eduard Kohler 2023-10-08 10:30:20 UTC

This also happens on a F37 system, EXT4 partition over a mdadm raid1 (HDD 4T).

Whole kernel line 6.5 (tested 6.5.4, 6.5.5 and 6.5.6 that are available on koji) display this behaviour, while previous line 6.4 (tested 6.4.4, 6.4.15) doesn't.

What triggers this behaviour in my case is creating small files on the raid array's partition, ie:

#for i in {0001..0200}; echo "some text" > "file_${i}.txt"

After a few seconds the kworker/flush kicks in for a variable amount of time dependent of the number of created files. During the time the kworker/flush is 100% CPU, trying to delete these files is more or less impossible.

Removing these files (once the kworker/flush goes away) is fast and doesn't trig this behaviour.

Writing one huge file (dd if=/dev/zero of=/raid/file) doesn't seem to trig this behaviour.

I also experienced the behaviour in Comment #2, which lead to a reconstruction of the raid array, youpii.

On the same system, a small SSD (16G) is installed for the system with a EXT4 partition, no raid. Writing smalls file on this SSD partition doesn't trig the kworker/flush to eat 100% CPU.

I am willing to test kernels as long as they work on F37 (for now) and I don't have to build them. Building Fedora kernels are not an option for me. Last time I tried it took several hours just to fail after filling remaining 16G disk space on a I7 laptop (ok not last generation, but still).

Comment 5 Joshua Noeske 2023-10-11 09:13:51 UTC

I can confirm that 6.4.15 does not show this behaviour. The error has not occurred with this kernel version yet.

Comment 6 Aoife Moloney 2024-05-31 08:38:14 UTC

Fedora Linux 38 entered end-of-life (EOL) status on 2024-05-21.

Fedora Linux 38 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.