Bug 2182228 - kernel 6.2 turns on discard=async by default for btrfs
Summary: kernel 6.2 turns on discard=async by default for btrfs
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 38
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: fedora-kernel-btrfs
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: RejectedBlocker
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-28 00:51 UTC by Scott Williams
Modified: 2023-06-13 18:09 UTC (History)
26 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-13 18:09:53 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
CPU temp spike (30.58 KB, image/png)
2023-04-05 23:08 UTC, Gurenko Alex
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Fedora Pagure fedora-btrfs/project issue 6 0 None None None 2023-03-28 01:13:02 UTC

Description Scott Williams 2023-03-28 00:51:03 UTC
1. Please describe the problem:
As of kernel-6.2, the defaults for btrfs have apparently changed such that the "discard=async" option is applied to btrfs unless "nodiscard" option is explicitly given at mount/fstab.  This impacts all Fedora 37 and 38 users who upgraded to kernel-6.2.  The result by this redundant trim (over our current fstrim.timer method) is additional IO overhead and possibly a negative impact to SSD longevity.

This was reported through discussion.fedoraproject.org: https://discussion.fedoraproject.org/t/btrfs-discard-storm-on-fedora/79997/9

2. What is the Version-Release number of the kernel:
kernel-6.2.7-300.fc38
kernel-6.2.8-200.fc37 
Fedora 36 very likely impacted, but I did not test: kernel-6.2.8-100.fc36 

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
It first appeared as of the 6.2 kernel release.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
I have reproduced this issue on two laptops, one running Fedora 37 and one running Fedora 38, both with nvme drives.  

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
It is an issue as of kernel-6.2

6. Are you running any modules that not shipped with directly Fedora's kernel?:
No.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

`discard=async` is shown on each btrfs mount line when `mount` is ran as of 6.2.


One last thing to note, is that the current latest btrfs documentation still seems to suggest that no discard should still be the default: https://btrfs.readthedocs.io/en/latest/Administration.html

Comment 1 Scott Williams 2023-03-28 00:53:47 UTC
For context, this is the expected behavior, that discards happen periodically through fstrim.timer:  https://fedoraproject.org/wiki/Changes/EnableFSTrimTimer

This change is an abrupt departure from that behavior.

Comment 2 Fedora Blocker Bugs Application 2023-03-28 00:57:51 UTC
Proposed as a Blocker for 38-final by Fedora user vwbusguy using the blocker tracking app because:

 This is a fundamental change in expected behavior for default filesystem maintenance brought recently in the current 6.2 kernel release.  Previous changes to this behavior were well documented and intentionally tested ahead of release: https://fedoraproject.org/wiki/Changes/EnableFSTrimTimer .  

This change is likely to decrease performance through increased disk IO and could potentially lessen the life of an SSD.

Comment 3 Justin M. Forbes 2023-03-28 01:16:34 UTC
Absolutely against this being an F38 Blocker. I am happy to revert the patch that turned it on if necessary for F37 and F36, but I am not going to carry a revert against upstream forever. This change was in 6.2 long before F38 even branched. When the working groups chose to move to a btrfs default against the kenrel team's recomendation, it was understood that any bugfixes in btrfs have to be coordinated upstream. F38 will need to come up with an appropriate way to deal with this change in-line with upstream.

Comment 4 Scott Williams 2023-03-28 01:20:23 UTC
I don't feel strongly about this being a blocker, to be clear, but thought it should be raised since it was already a surprise change for users, including myself.  In the past changes like this were done more intentionally: https://fedoraproject.org/wiki/Changes/EnableFSTrimTimer .  Reverting for Fedora 36 and 37 but keeping it for Fedora 38 and communicating that in the upgrade notes seems like a good compromise to me.

Comment 5 Justin M. Forbes 2023-03-28 01:29:54 UTC
It is upstream commit 63a7cb1307184 for reference. It came in with the 6.2 merge window and has been included since 6.2-rc1.

Comment 6 Neal Gompa 2023-03-28 08:09:19 UTC
This is absolutely not only *not a blocker*, I do not want this behavior reverted in *any* Fedora release because we asked upstream to enabled it *last year*.

We've also discussed upstream what discard=async + fstrim means, and they've assured us that it means "not much" and it's safe to have both.

Comment 7 Neal Gompa 2023-03-28 08:11:39 UTC
(In reply to Justin M. Forbes from comment #3)
> Absolutely against this being an F38 Blocker. I am happy to revert the patch
> that turned it on if necessary for F37 and F36, but I am not going to carry
> a revert against upstream forever. This change was in 6.2 long before F38
> even branched. When the working groups chose to move to a btrfs default
> against the kenrel team's recomendation, it was understood that any bugfixes
> in btrfs have to be coordinated upstream. F38 will need to come up with an
> appropriate way to deal with this change in-line with upstream.

I absolutely do not appreciate the jab here. This policy better not be different for other storage subsystems just because there happens to be someone *at Red Hat*. That would be a gross misuse of having a blended Fedora/RHEL kernel tree for Fedora's upstream tree.

Comment 8 Neal Gompa 2023-03-28 08:14:59 UTC
(In reply to Scott Williams from comment #0)
> 
> One last thing to note, is that the current latest btrfs documentation still
> seems to suggest that no discard should still be the default:
> https://btrfs.readthedocs.io/en/latest/Administration.html

This, however, is probably something that should be fixed upstream...

Comment 9 Neal Gompa 2023-03-28 08:30:26 UTC
(In reply to Scott Williams from comment #0)
> 1. Please describe the problem:
> As of kernel-6.2, the defaults for btrfs have apparently changed such that
> the "discard=async" option is applied to btrfs unless "nodiscard" option is
> explicitly given at mount/fstab.  This impacts all Fedora 37 and 38 users
> who upgraded to kernel-6.2.  The result by this redundant trim (over our
> current fstrim.timer method) is additional IO overhead and possibly a
> negative impact to SSD longevity.
> 
> This was reported through discussion.fedoraproject.org:
> https://discussion.fedoraproject.org/t/btrfs-discard-storm-on-fedora/79997/9
> 

fstrim.timer will do nothing if everything has been handled by discard=async.

https://lore.kernel.org/linux-btrfs/20220727174726.GU13489@suse.cz/

Comment 10 Justin M. Forbes 2023-03-28 11:48:41 UTC
(In reply to Neal Gompa from comment #7)
> (In reply to Justin M. Forbes from comment #3)
> I absolutely do not appreciate the jab here. This policy better not be
> different for other storage subsystems just because there happens to be
> someone *at Red Hat*. That would be a gross misuse of having a blended
> Fedora/RHEL kernel tree for Fedora's upstream tree.

Apologies, it was a bit testy sounding, but no, there is *no* different policy. The policy would be the same. The expectation is that we would be a bit more in the loop on changes, and would have more influence on upstream to get changes in so that we do not have to carry a patch forever, but if this were an XFS change, my policy would be no different. I would be willing to carry a revert if necessary for existing stable releases, not willing to revert for pre-release or rawhide. This has nothing to do with a blended Fedora/RHEL tree. It never has.  My focus is Fedora, and always has been. 

As for this specific change, I do understand why this is desired behavior going forward. I do find it a bit odd, that it wasn't dropped behind a Kconfig, but if everyone is happy with it, I suppose it doesn't matter much.

Comment 11 Neal Gompa 2023-03-28 12:45:09 UTC
(In reply to Justin M. Forbes from comment #10)
> (In reply to Neal Gompa from comment #7)
> > (In reply to Justin M. Forbes from comment #3)
> > I absolutely do not appreciate the jab here. This policy better not be
> > different for other storage subsystems just because there happens to be
> > someone *at Red Hat*. That would be a gross misuse of having a blended
> > Fedora/RHEL kernel tree for Fedora's upstream tree.
> 
> Apologies, it was a bit testy sounding, but no, there is *no* different
> policy. The policy would be the same. The expectation is that we would be a
> bit more in the loop on changes, and would have more influence on upstream
> to get changes in so that we do not have to carry a patch forever, but if
> this were an XFS change, my policy would be no different. I would be willing
> to carry a revert if necessary for existing stable releases, not willing to
> revert for pre-release or rawhide. This has nothing to do with a blended
> Fedora/RHEL tree. It never has.  My focus is Fedora, and always has been. 
> 

Thank you for the clarification. For what it's worth, we have plenty of upstream influence. At least two Btrfs kernel developers are actively tracking our bugs, and I bring things up to the whole group when they're pointed out to me as needed. 

> As for this specific change, I do understand why this is desired behavior
> going forward. I do find it a bit odd, that it wasn't dropped behind a
> Kconfig, but if everyone is happy with it, I suppose it doesn't matter much.

The reasoning in the F2F discussion in Rochester last year was that it basically no-ops in conditions it's not needed anyway, so making it a compile-time flag was considered overkill.

Comment 12 Scott Williams 2023-03-28 15:27:31 UTC
After seeing the context for this in https://pagure.io/releng/issue/11341, I'm inclined to now view this change as a surprise, but not a bug.  My assumption was that setting discard on a filesystem is a lot of extra overhead and could shorten the life of an SSD (which is a significant part of why it wasn't enabled by default up until now).  This is addressed in that linked discussion, that discard=async shouldn't have the same drawbacks.  Unless someone else feels strongly about it, I'm OK with keeping things as is and closing this out.

Comment 13 Adam Williamson 2023-03-28 16:16:34 UTC
-4 in https://pagure.io/fedora-qa/blocker-review/issue/1131 , marking rejected.

Comment 14 Thorsten Leemhuis 2023-03-28 17:00:33 UTC
TWIMC: this new behaviour might be a regression that is also discussed upstream here:

https://lore.kernel.org/all/Y%2F+n1wS%2F4XAH7X1p@nz/

If somebody can show that this reduces SSD lifetime or performance even for some SSDs, I'll put in my weight to get that change reverted or adjust the defaults

Comment 15 Scott Williams 2023-03-28 17:30:50 UTC
Reading the logs seems like it was pushed based on testing at Facebook datacenters against NVMe drives doing server-based workloads, which may or may not be sane for laptop workloads or older SSDs.  I assume that this update enables this behavior across all SSDs and not just newer NVMe drives?

Quote from Christoph Hellwig:
FYI, discard performance differs a lot between different SSDs.
It used to be pretty horrible for most devices early on, and then a
certain hyperscaler started requiring decent performance for enterprise
drives, so many of them are good now.  A lot less so for the typical
consumer drive, especially at the lower end of the spectrum.

And that jut[sic] NVMe, the still shipping SATA SSDs are another different
story.  Not helped by the fact that we don't even support ranged
discards for them in Linux.

Comment 16 Josef Bacik 2023-03-31 19:06:31 UTC
-o discard=async was introduced to solve two problems.  First was the discards were being handled in the transaction commit, which for most drives was fine, but particularly crappy drives would perform poorly with discards and this would increase latencies in our testing.  The async part was to move it out of the transaction commit path to even out latencies.

Secondly because of these incredibly awful drives, and additionally to provide better drive performance with heavy IO workloads, we wanted to also limit discards.  Because btrfs is COW we will discard far more often than most file systems, and additionally this extra discard'ing wasn't actually necessary because often times we'd come along and write over areas we had just freed up.  So in testing we looked at a sample size of our workloads and decided to have some default characteristics, namely to make sure we're only discarding when extents are relatively large, and to limit the number of discard iops in order to not interfere with actual IO.

The reason there are tunable knobs is because obviously this is very workload dependent.  We have a lot of data that shows these defaults make sense in the vast majority of cases, with our huge number of workloads and our pretty diverse set of drives that we have deployed in our fleet.  Auto-tuning is not really a reasonable option as we would have to do things like monitor the io pattern for the fs, as well as the latency characteristics of the underlying device, which adds a certain amount of overhead.  Additionally we're obviously talking about machines in a data center, we're comfortable with discards being slowly trickled out onto the disk.

On top of that I've been running discard=async on all my personal machines since it was introduced, including 3 laptops, I've not experienced this sort of behavior.  The defaults are the default for a reason, they're a solid starting point.

However, it is definitely workload dependent, and having your hard drive light on all the time is not good.  If the defaults are creating problems for certain workloads we need to figure out a way to solve this in a way that makes sense for everybody.  It is unreasonable to expect normal users to understand all the things that go into their own workloads and how their particular drive responds to discards.  Honestly it's unreasonable that I have to know anything about discards, but that's a thing for me and my therapist to deal with.

We btrfs developers have been discussing how to deal with this in a sane way.  The general consensus is that there are tuning knobs for a reason, but it's not reasonable to expect users to have to know how to mess with them.  As I indicated before discard=async is really 2 different behaviors, 1 the async part, and 2 the limiting part.  We are looking at the code to see if we can simply disable the limiting part completely.  This would give users the same behavior as normal -o discard, but with it async'ed off on the side.  If we can't disable it with the current code we're going to fix it so you can, and then make the default have no limiting whatsoever.  This will return the default behavior to what happens with -o discard, but with the benefit of it not being in the transaction commit path so you still get the latency benefits.

Comment 17 Gurenko Alex 2023-04-05 23:08:36 UTC
Created attachment 1956015 [details]
CPU temp spike

As suggested here (https://discussion.fedoraproject.org/t/btrfs-discard-storm-on-fedora/79997/21), I would like to move my concerns here (re-phrased original post)

After using it (discard=async) for a some time now, I finally pin-pointed some annoyance to this change.
My water cooled (360mm AIO) Ryzen 5900X spikes from normal 40 degrees C to 65C every 8-10 minutes (usually, but can vary) or so for ~10-15 seconds (see screenshot).

I'm running following configuration:

$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    0 931.5G  0 disk 
└─sda1        8:1    0 931.5G  0 part /mnt/ssd
sdb           8:16   0 931.5G  0 disk 
└─sdb1        8:17   0 931.5G  0 part 
sdc           8:32   0 931.5G  0 disk 
└─sdc1        8:33   0 931.5G  0 part 
sdd           8:48   0 931.5G  0 disk 
└─sdd1        8:49   0 931.5G  0 part 
zram0       252:0    0     8G  0 disk [SWAP]
nvme0n1     259:0    0 465.8G  0 disk 
└─nvme0n1p1 259:1    0 465.8G  0 part /home
nvme1n1     259:2    0 465.8G  0 disk 
├─nvme1n1p1 259:3    0     1G  0 part /boot/efi
├─nvme1n1p2 259:4    0     1G  0 part /boot
└─nvme1n1p3 259:5    0 463.8G  0 part /


/dev/sd{a..d} are in BTRFS RAID10 (m=raid10, d=raid10)

nvme1n1  - Samsung 970 EVO Plus 500Gb
nmve0n1  - WD Black SN850 - 500Gb
sd{a..d} - Seagate Firecuda 120 1Tb

Probably lower drive count does not have this issue, but it took me a while to find the reason for my spikes, since upgrade to 6.2.x kernel and BIOS update coincided and I thought it was the later, but after flashing old BIOS/NEW BIOS I wasn't able to confirm it, however running the system since yesterday with `nodiscard` finally proved it that it (temperature and cpu load that is) remains completely flat during the normal day operation.

Comment 18 Thorsten Leemhuis 2023-04-06 04:52:22 UTC
FYI: upstream is now working out a solution afaics: https://lore.kernel.org/all/20230404193909.GC344341@zen/

Comment 19 Dagan McGregor 2023-04-24 13:22:48 UTC
Will the patch for Kernel 6.3 be added to Fedora 38, which is on 6.2 kernels? Or will we need to update to 6.3 kernel from Rawhide?

https://lore.kernel.org/linux-btrfs/cover.1680711209.git.boris@bur.io/

https://lore.kernel.org/linux-btrfs/45f813c5fabdb32df67ba661c396c592b863ff25.1680711209.git.boris@bur.io/

https://lore.kernel.org/linux-btrfs/b2938cfd0da3f8368f18b05b774e0f4207f8a3fe.1680711209.git.boris@bur.io/


Quote:

"Previously, the default was a relatively conservative 10. This results
in a 100ms delay, so with ~300 discards in a commit, it takes the full
30s till the next commit to finish the discards. On a workstation, this
results in the disk never going idle, wasting power/battery, etc.

Set the default to 1000, which results in using the smallest possible
delay, currently, which is 1ms. This has shown to not pathologically
keep the disk busy by the original reporter."


I am sure lots of Fedora users would really appreciate getting this patch.

Comment 20 Neal Gompa 2023-04-24 13:46:35 UTC
It is already being queued for 6.2 and will be part of an upcoming upstream Linux 6.2.y release.

Comment 21 Thorsten Leemhuis 2023-04-24 13:48:26 UTC
6.2.13, to be precise; -rc1 of that release just dropped.

Comment 22 Dagan McGregor 2023-04-24 14:08:52 UTC
(In reply to Neal Gompa from comment #20)
> It is already being queued for 6.2 and will be part of an upcoming upstream
> Linux 6.2.y release.

(In reply to Thorsten Leemhuis from comment #21)
> 6.2.13, to be precise; -rc1 of that release just dropped.

Thanks for the prompt response. I will keep an eye out for those updates.

Comment 23 Dagan McGregor 2023-04-29 10:31:28 UTC
I installed 6.2.13-300 package from updates-testing, and so far the graph for disk usage is looking much smoother. I haven't done any specific performance tuning or tests.


Note You need to log in before you can comment on or make changes to this bug.