Bug 1851783

Summary: drop bfq scheduler, instead use mq-deadline across the board
Product: [Fedora] Fedora Reporter: Chris Murphy <bugzilla>
Component: systemdAssignee: systemd-maint
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: bberg, ego.cordatus, igor.raits, josef, lnykryn, mavit, mcatanza, msekleta, ngompa13, paolo.valente, samuel-rhbugs, ssahani, s, systemd-maint, tseewald, vitaly, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: FutureFeature, Triaged
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Causes of high latencies induced by writes none

Description Chris Murphy 2020-06-28 23:15:36 UTC
Description of problem:

/usr/lib/udev/rules.d/60-block-scheduler.rules means some things like SSDs use scheduler bfq, and other things like NVMe use none. 

After speaking to Josef about it, recommend using mq-deadline for everything: NVMe, SAS/SATA SSD and HDD, USB sticks and drives, and mmcblk.


Version-Release number of selected component (if applicable):
systemd-udev-245.4-1.fc32.x86_64

How reproducible: Always

Comment 1 Josef Bacik 2020-06-28 23:23:46 UTC
Bfq consistently causes long latencies, measured in seconds, for seemingly no reason.  We evaluated it several times in production and each time it would cause outages because of uncontrolled max latencies.  mq-deadline is straightforward and generally a sane default for the average use case.  It's what we use by default, using only kyber or none in very unique and specific circumstances.

Comment 2 Chris Murphy 2020-06-28 23:36:16 UTC
This works for me for SSD and NVMe, but almost certain the nvme entry is not correct. For me it needs to be set on nvme0n1, but the 0 and 1 could vary elsewhere.

ACTION=="add", SUBSYSTEM=="block", \
  KERNEL=="mmcblk*[0-9]|msblk*[0-9]|mspblk*[0-9]|nvme*[0-9]|sd*[!0-9]|sr*", \
  ENV{DEVTYPE}=="disk", \
  ATTR{queue/scheduler}="mq-deadline"

Comment 3 Chris Murphy 2020-06-28 23:37:18 UTC
Also we should add something like vd*[0-9] so that VM virtblkio devices also get mq-deadline.

Comment 4 Igor Raits 2020-06-29 07:38:01 UTC
What is the reason to not use `none` on NVMe devices? After quick googling, it seems that it is designed exactly for such fast devices as nvmes..

Comment 5 Chris Murphy 2020-06-29 20:32:27 UTC
None is fast for single task workloads, in particular does well on synthetic benchmarks predicated on single task workloads. Whereas the multiple application case, where there can be multiple sources of IO pressure, and variably starve them in ways that least to latency spikes. mq-deadline will balance that out for better overall results in a mixed application workload. Especially with consumer hardware. Lemme know if you want a change proposal so I can be quick like a bunny within the next 24 hours!

Comment 6 Igor Raits 2020-06-29 20:39:34 UTC
(In reply to Chris Murphy from comment #5)
> None is fast for single task workloads, in particular does well on synthetic
> benchmarks predicated on single task workloads. Whereas the multiple
> application case, where there can be multiple sources of IO pressure, and
> variably starve them in ways that least to latency spikes. mq-deadline will
> balance that out for better overall results in a mixed application workload.
> Especially with consumer hardware. Lemme know if you want a change proposal
> so I can be quick like a bunny within the next 24 hours!

Well, if you can write down benefits - would be much appreciated.

I have NVMe and I did not see anything bad with "none". Also it is multiqueue scheduler, so I probably do not understand what "single task workloads" means.

I guess we should move this discussion to the mailing list instead of keeping it here.

Comment 7 Josef Bacik 2020-06-29 20:58:23 UTC
The only problem we've had with 'none' is on our relatively busy boxes you can sometimes exhaust NVME 'tags' (basically the number of IO's you can have in flight) and thus induce latency spikes to other tasks.  So if you have somebody doing a lot of tiny writes, they use up all of the available IO's you can have in flight, and with no scheduler it's basically luck of who gets woken up as to who gets to do their IO next.  mq-deadline makes it so you don't hit this issue.  That isn't to say that "none" is awful, just that "mq-deadline" is probably a better default, and "none" should probably be reserved for those who know how to read warning labels.

Comment 9 Zbigniew Jędrzejewski-Szmek 2020-06-29 22:09:33 UTC
This was previously discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1738828.
I'll repeat some arguments from the previous round:

- setting this through udev rules after the device has already been detected is really backwards.
Kernel should have an option to specify the default scheduler at compilation so that
block devices are brought up with the appropriate scheduler.

- kernel developers know best which scheduler in a given version of the kernel works best
and should provide the default value of the default.

Essentially, userspace has to be involved in this because of internal kernel politics and
we don't have a way to configure a default and the implicit default is not appropriate.

All that said, if we're to change the configuration in userspace, we need some benchmarks.
Previously, various benchmarks showed that bfq was giving good results. Has this changed?
If yes, let's do the change. But this should be based on some quantified results.

Comment 10 Paolo 2020-06-30 14:58:42 UTC
BFQ performance has even further improved since then. Especially, latency figures are still incomparably better than mq-deadline. Throughput is on par with mq-deadline for SSDs, higher than mq-deadline on HDDs. If useful, I can run a fresh batch of tests, with both an HDD and an SSD. Let me know.

Comment 11 Chris Murphy 2020-06-30 20:28:56 UTC
Hi Paolo thanks for the response.

I'm a benchmark skeptic, mainly because of criticisms of them by file systems kernel developers. A benchmark is only as relevant as well as it mimics the workload we care about. And the difficulty there is 'desktop' workloads in Fedora are heterogeneous. We've got folks with NVMe and SSD and some HDD. Folks do compile softwar, others are working on video and audio. Another factor is the Btrfs by default proposal, and we'd want to make certain the workloads we care about run well with the default file system and the default IO scheduler.

Also I notice that the udev rule for this applies Fedora wide. Is it the intention that it apply to Fedora Server and IoT editions as well as on the desktop? In VM's right now virtioblk devices (/dev/vda) will get bfq IO scheduler, but SATA devices (/dev/sda) get mq-deadline. Currently on the desktop, we're using 'none' for NVMe which as Josef states can lead to tag starvation in some cases because there's no arbiter. I'd rather see none used as an optimization for certain workloads rather than risk even a significant minority of Fedora users running into latency spikes due to a tag aggressive task.


These benchmarks drive me nuts. None of them are very representative of desktop workloads. And thus the geometric mean is still misleading, and yet it suggests none or mq-deadline. And is run on a recent kernel. And that's sorta why I'm skeptical of running a more complex scheduler for a wide ranging set of uses, it just strikes me as highly likely the more complex anything is the more edge cases there will be. And even if mq-deadline isn't squeaking out best performances in benchmarks, what I care about is not seeing latency spikes anywhere but that is quite hard to detect.

https://www.phoronix.com/scan.php?page=article&item=linux-56-nvme&num=4

Comment 12 Michael Catanzaro 2020-06-30 20:42:37 UTC
(In reply to Chris Murphy from comment #11)
> These benchmarks drive me nuts. None of them are very representative of
> desktop workloads.

Well the workload is "time to launch gnome-terminal" (admittedly while under heavy I/O pressure). It seems plausible to me? How else can we possibly measure...?

Comment 13 Chris Murphy 2020-06-30 21:08:22 UTC
(In reply to Michael Catanzaro from comment #12)
> Well the workload is "time to launch gnome-terminal" (admittedly while under
> heavy I/O pressure). It seems plausible to me? How else can we possibly
> measure...?

That seems like a specious metric. But maybe we're talking about different benchmarks. The one I mention in comment 11 is not the 'time to load gnome terminal' one that I dislike. It's a different one that I also dislike. But yes you're right, a synthetic test could be designed where task A and B hog all the tags they possibly can as fast as they can, and measure IO pressure for unacceptable (defined in advance both in magnitude and duration) latency spikes. A scheduler that isn't well suited for such a decently likely though not common workload is probably not a good fit for Fedora.

Comment 14 Zbigniew Jędrzejewski-Szmek 2020-06-30 21:25:24 UTC
The one thing that seems relatively clear from this discussion is that we should set mq-deadline on nvme devices.

Comment 15 Zbigniew Jędrzejewski-Szmek 2020-06-30 21:27:50 UTC
> If useful, I can run a fresh batch of tests, with both an HDD and an SSD.

Yeah, that'd be useful.

Comment 16 Michael Catanzaro 2020-06-30 21:29:01 UTC
Sorry, I was referring to http://algo.ing.unimo.it/people/paolo/disk_sched/results.php

Comment 17 Chris Murphy 2020-06-30 22:28:28 UTC
Has bfq been evaluated in a cgroup2 context? We have quite a lot of work being done in this area to do proper memory, cpu, and io isolation. That seems to be what the focus should be on for solving some application's launch times, rather than asking a self-admitted complex scheduler to do this for us, while making it very hard for anyone to prove that it has no side-effects whatsoever. And yet we have correlating annecdota on devel@ and elsewhere that hangs (latency spikes) can happen with bfq as the IO scheduler that don't happen when mq-deadline is the scheduler. So whose burden is it supposed to be?

I do not think 38 second application launch times is something that should be solved by an IO scheduler. That isn't even necessarily a real problem - it may be the exact correct outcome given the load, absent proper resource control measures that compel an alternative outcome as a UI/UX preference. Not as an IO scheduler benefit.

And what do those benchmarks have to do with Fedora Cloud and Server? They're using this same udev rule.

Comment 18 Artem 2020-07-01 08:00:24 UTC
https://blogs.gnome.org/wjjt/2018/11/15/the-devil-makes-work-for-idle-processes/

TLDR: in Endless OS, we switched the IO scheduler from CFQ to BFQ, and set the IO priority of the threads doing Flatpak downloads, installs and upgrades to “idle”; this makes the interactive performance of the system while doing Flatpak operations indistinguishable from when the system is idle.

Comment 19 Paolo 2020-07-01 09:21:30 UTC
Hi,
I'm sorry, I replied only to the first comment I received since when I've been added to this thread. Now I'll try to reply to all main points, and then to get to the specific regression reported.

BFQ allows the desired bandwidth and latency to be guaranteed to each process or group of processes. In particular, BFQ honors I/O priorities and priority classes, and complies with cgroups-v1 and v2.

In this article you can see BFQ at work in controlling bandwidth on a cgroup basis (which translates into control on also a container or VM basis):
https://lwn.net/Articles/763603/

Or, in this article, you find a general survey of current solutions for controlling bandwidth in production:
https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/

The only alternatives to BFQ for controlling bandwidth and latency with cgroups are the two new I/O controllers made by Facebook people. But, as I show in this presentation (through repeatable tests also in this case), these solutions do not work on any of the several machines and storage devices I've used as a testbed:
https://www.usenix.org/conference/vault20/presentation/valente

The authors of the controllers however claim that their controllers work well on their machines.

mq-deadline does not support any bandwidth or latency control using groups. Actually, mq-deadline performs rather little I/O control in general.

One of the reasons why BFQ succeeds in controlling I/O is exactly that BFQ also controls request tags too. In particular, BFQ's mechanism for controlling tags is an extended version of the coarse mechanism available in kyber.

BFQ has not yet been tuned for multi-queue, very fast devices yet. So, for the moment, it may have regressions with nvme devices.

Let's go now to the problem reported by Josef. I have no idea why these high latencies are occurring. There is no precise description of the workload and of the setting in this thread, so I have no chance to reproduce it. For sure, mq-deadline performs no tag control at all AFAIK. In this cases, i.e., when a logic has a problem with a workload, and some other solution has no logic at all, the second solution works better, mainly because of luck. It already happened in favor of BFQ too, a few times, for no technical merit.

IMO the right way to go is to just find out why BFQ's tag-handling logic is failing in this case. It should then be easy and quick to fix the bug. Or at least it has been so, in all cases, in the last 10+ years.

Comment 20 Ben Cotton 2020-08-11 13:41:16 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle.
Changing version to 33.

Comment 21 Chris Murphy 2020-10-20 17:37:54 UTC
Cgroups developers tell me that BFQ "can't currently implement comprehensive IO" isolation, doesn't handle 'writeback / forced IO backcharging", and has scalability issues.

Resource control is a significant effort in Fedora, in particular GNOME and KDE, and one of the reasons for switching to Btrfs by default in Fedora 33. I think the presumption should be that we revert to mq-deadline by default in Fedora 34.

Comment 22 Tom Seewald 2020-10-20 17:52:14 UTC
(In reply to Chris Murphy from comment #21)
> Cgroups developers tell me that BFQ "can't currently implement comprehensive
> IO" isolation, doesn't handle 'writeback / forced IO backcharging", and has
> scalability issues.
> 
> Resource control is a significant effort in Fedora, in particular GNOME and
> KDE, and one of the reasons for switching to Btrfs by default in Fedora 33.
> I think the presumption should be that we revert to mq-deadline by default
> in Fedora 34.

What is your response to the links Paolo has posted above that strongly suggest bfq improves system responsiveness and that the current IO controllers are not sufficient? I think we should actually talk about this openly rather than trying to force a change based on private conversations. It's not clear to me how bfq is an impediment to implementing other IO control mechanisms or how deadline-mq improves IO control.

Comment 23 Paolo 2020-10-21 05:54:16 UTC
(In reply to Chris Murphy from comment #21)
> Cgroups developers tell me that BFQ "can't currently implement comprehensive
> IO" isolation, doesn't handle 'writeback / forced IO backcharging",

false

 and has
> scalability issues.
> 

false, with the (only) drives for which BFQ is enabled in Fedora.

This time (and in my future replies) I'll save us links to public, easily repeatable results.

Comment 24 Chris Murphy 2020-10-21 23:56:10 UTC
(In reply to Paolo from comment #23)
> (In reply to Chris Murphy from comment #21)
> > Cgroups developers tell me that BFQ "can't currently implement comprehensive
> > IO" isolation, doesn't handle 'writeback / forced IO backcharging",
> 
> false

What is false? BFQ can do comprehensive IO isolation? When BFQ is used and a cgroup generates a lot of swap writes, are those writes attributed to the root cgroup? Or to the cgroup that generated the swap writes?

Whether BFQ or mq-deadline, using a decent SATA SSDs, and with a heavy cpu, memory, and swap writes workload, the desktop becomes entirely unresponsive. Multiple seconds, even minutes, of complete total hang. This is trivially reproducible, just compile webkitgtk on any current release Fedora Workstation (or KDE spin). We at least have a way forward with cgroupsv2 memory, cpu, and io isolation to solve this problem, and pretty much always ensure the GUI has minimum resources necessary for a user to maintain decent interactivity with their own computer, including even being able to kill a wayward process rather than (a) wait indefinitely or (b) pull the power cord. BFQ alone isn't solving this problem, are you saying it can and should?

>  and has
> > scalability issues.
> > 
> 
> false, with the (only) drives for which BFQ is enabled in Fedora.
> 
> This time (and in my future replies) I'll save us links to public, easily
> repeatable results.

The udev rule in Fedora applies BFQ to all sd* devices. I explicitly asked Tejun about that and he replied:

"Even for good SATA SSDs, the scalability limitation can be a problem and the
control comprehensiveness issue is likely gonna matter more for lower
performance devices."

He also pointed me to:
https://lore.kernel.org/linux-block/7CD57B83-F067-4918-878C-BAC413C6A2B3@linaro.org/
https://github.com/facebookexperimental/resctl-demo

I don't know if it's better discussed on linux-block@ list or fedora-devel@. An upstream consensus doesn't appear likely anytime soon. And the previous fedora-devel@ discussion also seems misplaced, hindsight being 20/20.

However, folks with both servers and desktops report multi-second latencies using BFQ that don't happen with mq-deadline. It doesn't matter if BFQ is better overall 99% of the time if it has multi-second latency spikes even for a scant minority of users, who are very unlikely to know how to track down the cause. The people who can suggest that they "change to none or mq-deadline scheduler and try to reproduce the problem" are not omnipresent on the various lists where such problems are typically reported.

I also don't know if BFQ is more appropriate for ext4 than it is Btrfs. Or if that makes things even more complicated. My gut instinct is that it's specious and makes things more complicated.

Comment 25 Paolo 2020-10-22 06:29:28 UTC
(In reply to Chris Murphy from comment #24)
> (In reply to Paolo from comment #23)
> > (In reply to Chris Murphy from comment #21)
> > > Cgroups developers tell me that BFQ "can't currently implement comprehensive
> > > IO" isolation, doesn't handle 'writeback / forced IO backcharging",
> > 
> > false
> 
> What is false? BFQ can do comprehensive IO isolation? When BFQ is used and a
> cgroup generates a lot of swap writes, are those writes attributed to the
> root cgroup? Or to the cgroup that generated the swap writes?
> 

BFQ charges the cgroup for writes.  In a scenario where the writes
cannot be trace back to the originating cgroup, no solution at all can
provide any isolation.

> Whether BFQ or mq-deadline, using a decent SATA SSDs, and with a heavy cpu,
> memory, and swap writes workload, the desktop becomes entirely unresponsive.
> Multiple seconds, even minutes, of complete total hang. This is trivially
> reproducible, just compile webkitgtk on any current release Fedora
> Workstation (or KDE spin). We at least have a way forward with cgroupsv2
> memory, cpu, and io isolation to solve this problem, and pretty much always
> ensure the GUI has minimum resources necessary for a user to maintain decent
> interactivity with their own computer, including even being able to kill a
> wayward process rather than (a) wait indefinitely or (b) pull the power
> cord. BFQ alone isn't solving this problem, are you saying it can and should?
> 

Short answer: yes BFQ can, and only BFQ at the moment (see below for
the solution you use).

Long answer: the causes of the problem are outside BFQ, and we have
studied them very carefully.  I'm about to attach a report with a list
of these causes.  Because of these causes, the problem cannot be
solved by BFQ alone.  But it can be solved by BFQ plus a little extra
code outside BFQ.  We have already written that code, but it is not
production-ready.  It is not production-ready because we are working
on it very slowly.  This is because I have failed to find big support.
And I have failed to get important support because unfortunately this
problem hits only rather specific system configurations and workloads.

The solution you use (namely configuring memory, cpu and io controller
with magic parameter values) is simply unrealistic for an average
user, and it unavoidably leads to a waste of resources for a different
workload than that for which you tune it.

> >  and has
> > > scalability issues.
> > > 
> > 
> > false, with the (only) drives for which BFQ is enabled in Fedora.
> > 
> > This time (and in my future replies) I'll save us links to public, easily
> > repeatable results.
> 
> The udev rule in Fedora applies BFQ to all sd* devices. I explicitly asked
> Tejun about that and he replied:
> 
> "Even for good SATA SSDs, the scalability limitation can be a problem and the
> control comprehensiveness issue is likely gonna matter more for lower
> performance devices."
> 

You seem to prefer "ipse dixit" over numbers, so I'll save us
reproducible numbers again.

> He also pointed me to:
> https://lore.kernel.org/linux-block/7CD57B83-F067-4918-878C-
> BAC413C6A2B3@linaro.org/
> https://github.com/facebookexperimental/resctl-demo
> 

Those results are exactly for what's outside the scope of this thread:
a superfast nvme drive, attached to a superfast system.

> I don't know if it's better discussed on linux-block@ list or fedora-devel@.
> An upstream consensus doesn't appear likely anytime soon. And the previous
> fedora-devel@ discussion also seems misplaced, hindsight being 20/20.
> 
> However, folks with both servers and desktops report multi-second latencies
> using BFQ that don't happen with mq-deadline. It doesn't matter if BFQ is
> better overall 99% of the time if it has multi-second latency spikes even
> for a scant minority of users, who are very unlikely to know how to track
> down the cause. The people who can suggest that they "change to none or
> mq-deadline scheduler and try to reproduce the problem" are not omnipresent
> on the various lists where such problems are typically reported.
> 

I've already expressed my negative opinion on a non-collaborative
approach, based on non-shared, non-reproducible anecdotal evidence.

> I also don't know if BFQ is more appropriate for ext4 than it is Btrfs.

Same results.

> Or
> if that makes things even more complicated.

No, it doesn't.

> My gut instinct is that it's
> specious and makes things more complicated.

Comment 26 Paolo 2020-10-22 06:30:41 UTC
Created attachment 1723415 [details]
Causes of high latencies induced by writes

Comment 27 Michael Catanzaro 2020-10-22 14:38:02 UTC
(In reply to Paolo from comment #25)
> The solution you use (namely configuring memory, cpu and io controller
> with magic parameter values) is simply unrealistic for an average
> user, and it unavoidably leads to a waste of resources for a different
> workload than that for which you tune it.

The goal of uresourced's resource limits is to preserve desktop responsiveness at the cost of potentially wasted resources. That's an intentional design choice. You could consider them "reservations" rather than "limits" in that resource usage is only limited to whatever needed to protect the resources reserved for the desktop slice... but whatever, I suppose a limit is a limit. Point is, we have resource limits enabled by default in Fedora 33, and you should assume that almost all users will stick with our defaults.

Comment 28 Paolo 2020-10-22 14:49:34 UTC
(In reply to Michael Catanzaro from comment #27)
> (In reply to Paolo from comment #25)
> > The solution you use (namely configuring memory, cpu and io controller
> > with magic parameter values) is simply unrealistic for an average
> > user, and it unavoidably leads to a waste of resources for a different
> > workload than that for which you tune it.
> 
> The goal of uresourced's resource limits is to preserve desktop
> responsiveness at the cost of potentially wasted resources. That's an
> intentional design choice. You could consider them "reservations" rather
> than "limits" in that resource usage is only limited to whatever needed to
> protect the resources reserved for the desktop slice... but whatever, I
> suppose a limit is a limit. Point is, we have resource limits enabled by
> default in Fedora 33, and you should assume that almost all users will stick
> with our defaults.

Yep.  But these general (and longstanding) solution you mention is a
different story than the specific configurations that one must set to
contrast the nasty, low-level starvation issues caused by intense
writeback.

If you have some faith in numbers, here is, e.g., one more
presentation that shows how difficult the problem is, and how badly
I/O control works:
https://www.usenix.org/conference/vault20/presentation/valente

You can easily repeat these tests yourself.