Bug 1851783 - drop bfq scheduler, instead use mq-deadline across the board
Summary: drop bfq scheduler, instead use mq-deadline across the board
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 33
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-28 23:15 UTC by Chris Murphy
Modified: 2020-08-11 13:41 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug


Attachments (Terms of Use)

Description Chris Murphy 2020-06-28 23:15:36 UTC
Description of problem:

/usr/lib/udev/rules.d/60-block-scheduler.rules means some things like SSDs use scheduler bfq, and other things like NVMe use none. 

After speaking to Josef about it, recommend using mq-deadline for everything: NVMe, SAS/SATA SSD and HDD, USB sticks and drives, and mmcblk.


Version-Release number of selected component (if applicable):
systemd-udev-245.4-1.fc32.x86_64

How reproducible: Always

Comment 1 Josef Bacik 2020-06-28 23:23:46 UTC
Bfq consistently causes long latencies, measured in seconds, for seemingly no reason.  We evaluated it several times in production and each time it would cause outages because of uncontrolled max latencies.  mq-deadline is straightforward and generally a sane default for the average use case.  It's what we use by default, using only kyber or none in very unique and specific circumstances.

Comment 2 Chris Murphy 2020-06-28 23:36:16 UTC
This works for me for SSD and NVMe, but almost certain the nvme entry is not correct. For me it needs to be set on nvme0n1, but the 0 and 1 could vary elsewhere.

ACTION=="add", SUBSYSTEM=="block", \
  KERNEL=="mmcblk*[0-9]|msblk*[0-9]|mspblk*[0-9]|nvme*[0-9]|sd*[!0-9]|sr*", \
  ENV{DEVTYPE}=="disk", \
  ATTR{queue/scheduler}="mq-deadline"

Comment 3 Chris Murphy 2020-06-28 23:37:18 UTC
Also we should add something like vd*[0-9] so that VM virtblkio devices also get mq-deadline.

Comment 4 Igor Raits 2020-06-29 07:38:01 UTC
What is the reason to not use `none` on NVMe devices? After quick googling, it seems that it is designed exactly for such fast devices as nvmes..

Comment 5 Chris Murphy 2020-06-29 20:32:27 UTC
None is fast for single task workloads, in particular does well on synthetic benchmarks predicated on single task workloads. Whereas the multiple application case, where there can be multiple sources of IO pressure, and variably starve them in ways that least to latency spikes. mq-deadline will balance that out for better overall results in a mixed application workload. Especially with consumer hardware. Lemme know if you want a change proposal so I can be quick like a bunny within the next 24 hours!

Comment 6 Igor Raits 2020-06-29 20:39:34 UTC
(In reply to Chris Murphy from comment #5)
> None is fast for single task workloads, in particular does well on synthetic
> benchmarks predicated on single task workloads. Whereas the multiple
> application case, where there can be multiple sources of IO pressure, and
> variably starve them in ways that least to latency spikes. mq-deadline will
> balance that out for better overall results in a mixed application workload.
> Especially with consumer hardware. Lemme know if you want a change proposal
> so I can be quick like a bunny within the next 24 hours!

Well, if you can write down benefits - would be much appreciated.

I have NVMe and I did not see anything bad with "none". Also it is multiqueue scheduler, so I probably do not understand what "single task workloads" means.

I guess we should move this discussion to the mailing list instead of keeping it here.

Comment 7 Josef Bacik 2020-06-29 20:58:23 UTC
The only problem we've had with 'none' is on our relatively busy boxes you can sometimes exhaust NVME 'tags' (basically the number of IO's you can have in flight) and thus induce latency spikes to other tasks.  So if you have somebody doing a lot of tiny writes, they use up all of the available IO's you can have in flight, and with no scheduler it's basically luck of who gets woken up as to who gets to do their IO next.  mq-deadline makes it so you don't hit this issue.  That isn't to say that "none" is awful, just that "mq-deadline" is probably a better default, and "none" should probably be reserved for those who know how to read warning labels.

Comment 9 Zbigniew Jędrzejewski-Szmek 2020-06-29 22:09:33 UTC
This was previously discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1738828.
I'll repeat some arguments from the previous round:

- setting this through udev rules after the device has already been detected is really backwards.
Kernel should have an option to specify the default scheduler at compilation so that
block devices are brought up with the appropriate scheduler.

- kernel developers know best which scheduler in a given version of the kernel works best
and should provide the default value of the default.

Essentially, userspace has to be involved in this because of internal kernel politics and
we don't have a way to configure a default and the implicit default is not appropriate.

All that said, if we're to change the configuration in userspace, we need some benchmarks.
Previously, various benchmarks showed that bfq was giving good results. Has this changed?
If yes, let's do the change. But this should be based on some quantified results.

Comment 10 Paolo 2020-06-30 14:58:42 UTC
BFQ performance has even further improved since then. Especially, latency figures are still incomparably better than mq-deadline. Throughput is on par with mq-deadline for SSDs, higher than mq-deadline on HDDs. If useful, I can run a fresh batch of tests, with both an HDD and an SSD. Let me know.

Comment 11 Chris Murphy 2020-06-30 20:28:56 UTC
Hi Paolo thanks for the response.

I'm a benchmark skeptic, mainly because of criticisms of them by file systems kernel developers. A benchmark is only as relevant as well as it mimics the workload we care about. And the difficulty there is 'desktop' workloads in Fedora are heterogeneous. We've got folks with NVMe and SSD and some HDD. Folks do compile softwar, others are working on video and audio. Another factor is the Btrfs by default proposal, and we'd want to make certain the workloads we care about run well with the default file system and the default IO scheduler.

Also I notice that the udev rule for this applies Fedora wide. Is it the intention that it apply to Fedora Server and IoT editions as well as on the desktop? In VM's right now virtioblk devices (/dev/vda) will get bfq IO scheduler, but SATA devices (/dev/sda) get mq-deadline. Currently on the desktop, we're using 'none' for NVMe which as Josef states can lead to tag starvation in some cases because there's no arbiter. I'd rather see none used as an optimization for certain workloads rather than risk even a significant minority of Fedora users running into latency spikes due to a tag aggressive task.


These benchmarks drive me nuts. None of them are very representative of desktop workloads. And thus the geometric mean is still misleading, and yet it suggests none or mq-deadline. And is run on a recent kernel. And that's sorta why I'm skeptical of running a more complex scheduler for a wide ranging set of uses, it just strikes me as highly likely the more complex anything is the more edge cases there will be. And even if mq-deadline isn't squeaking out best performances in benchmarks, what I care about is not seeing latency spikes anywhere but that is quite hard to detect.

https://www.phoronix.com/scan.php?page=article&item=linux-56-nvme&num=4

Comment 12 Michael Catanzaro 2020-06-30 20:42:37 UTC
(In reply to Chris Murphy from comment #11)
> These benchmarks drive me nuts. None of them are very representative of
> desktop workloads.

Well the workload is "time to launch gnome-terminal" (admittedly while under heavy I/O pressure). It seems plausible to me? How else can we possibly measure...?

Comment 13 Chris Murphy 2020-06-30 21:08:22 UTC
(In reply to Michael Catanzaro from comment #12)
> Well the workload is "time to launch gnome-terminal" (admittedly while under
> heavy I/O pressure). It seems plausible to me? How else can we possibly
> measure...?

That seems like a specious metric. But maybe we're talking about different benchmarks. The one I mention in comment 11 is not the 'time to load gnome terminal' one that I dislike. It's a different one that I also dislike. But yes you're right, a synthetic test could be designed where task A and B hog all the tags they possibly can as fast as they can, and measure IO pressure for unacceptable (defined in advance both in magnitude and duration) latency spikes. A scheduler that isn't well suited for such a decently likely though not common workload is probably not a good fit for Fedora.

Comment 14 Zbigniew Jędrzejewski-Szmek 2020-06-30 21:25:24 UTC
The one thing that seems relatively clear from this discussion is that we should set mq-deadline on nvme devices.

Comment 15 Zbigniew Jędrzejewski-Szmek 2020-06-30 21:27:50 UTC
> If useful, I can run a fresh batch of tests, with both an HDD and an SSD.

Yeah, that'd be useful.

Comment 16 Michael Catanzaro 2020-06-30 21:29:01 UTC
Sorry, I was referring to http://algo.ing.unimo.it/people/paolo/disk_sched/results.php

Comment 17 Chris Murphy 2020-06-30 22:28:28 UTC
Has bfq been evaluated in a cgroup2 context? We have quite a lot of work being done in this area to do proper memory, cpu, and io isolation. That seems to be what the focus should be on for solving some application's launch times, rather than asking a self-admitted complex scheduler to do this for us, while making it very hard for anyone to prove that it has no side-effects whatsoever. And yet we have correlating annecdota on devel@ and elsewhere that hangs (latency spikes) can happen with bfq as the IO scheduler that don't happen when mq-deadline is the scheduler. So whose burden is it supposed to be?

I do not think 38 second application launch times is something that should be solved by an IO scheduler. That isn't even necessarily a real problem - it may be the exact correct outcome given the load, absent proper resource control measures that compel an alternative outcome as a UI/UX preference. Not as an IO scheduler benefit.

And what do those benchmarks have to do with Fedora Cloud and Server? They're using this same udev rule.

Comment 18 Artem 2020-07-01 08:00:24 UTC
https://blogs.gnome.org/wjjt/2018/11/15/the-devil-makes-work-for-idle-processes/

TLDR: in Endless OS, we switched the IO scheduler from CFQ to BFQ, and set the IO priority of the threads doing Flatpak downloads, installs and upgrades to “idle”; this makes the interactive performance of the system while doing Flatpak operations indistinguishable from when the system is idle.

Comment 19 Paolo 2020-07-01 09:21:30 UTC
Hi,
I'm sorry, I replied only to the first comment I received since when I've been added to this thread. Now I'll try to reply to all main points, and then to get to the specific regression reported.

BFQ allows the desired bandwidth and latency to be guaranteed to each process or group of processes. In particular, BFQ honors I/O priorities and priority classes, and complies with cgroups-v1 and v2.

In this article you can see BFQ at work in controlling bandwidth on a cgroup basis (which translates into control on also a container or VM basis):
https://lwn.net/Articles/763603/

Or, in this article, you find a general survey of current solutions for controlling bandwidth in production:
https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/

The only alternatives to BFQ for controlling bandwidth and latency with cgroups are the two new I/O controllers made by Facebook people. But, as I show in this presentation (through repeatable tests also in this case), these solutions do not work on any of the several machines and storage devices I've used as a testbed:
https://www.usenix.org/conference/vault20/presentation/valente

The authors of the controllers however claim that their controllers work well on their machines.

mq-deadline does not support any bandwidth or latency control using groups. Actually, mq-deadline performs rather little I/O control in general.

One of the reasons why BFQ succeeds in controlling I/O is exactly that BFQ also controls request tags too. In particular, BFQ's mechanism for controlling tags is an extended version of the coarse mechanism available in kyber.

BFQ has not yet been tuned for multi-queue, very fast devices yet. So, for the moment, it may have regressions with nvme devices.

Let's go now to the problem reported by Josef. I have no idea why these high latencies are occurring. There is no precise description of the workload and of the setting in this thread, so I have no chance to reproduce it. For sure, mq-deadline performs no tag control at all AFAIK. In this cases, i.e., when a logic has a problem with a workload, and some other solution has no logic at all, the second solution works better, mainly because of luck. It already happened in favor of BFQ too, a few times, for no technical merit.

IMO the right way to go is to just find out why BFQ's tag-handling logic is failing in this case. It should then be easy and quick to fix the bug. Or at least it has been so, in all cases, in the last 10+ years.

Comment 20 Ben Cotton 2020-08-11 13:41:16 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle.
Changing version to 33.


Note You need to log in before you can comment on or make changes to this bug.