Description of problem:
/usr/lib/udev/rules.d/60-block-scheduler.rules means some things like SSDs use scheduler bfq, and other things like NVMe use none.
After speaking to Josef about it, recommend using mq-deadline for everything: NVMe, SAS/SATA SSD and HDD, USB sticks and drives, and mmcblk.
Version-Release number of selected component (if applicable):
How reproducible: Always
Bfq consistently causes long latencies, measured in seconds, for seemingly no reason. We evaluated it several times in production and each time it would cause outages because of uncontrolled max latencies. mq-deadline is straightforward and generally a sane default for the average use case. It's what we use by default, using only kyber or none in very unique and specific circumstances.
This works for me for SSD and NVMe, but almost certain the nvme entry is not correct. For me it needs to be set on nvme0n1, but the 0 and 1 could vary elsewhere.
ACTION=="add", SUBSYSTEM=="block", \
Also we should add something like vd*[0-9] so that VM virtblkio devices also get mq-deadline.
What is the reason to not use `none` on NVMe devices? After quick googling, it seems that it is designed exactly for such fast devices as nvmes..
None is fast for single task workloads, in particular does well on synthetic benchmarks predicated on single task workloads. Whereas the multiple application case, where there can be multiple sources of IO pressure, and variably starve them in ways that least to latency spikes. mq-deadline will balance that out for better overall results in a mixed application workload. Especially with consumer hardware. Lemme know if you want a change proposal so I can be quick like a bunny within the next 24 hours!
(In reply to Chris Murphy from comment #5)
> None is fast for single task workloads, in particular does well on synthetic
> benchmarks predicated on single task workloads. Whereas the multiple
> application case, where there can be multiple sources of IO pressure, and
> variably starve them in ways that least to latency spikes. mq-deadline will
> balance that out for better overall results in a mixed application workload.
> Especially with consumer hardware. Lemme know if you want a change proposal
> so I can be quick like a bunny within the next 24 hours!
Well, if you can write down benefits - would be much appreciated.
I have NVMe and I did not see anything bad with "none". Also it is multiqueue scheduler, so I probably do not understand what "single task workloads" means.
I guess we should move this discussion to the mailing list instead of keeping it here.
The only problem we've had with 'none' is on our relatively busy boxes you can sometimes exhaust NVME 'tags' (basically the number of IO's you can have in flight) and thus induce latency spikes to other tasks. So if you have somebody doing a lot of tiny writes, they use up all of the available IO's you can have in flight, and with no scheduler it's basically luck of who gets woken up as to who gets to do their IO next. mq-deadline makes it so you don't hit this issue. That isn't to say that "none" is awful, just that "mq-deadline" is probably a better default, and "none" should probably be reserved for those who know how to read warning labels.
OK I started a devel@ thread for it.
This was previously discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1738828.
I'll repeat some arguments from the previous round:
- setting this through udev rules after the device has already been detected is really backwards.
Kernel should have an option to specify the default scheduler at compilation so that
block devices are brought up with the appropriate scheduler.
- kernel developers know best which scheduler in a given version of the kernel works best
and should provide the default value of the default.
Essentially, userspace has to be involved in this because of internal kernel politics and
we don't have a way to configure a default and the implicit default is not appropriate.
All that said, if we're to change the configuration in userspace, we need some benchmarks.
Previously, various benchmarks showed that bfq was giving good results. Has this changed?
If yes, let's do the change. But this should be based on some quantified results.
BFQ performance has even further improved since then. Especially, latency figures are still incomparably better than mq-deadline. Throughput is on par with mq-deadline for SSDs, higher than mq-deadline on HDDs. If useful, I can run a fresh batch of tests, with both an HDD and an SSD. Let me know.
Hi Paolo thanks for the response.
I'm a benchmark skeptic, mainly because of criticisms of them by file systems kernel developers. A benchmark is only as relevant as well as it mimics the workload we care about. And the difficulty there is 'desktop' workloads in Fedora are heterogeneous. We've got folks with NVMe and SSD and some HDD. Folks do compile softwar, others are working on video and audio. Another factor is the Btrfs by default proposal, and we'd want to make certain the workloads we care about run well with the default file system and the default IO scheduler.
Also I notice that the udev rule for this applies Fedora wide. Is it the intention that it apply to Fedora Server and IoT editions as well as on the desktop? In VM's right now virtioblk devices (/dev/vda) will get bfq IO scheduler, but SATA devices (/dev/sda) get mq-deadline. Currently on the desktop, we're using 'none' for NVMe which as Josef states can lead to tag starvation in some cases because there's no arbiter. I'd rather see none used as an optimization for certain workloads rather than risk even a significant minority of Fedora users running into latency spikes due to a tag aggressive task.
These benchmarks drive me nuts. None of them are very representative of desktop workloads. And thus the geometric mean is still misleading, and yet it suggests none or mq-deadline. And is run on a recent kernel. And that's sorta why I'm skeptical of running a more complex scheduler for a wide ranging set of uses, it just strikes me as highly likely the more complex anything is the more edge cases there will be. And even if mq-deadline isn't squeaking out best performances in benchmarks, what I care about is not seeing latency spikes anywhere but that is quite hard to detect.
(In reply to Chris Murphy from comment #11)
> These benchmarks drive me nuts. None of them are very representative of
> desktop workloads.
Well the workload is "time to launch gnome-terminal" (admittedly while under heavy I/O pressure). It seems plausible to me? How else can we possibly measure...?
(In reply to Michael Catanzaro from comment #12)
> Well the workload is "time to launch gnome-terminal" (admittedly while under
> heavy I/O pressure). It seems plausible to me? How else can we possibly
That seems like a specious metric. But maybe we're talking about different benchmarks. The one I mention in comment 11 is not the 'time to load gnome terminal' one that I dislike. It's a different one that I also dislike. But yes you're right, a synthetic test could be designed where task A and B hog all the tags they possibly can as fast as they can, and measure IO pressure for unacceptable (defined in advance both in magnitude and duration) latency spikes. A scheduler that isn't well suited for such a decently likely though not common workload is probably not a good fit for Fedora.
The one thing that seems relatively clear from this discussion is that we should set mq-deadline on nvme devices.
> If useful, I can run a fresh batch of tests, with both an HDD and an SSD.
Yeah, that'd be useful.
Sorry, I was referring to http://algo.ing.unimo.it/people/paolo/disk_sched/results.php
Has bfq been evaluated in a cgroup2 context? We have quite a lot of work being done in this area to do proper memory, cpu, and io isolation. That seems to be what the focus should be on for solving some application's launch times, rather than asking a self-admitted complex scheduler to do this for us, while making it very hard for anyone to prove that it has no side-effects whatsoever. And yet we have correlating annecdota on devel@ and elsewhere that hangs (latency spikes) can happen with bfq as the IO scheduler that don't happen when mq-deadline is the scheduler. So whose burden is it supposed to be?
I do not think 38 second application launch times is something that should be solved by an IO scheduler. That isn't even necessarily a real problem - it may be the exact correct outcome given the load, absent proper resource control measures that compel an alternative outcome as a UI/UX preference. Not as an IO scheduler benefit.
And what do those benchmarks have to do with Fedora Cloud and Server? They're using this same udev rule.
TLDR: in Endless OS, we switched the IO scheduler from CFQ to BFQ, and set the IO priority of the threads doing Flatpak downloads, installs and upgrades to “idle”; this makes the interactive performance of the system while doing Flatpak operations indistinguishable from when the system is idle.
I'm sorry, I replied only to the first comment I received since when I've been added to this thread. Now I'll try to reply to all main points, and then to get to the specific regression reported.
BFQ allows the desired bandwidth and latency to be guaranteed to each process or group of processes. In particular, BFQ honors I/O priorities and priority classes, and complies with cgroups-v1 and v2.
In this article you can see BFQ at work in controlling bandwidth on a cgroup basis (which translates into control on also a container or VM basis):
Or, in this article, you find a general survey of current solutions for controlling bandwidth in production:
The only alternatives to BFQ for controlling bandwidth and latency with cgroups are the two new I/O controllers made by Facebook people. But, as I show in this presentation (through repeatable tests also in this case), these solutions do not work on any of the several machines and storage devices I've used as a testbed:
The authors of the controllers however claim that their controllers work well on their machines.
mq-deadline does not support any bandwidth or latency control using groups. Actually, mq-deadline performs rather little I/O control in general.
One of the reasons why BFQ succeeds in controlling I/O is exactly that BFQ also controls request tags too. In particular, BFQ's mechanism for controlling tags is an extended version of the coarse mechanism available in kyber.
BFQ has not yet been tuned for multi-queue, very fast devices yet. So, for the moment, it may have regressions with nvme devices.
Let's go now to the problem reported by Josef. I have no idea why these high latencies are occurring. There is no precise description of the workload and of the setting in this thread, so I have no chance to reproduce it. For sure, mq-deadline performs no tag control at all AFAIK. In this cases, i.e., when a logic has a problem with a workload, and some other solution has no logic at all, the second solution works better, mainly because of luck. It already happened in favor of BFQ too, a few times, for no technical merit.
IMO the right way to go is to just find out why BFQ's tag-handling logic is failing in this case. It should then be easy and quick to fix the bug. Or at least it has been so, in all cases, in the last 10+ years.
This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle.
Changing version to 33.