After Laura's suggestion, I'm filing this bug to propose the selection of BFQ as default I/O scheduler, through systemd. I'm not an expert on systemd, so I'm not proposing a configuration right away. But I'm willing to try to make one, if you prefer to have me do it. To motivate this proposal, I'd like to summarize some of the benefits of BFQ. For completeness, I'll report current BFQ's limitations too. These are some of the benefits provided by BFQ, on any type of storage medium (embedded flash storage, HDDs, SATA or NVMe SSDs, ...) and on systems ranging from minimal embedded systems to high-end servers: - Under load, BFQ loads applications up to 20X times as fast as any other I/O scheduler. In absolute terms, the system is virtually as responsive as if it was idle, regardless of the background I/O workload. As a concrete example, with writes as background workload on a Samsung SSD 970 PRO, gnome-terminal starts in 1.8 seconds with BFQ, and in at least 28.7 seconds with the other I/O schedulers [1]. - Soft real-time applications, such as audio and video players or audio- and video-streaming applications, enjoy smooth playback or streaming, regardless of the background I/O workload [1]. - In multi-client applications---i.e., when multiple clients, groups, containers, virtual machines or any other kind of entities compete for a shared medium---BFQ reaches from 5X to 10X higher throughput than any other solution for guaranteeing bandwidth to each entity competing for storage [2]. In addition, BFQ reaches up to 2X higher throughput than the other I/O schedulers on slow devices, and guarantees high throughput and responsiveness with code-development tasks. Links to demos and, in general, more details on BFQ's homepage [3]. The main limitation of the current version of BFQ is that it is not suited for drives delivering millions of IOPS. To provide very high responsiveness and throughput, BFQ implements a definitely more sophisticated logic than the other I/O schedulers. In addition, BFQ still uses a single scheduler-wise lock. Currently, this limits the maximum I/O speed that be reached with BFQ to, e.g., ~500 KIOPS (i.e., 2 GB/s with 4 KB random I/O) on a laptop CPU, against 800-1000 KIOPS with the other I/O schedulers. We are working on a multi-lock, parallel version of BFQ. We expect to submit it in the next months. That's all. I'm of course willing to answer any question and help with any step. [1] https://algo.ing.unimo.it/people/paolo/BFQ/results.php [2] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/ [3] https://algo.ing.unimo.it/people/paolo/BFQ
For clarification, the kernel already has all necessary pieces. This is just the portion to make a policy change. Based on the data provided, I'm for this change if we can have it go through the change process.
I think that BFQ makes sense as the default. After all, there's many more "small" users, and the owners of high-throughput machines who might be negatively impacted by this are better placed to adjust the defaults. They probably already override various kernel defaults anyway. The question is why the kernel should not do this on its own? My understanding is that to "enable" BFQ in userspace we'd need to install a udev rule to set ATTR{queue/scheduler}="bfq" on each device separately. This is of course possible, but it would be nicer to have CONFIG_IOSCHED_DEFAULT=bfq in the kernel or something like that.
(In reply to Zbigniew Jędrzejewski-Szmek from comment #2) > I think that BFQ makes sense as the default. After all, there's many more > "small" users, Yep, and not only small users. Probably not so many companies use storage that does millions of IOPS. > and > the owners of high-throughput machines who might be negatively impacted by > this are better > placed to adjust the defaults. They probably already override various kernel > defaults anyway. > Exactly, they do override defaults, and accurately tune their systems for demanding performance. Actually, with millions of IOPS, the very in-kernel I/O handling is often too heavy; even with no I/O scheduling. The most recent evidence of this is the new io_uring effort. > The question is why the kernel should not do this on its own? My > understanding is that > to "enable" BFQ in userspace we'd need to install a udev rule to set > ATTR{queue/scheduler}="bfq" > on each device separately. This is of course possible, but it would be nicer > to have > CONFIG_IOSCHED_DEFAULT=bfq in the kernel or something like that. Such a default-scheduler option was present. But it has been removed by the block-layer maintainer, Jens Axboe. His motivation is that it applied across everything, while this has to be a per device setting. I think his argument is flawed, in that there is *always* a default, even if an option with that name is removed. In particular, now the selection of mq-deadline as default is even hardwired in the block-layer. As a conclusion of the (public) discussions on this topic, Jens replied that it's one of distros' tasks to configure the right I/O scheduler. In a sense, this is the root reason why I opened this bug.
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle. Changing version to 31.
After the discussion in systemd upstream, I'll add the patch from this PR as a patch in Fedora (F31+).
There may be an issue with Fedora systems that are using host-managed zoned block devices, since the only scheduler that knows how to "keep writes sequential" for zoned devices is the mq-deadline scheduler. Prior to when Fedora switched to the multi-queue schedulers, Fedora defaulted to cfq, and I had to add this udev rule to ensure that the host-managed zoned block devices on my test system were using the "deadline" scheduler: $ cat /etc/udev/rules.d/99-zoned-block-devices.rules ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{queue/zoned}=="host-managed", ATTR{queue/scheduler}="deadline" After the 5.0 kernel changed to only multi-queue schedulers, the default was "mq-deadline", so I wouldn't have needed this udev rule. But if the scheduler defaults to something other than "mq-deadline", the scheduler may attempt to schedule a write that could send a SCSI "Aborted command" error, which may cause problems for drivers attempting to use the drive. Other GNU/Linux distributions may run into the same problem if they choose a default scheduler that's not "mq-deadline".
This rule would still work.
Right, but if this rule _isn't_ installed, a user with a host-managed zoned block device (for example, a Shingled Magnetic Recording hard drive) could end up having the scheduler reorder writes, which would then appear as "mysterious" errors from the device. After being bombarded with hundreds or thousands of these errors, more serious problems may occur, including a kernel oops. I know this, because it happened to test systems of mine last year, with the cfq scheduler. I don't want to see a user experience the same thing. Last year, there was a patch proposed to the Linux kernel ( https://patchwork.kernel.org/patch/10641923/ ) that (among other things) could have ensured that host-managed zoned block devices were configured to use the mq-deadline scheduler, the only scheduler that currently knows to keep writes sequential in "sequential-only" zones. This patch was never merged. Therefore, now it's up to udev to prevent the scenario I mentioned. Ideally, the rule to ensure that host-managed zoned block devices use mq-deadline would be a default setting.
I've waited a little bit before adding a comment, because the protection(In reply to Bryan Gurney from comment #8) > Right, but if this rule _isn't_ installed, a user with a host-managed zoned > block device (for example, a Shingled Magnetic Recording hard drive) could > end up having the scheduler reorder writes, which would then appear as > "mysterious" errors from the device. After being bombarded with hundreds or > thousands of these errors, more serious problems may occur, including a > kernel oops. > > I know this, because it happened to test systems of mine last year, with the > cfq scheduler. I don't want to see a user experience the same thing. > > Last year, there was a patch proposed to the Linux kernel ( > https://patchwork.kernel.org/patch/10641923/ ) that (among other things) > could have ensured that host-managed zoned block devices were configured to > use the mq-deadline scheduler, the only scheduler that currently knows to > keep writes sequential in "sequential-only" zones. This patch was never > merged. > FYI, a new patch series implementing this automatic in-kernel protection was proposed about twelve hours ago. Unfortunately, the thread doesn't seem to be available in any linux-block archive yet (the subject of the cover letter is "[PATCH 0/7] Elevator cleanups and improvements linux-block"). However, even if the new patch series is luckier than the old one, this change will appear only in later kernel versions. It seems rather easy to add support for zoned block devices to BFQ. And it is on my TODO list. The problem is just my limited single-person bandwidth :) If BFQ goes on being successful, I might probably make it in the next 3-5 months. > Therefore, now it's up to udev to prevent the scenario I mentioned. > Ideally, the rule to ensure that host-managed zoned block devices use > mq-deadline would be a default setting. So, why don't we just enrich the rule for switching to BFQ with the rule for sticking to mq-deadline for zoned block devices? Regardless of BFQ, I guess that the very lack of any protection rule for zoned block device may be dangerous for Fedora users. Or such a rule is already in place, and I simply misunderstand the problem? :)
(In reply to Paolo from comment #9) > I've waited a little bit before adding a comment, because the protection(In > reply to Bryan Gurney from comment #8) > > Right, but if this rule _isn't_ installed, a user with a host-managed zoned > > block device (for example, a Shingled Magnetic Recording hard drive) could > > end up having the scheduler reorder writes, which would then appear as > > "mysterious" errors from the device. After being bombarded with hundreds or > > thousands of these errors, more serious problems may occur, including a > > kernel oops. > > > > I know this, because it happened to test systems of mine last year, with the > > cfq scheduler. I don't want to see a user experience the same thing. > > > > Last year, there was a patch proposed to the Linux kernel ( > > https://patchwork.kernel.org/patch/10641923/ ) that (among other things) > > could have ensured that host-managed zoned block devices were configured to > > use the mq-deadline scheduler, the only scheduler that currently knows to > > keep writes sequential in "sequential-only" zones. This patch was never > > merged. > > > > FYI, a new patch series implementing this automatic in-kernel > protection was proposed about twelve hours ago. Unfortunately, the > thread doesn't seem to be available in any linux-block archive yet > (the subject of the cover letter is "[PATCH 0/7] Elevator cleanups and > improvements linux-block"). However, even if the new patch series is > luckier than the old one, this change will appear only in later kernel > versions. > I just spotted Damien Le Moal's series an hour ago when I checked my email; I like what I see so far. But yes, it could be a matter of months between the time the patches are merged, and when a "dnf update" on a Fedora system installs the kernel version with these updates. > It seems rather easy to add support for zoned block devices to BFQ. > And it is on my TODO list. The problem is just my limited > single-person bandwidth :) If BFQ goes on being successful, I might > probably make it in the next 3-5 months. > > > Therefore, now it's up to udev to prevent the scenario I mentioned. > > Ideally, the rule to ensure that host-managed zoned block devices use > > mq-deadline would be a default setting. > > So, why don't we just enrich the rule for switching to BFQ with the > rule for sticking to mq-deadline for zoned block devices? Regardless > of BFQ, I guess that the very lack of any protection rule for zoned > block device may be dangerous for Fedora users. Or such a rule is > already in place, and I simply misunderstand the problem? :) As far as I know, there's no such "protection rule". Last year, Fedora had CONFIG_SCSI_MQ_DEFAULT disabled, I believe until the 5.0 kernel, when only the multi-queue schedulers were available. If we can have the rule default to BFQ, but change host-managed zoned block devices to mq-deadline, that would be a good "safety mechanism" until the elevator features in the kernel patches above are widely available.
If somebody can provide such a patch for the udev rule, that'd be great. I'll close this bug though, since the main part is already implemented.