Bug 1738828 - adopting the BFQ I/O scheduler to boost responsiveness and throughput
Summary: adopting the BFQ I/O scheduler to boost responsiveness and throughput
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 31
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-08 09:00 UTC by Paolo
Modified: 2019-08-23 17:28 UTC (History)
12 users (show)

Fixed In Version: systemd-243~rc2-1.fc31
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-23 17:27:20 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github systemd systemd pull 13321 None None None 2019-08-14 14:17:39 UTC

Description Paolo 2019-08-08 09:00:32 UTC
After Laura's suggestion, I'm filing this bug to propose the selection
of BFQ as default I/O scheduler, through systemd. I'm not an expert on
systemd, so I'm not proposing a configuration right away. But I'm
willing to try to make one, if you prefer to have me do it.

To motivate this proposal, I'd like to summarize some of the benefits
of BFQ. For completeness, I'll report current BFQ's limitations too.

These are some of the benefits provided by BFQ, on any type of storage
medium (embedded flash storage, HDDs, SATA or NVMe SSDs, ...) and on
systems ranging from minimal embedded systems to high-end servers:

- Under load, BFQ loads applications up to 20X times as fast as any
  other I/O scheduler. In absolute terms, the system is virtually as
  responsive as if it was idle, regardless of the background I/O
  workload. As a concrete example, with writes as background workload
  on a Samsung SSD 970 PRO, gnome-terminal starts in 1.8 seconds with
  BFQ, and in at least 28.7 seconds with the other I/O schedulers [1].

- Soft real-time applications, such as audio and video players or
  audio- and video-streaming applications, enjoy smooth playback or
  streaming, regardless of the background I/O workload [1].

- In multi-client applications---i.e., when multiple clients, groups,
  containers, virtual machines or any other kind of entities compete
  for a shared medium---BFQ reaches from 5X to 10X higher throughput
  than any other solution for guaranteeing bandwidth to each entity
  competing for storage [2].

In addition, BFQ reaches up to 2X higher throughput than the other I/O
schedulers on slow devices, and guarantees high throughput and
responsiveness with code-development tasks. Links to demos and, in
general, more details on BFQ's homepage [3].

The main limitation of the current version of BFQ is that it is not
suited for drives delivering millions of IOPS. To provide very high
responsiveness and throughput, BFQ implements a definitely more
sophisticated logic than the other I/O schedulers. In addition, BFQ
still uses a single scheduler-wise lock. Currently, this limits the
maximum I/O speed that be reached with BFQ to, e.g., ~500 KIOPS (i.e.,
2 GB/s with 4 KB random I/O) on a laptop CPU, against 800-1000 KIOPS
with the other I/O schedulers. We are working on a multi-lock,
parallel version of BFQ. We expect to submit it in the next months.

That's all. I'm of course willing to answer any question and help with
any step.

[1] https://algo.ing.unimo.it/people/paolo/BFQ/results.php
[2] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/
[3] https://algo.ing.unimo.it/people/paolo/BFQ

Comment 1 Laura Abbott 2019-08-08 09:52:28 UTC
For clarification, the kernel already has all necessary pieces. This is just the portion to make a policy change. Based on the data provided, I'm for this change if we can have it go through the change process.

Comment 2 Zbigniew Jędrzejewski-Szmek 2019-08-12 18:01:14 UTC
I think that BFQ makes sense as the default. After all, there's many more "small" users, and
the owners of high-throughput machines who might be negatively impacted by this are better
placed to adjust the defaults. They probably already override various kernel defaults anyway.

The question is why the kernel should not do this on its own? My understanding is that
to "enable" BFQ in userspace we'd need to install a udev rule to set ATTR{queue/scheduler}="bfq"
on each device separately. This is of course possible, but it would be nicer to have
CONFIG_IOSCHED_DEFAULT=bfq in the kernel or something like that.

Comment 3 Paolo 2019-08-13 07:49:41 UTC
(In reply to Zbigniew Jędrzejewski-Szmek from comment #2)
> I think that BFQ makes sense as the default. After all, there's many more
> "small" users,

Yep, and not only small users.  Probably not so many companies use storage
that does millions of IOPS.

> and
> the owners of high-throughput machines who might be negatively impacted by
> this are better
> placed to adjust the defaults. They probably already override various kernel
> defaults anyway.
> 

Exactly, they do override defaults, and accurately tune their systems for
demanding performance.  Actually, with millions of IOPS, the very
in-kernel I/O handling is often too heavy; even with no I/O scheduling.
The most recent evidence of this is the new io_uring effort.

> The question is why the kernel should not do this on its own? My
> understanding is that
> to "enable" BFQ in userspace we'd need to install a udev rule to set
> ATTR{queue/scheduler}="bfq"
> on each device separately. This is of course possible, but it would be nicer
> to have
> CONFIG_IOSCHED_DEFAULT=bfq in the kernel or something like that.

Such a default-scheduler option was present. But it has been removed
by the block-layer maintainer, Jens Axboe.  His motivation is that it
applied across everything, while this has to be a per device setting.  I
think his argument is flawed, in that there is *always* a default,
even if an option with that name is removed.  In particular, now the
selection of mq-deadline as default is even hardwired in the
block-layer.  As a conclusion of the (public) discussions on this
topic, Jens replied that it's one of distros' tasks to configure the
right I/O scheduler.  In a sense, this is the root reason why I opened
this bug.

Comment 4 Ben Cotton 2019-08-13 17:03:40 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to 31.

Comment 5 Zbigniew Jędrzejewski-Szmek 2019-08-19 19:01:09 UTC
After the discussion in systemd upstream, I'll add the patch from this PR as a patch in Fedora (F31+).

Comment 6 Bryan Gurney 2019-08-22 21:41:01 UTC
There may be an issue with Fedora systems that are using host-managed zoned block devices, since the only scheduler that knows how to "keep writes sequential" for zoned devices is the mq-deadline scheduler.

Prior to when Fedora switched to the multi-queue schedulers, Fedora defaulted to cfq, and I had to add this udev rule to ensure that the host-managed zoned block devices on my test system were using the "deadline" scheduler:

$ cat /etc/udev/rules.d/99-zoned-block-devices.rules 
ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{queue/zoned}=="host-managed", ATTR{queue/scheduler}="deadline"

After the 5.0 kernel changed to only multi-queue schedulers, the default was "mq-deadline", so I wouldn't have needed this udev rule.  But if the scheduler defaults to something other than "mq-deadline", the scheduler may attempt to schedule a write that could send a SCSI "Aborted command" error, which may cause problems for drivers attempting to use the drive.

Other GNU/Linux distributions may run into the same problem if they choose a default scheduler that's not "mq-deadline".

Comment 7 Zbigniew Jędrzejewski-Szmek 2019-08-23 08:57:21 UTC
This rule would still work.

Comment 8 Bryan Gurney 2019-08-23 12:28:48 UTC
Right, but if this rule _isn't_ installed, a user with a host-managed zoned block device (for example, a Shingled Magnetic Recording hard drive) could end up having the scheduler reorder writes, which would then appear as "mysterious" errors from the device.  After being bombarded with hundreds or thousands of these errors, more serious problems may occur, including a kernel oops.

I know this, because it happened to test systems of mine last year, with the cfq scheduler.  I don't want to see a user experience the same thing.

Last year, there was a patch proposed to the Linux kernel ( https://patchwork.kernel.org/patch/10641923/ ) that (among other things) could have ensured that host-managed zoned block devices were configured to use the mq-deadline scheduler, the only scheduler that currently knows to keep writes sequential in "sequential-only" zones.  This patch was never merged.

Therefore, now it's up to udev to prevent the scenario I mentioned.  Ideally, the rule to ensure that host-managed zoned block devices use mq-deadline would be a default setting.

Comment 9 Paolo 2019-08-23 13:38:40 UTC
I've waited a little bit before adding a comment, because the protection(In reply to Bryan Gurney from comment #8)
> Right, but if this rule _isn't_ installed, a user with a host-managed zoned
> block device (for example, a Shingled Magnetic Recording hard drive) could
> end up having the scheduler reorder writes, which would then appear as
> "mysterious" errors from the device.  After being bombarded with hundreds or
> thousands of these errors, more serious problems may occur, including a
> kernel oops.
> 
> I know this, because it happened to test systems of mine last year, with the
> cfq scheduler.  I don't want to see a user experience the same thing.
> 
> Last year, there was a patch proposed to the Linux kernel (
> https://patchwork.kernel.org/patch/10641923/ ) that (among other things)
> could have ensured that host-managed zoned block devices were configured to
> use the mq-deadline scheduler, the only scheduler that currently knows to
> keep writes sequential in "sequential-only" zones.  This patch was never
> merged.
> 

FYI, a new patch series implementing this automatic in-kernel
protection was proposed about twelve hours ago.  Unfortunately, the
thread doesn't seem to be available in any linux-block archive yet
(the subject of the cover letter is "[PATCH 0/7] Elevator cleanups and
improvements linux-block").  However, even if the new patch series is
luckier than the old one, this change will appear only in later kernel
versions.

It seems rather easy to add support for zoned block devices to BFQ.
And it is on my TODO list.  The problem is just my limited
single-person bandwidth :) If BFQ goes on being successful, I might
probably make it in the next 3-5 months.

> Therefore, now it's up to udev to prevent the scenario I mentioned. 
> Ideally, the rule to ensure that host-managed zoned block devices use
> mq-deadline would be a default setting.

So, why don't we just enrich the rule for switching to BFQ with the
rule for sticking to mq-deadline for zoned block devices?  Regardless
of BFQ, I guess that the very lack of any protection rule for zoned
block device may be dangerous for Fedora users.  Or such a rule is
already in place, and I simply misunderstand the problem?  :)

Comment 10 Bryan Gurney 2019-08-23 14:30:30 UTC
(In reply to Paolo from comment #9)
> I've waited a little bit before adding a comment, because the protection(In
> reply to Bryan Gurney from comment #8)
> > Right, but if this rule _isn't_ installed, a user with a host-managed zoned
> > block device (for example, a Shingled Magnetic Recording hard drive) could
> > end up having the scheduler reorder writes, which would then appear as
> > "mysterious" errors from the device.  After being bombarded with hundreds or
> > thousands of these errors, more serious problems may occur, including a
> > kernel oops.
> > 
> > I know this, because it happened to test systems of mine last year, with the
> > cfq scheduler.  I don't want to see a user experience the same thing.
> > 
> > Last year, there was a patch proposed to the Linux kernel (
> > https://patchwork.kernel.org/patch/10641923/ ) that (among other things)
> > could have ensured that host-managed zoned block devices were configured to
> > use the mq-deadline scheduler, the only scheduler that currently knows to
> > keep writes sequential in "sequential-only" zones.  This patch was never
> > merged.
> > 
> 
> FYI, a new patch series implementing this automatic in-kernel
> protection was proposed about twelve hours ago.  Unfortunately, the
> thread doesn't seem to be available in any linux-block archive yet
> (the subject of the cover letter is "[PATCH 0/7] Elevator cleanups and
> improvements linux-block").  However, even if the new patch series is
> luckier than the old one, this change will appear only in later kernel
> versions.
> 

I just spotted Damien Le Moal's series an hour ago when I checked my email; I like what I see so far.  But yes, it could be a matter of months between the time the patches are merged, and when a "dnf update" on a Fedora system installs the kernel version with these updates.

> It seems rather easy to add support for zoned block devices to BFQ.
> And it is on my TODO list.  The problem is just my limited
> single-person bandwidth :) If BFQ goes on being successful, I might
> probably make it in the next 3-5 months.
> 
> > Therefore, now it's up to udev to prevent the scenario I mentioned. 
> > Ideally, the rule to ensure that host-managed zoned block devices use
> > mq-deadline would be a default setting.
> 
> So, why don't we just enrich the rule for switching to BFQ with the
> rule for sticking to mq-deadline for zoned block devices?  Regardless
> of BFQ, I guess that the very lack of any protection rule for zoned
> block device may be dangerous for Fedora users.  Or such a rule is
> already in place, and I simply misunderstand the problem?  :)

As far as I know, there's no such "protection rule".  Last year, Fedora had CONFIG_SCSI_MQ_DEFAULT disabled, I believe until the 5.0 kernel, when only the multi-queue schedulers were available.

If we can have the rule default to BFQ, but change host-managed zoned block devices to mq-deadline, that would be a good "safety mechanism" until the elevator features in the kernel patches above are widely available.

Comment 11 Zbigniew Jędrzejewski-Szmek 2019-08-23 17:27:20 UTC
If somebody can provide such a patch for the udev rule, that'd be great.
I'll close this bug though, since the main part is already implemented.


Note You need to log in before you can comment on or make changes to this bug.