Bug 1899600 - Increased etcd fsync latency as of OCP 4.6 [NEEDINFO]
Summary: Increased etcd fsync latency as of OCP 4.6
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: 4.7.0
Assignee: Colin Walters
QA Contact: Michael Nguyen
URL:
Whiteboard: aos-scalability-46
Depends On:
Blocks: 1900666
TreeView+ depends on / blocked
 
Reported: 2020-11-19 16:15 UTC by Raul Sevilla
Modified: 2021-02-24 15:35 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
In OCP 4.6, a change was made to use the BFQ (Budget Fair Queueing) Linux I/O scheduler. As a consequence, there was an increased fsync I/O latency in etcd. For OCP 4.7, the I/O scheduler has been changed to mq-deadline (except for NVMe devices, which are configured to not use an I/O scheduler). For RHEL Core OS updates, the BFQ scheduler is still used. As a result, latency times have been reduced to acceptable levels.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:35:02 UTC
Target Upstream Version:
sdodson: needinfo? (amurdaca)
jerzhang: needinfo? (walters)


Attachments (Terms of Use)
99th fsync latency 4.5 (91.97 KB, image/png)
2020-11-19 16:15 UTC, Raul Sevilla
no flags Details
99th fsync latency 4.6 (106.94 KB, image/png)
2020-11-19 16:16 UTC, Raul Sevilla
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2243 0 None closed Bug 1899600: daemon: Only switch to bfq scheduler when we have an OS update 2021-02-15 20:24:35 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:35:31 UTC

Description Raul Sevilla 2020-11-19 16:15:38 UTC
Created attachment 1730968 [details]
99th fsync latency 4.5

Hi, 

We have observed an increased fsync I/O latency in etcd as of OCP 4.6 (our tests with 4.7 also show this increase), we haven't had any problem related with this increased latency so far, however we wonder what could have happened to cause it.

The graphs attached were generated after triggering the same benchmark, (Creating 4K namespaces with a bunch of objects each with 20 QPS)

The max fsync latency in the 4.5 iteration was 8 ms and 30.6 ms in the 4.6 scenario.

The platform used for both tests was the similar: AWS  with master nodes using preprovisioned 3000 IOPS NVME disks (io1).

Comment 1 Raul Sevilla 2020-11-19 16:16:10 UTC
Created attachment 1730969 [details]
99th fsync latency 4.6

Comment 5 Scott Dodson 2020-11-20 21:18:34 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 8 Colin Walters 2020-11-24 14:03:56 UTC
In RHEL8 the default is mq-deadline...except for NVMe devices, where it's "none".

Also xref https://github.com/systemd/systemd/pull/13321/ which changed the default to bfq...but also skipped NVMe.

I suspect at one point you switched instance types from an old one (e.g. m4) to a modern one like m5 which uses NVMe for EBS.

Comment 9 Raul Sevilla 2020-11-24 14:48:05 UTC
(In reply to Colin Walters from comment #8)
> In RHEL8 the default is mq-deadline...except for NVMe devices, where it's
> "none".
> 
> Also xref https://github.com/systemd/systemd/pull/13321/ which changed the
> default to bfq...but also skipped NVMe.
> 
> I suspect at one point you switched instance types from an old one (e.g. m4)
> to a modern one like m5 which uses NVMe for EBS.

As discussed internally, control plane instance type was r5.4xlarge usingusing preprovisioned 3000 IOPS NVME disks (io1). So the default scheduler is indeed none.

Comment 10 Michael Nguyen 2020-11-24 16:50:45 UTC
Are there anymore open questions?  If not this will be closed as verified based on https://bugzilla.redhat.com/show_bug.cgi?id=1899600#c7

Comment 11 Scott Dodson 2020-11-25 02:22:54 UTC
No, I don't believe there are any outstanding questions at this point. This should be ready to be verified.

Comment 12 Scott Dodson 2020-11-25 15:41:18 UTC
I've confirmed that the scheduler is set per the new expected behavior.

Comment 13 Colin Walters 2020-11-25 16:40:41 UTC
One thing I overlooked in discussions around this is that the logic to set bfq lives in the MCD which is a daemonset.  So during an upgrade we will actually roll out the default bfq switch to *all* control plane nodes still running 4.5 OS version, even before we've started OS updates.

Once the fix merges into the MCO, we should probably just direct 4.5 people to upgrade to the latest 4.6.X and skip previous versions (but they should do that anyways of course).

Comment 15 Yu Qi Zhang 2021-01-06 16:55:29 UTC
The change in scheduler behaviour should be documented as a bugfix. @Colin could you add to the Doc Text field above?

Comment 16 Michael Burke 2021-02-16 14:49:50 UTC
Added doc text. Feel free to comment.

Comment 18 errata-xmlrpc 2021-02-24 15:35:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.