1899600 – Increased etcd fsync latency as of OCP 4.6

Bug 1899600 - Increased etcd fsync latency as of OCP 4.6

Summary: Increased etcd fsync latency as of OCP 4.6

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Colin Walters
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:	aos-scalability-46
Depends On:
Blocks:	1900666
TreeView+	depends on / blocked

Reported:	2020-11-19 16:15 UTC by Raul Sevilla
Modified:	2023-09-15 00:51 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	In OCP 4.6, a change was made to use the BFQ (Budget Fair Queueing) Linux I/O scheduler. As a consequence, there was an increased fsync I/O latency in etcd. For OCP 4.7, the I/O scheduler has been changed to mq-deadline (except for NVMe devices, which are configured to not use an I/O scheduler). For RHEL Core OS updates, the BFQ scheduler is still used. As a result, latency times have been reduced to acceptable levels.
Clone Of:
Environment:
Last Closed:	2021-02-24 15:35:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
99th fsync latency 4.5 (91.97 KB, image/png) 2020-11-19 16:15 UTC, Raul Sevilla	no flags	Details
99th fsync latency 4.6 (106.94 KB, image/png) 2020-11-19 16:16 UTC, Raul Sevilla	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2243	0	None	closed	Bug 1899600: daemon: Only switch to bfq scheduler when we have an OS update	2021-02-15 20:24:35 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:35:31 UTC

Description Raul Sevilla 2020-11-19 16:15:38 UTC

Created attachment 1730968 [details]
99th fsync latency 4.5

Hi, 

We have observed an increased fsync I/O latency in etcd as of OCP 4.6 (our tests with 4.7 also show this increase), we haven't had any problem related with this increased latency so far, however we wonder what could have happened to cause it.

The graphs attached were generated after triggering the same benchmark, (Creating 4K namespaces with a bunch of objects each with 20 QPS)

The max fsync latency in the 4.5 iteration was 8 ms and 30.6 ms in the 4.6 scenario.

The platform used for both tests was the similar: AWS  with master nodes using preprovisioned 3000 IOPS NVME disks (io1).

Comment 1 Raul Sevilla 2020-11-19 16:16:10 UTC

Created attachment 1730969 [details]
99th fsync latency 4.6

Comment 5 Scott Dodson 2020-11-20 21:18:34 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 8 Colin Walters 2020-11-24 14:03:56 UTC

In RHEL8 the default is mq-deadline...except for NVMe devices, where it's "none".

Also xref https://github.com/systemd/systemd/pull/13321/ which changed the default to bfq...but also skipped NVMe.

I suspect at one point you switched instance types from an old one (e.g. m4) to a modern one like m5 which uses NVMe for EBS.

Comment 9 Raul Sevilla 2020-11-24 14:48:05 UTC

(In reply to Colin Walters from comment #8)
> In RHEL8 the default is mq-deadline...except for NVMe devices, where it's
> "none".
> 
> Also xref https://github.com/systemd/systemd/pull/13321/ which changed the
> default to bfq...but also skipped NVMe.
> 
> I suspect at one point you switched instance types from an old one (e.g. m4)
> to a modern one like m5 which uses NVMe for EBS.

As discussed internally, control plane instance type was r5.4xlarge usingusing preprovisioned 3000 IOPS NVME disks (io1). So the default scheduler is indeed none.

Comment 10 Michael Nguyen 2020-11-24 16:50:45 UTC

Are there anymore open questions?  If not this will be closed as verified based on https://bugzilla.redhat.com/show_bug.cgi?id=1899600#c7

Comment 11 Scott Dodson 2020-11-25 02:22:54 UTC

No, I don't believe there are any outstanding questions at this point. This should be ready to be verified.

Comment 12 Scott Dodson 2020-11-25 15:41:18 UTC

I've confirmed that the scheduler is set per the new expected behavior.

Comment 13 Colin Walters 2020-11-25 16:40:41 UTC

One thing I overlooked in discussions around this is that the logic to set bfq lives in the MCD which is a daemonset.  So during an upgrade we will actually roll out the default bfq switch to *all* control plane nodes still running 4.5 OS version, even before we've started OS updates.

Once the fix merges into the MCO, we should probably just direct 4.5 people to upgrade to the latest 4.6.X and skip previous versions (but they should do that anyways of course).

Comment 15 Yu Qi Zhang 2021-01-06 16:55:29 UTC

The change in scheduler behaviour should be documented as a bugfix. @Colin could you add to the Doc Text field above?

Comment 16 Michael Burke 2021-02-16 14:49:50 UTC

Added doc text. Feel free to comment.

Comment 18 errata-xmlrpc 2021-02-24 15:35:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 19 W. Trevor King 2021-04-05 17:47:06 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Comment 20 Red Hat Bugzilla 2023-09-15 00:51:30 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.