Created attachment 1730968 [details]
99th fsync latency 4.5
We have observed an increased fsync I/O latency in etcd as of OCP 4.6 (our tests with 4.7 also show this increase), we haven't had any problem related with this increased latency so far, however we wonder what could have happened to cause it.
The graphs attached were generated after triggering the same benchmark, (Creating 4K namespaces with a bunch of objects each with 20 QPS)
The max fsync latency in the 4.5 iteration was 8 ms and 30.6 ms in the 4.6 scenario.
The platform used for both tests was the similar: AWS with master nodes using preprovisioned 3000 IOPS NVME disks (io1).
Created attachment 1730969 [details]
99th fsync latency 4.6
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.
Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?
example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact? Is it serious enough to warrant blocking edges?
example: Up to 2 minute disruption in edge routing
example: Up to 90seconds of API downtime
example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
example: Issue resolves itself after five minutes
example: Admin uses oc to fix things
example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
example: No, it’s always been like this we just never noticed
example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
In RHEL8 the default is mq-deadline...except for NVMe devices, where it's "none".
Also xref https://github.com/systemd/systemd/pull/13321/ which changed the default to bfq...but also skipped NVMe.
I suspect at one point you switched instance types from an old one (e.g. m4) to a modern one like m5 which uses NVMe for EBS.
(In reply to Colin Walters from comment #8)
> In RHEL8 the default is mq-deadline...except for NVMe devices, where it's
> Also xref https://github.com/systemd/systemd/pull/13321/ which changed the
> default to bfq...but also skipped NVMe.
> I suspect at one point you switched instance types from an old one (e.g. m4)
> to a modern one like m5 which uses NVMe for EBS.
As discussed internally, control plane instance type was r5.4xlarge usingusing preprovisioned 3000 IOPS NVME disks (io1). So the default scheduler is indeed none.
Are there anymore open questions? If not this will be closed as verified based on https://bugzilla.redhat.com/show_bug.cgi?id=1899600#c7
No, I don't believe there are any outstanding questions at this point. This should be ready to be verified.
I've confirmed that the scheduler is set per the new expected behavior.
One thing I overlooked in discussions around this is that the logic to set bfq lives in the MCD which is a daemonset. So during an upgrade we will actually roll out the default bfq switch to *all* control plane nodes still running 4.5 OS version, even before we've started OS updates.
Once the fix merges into the MCO, we should probably just direct 4.5 people to upgrade to the latest 4.6.X and skip previous versions (but they should do that anyways of course).
The change in scheduler behaviour should be documented as a bugfix. @Colin could you add to the Doc Text field above?
Added doc text. Feel free to comment.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.