1900666 – Increased etcd fsync latency as of OCP 4.6

Bug 1900666 - Increased etcd fsync latency as of OCP 4.6

Summary: Increased etcd fsync latency as of OCP 4.6

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1894272 (view as bug list)
Depends On:	1899600
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-23 13:55 UTC by OpenShift BugZilla Robot
Modified:	2021-04-05 17:48 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-14 13:51:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2251	0	None	closed	[release-4.6] Bug 1900666: daemon: Only switch to bfq scheduler when we have an OS update	2021-02-04 20:40:54 UTC
Red Hat Product Errata	RHSA-2020:5259	0	None	None	None	2020-12-14 13:51:42 UTC

Comment 3 Sam Batschelet 2020-11-30 17:09:16 UTC

*** Bug 1894272 has been marked as a duplicate of this bug. ***

Comment 5 Michael Nguyen 2020-12-04 14:23:00 UTC

Verified on 4.6.0-0.nightly-2020-12-04-033739.  Scheduler is set to bfq on master

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-12-04-033739   True        False         12m     Cluster version is 4.6.0-0.nightly-2020-12-04-033739
$ oc get nodes | grep master
ip-10-0-157-110.us-west-2.compute.internal   Ready    master   37m   v1.19.0+1348ff8
ip-10-0-161-30.us-west-2.compute.internal    Ready    master   37m   v1.19.0+1348ff8
ip-10-0-216-159.us-west-2.compute.internal   Ready    master   37m   v1.19.0+1348ff8
$ oc debug node/ip-10-0-157-110.us-west-2.compute.internal
Starting pod/ip-10-0-157-110us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# cat /sys/block/nvme0n1/queue/scheduler 
[none] mq-deadline kyber bfq

Comment 7 Micah Abbott 2020-12-04 16:02:20 UTC

I also tested this with 4.6.7 on GCP

The PR switches the scheduler on *all* nodes during OS updates, including switching to kernel-rt.

```
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.7     True        False         2m5s    Cluster version is 4.6.7

$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-rq1mq4b-f76d1-xf849-master-0         Ready    master   40m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-master-1         Ready    master   40m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-master-2         Ready    master   40m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-worker-b-tmklt   Ready    worker   30m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-worker-c-hqln4   Ready    worker   30m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-worker-d-tccv4   Ready    worker   34m   v1.19.0+1348ff8

$ cat machineConfigs/worker-realtime.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig                                                                                                    
metadata:                                                                                                                                                                                                                                     
  labels:                                                                                                              
    machineconfiguration.openshift.io/role: "worker"                                                                   
  name: 99-worker-kerneltype                                                                                           
spec:                                                                                                                  
  kernelType: realtime                                                                                                 

$ oc apply -f machineConfigs/worker-realtime.yaml 
machineconfig.machineconfiguration.openshift.io/99-worker-kerneltype created                        
$ oc get nodes                                                                                                                                                                    
NAME                                       STATUS                     ROLES    AGE   VERSION                           
ci-ln-rq1mq4b-f76d1-xf849-master-0         Ready                      master   67m   v1.19.0+1348ff8                   
ci-ln-rq1mq4b-f76d1-xf849-master-1         Ready                      master   67m   v1.19.0+1348ff8                   
ci-ln-rq1mq4b-f76d1-xf849-master-2         Ready                      master   67m   v1.19.0+1348ff8                   
ci-ln-rq1mq4b-f76d1-xf849-worker-b-tmklt   Ready                      worker   57m   v1.19.0+1348ff8                   
ci-ln-rq1mq4b-f76d1-xf849-worker-c-hqln4   Ready,SchedulingDisabled   worker   57m   v1.19.0+1348ff8          
ci-ln-rq1mq4b-f76d1-xf849-worker-d-tccv4   Ready                      worker   61m   v1.19.0+1348ff8

$ oc debug nodes/ci-ln-rq1mq4b-f76d1-xf849-worker-c-hqln4                                                                                                                         
Starting pod/ci-ln-rq1mq4b-f76d1-xf849-worker-c-hqln4-debug ...
To use host binaries, run `chroot /host`                                                                                                                                                                                                      
Pod IP: 10.0.32.4                                                                                                      
If you don't see a command prompt, try pressing enter.                                                                 
sh-4.4# chroot /host                                                                                                                                                                                                                          
sh-4.4# cat /sys/block/sda/queue/scheduler                                                                             
[mq-deadline] kyber bfq none                                                                                           
sh-4.4# watch cat /sys/block/sda/queue/scheduler                                                                                                                                                                                              
sh-4.4# cat /sys/block/sda/queue/scheduler                                                                                                                                                                                                    
mq-deadline kyber [bfq] none                                                                                           
sh-4.4#                                                                                                                
Removing debug pod ...                                                                 
```

Comment 8 Mike Fiedler 2020-12-04 17:00:57 UTC

Verified on 4.6.7 promoted candidate.

Inside etcd pod on 4.6.6:

# cat /sys/class/block/nvme0n1/queue/scheduler
mq-deadline kyber [bfq] none

4.6.7:

# cat /sys/class/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq

Comment 10 errata-xmlrpc 2020-12-14 13:51:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.6.8 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5259

Comment 11 W. Trevor King 2021-04-05 17:48:05 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.