Bug 1900666 - Increased etcd fsync latency as of OCP 4.6
Summary: Increased etcd fsync latency as of OCP 4.6
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: 4.6.z
Assignee: Antonio Murdaca
QA Contact: Michael Nguyen
URL:
Whiteboard:
: 1894272 (view as bug list)
Depends On: 1899600
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-23 13:55 UTC by OpenShift BugZilla Robot
Modified: 2021-04-05 17:48 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-14 13:51:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2251 0 None closed [release-4.6] Bug 1900666: daemon: Only switch to bfq scheduler when we have an OS update 2021-02-04 20:40:54 UTC
Red Hat Product Errata RHSA-2020:5259 0 None None None 2020-12-14 13:51:42 UTC

Comment 3 Sam Batschelet 2020-11-30 17:09:16 UTC
*** Bug 1894272 has been marked as a duplicate of this bug. ***

Comment 5 Michael Nguyen 2020-12-04 14:23:00 UTC
Verified on 4.6.0-0.nightly-2020-12-04-033739.  Scheduler is set to bfq on master

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-12-04-033739   True        False         12m     Cluster version is 4.6.0-0.nightly-2020-12-04-033739
$ oc get nodes | grep master
ip-10-0-157-110.us-west-2.compute.internal   Ready    master   37m   v1.19.0+1348ff8
ip-10-0-161-30.us-west-2.compute.internal    Ready    master   37m   v1.19.0+1348ff8
ip-10-0-216-159.us-west-2.compute.internal   Ready    master   37m   v1.19.0+1348ff8
$ oc debug node/ip-10-0-157-110.us-west-2.compute.internal
Starting pod/ip-10-0-157-110us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# cat /sys/block/nvme0n1/queue/scheduler 
[none] mq-deadline kyber bfq

Comment 7 Micah Abbott 2020-12-04 16:02:20 UTC
I also tested this with 4.6.7 on GCP

The PR switches the scheduler on *all* nodes during OS updates, including switching to kernel-rt.

```
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.7     True        False         2m5s    Cluster version is 4.6.7

$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-rq1mq4b-f76d1-xf849-master-0         Ready    master   40m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-master-1         Ready    master   40m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-master-2         Ready    master   40m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-worker-b-tmklt   Ready    worker   30m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-worker-c-hqln4   Ready    worker   30m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-worker-d-tccv4   Ready    worker   34m   v1.19.0+1348ff8

$ cat machineConfigs/worker-realtime.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig                                                                                                    
metadata:                                                                                                                                                                                                                                     
  labels:                                                                                                              
    machineconfiguration.openshift.io/role: "worker"                                                                   
  name: 99-worker-kerneltype                                                                                           
spec:                                                                                                                  
  kernelType: realtime                                                                                                 

$ oc apply -f machineConfigs/worker-realtime.yaml 
machineconfig.machineconfiguration.openshift.io/99-worker-kerneltype created                        
$ oc get nodes                                                                                                                                                                    
NAME                                       STATUS                     ROLES    AGE   VERSION                           
ci-ln-rq1mq4b-f76d1-xf849-master-0         Ready                      master   67m   v1.19.0+1348ff8                   
ci-ln-rq1mq4b-f76d1-xf849-master-1         Ready                      master   67m   v1.19.0+1348ff8                   
ci-ln-rq1mq4b-f76d1-xf849-master-2         Ready                      master   67m   v1.19.0+1348ff8                   
ci-ln-rq1mq4b-f76d1-xf849-worker-b-tmklt   Ready                      worker   57m   v1.19.0+1348ff8                   
ci-ln-rq1mq4b-f76d1-xf849-worker-c-hqln4   Ready,SchedulingDisabled   worker   57m   v1.19.0+1348ff8          
ci-ln-rq1mq4b-f76d1-xf849-worker-d-tccv4   Ready                      worker   61m   v1.19.0+1348ff8

$ oc debug nodes/ci-ln-rq1mq4b-f76d1-xf849-worker-c-hqln4                                                                                                                         
Starting pod/ci-ln-rq1mq4b-f76d1-xf849-worker-c-hqln4-debug ...
To use host binaries, run `chroot /host`                                                                                                                                                                                                      
Pod IP: 10.0.32.4                                                                                                      
If you don't see a command prompt, try pressing enter.                                                                 
sh-4.4# chroot /host                                                                                                                                                                                                                          
sh-4.4# cat /sys/block/sda/queue/scheduler                                                                             
[mq-deadline] kyber bfq none                                                                                           
sh-4.4# watch cat /sys/block/sda/queue/scheduler                                                                                                                                                                                              
sh-4.4# cat /sys/block/sda/queue/scheduler                                                                                                                                                                                                    
mq-deadline kyber [bfq] none                                                                                           
sh-4.4#                                                                                                                
Removing debug pod ...                                                                 
```

Comment 8 Mike Fiedler 2020-12-04 17:00:57 UTC
Verified on 4.6.7 promoted candidate.

Inside etcd pod on 4.6.6:

# cat /sys/class/block/nvme0n1/queue/scheduler
mq-deadline kyber [bfq] none

4.6.7:

# cat /sys/class/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq

Comment 10 errata-xmlrpc 2020-12-14 13:51:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.6.8 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5259

Comment 11 W. Trevor King 2021-04-05 17:48:05 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.