Bug 1900666

Summary: Increased etcd fsync latency as of OCP 4.6
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.6CC: amurdaca, jeder, jhopper, jhou, kgarriso, miabbott, mifiedle, nelluri, oarribas, sbatsche, sdodson, wking, wlewis
Target Milestone: ---Keywords: Performance, Regression, ServiceDeliveryBlocker
Target Release: 4.6.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-14 13:51:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1899600    
Bug Blocks:    

Comment 3 Sam Batschelet 2020-11-30 17:09:16 UTC
*** Bug 1894272 has been marked as a duplicate of this bug. ***

Comment 5 Michael Nguyen 2020-12-04 14:23:00 UTC
Verified on 4.6.0-0.nightly-2020-12-04-033739.  Scheduler is set to bfq on master

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-12-04-033739   True        False         12m     Cluster version is 4.6.0-0.nightly-2020-12-04-033739
$ oc get nodes | grep master
ip-10-0-157-110.us-west-2.compute.internal   Ready    master   37m   v1.19.0+1348ff8
ip-10-0-161-30.us-west-2.compute.internal    Ready    master   37m   v1.19.0+1348ff8
ip-10-0-216-159.us-west-2.compute.internal   Ready    master   37m   v1.19.0+1348ff8
$ oc debug node/ip-10-0-157-110.us-west-2.compute.internal
Starting pod/ip-10-0-157-110us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# cat /sys/block/nvme0n1/queue/scheduler 
[none] mq-deadline kyber bfq

Comment 7 Micah Abbott 2020-12-04 16:02:20 UTC
I also tested this with 4.6.7 on GCP

The PR switches the scheduler on *all* nodes during OS updates, including switching to kernel-rt.

```
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.7     True        False         2m5s    Cluster version is 4.6.7

$ oc get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-rq1mq4b-f76d1-xf849-master-0         Ready    master   40m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-master-1         Ready    master   40m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-master-2         Ready    master   40m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-worker-b-tmklt   Ready    worker   30m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-worker-c-hqln4   Ready    worker   30m   v1.19.0+1348ff8
ci-ln-rq1mq4b-f76d1-xf849-worker-d-tccv4   Ready    worker   34m   v1.19.0+1348ff8

$ cat machineConfigs/worker-realtime.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig                                                                                                    
metadata:                                                                                                                                                                                                                                     
  labels:                                                                                                              
    machineconfiguration.openshift.io/role: "worker"                                                                   
  name: 99-worker-kerneltype                                                                                           
spec:                                                                                                                  
  kernelType: realtime                                                                                                 

$ oc apply -f machineConfigs/worker-realtime.yaml 
machineconfig.machineconfiguration.openshift.io/99-worker-kerneltype created                        
$ oc get nodes                                                                                                                                                                    
NAME                                       STATUS                     ROLES    AGE   VERSION                           
ci-ln-rq1mq4b-f76d1-xf849-master-0         Ready                      master   67m   v1.19.0+1348ff8                   
ci-ln-rq1mq4b-f76d1-xf849-master-1         Ready                      master   67m   v1.19.0+1348ff8                   
ci-ln-rq1mq4b-f76d1-xf849-master-2         Ready                      master   67m   v1.19.0+1348ff8                   
ci-ln-rq1mq4b-f76d1-xf849-worker-b-tmklt   Ready                      worker   57m   v1.19.0+1348ff8                   
ci-ln-rq1mq4b-f76d1-xf849-worker-c-hqln4   Ready,SchedulingDisabled   worker   57m   v1.19.0+1348ff8          
ci-ln-rq1mq4b-f76d1-xf849-worker-d-tccv4   Ready                      worker   61m   v1.19.0+1348ff8

$ oc debug nodes/ci-ln-rq1mq4b-f76d1-xf849-worker-c-hqln4                                                                                                                         
Starting pod/ci-ln-rq1mq4b-f76d1-xf849-worker-c-hqln4-debug ...
To use host binaries, run `chroot /host`                                                                                                                                                                                                      
Pod IP: 10.0.32.4                                                                                                      
If you don't see a command prompt, try pressing enter.                                                                 
sh-4.4# chroot /host                                                                                                                                                                                                                          
sh-4.4# cat /sys/block/sda/queue/scheduler                                                                             
[mq-deadline] kyber bfq none                                                                                           
sh-4.4# watch cat /sys/block/sda/queue/scheduler                                                                                                                                                                                              
sh-4.4# cat /sys/block/sda/queue/scheduler                                                                                                                                                                                                    
mq-deadline kyber [bfq] none                                                                                           
sh-4.4#                                                                                                                
Removing debug pod ...                                                                 
```

Comment 8 Mike Fiedler 2020-12-04 17:00:57 UTC
Verified on 4.6.7 promoted candidate.

Inside etcd pod on 4.6.6:

# cat /sys/class/block/nvme0n1/queue/scheduler
mq-deadline kyber [bfq] none

4.6.7:

# cat /sys/class/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber bfq

Comment 10 errata-xmlrpc 2020-12-14 13:51:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.6.8 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5259

Comment 11 W. Trevor King 2021-04-05 17:48:05 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475