2074031 – Admins should be able to tune garbage collector aggressiveness (GOGC) for kube-apiserver if necessary

Bug 2074031 - Admins should be able to tune garbage collector aggressiveness (GOGC) for kube-apiserver if necessary

Summary: Admins should be able to tune garbage collector aggressiveness (GOGC) for kub...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Ben Luddy
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-11 12:15 UTC by Abu Kashem
Modified:	2023-08-14 08:20 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 11:05:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-apiserver-operator pull 1359	0	None	open	Bug 2074031: Support tuning GOGC within a limited range.	2022-06-06 18:41:00 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:06:44 UTC

Description Abu Kashem 2022-04-11 12:15:09 UTC

with go 1.18, kube-apiserver memory usage is high, see https://github.com/kubernetes/kubernetes/issues/108357. 

Looks like 'GOGC=63' brings back memory usage to the original level but this is from the upstream 5K test. See https://github.com/kubernetes/kubernetes/issues/108357#issuecomment-1056901991

Is 'GOGC=63' suitable for openshift? we need to do some perf testing specific to openshift and try to find a suitable vale for GOGC and set it appropriately when we build openshift kube-apiserver.

Comment 1 Abu Kashem 2022-04-11 12:21:32 UTC

I am fine if we go with 'GOGC=63', but I think we should run some sort of scale test with openshift and capture some data points in order for us to to make this decision. 

one option is to wait and see what the scale test run by perf team (equivalent to upstream 5K) shows with and without GOGC defined.

Comment 2 Abu Kashem 2022-04-13 13:51:48 UTC

I think we should address this for all three apiservers

Comment 3 Ben Luddy 2022-05-26 14:39:08 UTC

Since the effect of the Go 1.18 changes is dependent on load characteristics and won't be uniform across all clusters, the current plan is to provide a knob for admins to tweak GOGC. There should also be a release note / upgrade checklist item so that admins can proactively tune their clusters if necessary.

Comment 7 Ke Wang 2022-06-24 15:01:16 UTC

Verification steps:

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-23-153912   True        False         3h57m   Cluster version is 4.11.0-0.nightly-2022-06-23-153912

Check the default GOGC setting for kube-apiserver,

$ oc exec kube-apiserver-kewang-2411op2-dnk84-master-0 -n openshift-kube-apiserver -- printenv | grep -i gogc
GOGC=100

$ oc get pod -n  kube-apiserver-kewang-2411op2-dnk84-master-0 -n openshift-kube-apiserver -oyaml | grep -iA1 gogc 
      - name: GOGC
        value: "100"
--
      - name: GOGC
        value: "100"
--
      - name: GOGC
        value: "100"

Change the default GOGC setting for kube-apiserver in the first terminal, 
$ oc get no
NAME                                  STATUS   ROLES    AGE   VERSION
kewang-2411op2-dnk84-master-0         Ready    master   41m   v1.24.0+284d62a
kewang-2411op2-dnk84-master-1         Ready    master   41m   v1.24.0+284d62a
kewang-2411op2-dnk84-master-2         Ready    master   41m   v1.24.0+284d62a
...

$ oc debug node/kewang-2411op2-dnk84-master-0
...
sh-4.4# chroot /host 
sh-4.4# cd /etc/kubernetes/manifests
sh-4.4# vi kube-apiserver-pod.yaml # replace GOGC value with 63 and save 
sh-4.4# mv kube-apiserver-pod.yaml .. # kubelet will shutdown kube-apiserver

Open second terminal and check the kube-apiserver pods,

$ oc get po -n openshift-kube-apiserver -l apiserver
NAME                                           READY   STATUS    RESTARTS   AGE
kube-apiserver-kewang-2411op2-dnk84-master-1   5/5     Running   0          3h42m
kube-apiserver-kewang-2411op2-dnk84-master-2   5/5     Running   0          3h45m

Move back the kube-apiserver-pod.yaml to manifest directory in the first terminal,

Check kube-apiserver pods in the second terminal after a while, the kube-apiserver pod is started up.
$ oc get po -n openshift-kube-apiserver -l apiserver
NAME                                           READY   STATUS    RESTARTS   AGE
kube-apiserver-kewang-2411op2-dnk84-master-0   3/5     Running   0          7s
kube-apiserver-kewang-2411op2-dnk84-master-1   5/5     Running   0          3h43m
kube-apiserver-kewang-2411op2-dnk84-master-2   5/5     Running   0          3h46m

Apply the same steps to other kube-apiservers for GOGC setting, check the results,
$ oc exec kube-apiserver-kewang-2411op2-dnk84-master-0 -n openshift-kube-apiserver -- printenv | grep -i gogc
GOGC=63

$ oc exec kube-apiserver-kewang-2411op2-dnk84-master-1 -n openshift-kube-apiserver -- printenv | grep -i gogc
GOGC=63

$ oc exec kube-apiserver-kewang-2411op2-dnk84-master-2 -n openshift-kube-apiserver -- printenv | grep -i gogc
GOGC=63

Check the cluster operators,
$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0-0.nightly-2022-06-23-153912   True        False         False      40m     
baremetal                                  4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h41m   
cloud-controller-manager                   4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h43m   
cloud-credential                           4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h44m   
cluster-autoscaler                         4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h41m   
config-operator                            4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h42m   
console                                    4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h21m   
csi-snapshot-controller                    4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h42m   
dns                                        4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h41m   
etcd                                       4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h30m   
image-registry                             4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h24m   
ingress                                    4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h24m   
insights                                   4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h28m   
kube-apiserver                             4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h29m   
kube-controller-manager                    4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h38m   
kube-scheduler                             4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h37m   
kube-storage-version-migrator              4.11.0-0.nightly-2022-06-23-153912   True        False         False      56m     
machine-api                                4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h38m   
machine-approver                           4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h41m   
machine-config                             4.11.0-0.nightly-2022-06-23-153912   True        False         False      40m     
marketplace                                4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h41m   
monitoring                                 4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h22m   
network                                    4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h42m   
node-tuning                                4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h41m   
openshift-apiserver                        4.11.0-0.nightly-2022-06-23-153912   True        False         False      58m     
openshift-controller-manager               4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h37m   
openshift-samples                          4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h25m   
operator-lifecycle-manager                 4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h41m   
operator-lifecycle-manager-catalog         4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h42m   
operator-lifecycle-manager-packageserver   4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h26m   
service-ca                                 4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h42m   
storage                                    4.11.0-0.nightly-2022-06-23-153912   True        False         False      7h37m   

$ oc adm top node
NAME                                  CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
kewang-2411op2-dnk84-master-0         1583m        21%    10589Mi         71%       
kewang-2411op2-dnk84-master-1         1338m        17%    9318Mi          62%       
kewang-2411op2-dnk84-master-2         688m         9%     4925Mi          33%       
kewang-2411op2-dnk84-worker-0-j2vfq   702m         20%    4115Mi          60%       
kewang-2411op2-dnk84-worker-0-v9vft   1724m        49%    4703Mi          68%

Based on above, after GOGC setting changed, the cluster works well, so move the bug VERIFIED.

Comment 8 errata-xmlrpc 2022-08-10 11:05:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.