Bug 2074031
| Summary: | Admins should be able to tune garbage collector aggressiveness (GOGC) for kube-apiserver if necessary | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Abu Kashem <akashem> |
| Component: | kube-apiserver | Assignee: | Ben Luddy <bluddy> |
| Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 4.11 | CC: | aos-bugs, bluddy, mfojtik, xxia |
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-10 11:05:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Abu Kashem
2022-04-11 12:15:09 UTC
I am fine if we go with 'GOGC=63', but I think we should run some sort of scale test with openshift and capture some data points in order for us to to make this decision. one option is to wait and see what the scale test run by perf team (equivalent to upstream 5K) shows with and without GOGC defined. I think we should address this for all three apiservers Since the effect of the Go 1.18 changes is dependent on load characteristics and won't be uniform across all clusters, the current plan is to provide a knob for admins to tweak GOGC. There should also be a release note / upgrade checklist item so that admins can proactively tune their clusters if necessary. Verification steps:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-06-23-153912 True False 3h57m Cluster version is 4.11.0-0.nightly-2022-06-23-153912
Check the default GOGC setting for kube-apiserver,
$ oc exec kube-apiserver-kewang-2411op2-dnk84-master-0 -n openshift-kube-apiserver -- printenv | grep -i gogc
GOGC=100
$ oc get pod -n kube-apiserver-kewang-2411op2-dnk84-master-0 -n openshift-kube-apiserver -oyaml | grep -iA1 gogc
- name: GOGC
value: "100"
--
- name: GOGC
value: "100"
--
- name: GOGC
value: "100"
Change the default GOGC setting for kube-apiserver in the first terminal,
$ oc get no
NAME STATUS ROLES AGE VERSION
kewang-2411op2-dnk84-master-0 Ready master 41m v1.24.0+284d62a
kewang-2411op2-dnk84-master-1 Ready master 41m v1.24.0+284d62a
kewang-2411op2-dnk84-master-2 Ready master 41m v1.24.0+284d62a
...
$ oc debug node/kewang-2411op2-dnk84-master-0
...
sh-4.4# chroot /host
sh-4.4# cd /etc/kubernetes/manifests
sh-4.4# vi kube-apiserver-pod.yaml # replace GOGC value with 63 and save
sh-4.4# mv kube-apiserver-pod.yaml .. # kubelet will shutdown kube-apiserver
Open second terminal and check the kube-apiserver pods,
$ oc get po -n openshift-kube-apiserver -l apiserver
NAME READY STATUS RESTARTS AGE
kube-apiserver-kewang-2411op2-dnk84-master-1 5/5 Running 0 3h42m
kube-apiserver-kewang-2411op2-dnk84-master-2 5/5 Running 0 3h45m
Move back the kube-apiserver-pod.yaml to manifest directory in the first terminal,
Check kube-apiserver pods in the second terminal after a while, the kube-apiserver pod is started up.
$ oc get po -n openshift-kube-apiserver -l apiserver
NAME READY STATUS RESTARTS AGE
kube-apiserver-kewang-2411op2-dnk84-master-0 3/5 Running 0 7s
kube-apiserver-kewang-2411op2-dnk84-master-1 5/5 Running 0 3h43m
kube-apiserver-kewang-2411op2-dnk84-master-2 5/5 Running 0 3h46m
Apply the same steps to other kube-apiservers for GOGC setting, check the results,
$ oc exec kube-apiserver-kewang-2411op2-dnk84-master-0 -n openshift-kube-apiserver -- printenv | grep -i gogc
GOGC=63
$ oc exec kube-apiserver-kewang-2411op2-dnk84-master-1 -n openshift-kube-apiserver -- printenv | grep -i gogc
GOGC=63
$ oc exec kube-apiserver-kewang-2411op2-dnk84-master-2 -n openshift-kube-apiserver -- printenv | grep -i gogc
GOGC=63
Check the cluster operators,
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.11.0-0.nightly-2022-06-23-153912 True False False 40m
baremetal 4.11.0-0.nightly-2022-06-23-153912 True False False 7h41m
cloud-controller-manager 4.11.0-0.nightly-2022-06-23-153912 True False False 7h43m
cloud-credential 4.11.0-0.nightly-2022-06-23-153912 True False False 7h44m
cluster-autoscaler 4.11.0-0.nightly-2022-06-23-153912 True False False 7h41m
config-operator 4.11.0-0.nightly-2022-06-23-153912 True False False 7h42m
console 4.11.0-0.nightly-2022-06-23-153912 True False False 7h21m
csi-snapshot-controller 4.11.0-0.nightly-2022-06-23-153912 True False False 7h42m
dns 4.11.0-0.nightly-2022-06-23-153912 True False False 7h41m
etcd 4.11.0-0.nightly-2022-06-23-153912 True False False 7h30m
image-registry 4.11.0-0.nightly-2022-06-23-153912 True False False 7h24m
ingress 4.11.0-0.nightly-2022-06-23-153912 True False False 7h24m
insights 4.11.0-0.nightly-2022-06-23-153912 True False False 7h28m
kube-apiserver 4.11.0-0.nightly-2022-06-23-153912 True False False 7h29m
kube-controller-manager 4.11.0-0.nightly-2022-06-23-153912 True False False 7h38m
kube-scheduler 4.11.0-0.nightly-2022-06-23-153912 True False False 7h37m
kube-storage-version-migrator 4.11.0-0.nightly-2022-06-23-153912 True False False 56m
machine-api 4.11.0-0.nightly-2022-06-23-153912 True False False 7h38m
machine-approver 4.11.0-0.nightly-2022-06-23-153912 True False False 7h41m
machine-config 4.11.0-0.nightly-2022-06-23-153912 True False False 40m
marketplace 4.11.0-0.nightly-2022-06-23-153912 True False False 7h41m
monitoring 4.11.0-0.nightly-2022-06-23-153912 True False False 7h22m
network 4.11.0-0.nightly-2022-06-23-153912 True False False 7h42m
node-tuning 4.11.0-0.nightly-2022-06-23-153912 True False False 7h41m
openshift-apiserver 4.11.0-0.nightly-2022-06-23-153912 True False False 58m
openshift-controller-manager 4.11.0-0.nightly-2022-06-23-153912 True False False 7h37m
openshift-samples 4.11.0-0.nightly-2022-06-23-153912 True False False 7h25m
operator-lifecycle-manager 4.11.0-0.nightly-2022-06-23-153912 True False False 7h41m
operator-lifecycle-manager-catalog 4.11.0-0.nightly-2022-06-23-153912 True False False 7h42m
operator-lifecycle-manager-packageserver 4.11.0-0.nightly-2022-06-23-153912 True False False 7h26m
service-ca 4.11.0-0.nightly-2022-06-23-153912 True False False 7h42m
storage 4.11.0-0.nightly-2022-06-23-153912 True False False 7h37m
$ oc adm top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
kewang-2411op2-dnk84-master-0 1583m 21% 10589Mi 71%
kewang-2411op2-dnk84-master-1 1338m 17% 9318Mi 62%
kewang-2411op2-dnk84-master-2 688m 9% 4925Mi 33%
kewang-2411op2-dnk84-worker-0-j2vfq 702m 20% 4115Mi 60%
kewang-2411op2-dnk84-worker-0-v9vft 1724m 49% 4703Mi 68%
Based on above, after GOGC setting changed, the cluster works well, so move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |