Bug 1888015
Summary: | workaround kubelet graceful termination of static pods bug | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | David Eads <deads> | |
Component: | kube-apiserver | Assignee: | David Eads <deads> | |
Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.6 | CC: | aos-bugs, kewang, mfojtik, wking, xxia | |
Target Milestone: | --- | |||
Target Release: | 4.7.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1888026 (view as bug list) | Environment: | ||
Last Closed: | 2021-02-24 15:25:41 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1888026, 1888052 |
Description
David Eads
2020-10-13 19:55:25 UTC
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-10-17-034503 True False 8h Cluster version is 4.7.0-0.nightly-2020-10-17-034503 Connect to one master node, $ oc debug node/ip-xx-x-195-122.us-east-2.compute.internal -For kube-apiserver, change one container requested memory size in kube-apiserver-pod.yaml, before change, check the current process ID of kube-apiserver. sh-4.4# ps -ef |grep ' kube-apiserver ' root 415810 415772 99 08:45 ? 00:00:32 kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=10.0.195.122 -v=8 sh-4.4# cd /etc/kubernetes/manifests sh-4.4# vi kube-apiserver-pod.yaml # changed "memory": "50Mi" to "55Mi" New kube-apiserver was started up with new process ID. sh-4.4# ps -ef |grep ' kube-apiserver ' root 429872 429836 99 08:57 ? 00:00:01 kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml --advertise-address=10.0.195.122 -v=8 In another terminal console, check if the kube-apiserver has been restarted. $ oc get pods -n openshift-kube-apiserver --show-labels -l apiserver NAME READY STATUS RESTARTS AGE LABELS ... kube-apiserver-ip-xx-x-195-122.us-east-2.compute.internal 5/5 Running 13 19s apiserver=true,app=openshift-kube-apiserver,revision=7 -------------------- - For etcd, Before change, check the current process ID of etcd, sh-4.4# ps -ef |grep ' etcd ' | grep -v grep | awk '{print $2}' 448696 448769 sh-4.4# vi etcd-pod.yaml # changed "memory": "50Mi" to "55Mi" New etcd server was started up with new process ID. sh-4.4# sh-4.4# ps -ef |grep ' etcd ' | grep -v grep | awk '{print $2}' 452352 452414 In another terminal console, check if the etcd server has been restarted. $ oc get pods -n openshift-etcd -l app=etcd --show-labels ... etcd-ip-xx-x-195-122.us-east-2.compute.internal 3/3 Running 0 78s app=etcd,etcd=true,k8s-app=etcd,revision=3 --------------------- - For kube-controller-manager, Before change, check the current process ID of kube-controller-manager, sh-4.4# ps -ef |grep ' kube-controller-manager ' | grep -v grep | awk '{print $2}' 2240 sh-4.4# vi kube-controller-manager-pod.yaml # changed "memory": "50Mi" to "55Mi" New kube-controller-manager server was started up with new process ID. sh-4.4# ps -ef |grep ' kube-controller-manager ' | grep -v grep | awk '{print $2}' 464346 In another terminal console, check if the kube-controller-manager server has been restarted. $ oc get pods -n openshift-kube-controller-manager --show-labels | grep kube-controller-manager ... kube-controller-manager-ip-xx-x-195-122.us-east-2.compute.internal 2/4 Running 0 22s app=kube-controller-manager,kube-controller-manager=true,revision=7 -------------------- - For kube-scheduler, Before change, check the current process ID of kube-scheduler, sh-4.4# ps -ef |grep ' kube-scheduler ' | grep -v grep | awk '{print $2}' 4083 sh-4.4# vi kube-scheduler-pod.yaml # changed "memory": "50Mi" to "55Mi" New kube-scheduler server was started up with new process ID. sh-4.4# ps -ef |grep ' kube-scheduler ' | grep -v grep | awk '{print $2}' 531438 In another terminal console, check if the kube-scheduler server has been restarted. $ oc get pods -n openshift-kube-scheduler --show-labels | grep kube-scheduler ... openshift-kube-scheduler-ip-xx-x-195-122.us-east-2.compute.internal 1/2 Running 0 32s app=openshift-kube-scheduler,revision=6,scheduler=true Hi deads, please see my verification, one question for kube-apiserver termination and restarting, many times of RESTARTS, is this as expected?
> $ oc get pods -n openshift-kube-apiserver --show-labels -l apiserver
NAME READY STATUS RESTARTS AGE LABELS
...
kube-apiserver-ip-xx-x-195-122.us-east-2.compute.internal 5/5 Running 13 19s apiserver=true,app=openshift-kube-apiserver,revision=7
Ke Wang, looks like you checks process ID. This bug checks pod's uid instead of process id, under pod YAML's metadata. First checked "[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully" test, it is under the "origin" repo's test/extended/apiserver/graceful_termination.go, it is tested by only checking a cluster's events' json: for _, ev := range evs.Items { if ev.Reason != "NonGracefulTermination" { continue } t.Errorf("kube-apiserver reports a non-graceful termination: %#v. Probably kubelet or CRI-O is not giving the time to cleanly shut down. This can lead to connection refused and network I/O timeout errors in other components.", ev) } It it finds a NonGracefulTermination event, then the case fails. This case does not make test data, it only checks cluster data, so it can only be tested via large scale of CI runs to check whether the failure frequency is much less. Second, check the PR code, the pod YAML file it writes to the master now have uid set: pod.UID = uuid.NewUUID() finalPodBytes := resourceread.WritePodV1OrDie(pod) if err := ioutil.WriteFile(path.Join(resourceDir, podFileName), []byte(finalPodBytes), 0644); err != nil { This can be proved by: Checked 4.6.1 env which does not include the fix, ssh to master: [root@ip-10-0-138-235 static-pod-resources]# cd /etc/kubernetes/static-pod-resources/ [root@ip-10-0-138-235 static-pod-resources]# grep -o -P '"uid":".*?"' kube-apiserver-pod-*/kube-apiserver-pod.yaml None is returned, means no uid But checked 4.7.0-0.nightly-2020-10-23-004149 env, the written files have uid: [root@ip-10-0-157-133 static-pod-resources]# grep -o -P '"uid":".*?"' kube-apiserver-pod-*/kube-apiserver-pod.yaml kube-apiserver-pod-2/kube-apiserver-pod.yaml:"uid":"69adaa39-692a-4240-b65a-1a86fc35e6d9" ... kube-apiserver-pod-8/kube-apiserver-pod.yaml:"uid":"9e2b87a6-0e37-4073-9984-afd1d4e8f803" This is expected by the PR, so moving to VERIFIED With above Xingxing's verification for kube-apiserver, checked for other three PRs. For etcd, sh-4.4# cd /etc/kubernetes/static-pod-resources sh-4.4# grep -o -P '"uid":".*?"' etcd-pod-*/etcd-pod.yaml etcd-pod-3/etcd-pod.yaml:"uid":"54c077a5-2898-4117-80d6-576ad1220ed8" etcd-pod-4/etcd-pod.yaml:"uid":"56b249f6-16a1-496f-8c47-63fd5182b9e5" For kube-controller-manager, sh-4.4# grep -o -P '"uid":".*?"' kube-controller-manager-pod*/kube-controller-manager-pod.yaml kube-controller-manager-pod-3/kube-controller-manager-pod.yaml:"uid":"1318e7e5-b22b-43c2-859e-60d4d2463b51" kube-controller-manager-pod-5/kube-controller-manager-pod.yaml:"uid":"9e253dad-6e22-4169-8bd2-b22e6509fa95" kube-controller-manager-pod-6/kube-controller-manager-pod.yaml:"uid":"3adfbe1c-8256-4744-b249-599d071d3308" kube-controller-manager-pod-7/kube-controller-manager-pod.yaml:"uid":"3ce9025d-7a94-413e-86a4-eb2b900c6dae" For kube-schedule, sh-4.4# grep -o -P '"uid":".*?"' kube-scheduler-pod-*/kube-scheduler-pod.yaml kube-scheduler-pod-5/kube-scheduler-pod.yaml:"uid":"4a745a27-3a0f-4a57-aa7c-896a24a92779" kube-scheduler-pod-6/kube-scheduler-pod.yaml:"uid":"991535c4-6f74-4bd3-afa1-8f964ee7762c" kube-scheduler-pod-7/kube-scheduler-pod.yaml:"uid":"a6e4a686-006a-4588-b1d5-25587f3ed6a8" All have uid setting. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |