Bug 2109152
Summary: | Kube-apiserver was down and could not recover | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | bzhai | ||||
Component: | kube-apiserver | Assignee: | Abu Kashem <akashem> | ||||
Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.9 | CC: | akashem, bzvonar, dhellmann, dphillip, dpunia, ealcaniz, eparis, mfojtik, openshift-bugs-escalate, otuchfel, scuppett, sdodson, wlewis, xxia | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.12.0 | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2023-01-17 19:53:06 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
bzhai
2022-07-20 14:22:08 UTC
The issue happened again today. Here are the timelines fetch from kubelet/crio logs: Installer pod starts fine, manifests are written out to the fs Jul 21 03:11:21 caotvdu102 bash[5905]: I0721 03:11:21.021419 5905 kubelet.go:2115] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/installer-44-caotvdu102] Jul 21 03:11:21 caotvdu102 bash[5905]: I0721 03:11:21.023228 5905 policy_static.go:98] "Allocate" pod="openshift-kube-apiserver/installer-44-caotvdu102" containerName="installer" Jul 21 03:11:36 caotvdu102 bash[5905]: I0721 03:11:36.201974 5905 kubelet.go:2122] "SyncLoop UPDATE" source="api" pods=[openshift-kube-apiserver/installer-44-caotvdu102] Jul 21 03:11:36 caotvdu102 bash[5905]: I0721 03:11:36.846916 5905 kuberuntime_manager.go:484] "No sandbox for pod can be found. Need to start a new one" pod="openshift-kube-apiserver/installer-44-caotvdu102" Jul 21 03:11:40 caotvdu102 bash[5905]: I0721 03:11:40.938504 5905 kubelet.go:2122] "SyncLoop UPDATE" source="api" pods=[openshift-kube-apiserver/installer-44-caotvdu102] Jul 21 03:11:36 caotvdu102 bash[5847]: time="2022-07-21 03:11:36.848221192Z" level=info msg="Running pod sandbox: openshift-kube-apiserver/installer-44-caotvdu102/POD" id=c0af1175-4c29-4c15-9ea0-d8728d85d102 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jul 21 03:11:40 caotvdu102 bash[5847]: time="2022-07-21 03:11:40.957179176Z" level=info msg="Creating container: openshift-kube-apiserver/installer-44-caotvdu102/installer" id=892ed72d-f15e-4a85-b29d-3f9f00e32334 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 21 03:11:44 caotvdu102 bash[5847]: time="2022-07-21 03:11:44.979953567Z" level=info msg="Created container 258f467f13299585a84dc05350764a28b941c74a315c1ba6920f63c8ee30ed94: openshift-kube-apiserver/installer-44-caotvdu102/installer" id=892ed72d-f15e-4a85-b29d-3f9f00e32334 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 21 03:11:44 caotvdu102 bash[5847]: time="2022-07-21 03:11:44.980944452Z" level=info msg="Starting container: 258f467f13299585a84dc05350764a28b941c74a315c1ba6920f63c8ee30ed94" id=3672c118-eb18-420d-ad6b-c026258a2427 name=/runtime.v1alpha2.RuntimeService/StartContainer Jul 21 03:11:45 caotvdu102 bash[5847]: time="2022-07-21 03:11:45.050459990Z" level=info msg="Started container" PID=209148 containerID=258f467f13299585a84dc05350764a28b941c74a315c1ba6920f63c8ee30ed94 description=openshift-kube-apiserver/installer-44-caotvdu102/installer id=3672c118-eb18-420d-ad6b-c026258a2427 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=c94ac5f32700ab1371ddb6ac2c62b77d9818d916a839bd5cfdc022c3f1b8c1a3 Current api-server is shutdown Jul 21 03:12:38 caotvdu102 bash[5905]: I0721 03:12:38.427038 5905 kubelet.go:2125] "SyncLoop REMOVE" source="file" pods=[openshift-kube-apiserver/kube-apiserver-caotvdu102] Jul 21 03:12:38 caotvdu102 bash[5905]: I0721 03:12:38.427105 5905 kubelet.go:2115] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-caotvdu102] Jul 21 03:12:38 caotvdu102 bash[5905]: I0721 03:12:38.427762 5905 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=6edc0f9b-1f95-4c9e-915d-3421ea3e28f3 containerName="kube-apiserver" containerID="cri-o://8a3c50611b4a9db88d02d6e0cfb99697b8d7108a788c2f42e2455583aad42ad2" gracePeriod=15 Jul 21 03:12:38 caotvdu102 bash[5905]: I0721 03:12:38.427803 5905 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=6edc0f9b-1f95-4c9e-915d-3421ea3e28f3 containerName="kube-apiserver-check-endpoints" containerID="cri-o://693be590487b24e8b0a37502536393ddbf29ce307d75c3c836e01813e5a027a1" gracePeriod=15 Jul 21 03:12:42 caotvdu102 bash[5847]: time="2022-07-21 03:12:42.189670410Z" level=info msg="Stopped container 693be590487b24e8b0a37502536393ddbf29ce307d75c3c836e01813e5a027a1: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-check-endpoints" id=d415b96b-fc85-4ea3-8f4a-480f595eabc3 name=/runtime.v1alpha2.RuntimeService/StopContainer Jul 21 03:12:42 caotvdu102 bash[5847]: time="2022-07-21 03:12:42.763237117Z" level=info msg="Stopped container e25115bc4cba45c842dfad945512846bfdc90f2e5f938e899a8962914882a556: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-regeneration-controller" id=2bf26335-cf65-4828-b3c0-2b5fe9a0ba13 name=/runtime.v1alpha2.RuntimeService/StopContainer Jul 21 03:12:42 caotvdu102 bash[5847]: time="2022-07-21 03:12:42.980432439Z" level=info msg="Stopped container d59b952c1cbff58df04eef6d04d0e2ac5091ba9dddd137e20a4a733677c56224: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-insecure-readyz" id=9ce08168-8f59-414c-8fde-96cf8ed0fed7 name=/runtime.v1alpha2.RuntimeService/StopContainer Jul 21 03:12:43 caotvdu102 bash[5847]: time="2022-07-21 03:12:43.582282235Z" level=info msg="Stopped container b2e7b33ed120a9b657d1cacf697793e797f99ee444d0cdeac8ac55b213af4b9a: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer" id=c81912a9-23fc-4af6-a3b3-ab001c5555d1 name=/runtime.v1alpha2.RuntimeService/StopContainer Jul 21 03:12:45 caotvdu102 bash[5847]: time="2022-07-21 03:12:45.115780302Z" level=info msg="Stopped container 8a3c50611b4a9db88d02d6e0cfb99697b8d7108a788c2f42e2455583aad42ad2: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver" id=136d101e-d0b2-4343-88e3-b95798ea2491 name=/runtime.v1alpha2.RuntimeService/StopContainer Started the monitor pod Jul 21 03:12:44 caotvdu102 bash[5847]: time="2022-07-21 03:12:44.123394100Z" level=info msg="Running pod sandbox: openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/POD" id=b3104916-276f-4a6c-8e5b-5bbfe1bfb0a8 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jul 21 03:12:45 caotvdu102 bash[5847]: time="2022-07-21 03:12:45.423191683Z" level=info msg="Ran pod sandbox f3b2b5fc2a74d7a07cdd1d4f7b869dcf734fcb89e1ca2594d37cb583623167c7 with infra container: openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/POD" id=b3104916-276f-4a6c-8e5b-5bbfe1bfb0a8 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jul 21 03:12:45 caotvdu102 bash[5847]: time="2022-07-21 03:12:45.674320243Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor" id=6a7f9f4d-2a75-4ed6-90d9-80e60a9a49fd name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 21 03:12:53 caotvdu102 bash[5847]: time="2022-07-21 03:12:53.070909342Z" level=info msg="Created container ebdf77ed1264d4748c7e811b666717e0d0134c28313117c84fc9b6e88e3b274a: openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor" id=6a7f9f4d-2a75-4ed6-90d9-80e60a9a49fd name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 21 03:12:54 caotvdu102 bash[5847]: time="2022-07-21 03:12:54.578591468Z" level=info msg="Started container" PID=234577 containerID=ebdf77ed1264d4748c7e811b666717e0d0134c28313117c84fc9b6e88e3b274a description=openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor id=bd14c18f-7c9d-4189-8bda-f337a0cc4de5 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=f3b2b5fc2a74d7a07cdd1d4f7b869dcf734fcb89e1ca2594d37cb583623167c7 But the kubelet never seems to start the new pod The old containers shutdown but cannot delete the pod kubelet_logs_last_45m:Jul 21 03:12:45 caotvdu102 bash[5905]: E0721 03:12:45.767969 5905 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"6edc0f9b-1f95-4c9e-915d-3421ea3e28f3\" with KillPodSandboxError: \"rpc error: code = Unknown desc = unable to remove managed namespaces: Removing namespaces encountered the following errors [remove /var/run/netns/7e9d874f-41b9-408e-9be9-f3ae94704f04: device or resource busy]\"" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=6edc0f9b-1f95-4c9e-915d-3421ea3e28f3 Attached the raw logs: apiserver_info_07210309.tar From above bug description, kube-apiserver pod was unable to remove due to device or resource busy,
> kubelet_logs_last_45m:Jul 21 03:12:45 caotvdu102 bash[5905]: E0721 03:12:45.767969 5905 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"6edc0f9b-1f95-4c9e-915d-3421ea3e28f3\" with KillPodSandboxError: \"rpc error: code = Unknown desc = unable to remove managed namespaces: Removing namespaces encountered the following errors [remove /var/run/netns/7e9d874f-41b9-408e-9be9-f3ae94704f04: device or resource busy]\"" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=6edc0f9b-1f95-4c9e-915d-3421ea3e28f3
we don't see a way to reproduce it. So I have to do some similar test to simulate the process with some load.
Before the testing,
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.12.0-0.nightly-2022-12-01-184212 True False 4h33m Cluster version is 4.12.0-0.nightly-2022-12-01-184212
$ oc adm top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 1131m 7% 13913Mi 23%
After added some system load,
$ oc adm top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 9545m 61% 18671Mi 31%
1. Set the installer pods to fail and try backoff during rollout by injecting error:
$ oc patch kubeapiserver/cluster --type=merge -p '{"spec":{"unsupportedConfigOverrides":{"installerErrorInjection":{"failPropability":1.0}}}}'
2. Force rollout to let the kube-apiserver roll out to simulate the restart:
oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test1: '"$(date)"'" } ]'
Repeatedly check, There will be many retry installer pods failing and retrying backoff:
$ oc get po -n openshift-kube-apiserver -L revision
NAME READY STATUS RESTARTS AGE REVISION
apiserver-watcher-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 1/1 Running 0 3h18m
installer-4-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h11m
installer-7-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 3h8m
installer-7-retry-1-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h3m
installer-8-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 179m
installer-9-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 3m32s
installer-9-retry-1-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 3m9s
installer-9-retry-2-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 2m42s
installer-9-retry-3-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 2m2s
installer-9-retry-4-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 61s
kube-apiserver-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 5/5 Running 0 178m 8
revision-pruner-7-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h4m
revision-pruner-8-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 179m
revision-pruner-9-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3m38s
3. Remove above injection:
$ oc patch kubeapiserver/cluster --type=merge -p '{"spec":{"unsupportedConfigOverrides":null}}'
4. Force rollout again, make kube-apiserver stop and start:
oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "xxia forced test: '"$(date)"'" } ]'
$ oc get po -n openshift-kube-apiserver -L revision --show-labels
NAME READY STATUS RESTARTS AGE REVISION LABELS
apiserver-watcher-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 1/1 Running 0 3h24m <none>
installer-10-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3m42s app=installer
installer-7-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 3h14m app=installer
installer-7-retry-1-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h10m app=installer
installer-8-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h5m app=installer
installer-9-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 10m app=installer
installer-9-retry-1-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 9m39s app=installer
installer-9-retry-2-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 9m12s app=installer
installer-9-retry-3-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 8m32s app=installer
installer-9-retry-4-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 7m31s app=installer
installer-9-retry-5-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 6m10s app=installer
kube-apiserver-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 5/5 Running 0 2m34s 10 apiserver=true,app=openshift-kube-apiserver,revision=10
revision-pruner-10-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3m48s app=pruner
revision-pruner-7-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h10m app=pruner
revision-pruner-8-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h5m app=pruner
revision-pruner-9-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 10m app=pruner
We can see the new installer 10 pod completed and kube-apiserver restarted with new revision 10. The case has been automated and worked well in our daily tests, didn't hit the same problem. Based on this, we can consider the bug fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |