Created attachment 1898297 [details] openshift-kube-apiserver kubelet crio logs Description of problem: Kube-apiserver was down and could not recover by its own. Our customer reported an issue that on their SNO node they could not use oc command which indicated the kube-apiserver was down somehow, from the kubelet/crio logs we could see when pod 'openshift-kube-apiserver/installer-42-caotvdu102/installer' and 'openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor' was trying to update the kube-apiserver to newer revision, crio took long time to create the container (BZ#2102366), eventually it got timeout, and in a edge case seems like startup-monitor pod was killed before all 6 kube-apiserver containers were created, and retry containers creation may have confused startup-monitor which eventually cause kube-apiserver down. Following are the major logs captured from kubelet/crio logs Jul 19 00:58:55 caotvdu102 bash[6003]: time="2022-07-19 00:58:55.437378365Z" level=info msg="Started container" PID=149559 containerID=c6589ab81e06526af8aa8dc751db4a607b6f4690542004d8ccc687298ac62a16 description=openshift-kube-apiserver/revision-pruner-42-caotvdu102/pruner id=9de5db15-1ae2-4403-9f79-0f0f3bac62fc name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=a9761c1a4b07de1c3d2abe7b55ada1fc6ff4425a0e92db79109fbf115eb6e490 Jul 19 00:59:03 caotvdu102 bash[6003]: time="2022-07-19 00:59:03.331629429Z" level=info msg="Created container 3e37f1219ed3d0da8151deea21c03d08e189830ae048ee8532a1b5e1da85645f: openshift-kube-apiserver/installer-42-caotvdu102/installer" id=395c2a6e-d51f-4721-a045-6d42671ff3d6 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 00:59:03 caotvdu102 bash[6003]: time="2022-07-19 00:59:03.391848935Z" level=info msg="Started container" PID=152665 containerID=3e37f1219ed3d0da8151deea21c03d08e189830ae048ee8532a1b5e1da85645f description=openshift-kube-apiserver/installer-42-caotvdu102/installer id=0f2f0d6e-8188-4a1f-befe-551eb7775554 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=0acceb55e54740ab58412a2580fab295ce34e4f29376ff5be710871ff51ca088 Jul 19 00:59:14 caotvdu102 bash[6124]: I0719 00:59:14.740901 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=0db10678-67fd-4d72-a192-d7d5efaebc21 containerName="kube-apiserver" containerID="cri-o://21677c115b6ec349fd2bb8eb5a8a9189db8ad3771ed912e7b27cfa9db085b346" gracePeriod=15 Jul 19 00:59:14 caotvdu102 bash[6124]: I0719 00:59:14.741235 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=0db10678-67fd-4d72-a192-d7d5efaebc21 containerName="kube-apiserver-check-endpoints" containerID="cri-o://2730582b69bd23227dcacea28af89cd6bc1a232ac747953fc667a652fcec78ec" gracePeriod=15 Jul 19 00:59:14 caotvdu102 bash[6124]: I0719 00:59:14.741284 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=0db10678-67fd-4d72-a192-d7d5efaebc21 containerName="kube-apiserver-insecure-readyz" containerID="cri-o://6a3d1b20a7f29769ff323e539e256b4feb5bc1d4e3f802fefede06baf7423aa9" gracePeriod=15 Jul 19 00:59:14 caotvdu102 bash[6124]: I0719 00:59:14.741591 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=0db10678-67fd-4d72-a192-d7d5efaebc21 containerName="kube-apiserver-cert-regeneration-controller" containerID="cri-o://fafafd9f98e807b2ed89b55fefd41aa1b05ff66a36a6cf92844a0e426695050d" gracePeriod=15 Jul 19 00:59:14 caotvdu102 bash[6124]: I0719 00:59:14.741679 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=0db10678-67fd-4d72-a192-d7d5efaebc21 containerName="kube-apiserver-cert-syncer" containerID="cri-o://862f2102dd1d47500ee6bfa086c192a9543e5806de330ad2f286ec2d2db1c1d4" gracePeriod=15 Jul 19 00:59:18 caotvdu102 bash[6003]: time="2022-07-19 00:59:18.587868499Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor" id=d82660b8-610d-42f7-8378-fd2ac7003026 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 00:59:27 caotvdu102 bash[6003]: time="2022-07-19 00:59:27.797915924Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/setup" id=7b518fe1-e44d-4ee2-917f-4b40a8e91aca name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:00:08 caotvdu102 bash[6003]: time="2022-07-19 01:00:08.590559752Z" level=info msg="Created container 49584ac6cf4c91799721789ea06500971e1580cfafd76b2040db945fd22b32c3: openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor" id=d82660b8-610d-42f7-8378-fd2ac7003026 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:00:09 caotvdu102 bash[6003]: time="2022-07-19 01:00:09.885462395Z" level=info msg="Started container" PID=161216 containerID=49584ac6cf4c91799721789ea06500971e1580cfafd76b2040db945fd22b32c3 description=openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor id=ae06407c-b74f-4226-8be0-5c76f72c52cf name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=ee091694b8d31ba03c429c240c4b0747de9c606ea648b4f41590ab71aa3c7a08 Jul 19 01:00:20 caotvdu102 bash[6003]: time="2022-07-19 01:00:20.247837300Z" level=info msg="Created container 2fc99a0caf176a9bffcac74b37afce0d5f3f768aabc2fdfb2638441e9b6c616f: openshift-kube-apiserver/kube-apiserver-caotvdu102/setup" id=7b518fe1-e44d-4ee2-917f-4b40a8e91aca name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:00:20 caotvdu102 bash[6003]: time="2022-07-19 01:00:20.610191410Z" level=info msg="Started container" PID=164837 containerID=2fc99a0caf176a9bffcac74b37afce0d5f3f768aabc2fdfb2638441e9b6c616f description=openshift-kube-apiserver/kube-apiserver-caotvdu102/setup id=558c71bf-f91a-4a84-a4ff-37a10035554f name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=fc2954cc6c6993d64eb9b88b3af24b6a1e1f444691a3fe7effc48d3ecd0c02a5 Jul 19 01:00:26 caotvdu102 bash[6003]: time="2022-07-19 01:00:26.823880169Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver" id=89d4db48-f01f-4696-9a92-b14b5ba9973d name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:01:10 caotvdu102 bash[6003]: time="2022-07-19 01:01:10.989687221Z" level=info msg="Created container 4a4290644d7c7713aea404a45a0b17709d5f849dea749fbf26dade4a052ff710: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver" id=89d4db48-f01f-4696-9a92-b14b5ba9973d name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:01:12 caotvdu102 bash[6003]: time="2022-07-19 01:01:12.113836679Z" level=info msg="Started container" PID=197104 containerID=4a4290644d7c7713aea404a45a0b17709d5f849dea749fbf26dade4a052ff710 description=openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver id=0c3d613b-013a-43b6-9e3f-93ebbc74ba29 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=fc2954cc6c6993d64eb9b88b3af24b6a1e1f444691a3fe7effc48d3ecd0c02a5 Jul 19 01:01:13 caotvdu102 bash[6003]: time="2022-07-19 01:01:13.197815327Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer" id=345f908a-31d9-4b14-83b1-0cca992d8ecc name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:03:13 caotvdu102 bash[6003]: time="2022-07-19 01:03:13.206062556Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-regeneration-controller" id=d0b68452-69cc-44ca-9b7f-db7a4d13e12a name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:04:09 caotvdu102 bash[6003]: time="2022-07-19 01:04:09.703908235Z" level=info msg="Created container 22dec08fa5f466b5724cfeb23f99d0060816e1ac4b8e5769aad16bd50b7a7b9e: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-regeneration-controller" id=d0b68452-69cc-44ca-9b7f-db7a4d13e12a name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:04:09 caotvdu102 bash[6003]: time="2022-07-19 01:04:09.912221461Z" level=info msg="Started container" PID=273443 containerID=22dec08fa5f466b5724cfeb23f99d0060816e1ac4b8e5769aad16bd50b7a7b9e description=openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-regeneration-controller id=751a03d0-2696-4a27-bfc0-712289cd3c6c name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=fc2954cc6c6993d64eb9b88b3af24b6a1e1f444691a3fe7effc48d3ecd0c02a5 Jul 19 01:04:11 caotvdu102 bash[6003]: time="2022-07-19 01:04:11.985111965Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-insecure-readyz" id=b299f82d-148a-4ba9-a276-6f71c4031058 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:04:39 caotvdu102 bash[6003]: time="2022-07-19 01:04:39.105982559Z" level=info msg="Created container 38b3f13292885233a8db602bfffe97b04b44719d571d2bd3f8d8a950898d4adb: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-insecure-readyz" id=b299f82d-148a-4ba9-a276-6f71c4031058 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:04:39 caotvdu102 bash[6003]: time="2022-07-19 01:04:39.211985147Z" level=info msg="Started container" PID=308417 containerID=38b3f13292885233a8db602bfffe97b04b44719d571d2bd3f8d8a950898d4adb description=openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-insecure-readyz id=16090dc4-38a7-4833-8d20-31f3667e6c3f name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=fc2954cc6c6993d64eb9b88b3af24b6a1e1f444691a3fe7effc48d3ecd0c02a5 Jul 19 01:04:39 caotvdu102 bash[6003]: time="2022-07-19 01:04:39.267800073Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-check-endpoints" id=3288c8f4-14b9-42f7-b3c6-dbf923d41b67 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:05:07 caotvdu102 bash[6003]: time="2022-07-19 01:05:07.639221339Z" level=info msg="Created container 16ab99e46f57d9d22bbc752d6403c0df80dc06bc4d0537ac5a82154bc1a00573: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-check-endpoints" id=3288c8f4-14b9-42f7-b3c6-dbf923d41b67 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:05:07 caotvdu102 bash[6003]: time="2022-07-19 01:05:07.959308844Z" level=info msg="Started container" PID=322736 containerID=16ab99e46f57d9d22bbc752d6403c0df80dc06bc4d0537ac5a82154bc1a00573 description=openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-check-endpoints id=6e695a02-884a-4f30-98d0-7ba1820c9884 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=fc2954cc6c6993d64eb9b88b3af24b6a1e1f444691a3fe7effc48d3ecd0c02a5 Jul 19 01:05:09 caotvdu102 bash[6124]: I0719 01:05:09.701346 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102" podUID=ccc216b9-0e46-4610-ab56-9a9fd1df6253 containerName="startup-monitor" containerID="cri-o://49584ac6cf4c91799721789ea06500971e1580cfafd76b2040db945fd22b32c3" gracePeriod=5 Jul 19 01:05:09 caotvdu102 bash[6003]: time="2022-07-19 01:05:09.868887139Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer" id=a42e4748-fbe1-46d3-86ea-5a5219c1396a name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:05:31 caotvdu102 bash[6003]: time="2022-07-19 01:05:31.199001663Z" level=info msg="Created container d8f17ff20c376b3d6996ff490b1af49f08ad717a8bc09993e913a4531deae21c: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer" id=a42e4748-fbe1-46d3-86ea-5a5219c1396a name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:05:31 caotvdu102 bash[6003]: time="2022-07-19 01:05:31.574751087Z" level=info msg="Started container" PID=337459 containerID=d8f17ff20c376b3d6996ff490b1af49f08ad717a8bc09993e913a4531deae21c description=openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer id=a23c2f91-ed13-4808-897d-72e229b51a92 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=fc2954cc6c6993d64eb9b88b3af24b6a1e1f444691a3fe7effc48d3ecd0c02a5 Jul 19 01:05:32 caotvdu102 bash[6124]: I0719 01:05:32.702653 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=aa690d6f-8561-4c1e-a920-dd8e36aecd77 containerName="kube-apiserver" containerID="cri-o://4a4290644d7c7713aea404a45a0b17709d5f849dea749fbf26dade4a052ff710" gracePeriod=15 Jul 19 01:05:32 caotvdu102 bash[6124]: I0719 01:05:32.702683 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=aa690d6f-8561-4c1e-a920-dd8e36aecd77 containerName="kube-apiserver-check-endpoints" containerID="cri-o://16ab99e46f57d9d22bbc752d6403c0df80dc06bc4d0537ac5a82154bc1a00573" gracePeriod=15 Jul 19 01:05:32 caotvdu102 bash[6124]: I0719 01:05:32.702726 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=aa690d6f-8561-4c1e-a920-dd8e36aecd77 containerName="kube-apiserver-insecure-readyz" containerID="cri-o://38b3f13292885233a8db602bfffe97b04b44719d571d2bd3f8d8a950898d4adb" gracePeriod=15 Jul 19 01:05:32 caotvdu102 bash[6124]: I0719 01:05:32.702753 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=aa690d6f-8561-4c1e-a920-dd8e36aecd77 containerName="kube-apiserver-cert-syncer" containerID="cri-o://d8f17ff20c376b3d6996ff490b1af49f08ad717a8bc09993e913a4531deae21c" gracePeriod=15 Jul 19 01:05:32 caotvdu102 bash[6124]: I0719 01:05:32.702776 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=aa690d6f-8561-4c1e-a920-dd8e36aecd77 containerName="kube-apiserver-cert-regeneration-controller" containerID="cri-o://22dec08fa5f466b5724cfeb23f99d0060816e1ac4b8e5769aad16bd50b7a7b9e" gracePeriod=15 Jul 19 01:06:02 caotvdu102 bash[6003]: time="2022-07-19 01:06:01.908178814Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/setup" id=bcee3fbc-8d6d-4493-968e-271978d120bb name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:06:52 caotvdu102 bash[6003]: time="2022-07-19 01:06:52.867054511Z" level=info msg="Created container 16b28a3cda8effe6dfbaf0037d3331ee1b6ca5dfd66ca75a903c196151d0551e: openshift-kube-apiserver/kube-apiserver-caotvdu102/setup" id=bcee3fbc-8d6d-4493-968e-271978d120bb name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:06:53 caotvdu102 bash[6003]: time="2022-07-19 01:06:53.045063572Z" level=info msg="Started container" PID=370434 containerID=16b28a3cda8effe6dfbaf0037d3331ee1b6ca5dfd66ca75a903c196151d0551e description=openshift-kube-apiserver/kube-apiserver-caotvdu102/setup id=171ab9f0-e180-4277-9c7e-a7937fbd7bc3 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=65583d3ff15bae5a4ffc94cb2bdb49c53d0c89af6088e019c4f3249c230daf8f Jul 19 01:06:53 caotvdu102 bash[6003]: time="2022-07-19 01:06:53.491534564Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver" id=5797d6b2-1ffc-4ae6-85a4-9671148efd34 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:08:02 caotvdu102 bash[6003]: time="2022-07-19 01:08:02.894368384Z" level=info msg="Created container be1fa650816c046b18a44ee13e14b4af0cfc8200e5bd76c20fb05b806a87ca2c: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver" id=5797d6b2-1ffc-4ae6-85a4-9671148efd34 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:08:03 caotvdu102 bash[6003]: time="2022-07-19 01:08:03.484342038Z" level=info msg="Started container" PID=397995 containerID=be1fa650816c046b18a44ee13e14b4af0cfc8200e5bd76c20fb05b806a87ca2c description=openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver id=1dd49b47-669f-4871-81ef-662c0f4fbe9e name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=65583d3ff15bae5a4ffc94cb2bdb49c53d0c89af6088e019c4f3249c230daf8f Jul 19 01:08:05 caotvdu102 bash[6003]: time="2022-07-19 01:08:05.680813816Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer" id=91bdd0d7-1ce1-4d34-8c60-b8c33a61d8d2 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:10:05 caotvdu102 bash[6003]: time="2022-07-19 01:10:05.725816976Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-regeneration-controller" id=a85dfbbf-84df-4875-97cd-dd3965de83fe name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:11:41 caotvdu102 bash[6003]: time="2022-07-19 01:11:41.508509075Z" level=info msg="Created container 6f94dafa80759bbc8e286595ca184cd49a5ecc08fe8c1db6ef4ebef5171319a2: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-regeneration-controller" id=a85dfbbf-84df-4875-97cd-dd3965de83fe name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:11:41 caotvdu102 bash[6003]: time="2022-07-19 01:11:41.765920989Z" level=info msg="Started container" PID=471070 containerID=6f94dafa80759bbc8e286595ca184cd49a5ecc08fe8c1db6ef4ebef5171319a2 description=openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-regeneration-controller id=a32222c1-8522-4398-916a-8de2a076fe38 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=65583d3ff15bae5a4ffc94cb2bdb49c53d0c89af6088e019c4f3249c230daf8f Jul 19 01:11:42 caotvdu102 bash[6003]: time="2022-07-19 01:11:42.483187695Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-insecure-readyz" id=31b9d194-ba26-485f-857b-325b571d7054 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:12:20 caotvdu102 bash[6003]: time="2022-07-19 01:12:20.614499524Z" level=info msg="Created container 150d71b6bab6e50b3d0311325e54fd93dcc08d7041041e142bfbc838e78200a6: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-insecure-readyz" id=31b9d194-ba26-485f-857b-325b571d7054 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:12:20 caotvdu102 bash[6003]: time="2022-07-19 01:12:20.687906711Z" level=info msg="Started container" PID=517942 containerID=150d71b6bab6e50b3d0311325e54fd93dcc08d7041041e142bfbc838e78200a6 description=openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-insecure-readyz id=70098466-b645-4c5f-a81f-2a07ebc8e12b name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=65583d3ff15bae5a4ffc94cb2bdb49c53d0c89af6088e019c4f3249c230daf8f Jul 19 01:12:20 caotvdu102 bash[6003]: time="2022-07-19 01:12:20.783205493Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-check-endpoints" id=fc78fd45-82e2-4ed1-9147-c0bfe91e9599 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:12:46 caotvdu102 bash[6003]: time="2022-07-19 01:12:46.886795485Z" level=info msg="Created container 56ad1d9ce65f5f0a08bd8bb528bbd95d8467e91079989aa6a82dff8adf99cfd1: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-check-endpoints" id=fc78fd45-82e2-4ed1-9147-c0bfe91e9599 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:12:47 caotvdu102 bash[6003]: time="2022-07-19 01:12:47.994693793Z" level=info msg="Started container" PID=536120 containerID=56ad1d9ce65f5f0a08bd8bb528bbd95d8467e91079989aa6a82dff8adf99cfd1 description=openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-check-endpoints id=9479ee1d-bd66-4f20-8b41-2f3b16dffd57 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=65583d3ff15bae5a4ffc94cb2bdb49c53d0c89af6088e019c4f3249c230daf8f Jul 19 01:12:49 caotvdu102 bash[6003]: time="2022-07-19 01:12:49.255166078Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer" id=e41ae56f-68fe-4fcd-9700-62406e529ff0 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:12:53 caotvdu102 bash[6003]: time="2022-07-19 01:12:53.464259342Z" level=info msg="Creating container: openshift-kube-apiserver/revision-pruner-43-caotvdu102/pruner" id=68bba509-6114-4528-bd68-740832b0640e name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:13:01 caotvdu102 bash[6003]: time="2022-07-19 01:13:01.200626370Z" level=info msg="Creating container: openshift-kube-apiserver/installer-43-caotvdu102/installer" id=23d593cd-2f30-41b2-8fe7-e0da0698418f name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:13:02 caotvdu102 bash[6003]: time="2022-07-19 01:13:02.942679713Z" level=info msg="Created container c80178b43464750f4b6be09d09b1e306e73402247f98f607c4d92556990d114e: openshift-kube-apiserver/revision-pruner-43-caotvdu102/pruner" id=68bba509-6114-4528-bd68-740832b0640e name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:13:02 caotvdu102 bash[6003]: time="2022-07-19 01:13:02.978808705Z" level=info msg="Started container" PID=551010 containerID=c80178b43464750f4b6be09d09b1e306e73402247f98f607c4d92556990d114e description=openshift-kube-apiserver/revision-pruner-43-caotvdu102/pruner id=6d339823-ce1b-4081-864b-5cd3ff98c7f1 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=a5cec1dd22c7820ad072e83f5167379b5d5b4db05138256cc72147d7427714c8 Jul 19 01:13:11 caotvdu102 bash[6003]: time="2022-07-19 01:13:11.776328112Z" level=info msg="Created container d6209ada479e3de00ed66d4492ca71cac0d24c4c91ff602ac927ddcba538b6c0: openshift-kube-apiserver/installer-43-caotvdu102/installer" id=23d593cd-2f30-41b2-8fe7-e0da0698418f name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:13:12 caotvdu102 bash[6003]: time="2022-07-19 01:13:12.404062389Z" level=info msg="Started container" PID=554386 containerID=d6209ada479e3de00ed66d4492ca71cac0d24c4c91ff602ac927ddcba538b6c0 description=openshift-kube-apiserver/installer-43-caotvdu102/installer id=22494bfb-25d8-4bd7-be2c-808c67a079de name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=da559079cec4d7b0190549321062a5b4fb4f18cbe5b964e85b8a8066ac195c2a Jul 19 01:13:21 caotvdu102 bash[6003]: time="2022-07-19 01:13:21.174437743Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor" id=c573f5dd-cdd6-4c5b-ada8-aa386a400bce name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:13:21 caotvdu102 bash[6003]: time="2022-07-19 01:13:21.877525913Z" level=info msg="Created container 6eb9797c53528fe0fe938639a846aef3caf656c3717e10b557d50a3c1e5c45b0: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer" id=e41ae56f-68fe-4fcd-9700-62406e529ff0 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:13:21 caotvdu102 bash[6003]: time="2022-07-19 01:13:21.918641718Z" level=info msg="Started container" PID=549782 containerID=6eb9797c53528fe0fe938639a846aef3caf656c3717e10b557d50a3c1e5c45b0 description=openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer id=843192fe-b89b-4fe9-affb-a50f70180697 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=65583d3ff15bae5a4ffc94cb2bdb49c53d0c89af6088e019c4f3249c230daf8f Jul 19 01:13:22 caotvdu102 bash[6124]: I0719 01:13:22.189028 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=19f03120-233b-4160-b74c-c890ea665453 containerName="kube-apiserver" containerID="cri-o://be1fa650816c046b18a44ee13e14b4af0cfc8200e5bd76c20fb05b806a87ca2c" gracePeriod=15 Jul 19 01:13:22 caotvdu102 bash[6124]: I0719 01:13:22.189109 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=19f03120-233b-4160-b74c-c890ea665453 containerName="kube-apiserver-cert-syncer" containerID="cri-o://6eb9797c53528fe0fe938639a846aef3caf656c3717e10b557d50a3c1e5c45b0" gracePeriod=15 Jul 19 01:13:22 caotvdu102 bash[6124]: I0719 01:13:22.189156 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=19f03120-233b-4160-b74c-c890ea665453 containerName="kube-apiserver-check-endpoints" containerID="cri-o://56ad1d9ce65f5f0a08bd8bb528bbd95d8467e91079989aa6a82dff8adf99cfd1" gracePeriod=15 Jul 19 01:13:22 caotvdu102 bash[6124]: I0719 01:13:22.189201 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=19f03120-233b-4160-b74c-c890ea665453 containerName="kube-apiserver-insecure-readyz" containerID="cri-o://150d71b6bab6e50b3d0311325e54fd93dcc08d7041041e142bfbc838e78200a6" gracePeriod=15 Jul 19 01:13:22 caotvdu102 bash[6124]: I0719 01:13:22.189291 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=19f03120-233b-4160-b74c-c890ea665453 containerName="kube-apiserver-cert-regeneration-controller" containerID="cri-o://6f94dafa80759bbc8e286595ca184cd49a5ecc08fe8c1db6ef4ebef5171319a2" gracePeriod=15 Jul 19 01:13:54 caotvdu102 bash[6003]: time="2022-07-19 01:13:54.575510406Z" level=info msg="Created container a50e84e9b3252f4b7fb789a94edd503d0d60659beccc9519de65bcb098b54df7: openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor" id=c573f5dd-cdd6-4c5b-ada8-aa386a400bce name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 19 01:13:54 caotvdu102 bash[6003]: time="2022-07-19 01:13:54.667200841Z" level=info msg="Started container" PID=564630 containerID=a50e84e9b3252f4b7fb789a94edd503d0d60659beccc9519de65bcb098b54df7 description=openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor id=ae2d8923-5d8b-484b-8f71-4c1bca40d0b0 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=0ee99c74f2553013d674a64a9adbf36de7271e38349c0f2aaf286d20660637f6 Jul 19 01:18:54 caotvdu102 bash[6124]: I0719 01:18:54.110589 6124 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102" podUID=424098dd-ce39-4370-9ffd-7c62d5aeedf7 containerName="startup-monitor" containerID="cri-o://a50e84e9b3252f4b7fb789a94edd503d0d60659beccc9519de65bcb098b54df7" gracePeriod=5 Jul 19 02:42:38 caotvdu102 bash[6124]: E0719 02:42:38.105559 6124 event.go:273] Unable to write event: '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-apiserver-startup-monitor-caotvdu102.170315f61bd7f0cd", GenerateName:"", Namespace:"openshift-kube-apiserver", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-apiserver", Name:"kube-apiserver-startup-monitor-caotvdu102", UID:"424098dd-ce39-4370-9ffd-7c62d5aeedf7", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.containers{startup-monitor}"}, Reason:"Created", Message:"Created container startup-monitor", Source:v1.EventSource{Component:"kubelet", Host:"caotvdu102"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc0ad9f34a2513ccd, ext:474128165495660, loc:(*time.Location)(0x7a03ec0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc0ad9f34a2513ccd, ext:474128165495660, loc:(*time.Location)(0x7a03ec0)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Post "https://api-int.caotvdu102.exilis.caot.gic.ericsson.se:6443/api/v1/namespaces/openshift-kube-apiserver/events": dial tcp 10.114.74.102:6443: connect: connection refused'(may retry after sleeping) J Aslo visualized the data in spreadsheet: https://docs.google.com/spreadsheets/d/1TARun2gpQ7493gJ1MDvjnsI34HjoE3RguQR9jnIZjHI/edit?usp=sharing Container Creating container Created container Latency(Seconds) Started container Killing container openshift-kube-apiserver/installer-42-caotvdu102/installer 2022-07-19 0:58:55 2022-07-19 0:59:03 2022-07-19 0:59:03 openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor 2022-07-19 0:59:18 2022-07-19 1:00:08 2022-07-19 1:00:09 2022-07-19 1:05:09 openshift-kube-apiserver/kube-apiserver-caotvdu102/setup 2022-07-19 0:59:27 2022-07-19 1:00:20 2022-07-19 1:00:20 2022-07-19 1:05:31 openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver 2022-07-19 1:00:26 2022-07-19 1:01:10 2022-07-19 1:01:12 2022-07-19 1:05:31 openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer 2022-07-19 1:01:13 2022-07-19 1:05:31 2022-07-19 1:05:31 openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer(retry) 2022-07-19 1:05:09 2022-07-19 1:05:31 2022-07-19 1:05:31 2022-07-19 1:05:31 openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-regeneration-controller 2022-07-19 1:03:13 2022-07-19 1:04:09 2022-07-19 1:04:09 2022-07-19 1:05:31 openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-insecure-readyz 2022-07-19 1:04:11 2022-07-19 1:04:39 2022-07-19 1:04:39 2022-07-19 1:05:31 openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-check-endpoints 2022-07-19 1:04:39 2022-07-19 1:05:07 2022-07-19 1:05:07 2022-07-19 1:05:31 openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor 2022-07-19 0:59:18 2022-07-19 1:00:08 50 2022-07-19 1:00:09 2022-07-19 1:05:09 openshift-kube-apiserver/kube-apiserver-caotvdu102/setup 2022-07-19 1:06:01 2022-07-19 1:06:52 51 2022-07-19 1:06:53 2022-07-19 1:13:22 openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver 2022-07-19 1:06:53 2022-07-19 1:08:02 69 2022-07-19 1:08:03 2022-07-19 1:13:22 openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer 2022-07-19 1:08:05 2022-07-19 1:11:41 216 2022-07-19 1:11:41 2022-07-19 1:13:22 openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer(retry) 2022-07-19 1:12:49 2022-07-19 1:13:21 32 openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-regeneration-controller 2022-07-19 1:10:05 2022-07-19 1:11:41 96 2022-07-19 1:11:41 2022-07-19 1:13:22 openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor(another one) 2022-07-19 1:13:21 2022-07-19 1:13:54 33 2022-07-19 1:18:54 openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-insecure-readyz 2022-07-19 1:11:42 2022-07-19 1:12:20 38 2022-07-19 1:12:20 2022-07-19 1:13:22 openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-check-endpoints 2022-07-19 1:12:20 2022-07-19 1:12:46 26 2022-07-19 1:12:47 2022-07-19 1:13:22 After that there was no retry any more and kube-appiserver was down. The raw journal logs can be found here: https://access.redhat.com/support/cases/#/case/03246567/discussion?attachmentId=a096R00002hMiKAQA0 Version-Release number of selected component (if applicable): 4.9.37 SNO How reproducible: Quite often on customers nodes. Many time they had to undeploy the workload and restart kubelet to recover. Steps to Reproduce: Not sure, but with logs above seems like it happened when there was a big lantency crio created containers. Actual results: kube-apiserver could not recover automatically. Expected results: kube-apiserver shall be recovered automatically. Additional info:
The issue happened again today. Here are the timelines fetch from kubelet/crio logs: Installer pod starts fine, manifests are written out to the fs Jul 21 03:11:21 caotvdu102 bash[5905]: I0721 03:11:21.021419 5905 kubelet.go:2115] "SyncLoop ADD" source="api" pods=[openshift-kube-apiserver/installer-44-caotvdu102] Jul 21 03:11:21 caotvdu102 bash[5905]: I0721 03:11:21.023228 5905 policy_static.go:98] "Allocate" pod="openshift-kube-apiserver/installer-44-caotvdu102" containerName="installer" Jul 21 03:11:36 caotvdu102 bash[5905]: I0721 03:11:36.201974 5905 kubelet.go:2122] "SyncLoop UPDATE" source="api" pods=[openshift-kube-apiserver/installer-44-caotvdu102] Jul 21 03:11:36 caotvdu102 bash[5905]: I0721 03:11:36.846916 5905 kuberuntime_manager.go:484] "No sandbox for pod can be found. Need to start a new one" pod="openshift-kube-apiserver/installer-44-caotvdu102" Jul 21 03:11:40 caotvdu102 bash[5905]: I0721 03:11:40.938504 5905 kubelet.go:2122] "SyncLoop UPDATE" source="api" pods=[openshift-kube-apiserver/installer-44-caotvdu102] Jul 21 03:11:36 caotvdu102 bash[5847]: time="2022-07-21 03:11:36.848221192Z" level=info msg="Running pod sandbox: openshift-kube-apiserver/installer-44-caotvdu102/POD" id=c0af1175-4c29-4c15-9ea0-d8728d85d102 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jul 21 03:11:40 caotvdu102 bash[5847]: time="2022-07-21 03:11:40.957179176Z" level=info msg="Creating container: openshift-kube-apiserver/installer-44-caotvdu102/installer" id=892ed72d-f15e-4a85-b29d-3f9f00e32334 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 21 03:11:44 caotvdu102 bash[5847]: time="2022-07-21 03:11:44.979953567Z" level=info msg="Created container 258f467f13299585a84dc05350764a28b941c74a315c1ba6920f63c8ee30ed94: openshift-kube-apiserver/installer-44-caotvdu102/installer" id=892ed72d-f15e-4a85-b29d-3f9f00e32334 name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 21 03:11:44 caotvdu102 bash[5847]: time="2022-07-21 03:11:44.980944452Z" level=info msg="Starting container: 258f467f13299585a84dc05350764a28b941c74a315c1ba6920f63c8ee30ed94" id=3672c118-eb18-420d-ad6b-c026258a2427 name=/runtime.v1alpha2.RuntimeService/StartContainer Jul 21 03:11:45 caotvdu102 bash[5847]: time="2022-07-21 03:11:45.050459990Z" level=info msg="Started container" PID=209148 containerID=258f467f13299585a84dc05350764a28b941c74a315c1ba6920f63c8ee30ed94 description=openshift-kube-apiserver/installer-44-caotvdu102/installer id=3672c118-eb18-420d-ad6b-c026258a2427 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=c94ac5f32700ab1371ddb6ac2c62b77d9818d916a839bd5cfdc022c3f1b8c1a3 Current api-server is shutdown Jul 21 03:12:38 caotvdu102 bash[5905]: I0721 03:12:38.427038 5905 kubelet.go:2125] "SyncLoop REMOVE" source="file" pods=[openshift-kube-apiserver/kube-apiserver-caotvdu102] Jul 21 03:12:38 caotvdu102 bash[5905]: I0721 03:12:38.427105 5905 kubelet.go:2115] "SyncLoop ADD" source="file" pods=[openshift-kube-apiserver/kube-apiserver-caotvdu102] Jul 21 03:12:38 caotvdu102 bash[5905]: I0721 03:12:38.427762 5905 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=6edc0f9b-1f95-4c9e-915d-3421ea3e28f3 containerName="kube-apiserver" containerID="cri-o://8a3c50611b4a9db88d02d6e0cfb99697b8d7108a788c2f42e2455583aad42ad2" gracePeriod=15 Jul 21 03:12:38 caotvdu102 bash[5905]: I0721 03:12:38.427803 5905 kuberuntime_container.go:729] "Killing container with a grace period" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=6edc0f9b-1f95-4c9e-915d-3421ea3e28f3 containerName="kube-apiserver-check-endpoints" containerID="cri-o://693be590487b24e8b0a37502536393ddbf29ce307d75c3c836e01813e5a027a1" gracePeriod=15 Jul 21 03:12:42 caotvdu102 bash[5847]: time="2022-07-21 03:12:42.189670410Z" level=info msg="Stopped container 693be590487b24e8b0a37502536393ddbf29ce307d75c3c836e01813e5a027a1: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-check-endpoints" id=d415b96b-fc85-4ea3-8f4a-480f595eabc3 name=/runtime.v1alpha2.RuntimeService/StopContainer Jul 21 03:12:42 caotvdu102 bash[5847]: time="2022-07-21 03:12:42.763237117Z" level=info msg="Stopped container e25115bc4cba45c842dfad945512846bfdc90f2e5f938e899a8962914882a556: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-regeneration-controller" id=2bf26335-cf65-4828-b3c0-2b5fe9a0ba13 name=/runtime.v1alpha2.RuntimeService/StopContainer Jul 21 03:12:42 caotvdu102 bash[5847]: time="2022-07-21 03:12:42.980432439Z" level=info msg="Stopped container d59b952c1cbff58df04eef6d04d0e2ac5091ba9dddd137e20a4a733677c56224: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-insecure-readyz" id=9ce08168-8f59-414c-8fde-96cf8ed0fed7 name=/runtime.v1alpha2.RuntimeService/StopContainer Jul 21 03:12:43 caotvdu102 bash[5847]: time="2022-07-21 03:12:43.582282235Z" level=info msg="Stopped container b2e7b33ed120a9b657d1cacf697793e797f99ee444d0cdeac8ac55b213af4b9a: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver-cert-syncer" id=c81912a9-23fc-4af6-a3b3-ab001c5555d1 name=/runtime.v1alpha2.RuntimeService/StopContainer Jul 21 03:12:45 caotvdu102 bash[5847]: time="2022-07-21 03:12:45.115780302Z" level=info msg="Stopped container 8a3c50611b4a9db88d02d6e0cfb99697b8d7108a788c2f42e2455583aad42ad2: openshift-kube-apiserver/kube-apiserver-caotvdu102/kube-apiserver" id=136d101e-d0b2-4343-88e3-b95798ea2491 name=/runtime.v1alpha2.RuntimeService/StopContainer Started the monitor pod Jul 21 03:12:44 caotvdu102 bash[5847]: time="2022-07-21 03:12:44.123394100Z" level=info msg="Running pod sandbox: openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/POD" id=b3104916-276f-4a6c-8e5b-5bbfe1bfb0a8 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jul 21 03:12:45 caotvdu102 bash[5847]: time="2022-07-21 03:12:45.423191683Z" level=info msg="Ran pod sandbox f3b2b5fc2a74d7a07cdd1d4f7b869dcf734fcb89e1ca2594d37cb583623167c7 with infra container: openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/POD" id=b3104916-276f-4a6c-8e5b-5bbfe1bfb0a8 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jul 21 03:12:45 caotvdu102 bash[5847]: time="2022-07-21 03:12:45.674320243Z" level=info msg="Creating container: openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor" id=6a7f9f4d-2a75-4ed6-90d9-80e60a9a49fd name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 21 03:12:53 caotvdu102 bash[5847]: time="2022-07-21 03:12:53.070909342Z" level=info msg="Created container ebdf77ed1264d4748c7e811b666717e0d0134c28313117c84fc9b6e88e3b274a: openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor" id=6a7f9f4d-2a75-4ed6-90d9-80e60a9a49fd name=/runtime.v1alpha2.RuntimeService/CreateContainer Jul 21 03:12:54 caotvdu102 bash[5847]: time="2022-07-21 03:12:54.578591468Z" level=info msg="Started container" PID=234577 containerID=ebdf77ed1264d4748c7e811b666717e0d0134c28313117c84fc9b6e88e3b274a description=openshift-kube-apiserver/kube-apiserver-startup-monitor-caotvdu102/startup-monitor id=bd14c18f-7c9d-4189-8bda-f337a0cc4de5 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=f3b2b5fc2a74d7a07cdd1d4f7b869dcf734fcb89e1ca2594d37cb583623167c7 But the kubelet never seems to start the new pod The old containers shutdown but cannot delete the pod kubelet_logs_last_45m:Jul 21 03:12:45 caotvdu102 bash[5905]: E0721 03:12:45.767969 5905 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"6edc0f9b-1f95-4c9e-915d-3421ea3e28f3\" with KillPodSandboxError: \"rpc error: code = Unknown desc = unable to remove managed namespaces: Removing namespaces encountered the following errors [remove /var/run/netns/7e9d874f-41b9-408e-9be9-f3ae94704f04: device or resource busy]\"" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=6edc0f9b-1f95-4c9e-915d-3421ea3e28f3 Attached the raw logs: apiserver_info_07210309.tar
From above bug description, kube-apiserver pod was unable to remove due to device or resource busy, > kubelet_logs_last_45m:Jul 21 03:12:45 caotvdu102 bash[5905]: E0721 03:12:45.767969 5905 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"6edc0f9b-1f95-4c9e-915d-3421ea3e28f3\" with KillPodSandboxError: \"rpc error: code = Unknown desc = unable to remove managed namespaces: Removing namespaces encountered the following errors [remove /var/run/netns/7e9d874f-41b9-408e-9be9-f3ae94704f04: device or resource busy]\"" pod="openshift-kube-apiserver/kube-apiserver-caotvdu102" podUID=6edc0f9b-1f95-4c9e-915d-3421ea3e28f3 we don't see a way to reproduce it. So I have to do some similar test to simulate the process with some load. Before the testing, $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-12-01-184212 True False 4h33m Cluster version is 4.12.0-0.nightly-2022-12-01-184212 $ oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 1131m 7% 13913Mi 23% After added some system load, $ oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 9545m 61% 18671Mi 31% 1. Set the installer pods to fail and try backoff during rollout by injecting error: $ oc patch kubeapiserver/cluster --type=merge -p '{"spec":{"unsupportedConfigOverrides":{"installerErrorInjection":{"failPropability":1.0}}}}' 2. Force rollout to let the kube-apiserver roll out to simulate the restart: oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test1: '"$(date)"'" } ]' Repeatedly check, There will be many retry installer pods failing and retrying backoff: $ oc get po -n openshift-kube-apiserver -L revision NAME READY STATUS RESTARTS AGE REVISION apiserver-watcher-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 1/1 Running 0 3h18m installer-4-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h11m installer-7-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 3h8m installer-7-retry-1-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h3m installer-8-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 179m installer-9-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 3m32s installer-9-retry-1-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 3m9s installer-9-retry-2-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 2m42s installer-9-retry-3-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 2m2s installer-9-retry-4-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 61s kube-apiserver-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 5/5 Running 0 178m 8 revision-pruner-7-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h4m revision-pruner-8-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 179m revision-pruner-9-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3m38s 3. Remove above injection: $ oc patch kubeapiserver/cluster --type=merge -p '{"spec":{"unsupportedConfigOverrides":null}}' 4. Force rollout again, make kube-apiserver stop and start: oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "xxia forced test: '"$(date)"'" } ]' $ oc get po -n openshift-kube-apiserver -L revision --show-labels NAME READY STATUS RESTARTS AGE REVISION LABELS apiserver-watcher-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 1/1 Running 0 3h24m <none> installer-10-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3m42s app=installer installer-7-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 3h14m app=installer installer-7-retry-1-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h10m app=installer installer-8-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h5m app=installer installer-9-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 10m app=installer installer-9-retry-1-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 9m39s app=installer installer-9-retry-2-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 9m12s app=installer installer-9-retry-3-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 8m32s app=installer installer-9-retry-4-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 7m31s app=installer installer-9-retry-5-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Error 0 6m10s app=installer kube-apiserver-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 5/5 Running 0 2m34s 10 apiserver=true,app=openshift-kube-apiserver,revision=10 revision-pruner-10-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3m48s app=pruner revision-pruner-7-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h10m app=pruner revision-pruner-8-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 3h5m app=pruner revision-pruner-9-kewang-0212gp1-mjz52-master-0.c.openshift-qe.internal 0/1 Completed 0 10m app=pruner We can see the new installer 10 pod completed and kube-apiserver restarted with new revision 10. The case has been automated and worked well in our daily tests, didn't hit the same problem. Based on this, we can consider the bug fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399