Description of problem: The API server is still not restarting gracefully in all cases. There have been a couple patches put in, but the issue has migrated into the storage areas of the Kubelet. Previously, the race condition in the cgroup was being cleaned up while the pod is being killed. We need to investigate why the storage reconciler is cleaning up the pod. https://bugzilla.redhat.com/show_bug.cgi?id=1882750 https://bugzilla.redhat.com/show_bug.cgi?id=1925262 https://bugzilla.redhat.com/show_bug.cgi?id=1927500 Search CI: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-api-machinery%5C%5D%5C%5BFeature%3AAPIServer%5C%5D%5C%5BLate%5C%5D+kubelet+terminates+kube-apiserver+gracefully Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Marking as urgent: this test affects 47% of failed 4.8 runs (15% of 4.8 runs), and it blocks promotion of nightly builds.
(In reply to Oleg Bulatov from comment #1) > Marking as urgent: this test affects 47% of failed 4.8 runs (15% of 4.8 > runs), and it blocks promotion of nightly builds. it's in 4.7 too: https://search.ci.openshift.org/?search=kubelet+terminates+kube-apiserver+gracefully&maxAge=12h&context=1&type=junit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job
Nightlies should be promoting again... We reverted the crio and conmon bump.
Kubelet has received various patches to fix races with the termination, shutdown of nodes, and graceful termination of pods. One patch for readiness probes just landed a few days ago (https://github.com/openshift/kubernetes/pull/594). The 'kubelet terminates kube-apiserver gracefully' test is now getting a failed liveness probe. The graceful termination timeout is being honored. Hopefully the API server team can figure out the liveness probe failure. cri-o logs do not show any errors with networking. job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.8/1366716541797666816 https://gist.github.com/rphillips/d221c7260f75fa884fdf3a1a10809549 ┌Mar 02 12:04:18.960050 ip-10-0-197-245 systemd[1]: Started crio-conmon-af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978.scope. │ │Mar 02 12:04:18.983818 ip-10-0-197-245 systemd[1]: Started libcontainer container af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978. │ │Mar 02 12:04:19.045404 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:04:19.045371413Z" level=info msg="Created container af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978: openshift-kube-apiserver/kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal/kube-apiserver" id=3335ec12-996b-45f3-a32c-f2820eecb2be name=/runtime.v1alpha2.RuntimeService/CreateContainer │ │Mar 02 12:04:19.045770 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:04:19.045743457Z" level=info msg="Starting container: af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978" id=309a7ece-e186-4f13-9b9f-51d5d51e3e0f name=/runtime.v1alpha2.RuntimeService/StartContainer │ │Mar 02 12:04:19.061679 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:04:19.061643568Z" level=info msg="Started container af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978: openshift-kube-apiserver/kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal/kube-apiserver" id=309a7ece-e186-4f13-9b9f-51d5d51e3e0f name=/runtime.v1alpha2.RuntimeService/StartContainer │ │Mar 02 12:04:19.958000 ip-10-0-197-245 hyperkube[1440]: I0302 12:04:19.957460 1440 kubelet.go:1958] SyncLoop (PLEG): "kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal_openshift-kube-apiserver(366daf38-eea4-49c5-b1ca-2a0a35bb79ba)", event: &pleg.PodLifecycleEvent{ID:"366daf38-eea4-49c5-b1ca-2a0a35bb79ba", Type:"ContainerStarted", Data:"af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978"} │ │Mar 02 12:08:33.155779 ip-10-0-197-245 hyperkube[1440]: I0302 12:08:33.155760 1440 kuberuntime_manager.go:660] Container "kube-apiserver" ({"cri-o" "af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978"}) of pod kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal_openshift-kube-apiserver(366daf38-eea4-49c5-b1ca-2a0a35bb79ba): Container kube-apiserver failed liveness probe, will be restarted │ │Mar 02 12:08:33.156233 ip-10-0-197-245 hyperkube[1440]: I0302 12:08:33.156218 1440 kuberuntime_container.go:650] Killing container "cri-o://af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978" with a 135 second grace period │ │Mar 02 12:08:33.156496 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:08:33.156460302Z" level=info msg="Stopping container: af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978 (timeout: 135s)" id=07f1f903-2a12-408a-a2a4-fd432d842c19 name=/runtime.v1alpha2.RuntimeService/StopContainer │ │Mar 02 12:10:48.169584 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:10:48.169531210Z" level=warning msg="Stopping container af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978 with stop signal timed out: timeout reached after 135 seconds waiting for container process to exit" │ │Mar 02 12:10:48.226353 ip-10-0-197-245 systemd[1]: crio-af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978.scope: Succeeded. │ │Mar 02 12:10:48.227002 ip-10-0-197-245 systemd[1]: crio-af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978.scope: Consumed 2min 8.724s CPU time │ │Mar 02 12:10:48.238230 ip-10-0-197-245 systemd[1]: crio-conmon-af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978.scope: Succeeded. │ │Mar 02 12:10:48.238779 ip-10-0-197-245 systemd[1]: crio-conmon-af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978.scope: Consumed 335ms CPU time │ │Mar 02 12:10:48.406158 ip-10-0-197-245 hyperkube[1440]: I0302 12:10:48.406123 1440 kubelet.go:1958] SyncLoop (PLEG): "kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal_openshift-kube-apiserver(366daf38-eea4-49c5-b1ca-2a0a35bb79ba)", event: &pleg.PodLifecycleEvent{ID:"366daf38-eea4-49c5-b1ca-2a0a35bb79ba", Type:"ContainerDied", Data:"af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978"} │ │Mar 02 12:10:48.616578 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:10:48.616534240Z" level=info msg="Stopped container af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978: openshift-kube-apiserver/kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal/kube-apiserver" id=07f1f903-2a12-408a-a2a4-fd432d842c19 name=/runtime.v1alpha2.RuntimeService/StopContainer │ │Mar 02 12:12:17.195483 ip-10-0-197-245 hyperkube[1440]: I0302 12:12:17.195460 1440 scope.go:95] [topologymanager] RemoveContainer - Container ID: af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978 │ │Mar 02 12:12:17.196989 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:12:17.196953197Z" level=info msg="Removing container: af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978" id=53eaa261-655b-4249-abba-1f4cdc8ff439 name=/runtime.v1alpha2.RuntimeService/RemoveContainer │ │Mar 02 12:12:17.237336 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:12:17.236906452Z" level=info msg="Removed container af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978: openshift-kube-apiserver/kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal/kube-apiserver" id=53eaa261-655b-4249-abba-1f4cdc8ff439 name=/runtime.v1alpha2.RuntimeService/RemoveContainer
*** Bug 1931624 has been marked as a duplicate of this bug. ***
Looked more into this today. https://gist.github.com/rphillips/05c5593556e9b9ced007acb5f8e58026 etcd-ip and kube-apiserver are racing to start. The apiserver pod dies after 12 seconds of starting, likely because the etcd-ip pod is still starting. Clayton mentioned in slack this flake is prior to e2e starting and while the cluster is installing. We could hard fail after install or flake pre-install. Slack conversation: https://coreos.slack.com/archives/CK1AE4ZCK/p1614725382148500?thread_ts=1614694378.113500&cid=CK1AE4ZCK
I'm ok with high since this is no longer blocking post install (at which point apiserver MUST remain available) scenarios. Current actions: 1. Update the graceful termination test to detect that the liveness probe failed (in which case it's apiserver team's responsibility to debug first) 2. Split the test into two parts like SCC test - fail post-test start, flake pre-test start, unless apiserver team is interested in fixing right now (i.e. this is important / delays startup) 3. Congratulate ourselves on getting 99% of this fixed - thank you node team, and thanks to apiserver team for continued debugging
*** Bug 1921157 has been marked as a duplicate of this bug. ***
*** Bug 1937826 has been marked as a duplicate of this bug. ***
Found a new problem, I fixed the wrong code. Opening fix for both issues.
Started seeing this failure in recent e2e-agnostic-upgrade test run in MCO PRs Failing invariant: [sig-node] kubelet terminates kube-apiserver gracefully Example job run: - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2457/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1372247927493234688 - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2466/pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade/1372451441993584640
In case these instances of this failure, along with the accompanying "* API remain available for new connections", 2 separate openshift/openshift-apiserver PRs that hit this since this went ON_QA: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/191/pull-ci-openshift-openshift-apiserver-master-e2e-aws-upgrade/1372350259400806400 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openshift-apiserver/192/pull-ci-openshift-openshift-apiserver-master-e2e-aws-upgrade/1372548595563106304
Update: The kube-apiserver containers are being terminated about 90s after the machine starts shutdown instead of the 2m15s we expect. We are missing logs past 90s, which implies either a catastrophic crash occurred or systemd isn't honoring the 2m15s grace period (135s) that is set on the systemd scope by crio. Three things are being investigated: 1. journal stops right away, which prevents us from seeing the shutdown order or any issues - this is a bug in dependency (scopes should be all terminated before journal exits) 2. whatever is causing apiserver to stop 3. if we aren't seeing all the apiserver logs show up that implies we are getting SIGKILLed or something on disk is lost (unlikely), so if we *aren't* getting sigkilled we may be losing pod logs in these sorts of shutdown cases Proposed workaround is to try and give scopes dependencies that would ensure they happen before the shutdown target is reached (which should fix journal). In the meantime, I'm going to mark this condition a flake back to this bug (which is moved back to ASSIGNED) while we identify a workaround. We can't yet depend on graceful shutdown in kubelet and need another solution.
Will be marking the "kubelet terminates kube-apiserver gracefully" https://github.com/openshift/origin/pull/25986 invariant flaky so that changes can merge while cri-o and kubelet find a short term fix as described above.
*** Bug 1942740 has been marked as a duplicate of this bug. ***
The individual test is marked as flaky so it does not fail jobs on itself anymore, but it seems to cause *other* tests to fail too (that's why https://bugzilla.redhat.com/show_bug.cgi?id=1942740 was marked a dup of this one).
Adding the alert test-case name so Sippy can find this, and we don't have folks opening more iterations of bug 1942740 that end up getting closed as dups. That test-case can also fail for other reasons unrelated to this bug, but Sippy doesn't have provisions for that at the moment.
Seems that every 4.7->4.8 upgrade hits some variation of this recently. Some example jobs: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377201792940511232 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377186186820325376 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1377172439389179904 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377192655808630784 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377218187266887680 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377233268302155776 I don't believe we've had a successful 4.7->4.8 upgrade yet based on those jobs. Using AWS as an example: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade Other platforms are generally worse. Is this the same problem as highlighted here?
*** Bug 1938353 has been marked as a duplicate of this bug. ***
*** This bug has been marked as a duplicate of bug 1952224 ***