Bug 1928946

Summary: [sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully ... continues
Product: OpenShift Container Platform Reporter: Ryan Phillips <rphillips>
Component: NodeAssignee: Ryan Phillips <rphillips>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: adam.kaplan, aos-bugs, ccoleman, cewong, fgiudici, gmontero, jerzhang, jluhrsen, mankulka, mfojtik, mgugino, nagrawal, obulatov, pmuller, rgudimet, skumari, tsweeney, wking, xxia
Version: 4.8   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: TechnicalReleaseBlocker tag-ci
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
[sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully [sig-arch] Check if alerts are firing during or after upgrade success
Last Closed: 2021-08-12 19:35:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1932097    

Description Ryan Phillips 2021-02-15 20:43:32 UTC
Description of problem:

The API server is still not restarting gracefully in all cases. There have been a couple patches put in, but the issue has migrated into the storage areas of the Kubelet. Previously, the race condition in the cgroup was being cleaned up while the pod is being killed.

We need to investigate why the storage reconciler is cleaning up the pod.

https://bugzilla.redhat.com/show_bug.cgi?id=1882750
https://bugzilla.redhat.com/show_bug.cgi?id=1925262
https://bugzilla.redhat.com/show_bug.cgi?id=1927500

Search CI:

https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-api-machinery%5C%5D%5C%5BFeature%3AAPIServer%5C%5D%5C%5BLate%5C%5D+kubelet+terminates+kube-apiserver+gracefully

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Oleg Bulatov 2021-02-17 10:57:26 UTC
Marking as urgent: this test affects 47% of failed 4.8 runs (15% of 4.8 runs), and it blocks promotion of nightly builds.

Comment 2 jamo luhrsen 2021-02-18 19:03:41 UTC
(In reply to Oleg Bulatov from comment #1)
> Marking as urgent: this test affects 47% of failed 4.8 runs (15% of 4.8
> runs), and it blocks promotion of nightly builds.

it's in 4.7 too:
https://search.ci.openshift.org/?search=kubelet+terminates+kube-apiserver+gracefully&maxAge=12h&context=1&type=junit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 3 Ryan Phillips 2021-02-19 15:41:02 UTC
Nightlies should be promoting again... We reverted the crio and conmon bump.

Comment 5 Ryan Phillips 2021-03-02 21:21:54 UTC
Kubelet has received various patches to fix races with the termination, shutdown of nodes, and graceful termination of pods. One patch for readiness probes just landed a few days ago (https://github.com/openshift/kubernetes/pull/594). The 'kubelet terminates kube-apiserver gracefully' test is now getting a failed liveness probe. The graceful termination timeout is being honored. Hopefully the API server team can figure out the liveness probe failure. cri-o logs do not show any errors with networking.

job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.8/1366716541797666816

https://gist.github.com/rphillips/d221c7260f75fa884fdf3a1a10809549

┌Mar 02 12:04:18.960050 ip-10-0-197-245 systemd[1]: Started crio-conmon-af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978.scope.                                                                                                                                                                                                                                                                                              │
│Mar 02 12:04:18.983818 ip-10-0-197-245 systemd[1]: Started libcontainer container af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978.                                                                                                                                                                                                                                                                                         │
│Mar 02 12:04:19.045404 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:04:19.045371413Z" level=info msg="Created container af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978: openshift-kube-apiserver/kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal/kube-apiserver" id=3335ec12-996b-45f3-a32c-f2820eecb2be name=/runtime.v1alpha2.RuntimeService/CreateContainer                                               │
│Mar 02 12:04:19.045770 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:04:19.045743457Z" level=info msg="Starting container: af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978" id=309a7ece-e186-4f13-9b9f-51d5d51e3e0f name=/runtime.v1alpha2.RuntimeService/StartContainer                                                                                                                                                 │
│Mar 02 12:04:19.061679 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:04:19.061643568Z" level=info msg="Started container af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978: openshift-kube-apiserver/kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal/kube-apiserver" id=309a7ece-e186-4f13-9b9f-51d5d51e3e0f name=/runtime.v1alpha2.RuntimeService/StartContainer                                                │
│Mar 02 12:04:19.958000 ip-10-0-197-245 hyperkube[1440]: I0302 12:04:19.957460    1440 kubelet.go:1958] SyncLoop (PLEG): "kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal_openshift-kube-apiserver(366daf38-eea4-49c5-b1ca-2a0a35bb79ba)", event: &pleg.PodLifecycleEvent{ID:"366daf38-eea4-49c5-b1ca-2a0a35bb79ba", Type:"ContainerStarted", Data:"af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978"}             │
│Mar 02 12:08:33.155779 ip-10-0-197-245 hyperkube[1440]: I0302 12:08:33.155760    1440 kuberuntime_manager.go:660] Container "kube-apiserver" ({"cri-o" "af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978"}) of pod kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal_openshift-kube-apiserver(366daf38-eea4-49c5-b1ca-2a0a35bb79ba): Container kube-apiserver failed liveness probe, will be restarted              │
│Mar 02 12:08:33.156233 ip-10-0-197-245 hyperkube[1440]: I0302 12:08:33.156218    1440 kuberuntime_container.go:650] Killing container "cri-o://af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978" with a 135 second grace period                                                                                                                                                                                             │
│Mar 02 12:08:33.156496 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:08:33.156460302Z" level=info msg="Stopping container: af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978 (timeout: 135s)" id=07f1f903-2a12-408a-a2a4-fd432d842c19 name=/runtime.v1alpha2.RuntimeService/StopContainer                                                                                                                                  │
│Mar 02 12:10:48.169584 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:10:48.169531210Z" level=warning msg="Stopping container af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978 with stop signal timed out: timeout reached after 135 seconds waiting for container process to exit"                                                                                                                                        │
│Mar 02 12:10:48.226353 ip-10-0-197-245 systemd[1]: crio-af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978.scope: Succeeded.                                                                                                                                                                                                                                                                                                  │
│Mar 02 12:10:48.227002 ip-10-0-197-245 systemd[1]: crio-af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978.scope: Consumed 2min 8.724s CPU time                                                                                                                                                                                                                                                                               │
│Mar 02 12:10:48.238230 ip-10-0-197-245 systemd[1]: crio-conmon-af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978.scope: Succeeded.                                                                                                                                                                                                                                                                                           │
│Mar 02 12:10:48.238779 ip-10-0-197-245 systemd[1]: crio-conmon-af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978.scope: Consumed 335ms CPU time                                                                                                                                                                                                                                                                              │
│Mar 02 12:10:48.406158 ip-10-0-197-245 hyperkube[1440]: I0302 12:10:48.406123    1440 kubelet.go:1958] SyncLoop (PLEG): "kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal_openshift-kube-apiserver(366daf38-eea4-49c5-b1ca-2a0a35bb79ba)", event: &pleg.PodLifecycleEvent{ID:"366daf38-eea4-49c5-b1ca-2a0a35bb79ba", Type:"ContainerDied", Data:"af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978"}                │
│Mar 02 12:10:48.616578 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:10:48.616534240Z" level=info msg="Stopped container af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978: openshift-kube-apiserver/kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal/kube-apiserver" id=07f1f903-2a12-408a-a2a4-fd432d842c19 name=/runtime.v1alpha2.RuntimeService/StopContainer                                                 │
│Mar 02 12:12:17.195483 ip-10-0-197-245 hyperkube[1440]: I0302 12:12:17.195460    1440 scope.go:95] [topologymanager] RemoveContainer - Container ID: af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978                                                                                                                                                                                                                       │
│Mar 02 12:12:17.196989 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:12:17.196953197Z" level=info msg="Removing container: af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978" id=53eaa261-655b-4249-abba-1f4cdc8ff439 name=/runtime.v1alpha2.RuntimeService/RemoveContainer                                                                                                                                                │
│Mar 02 12:12:17.237336 ip-10-0-197-245 crio[1404]: time="2021-03-02 12:12:17.236906452Z" level=info msg="Removed container af7c924b0c3e63a0d1f74c1b9a76d0fea1c8bbfed11e89b32eb7f92c87a82978: openshift-kube-apiserver/kube-apiserver-ip-10-0-197-245.us-west-1.compute.internal/kube-apiserver" id=53eaa261-655b-4249-abba-1f4cdc8ff439 name=/runtime.v1alpha2.RuntimeService/RemoveContainer

Comment 6 Ryan Phillips 2021-03-02 21:24:40 UTC
*** Bug 1931624 has been marked as a duplicate of this bug. ***

Comment 7 Ryan Phillips 2021-03-02 22:56:35 UTC
Looked more into this today.

https://gist.github.com/rphillips/05c5593556e9b9ced007acb5f8e58026

etcd-ip and kube-apiserver are racing to start. The apiserver pod dies after 12 seconds of starting, likely because the etcd-ip pod is still starting.

Clayton mentioned in slack this flake is prior to e2e starting and while the cluster is installing. We could hard fail after install or flake pre-install.

Slack conversation: https://coreos.slack.com/archives/CK1AE4ZCK/p1614725382148500?thread_ts=1614694378.113500&cid=CK1AE4ZCK

Comment 8 Clayton Coleman 2021-03-02 23:10:07 UTC
I'm ok with high since this is no longer blocking post install (at which point apiserver MUST remain available) scenarios.

Current actions:

1. Update the graceful termination test to detect that the liveness probe failed (in which case it's apiserver team's responsibility to debug first)
2. Split the test into two parts like SCC test - fail post-test start, flake pre-test start, unless apiserver team is interested in fixing right now (i.e. this is important / delays startup)
3. Congratulate ourselves on getting 99% of this fixed - thank you node team, and thanks to apiserver team for continued debugging

Comment 9 Ryan Phillips 2021-03-03 18:15:34 UTC
*** Bug 1921157 has been marked as a duplicate of this bug. ***

Comment 10 Ryan Phillips 2021-03-11 17:06:30 UTC
*** Bug 1937826 has been marked as a duplicate of this bug. ***

Comment 11 Clayton Coleman 2021-03-15 17:23:01 UTC
Found a new problem, I fixed the wrong code.  Opening fix for both issues.

Comment 15 Clayton Coleman 2021-03-18 22:13:02 UTC
Update:

The kube-apiserver containers are being terminated about 90s after the machine starts shutdown instead of the 2m15s we expect. We are missing logs past 90s, which implies either a catastrophic crash occurred or systemd isn't honoring the 2m15s grace period (135s) that is set on the systemd scope by crio.

Three things are being investigated:

1. journal stops right away, which prevents us from seeing the shutdown order or any issues - this is a bug in dependency (scopes should be all terminated before journal exits)
2. whatever is causing apiserver to stop
3. if we aren't seeing all the apiserver logs show up that implies we are getting SIGKILLed or something on disk is lost (unlikely), so if we *aren't* getting sigkilled we may be losing pod logs in these sorts of shutdown cases

Proposed workaround is to try and give scopes dependencies that would ensure they happen before the shutdown target is reached (which should fix journal).

In the meantime, I'm going to mark this condition a flake back to this bug (which is moved back to ASSIGNED) while we identify a workaround.

We can't yet depend on graceful shutdown in kubelet and need another solution.

Comment 16 Clayton Coleman 2021-03-18 22:16:23 UTC
Will be marking the "kubelet terminates kube-apiserver gracefully" https://github.com/openshift/origin/pull/25986 invariant flaky so that changes can merge while cri-o and kubelet find a short term fix as described above.

Comment 17 Stefan Schimanski 2021-03-25 08:47:04 UTC
*** Bug 1942740 has been marked as a duplicate of this bug. ***

Comment 18 Petr Muller 2021-03-25 13:18:03 UTC
The individual test is marked as flaky so it does not fail jobs on itself anymore, but it seems to cause *other* tests to fail too (that's why https://bugzilla.redhat.com/show_bug.cgi?id=1942740 was marked a dup of this one).

Comment 19 W. Trevor King 2021-03-30 04:00:02 UTC
Adding the alert test-case name so Sippy can find this, and we don't have folks opening more iterations of bug 1942740 that end up getting closed as dups.  That test-case can also fail for other reasons unrelated to this bug, but Sippy doesn't have provisions for that at the moment.

Comment 20 Yu Qi Zhang 2021-03-31 18:10:04 UTC
Seems that every 4.7->4.8 upgrade hits some variation of this recently. Some example jobs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377201792940511232
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377186186820325376
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1377172439389179904
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377192655808630784
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377218187266887680
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377233268302155776

I don't believe we've had a successful 4.7->4.8 upgrade yet based on those jobs. Using AWS as an example:

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade
https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade

Other platforms are generally worse. Is this the same problem as highlighted here?

Comment 21 Sascha Grunert 2021-05-14 07:08:37 UTC
*** Bug 1938353 has been marked as a duplicate of this bug. ***

Comment 22 Ryan Phillips 2021-08-12 19:35:44 UTC

*** This bug has been marked as a duplicate of bug 1952224 ***