1932097 – Apiserver liveness probe is marking it as unhealthy during normal shutdown

Bug 1932097 - Apiserver liveness probe is marking it as unhealthy during normal shutdown

Summary: Apiserver liveness probe is marking it as unhealthy during normal shutdown

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Elana Hashman
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:	1928946 1952224
Blocks:	1996846
TreeView+	depends on / blocked

Reported:	2021-02-23 22:43 UTC by Clayton Coleman
Modified:	2021-09-16 01:07 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: While container is terminating, liveness probe continues to probe container. Consequence: Liveness probe may fail during container shutdown as container is not running healthily as expected, which will cause it to kill the container during normal shutdown. Fix: Stop probing containers during shutdown. Result: Liveness probes will no longer fail when container is terminating, resulting in a premature shutdown.
Clone Of:
Environment:
Last Closed:	2021-07-27 22:47:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes pull 98571	None	closed	Stop probing a pod during graceful shutdown	2021-03-03 05:03:44 UTC
Github	openshift kubernetes pull 594	None	closed	Bug 1932097: UPSTREAM: 98571: kubelet: Stop probing a pod during graceful shutdown	2021-03-03 05:03:43 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:48:02 UTC

Description Clayton Coleman 2021-02-23 22:43:15 UTC

https://testgrid.k8s.io/redhat-openshift-informing#release-openshift-ocp-installer-e2e-aws-upgrade-4.8&include-filter-by-regex=Kubernetes%20APIs

Seeing about a 30% fail rate of 

[sig-api-machinery] Kubernetes APIs remain available for new connections

Feb 23 20:26:42.124: API "kubernetes-api-available-new-connections" was unreachable during disruption for at least 2s of 34m45s (0%):

Feb 23 20:18:03.262 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-ivyvzgrr-0b477.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/default": dial tcp 3.19.2.121:6443: connect: connection refused
Feb 23 20:18:04.212 - 1s    E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests
Feb 23 20:18:05.318 I kube-apiserver-new-connection kube-apiserver-new-connection started responding to GET requests

Deeper detail from the node log shows that right as we get this error one of the instances finishes its connection ,which is right when the error happens.

Feb 23 20:18:02.505 I ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-203-7.us-east-2.compute.internal node/ip-10-0-203-7 reason/TerminationMinimalShutdownDurationFinished The minimal shutdown duration of 1m10s finished
Feb 23 20:18:02.509 I ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-203-7.us-east-2.compute.internal node/ip-10-0-203-7 reason/TerminationStoppedServing Server has stopped listening
Feb 23 20:18:03.148 I ns/openshift-console-operator deployment/console-operator reason/OperatorStatusChanged Status for clusteroperator/console changed: Degraded message changed from "CustomRouteSyncDegraded: the server is currently unable to handle the request (delete routes.route.openshift.io console-custom)\nSyncLoopRefreshDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)" to "SyncLoopRefreshDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)" (2 times)
Feb 23 20:18:03.880 E kube-apiserver-reused-connection kube-apiserver-reused-connection started failing: Get "https://api.ci-op-ivyvzgrr-0b477.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/default": dial tcp 3.21.250.132:6443: connect: connection refused

This kind of looks like the load balancer didn't remove the kube-apiserver and kept sending traffic and the connection didn't cleanly shut down - did something regress in the apiserver traffic connection?

Setting to high because we had *just* got green when this regressed.

Comment 1 Ryan Phillips 2021-02-24 18:13:51 UTC

This has been green for 5 test runs in a row... Let's see if it clears up.

Comment 2 Clayton Coleman 2021-02-24 20:00:23 UTC

Also happening in regular runs: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25918/pull-ci-openshift-origin-master-e2e-aws-csi/1364310688415092736 yesterday at 2pm EST.

   pod "pod-7961e5d1-283d-4e09-816f-995a9ffbfba0" was not deleted: Get "https://api.ci-op-7cbd82h9-2550a.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/e2e-fsgroupchangepolicy-7286/pods/pod-7961e5d1-283d-4e09-816f-995a9ffbfba0": dial tcp 3.209.36.244:6443: connect: connection refused
occurred

This error should never happen.

Comment 3 Clayton Coleman 2021-02-24 20:01:59 UTC

Happened 4minutes ago

https://search.ci.openshift.org/?search=6443%3A+connect%3A+connection+refused&maxAge=48h&context=1&type=junit&name=.*-aws-.*&maxMatches=1&maxBytes=20971520&groupBy=job

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_sdn/263/pull-ci-openshift-sdn-master-e2e-aws-upgrade/1364633756488437760

Feb 24 18:43:59.292 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-hhxdyjmp-550b5.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/default": dial tcp 54.200.130.251:6443: connect: connection refused
Feb 24 18:44:00.153 E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests

Comment 5 Elana Hashman 2021-02-25 00:42:03 UTC

OpenShift backport pending upstream merge.

Comment 7 Sunil Choudhary 2021-03-10 08:19:37 UTC

Looks this is still an issue as I still see connection refused failure instances.

https://search.ci.openshift.org/?search=6443%3A+connect%3A+connection+refused&maxAge=48h&context=1&type=junit&name=.*-aws-.*&maxMatches=1&maxBytes=20971520&groupBy=job

https://search.ci.openshift.org/?search=kubelet+terminates+kube-apiserver+gracefully&maxAge=48h&context=1&type=junit&name=.*-aws-.*&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=job

Comment 8 Elana Hashman 2021-03-11 01:22:06 UTC

@Sunil I think this is a multi-part issue of which this was just one fix.

Comment 9 Elana Hashman 2021-03-11 01:48:36 UTC

Sorry, hit send too soon---

Adding a depends on where we have been continuing to track down the root of the issue.

For this bug, what we found here and what I submitted a fix for (hence, what should be verified) is that while the apiserver is gracefully terminating, it is not killed while in the process of deletion due to liveness probe failures. Representative log entries are in https://bugzilla.redhat.com/show_bug.cgi?id=1932097#c4 

This can be verified by looking at the node journal.

Comment 10 Sunil Choudhary 2021-03-17 13:56:17 UTC

Thanks @Elana, I checked journal logs while apiserver was gracefully terminating and do not see it being killed.

Comment 13 errata-xmlrpc 2021-07-27 22:47:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.