Bug 1932097
Summary: | Apiserver liveness probe is marking it as unhealthy during normal shutdown | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | Node | Assignee: | Elana Hashman <ehashman> |
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | urgent | CC: | aos-bugs, mfojtik, wking, xxia |
Version: | 4.8 | ||
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: While container is terminating, liveness probe continues to probe container.
Consequence: Liveness probe may fail during container shutdown as container is not running healthily as expected, which will cause it to kill the container during normal shutdown.
Fix: Stop probing containers during shutdown.
Result: Liveness probes will no longer fail when container is terminating, resulting in a premature shutdown.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:47:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1928946, 1952224 | ||
Bug Blocks: | 1996846 |
Description
Clayton Coleman
2021-02-23 22:43:15 UTC
This has been green for 5 test runs in a row... Let's see if it clears up. Also happening in regular runs: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25918/pull-ci-openshift-origin-master-e2e-aws-csi/1364310688415092736 yesterday at 2pm EST. pod "pod-7961e5d1-283d-4e09-816f-995a9ffbfba0" was not deleted: Get "https://api.ci-op-7cbd82h9-2550a.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/e2e-fsgroupchangepolicy-7286/pods/pod-7961e5d1-283d-4e09-816f-995a9ffbfba0": dial tcp 3.209.36.244:6443: connect: connection refused occurred This error should never happen. Happened 4minutes ago https://search.ci.openshift.org/?search=6443%3A+connect%3A+connection+refused&maxAge=48h&context=1&type=junit&name=.*-aws-.*&maxMatches=1&maxBytes=20971520&groupBy=job https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_sdn/263/pull-ci-openshift-sdn-master-e2e-aws-upgrade/1364633756488437760 Feb 24 18:43:59.292 E kube-apiserver-new-connection kube-apiserver-new-connection started failing: Get "https://api.ci-op-hhxdyjmp-550b5.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/default": dial tcp 54.200.130.251:6443: connect: connection refused Feb 24 18:44:00.153 E kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests OpenShift backport pending upstream merge. Looks this is still an issue as I still see connection refused failure instances. https://search.ci.openshift.org/?search=6443%3A+connect%3A+connection+refused&maxAge=48h&context=1&type=junit&name=.*-aws-.*&maxMatches=1&maxBytes=20971520&groupBy=job https://search.ci.openshift.org/?search=kubelet+terminates+kube-apiserver+gracefully&maxAge=48h&context=1&type=junit&name=.*-aws-.*&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=job @Sunil I think this is a multi-part issue of which this was just one fix. Sorry, hit send too soon--- Adding a depends on where we have been continuing to track down the root of the issue. For this bug, what we found here and what I submitted a fix for (hence, what should be verified) is that while the apiserver is gracefully terminating, it is not killed while in the process of deletion due to liveness probe failures. Representative log entries are in https://bugzilla.redhat.com/show_bug.cgi?id=1932097#c4 This can be verified by looking at the node journal. Thanks @Elana, I checked journal logs while apiserver was gracefully terminating and do not see it being killed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |