Bug 1811801
Summary: | /readyz should start reporting failure on shutdown initiation | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Abu Kashem <akashem> |
Component: | openshift-apiserver | Assignee: | Abu Kashem <akashem> |
Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.4 | CC: | aos-bugs, kewang, mfojtik, sttts, xxia |
Target Milestone: | --- | ||
Target Release: | 4.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1811202 | Environment: | |
Last Closed: | 2020-07-13 17:19:06 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1811202 |
Description
Abu Kashem
2020-03-09 19:59:22 UTC
Hi xxia, You can reproduce this issue as follows: - check the /readyz endpoint of the `kube-apiserver`. it should return ok. - roll out kube-apiserver or you can send a SIGTERM signal to a kube-apiserver Pod directly. - check the /readyz endpoint of the `kube-apiserver`. it will return error approximately 70s after the kill signal is sent. Expected result: With the fix in place, you should see /readyz reporting failure immediately after sending the kill signal to the Pod. Please let me know if you need any help with this. Thanks! I have verified that it is fixed in master, plus double checked that it wasn't 3+ days ago: - in one terminal: - exec into kube-apiserver pod of master 0 - execute: while true; do curl -k https://localhost:6443/readyz; done - printing: okokokokokokokok - in other terminal: - oc debug node/<master-0> - chroot /host - bash - ps aux | grep " kube-apiserver " - kill -INT <pid-from-previous-output> - in first terminal I see: [+]ping ok [+]log ok [+]etcd ok [+]poststarthook/openshift.io-startkubeinformers ok [+]poststarthook/openshift.io-StartOAuthInformers ok [+]poststarthook/start-kube-apiserver-admission-initializer ok [+]poststarthook/quota.openshift.io-clusterquotamapping ok [+]poststarthook/generic-apiserver-start-informers ok [+]poststarthook/start-apiextensions-informers ok [+]poststarthook/start-apiextensions-controllers ok [+]poststarthook/crd-discovery-available ok [+]poststarthook/crd-informer-synced ok [+]poststarthook/bootstrap-controller ok [+]poststarthook/rbac/bootstrap-roles ok [+]poststarthook/scheduling/bootstrap-system-priority-classes ok [+]poststarthook/start-cluster-authentication-info-controller ok [+]poststarthook/aggregator-reload-proxy-client-cert ok [+]poststarthook/start-kube-aggregator-informers ok [+]poststarthook/apiservice-registration-controller ok [+]poststarthook/apiservice-status-available-controller ok [+]poststarthook/apiservice-wait-for-first-sync ok [+]poststarthook/kube-apiserver-autoregistration ok [+]autoregister-completion ok [+]poststarthook/apiservice-openapi-controller ok [-]shutdown failed: reason withheld healthz check failed Above verifies kube-apiserver. Will verify this for openshift-apiserver later Verified with OCP build 4.5.0-0.nightly-2020-03-15-152626, detail see below, - in one terminal: - exec into kube-apiserver pod of master 0 $ oc get pods -n openshift-apiserver -o wide $ oc rsh -n openshift-apiserver <openshfit-apiserver pod name> - execute: while true; do curl -k https://localhost:8443/readyz; done sh-4.2# while true; do curl -k https://localhost:8443/readyz; done okokokokokok ... - in other terminal: - oc debug node/<master-0> - chroot /host - bash - ps aux | grep "openshift-apiserver start" - kill -INT <pid-from-previous-output> - in first terminal we can see: sh-4.2# while true; do curl -k https://localhost:8443/readyz; done okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokcommand terminated with exit code 137 The endpoint of readyz will start returning failure as soon as openshift-apiserver shutdown is initiated, but I am not sure if the output as expected. Anyone can confirm this? (In reply to Ke Wang from comment #6) > Verified with OCP build 4.5.0-0.nightly-2020-03-15-152626, detail see below, ... > sh-4.2# while true; do curl -k https://localhost:8443/readyz; done > okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokcommand terminated with exit code 137 > but I am not sure if the output as expected In 4.5.0-0.nightly-2020-03-15-220309 env, checked below. $ oc -n openshift-apiserver get po -o wide # get pod IP apiserver-675d9fc545-76gh5 1/1 Running 7 4h21m 10.128.0.25 xxia03-x4l9k-m-0.c.openshift-qe.internal <none> ... In one terminal, enter into master $ oc debug no/xxia03-x4l9k-m-0.c.openshift-qe.internal sh-4.2# chroot /host sh-4.4# while true; do curl -k --silent --show-error https://10.128.0.25:8443/readyz ; done |& tee /tmp/xxia.log # curl pod IP In another terminal, $ oc rsh xxia03-x4l9k-m-0copenshift-qeinternal-debug sh-4.2# chroot /host sh-4.4# ps aux | grep "openshift-apiserver start" root 582675 10.6 1.3 765608 200216 ? Ssl 07:17 0:07 openshift-apiserver start --config=/var/run/configmaps/config/config.yaml -v=2 sh-4.4# kill -INT 582675 In the first terminal, check the output, after above kill, can immediately see: okokokokokokokokokokokokokokokokokokokokokokokok curl: (7) Failed to connect to 10.128.0.25 port 8443: Connection refused curl: (7) Failed to connect to 10.128.0.25 port 8443: Connection refused ... I think this meets "We expect /readyz to start returning failure as soon as apiserver shutdown is initiated...detect that /readyz is red" Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |