Bug 1811801 - /readyz should start reporting failure on shutdown initiation
Summary: /readyz should start reporting failure on shutdown initiation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Abu Kashem
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks: 1811202
TreeView+ depends on / blocked
 
Reported: 2020-03-09 19:59 UTC by Abu Kashem
Modified: 2020-07-13 17:19 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 1811202
Environment:
Last Closed: 2020-07-13 17:19:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-apiserver pull 80 0 None closed [release 4.5] Bug 1811801: /readyz should start returning failure on shutdown initiation 2021-02-03 00:58:04 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:19:41 UTC

Description Abu Kashem 2020-03-09 19:59:22 UTC
+++ This bug was initially created as a clone of Bug #1811202 +++

+++ This bug was initially created as a clone of Bug #1811169 +++

Description of problem:

Currently, /readyz starts reporting failure after ShutdownDelayDuration elapses. The load balancer(s) uses /readyz for health check and are not aware of the shutdown initiation until ShutdownDelayDuration elapses. This does not give the load balancer(s) enough time to detect and react to it.

We expect /readyz to start returning failure as soon as apiserver shutdown is initiated(SIGTERM received). This gives the load balancer a window (defined by ShutdownDelayDuration) to detect that /readyz is red and stop sending traffic to this server.


How reproducible:
Always


upstream PR: https://github.com/kubernetes/kubernetes/pull/88911

--- Additional comment from Abu Kashem on 2020-03-09 19:57:50 UTC ---

This is to take the upstream patch https://github.com/kubernetes/kubernetes/pull/88911 into openshift apiserver.

See: https://github.com/openshift/openshift-apiserver/pull/81

Comment 3 Abu Kashem 2020-03-12 19:37:15 UTC
Hi xxia,
You can reproduce this issue as follows:
- check the /readyz endpoint of the `kube-apiserver`. it should return ok.
- roll out kube-apiserver or you can send a SIGTERM signal to a kube-apiserver Pod directly.
- check the /readyz endpoint of the `kube-apiserver`. it will return error approximately 70s after the kill signal is sent.

Expected result:
With the fix in place, you should see /readyz reporting failure immediately after sending the kill signal to the Pod.

Please let me know if you need any help with this.

Thanks!

Comment 4 Stefan Schimanski 2020-03-13 16:24:16 UTC
I have verified that it is fixed in master, plus double checked that it wasn't 3+ days ago:

- in one terminal:
  - exec into kube-apiserver pod of master 0
  - execute: while true; do curl -k https://localhost:6443/readyz; done
  - printing: okokokokokokokok
- in other terminal:
  - oc debug node/<master-0>
  - chroot /host
  - bash
  - ps aux | grep " kube-apiserver "
  - kill -INT <pid-from-previous-output>
- in first terminal I see:

[+]ping ok
[+]log ok
[+]etcd ok
[+]poststarthook/openshift.io-startkubeinformers ok
[+]poststarthook/openshift.io-StartOAuthInformers ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/quota.openshift.io-clusterquotamapping ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-discovery-available ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/apiservice-wait-for-first-sync ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[-]shutdown failed: reason withheld
healthz check failed

Comment 5 Xingxing Xia 2020-03-16 02:21:09 UTC
Above verifies kube-apiserver. Will verify this for openshift-apiserver later

Comment 6 Ke Wang 2020-03-16 06:05:01 UTC
Verified with OCP build 4.5.0-0.nightly-2020-03-15-152626, detail see below,

- in one terminal:
  - exec into kube-apiserver pod of master 0
    $ oc get pods -n openshift-apiserver -o wide
    $ oc rsh -n openshift-apiserver <openshfit-apiserver pod name>
  - execute: while true; do curl -k https://localhost:8443/readyz; done
    sh-4.2# while true; do curl -k https://localhost:8443/readyz; done
    okokokokokok ...

- in other terminal:
  - oc debug node/<master-0>
  - chroot /host
  - bash
  - ps aux | grep "openshift-apiserver start"
  - kill -INT <pid-from-previous-output>
- in first terminal we can see:

sh-4.2# while true; do curl -k https://localhost:8443/readyz; done
okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokcommand terminated with exit code 137

The endpoint of readyz will start returning failure as soon as openshift-apiserver shutdown is initiated, but I am not sure if the output as expected. Anyone can confirm this?

Comment 7 Xingxing Xia 2020-03-16 07:32:42 UTC
(In reply to Ke Wang from comment #6)
> Verified with OCP build 4.5.0-0.nightly-2020-03-15-152626, detail see below,
...
> sh-4.2# while true; do curl -k https://localhost:8443/readyz; done
> okokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokokcommand terminated with exit code 137
> but I am not sure if the output as expected
In 4.5.0-0.nightly-2020-03-15-220309 env, checked below.

$ oc -n openshift-apiserver get po -o wide # get pod IP
apiserver-675d9fc545-76gh5   1/1     Running   7          4h21m   10.128.0.25   xxia03-x4l9k-m-0.c.openshift-qe.internal   <none>
...

In one terminal, enter into master 
$ oc debug no/xxia03-x4l9k-m-0.c.openshift-qe.internal
sh-4.2# chroot /host
sh-4.4# while true; do curl -k --silent --show-error https://10.128.0.25:8443/readyz ; done |& tee /tmp/xxia.log # curl pod IP

In another terminal,
$ oc rsh xxia03-x4l9k-m-0copenshift-qeinternal-debug
sh-4.2# chroot /host
sh-4.4# ps aux | grep "openshift-apiserver start"
root      582675 10.6  1.3 765608 200216 ?       Ssl  07:17   0:07 openshift-apiserver start --config=/var/run/configmaps/config/config.yaml -v=2
sh-4.4# kill -INT 582675

In the first terminal, check the output, after above kill, can immediately see:
okokokokokokokokokokokokokokokokokokokokokokokok 
curl: (7) Failed to connect to 10.128.0.25 port 8443: Connection refused
curl: (7) Failed to connect to 10.128.0.25 port 8443: Connection refused
...

I think this meets "We expect /readyz to start returning failure as soon as apiserver shutdown is initiated...detect that /readyz is red"

Comment 9 errata-xmlrpc 2020-07-13 17:19:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.