Bug 1811202 - /readyz should start reporting failure on shutdown initiation
Summary: /readyz should start reporting failure on shutdown initiation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.4.0
Assignee: Abu Kashem
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On: 1811801
Blocks: 1821500 1821502 1821503
TreeView+ depends on / blocked
 
Reported: 2020-03-06 20:15 UTC by Abu Kashem
Modified: 2020-05-04 11:46 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1811169
: 1811801 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:45:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-apiserver pull 81 0 None closed [release 4.4] Bug 1811202: /readyz should start returning failure on shutdown initiation 2020-10-05 17:12:41 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:46:11 UTC

Description Abu Kashem 2020-03-06 20:15:08 UTC
+++ This bug was initially created as a clone of Bug #1811169 +++

Description of problem:

Currently, /readyz starts reporting failure after ShutdownDelayDuration elapses. The load balancer(s) uses /readyz for health check and are not aware of the shutdown initiation until ShutdownDelayDuration elapses. This does not give the load balancer(s) enough time to detect and react to it.

We expect /readyz to start returning failure as soon as apiserver shutdown is initiated(SIGTERM received). This gives the load balancer a window (defined by ShutdownDelayDuration) to detect that /readyz is red and stop sending traffic to this server.


How reproducible:
Always


upstream PR: https://github.com/kubernetes/kubernetes/pull/88911

Comment 1 Abu Kashem 2020-03-09 19:57:50 UTC
This is to take the upstream patch https://github.com/kubernetes/kubernetes/pull/88911 into openshift apiserver.

See: https://github.com/openshift/openshift-apiserver/pull/81

Comment 4 Ke Wang 2020-03-19 05:43:44 UTC
Verified with OCP 4.4.0-0.nightly-2020-03-18-102708 env, checked below.

$  oc -n openshift-apiserver get po -o wide # get pod IP
apiserver-74d496b787-b5s8v   1/1     Running   7          154m   10.....40   ip-10-...-193.us-east-2.compute.internal 
...

In one terminal, enter into master 
$ oc debug no/ip-10-..-..-193.us-east-2.compute.internal
Starting pod/ip-10-.-..-193us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10...193
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# while true; do curl -k --silent --show-error https://10.....40:8443/readyz ; done |& tee /tmp/ke.log
okokokokokokokokokokokokokokokokokokokokokokokokokokokokokok


In another terminal,
$ oc rsh ip-10-...-193us-east-2computeinternal-debug
sh-4.2# chroot /host
sh-4.4#  ps aux | grep "openshift-apiserver start"
root      325696  2.1  1.1 567144 196100 ?       Ssl  05:26   0:14 openshift-apiserver start --config=/var/run/configmaps/config/config.yaml -v=2

sh-4.4# kill -INT 325696

In the first terminal, check the output, after above kill, can immediately see:
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to 10...40:8443 
curl: (7) Failed to connect to 10.128.0.40 port 8443: Connection refused
curl: (7) Failed to connect to 10.128.0.40 port 8443: Connection refused
...

The endpoint of readyz will start returning failure as soon as openshift-apiserver shutdown is initiated, detects that /readyz is red.

Comment 6 errata-xmlrpc 2020-05-04 11:45:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.