Bug 1949956 - kaso: add minreadyseconds to ensure we don't have an LB outage on kas
Summary: kaso: add minreadyseconds to ensure we don't have an LB outage on kas
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Lukasz Szaszkiewicz
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-15 13:26 UTC by Lukasz Szaszkiewicz
Modified: 2021-07-27 23:01 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:00:58 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 1091 0 None open add minreadyseconds to ensure we don't have an LB outage on kas 2021-04-15 13:28:26 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:01:12 UTC

Description Lukasz Szaszkiewicz 2021-04-15 13:26:48 UTC
// minReadySeconds is the time to wait between the completion of an operand becoming ready (all containers ready)
// and starting the rollout onto the next node. This avoids a problem with an external load balancer that looks like
// 1. for some reason we have two instances, maybe a liveness check blipped on one node. it doesn't matter why
// 2. we bring down an instance on m0 to start a new revision
// 3. at this point we have one instance running on m1
// 4. m0 starts up and goes ready, but the LB ready check just timed out and is waiting for X seconds
// 5. we bring down an instance on m1 to start the new revision.
// 6. the LB thinks all backends are down and routes randomly
// 7. no profit.
// setting this field to 30s can prevent the kube-apiserver from triggering the above flow on AWS.

Comment 2 Xingxing Xia 2021-05-24 14:20:13 UTC
Tested in 4.8.0-0.nightly-2021-05-21-233425 env:
$ oc get po -l apiserver --no-headers -L revision
kube-apiserver-ip-10-0-159-135.us-east-2.compute.internal   5/5   Running   0     24m   10
kube-apiserver-ip-10-0-179-166.us-east-2.compute.internal   5/5   Running   0     32m   10
kube-apiserver-ip-10-0-222-145.us-east-2.compute.internal   5/5   Running   0     39m   10

$ N=10

$ INSTALLER_PODS=`oc get po -l app=installer --no-headers -o name | grep "installer-$N" | grep -o '[^/]*$'`

$ KAS_PODS=`oc get po -l apiserver --no-headers -o name | grep -o '[^/]*$'`

Check pod ready timestamp
$ oc get po $KAS_PODS -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.conditions[?(@.type=="Ready")].lastTransitionTime}{"\n"}{end}' > pods

Check installer pod creation timestamp
$ oc get po $INSTALLER_PODS -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.metadata.creationTimestamp}{"\n"}{end}' >> pods

$ sort -k 2 pods
installer-10-ip-10-0-222-145.us-east-2.compute.internal 2021-05-24T12:56:18Z
kube-apiserver-ip-10-0-222-145.us-east-2.compute.internal 2021-05-24T13:02:28Z
installer-10-ip-10-0-179-166.us-east-2.compute.internal 2021-05-24T13:03:24Z # new installer pod creation is more than 30s after the ready timestamp of last kube-apiserver instance pod
kube-apiserver-ip-10-0-179-166.us-east-2.compute.internal 2021-05-24T13:09:44Z
installer-10-ip-10-0-159-135.us-east-2.compute.internal 2021-05-24T13:10:36Z # new installer pod creation is more than 30s after the ready timestamp of last kube-apiserver instance pod
kube-apiserver-ip-10-0-159-135.us-east-2.compute.internal 2021-05-24T13:16:36Z

Comment 5 errata-xmlrpc 2021-07-27 23:00:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.