Bug 1949956

Summary:	kaso: add minreadyseconds to ensure we don't have an LB outage on kas
Product:	OpenShift Container Platform	Reporter:	Lukasz Szaszkiewicz <lszaszki>
Component:	kube-apiserver	Assignee:	Lukasz Szaszkiewicz <lszaszki>
Status:	CLOSED ERRATA	QA Contact:	Xingxing Xia <xxia>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	aos-bugs, kewang, mfojtik, xxia
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 23:00:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lukasz Szaszkiewicz 2021-04-15 13:26:48 UTC

// minReadySeconds is the time to wait between the completion of an operand becoming ready (all containers ready)
// and starting the rollout onto the next node. This avoids a problem with an external load balancer that looks like
// 1. for some reason we have two instances, maybe a liveness check blipped on one node. it doesn't matter why
// 2. we bring down an instance on m0 to start a new revision
// 3. at this point we have one instance running on m1
// 4. m0 starts up and goes ready, but the LB ready check just timed out and is waiting for X seconds
// 5. we bring down an instance on m1 to start the new revision.
// 6. the LB thinks all backends are down and routes randomly
// 7. no profit.
// setting this field to 30s can prevent the kube-apiserver from triggering the above flow on AWS.

Comment 2 Xingxing Xia 2021-05-24 14:20:13 UTC

Tested in 4.8.0-0.nightly-2021-05-21-233425 env:
$ oc get po -l apiserver --no-headers -L revision
kube-apiserver-ip-10-0-159-135.us-east-2.compute.internal   5/5   Running   0     24m   10
kube-apiserver-ip-10-0-179-166.us-east-2.compute.internal   5/5   Running   0     32m   10
kube-apiserver-ip-10-0-222-145.us-east-2.compute.internal   5/5   Running   0     39m   10

$ N=10

$ INSTALLER_PODS=`oc get po -l app=installer --no-headers -o name | grep "installer-$N" | grep -o '[^/]*$'`

$ KAS_PODS=`oc get po -l apiserver --no-headers -o name | grep -o '[^/]*$'`

Check pod ready timestamp
$ oc get po $KAS_PODS -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.conditions[?(@.type=="Ready")].lastTransitionTime}{"\n"}{end}' > pods

Check installer pod creation timestamp
$ oc get po $INSTALLER_PODS -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.metadata.creationTimestamp}{"\n"}{end}' >> pods

$ sort -k 2 pods
installer-10-ip-10-0-222-145.us-east-2.compute.internal 2021-05-24T12:56:18Z
kube-apiserver-ip-10-0-222-145.us-east-2.compute.internal 2021-05-24T13:02:28Z
installer-10-ip-10-0-179-166.us-east-2.compute.internal 2021-05-24T13:03:24Z # new installer pod creation is more than 30s after the ready timestamp of last kube-apiserver instance pod
kube-apiserver-ip-10-0-179-166.us-east-2.compute.internal 2021-05-24T13:09:44Z
installer-10-ip-10-0-159-135.us-east-2.compute.internal 2021-05-24T13:10:36Z # new installer pod creation is more than 30s after the ready timestamp of last kube-apiserver instance pod
kube-apiserver-ip-10-0-159-135.us-east-2.compute.internal 2021-05-24T13:16:36Z

Comment 5 errata-xmlrpc 2021-07-27 23:00:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438