1949956 – kaso: add minreadyseconds to ensure we don't have an LB outage on kas

Bug 1949956 - kaso: add minreadyseconds to ensure we don't have an LB outage on kas

Summary: kaso: add minreadyseconds to ensure we don't have an LB outage on kas

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Lukasz Szaszkiewicz
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-15 13:26 UTC by Lukasz Szaszkiewicz
Modified:	2021-07-27 23:01 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:00:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-apiserver-operator pull 1091	0	None	open	add minreadyseconds to ensure we don't have an LB outage on kas	2021-04-15 13:28:26 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:01:12 UTC

Description Lukasz Szaszkiewicz 2021-04-15 13:26:48 UTC

// minReadySeconds is the time to wait between the completion of an operand becoming ready (all containers ready)
// and starting the rollout onto the next node. This avoids a problem with an external load balancer that looks like
// 1. for some reason we have two instances, maybe a liveness check blipped on one node. it doesn't matter why
// 2. we bring down an instance on m0 to start a new revision
// 3. at this point we have one instance running on m1
// 4. m0 starts up and goes ready, but the LB ready check just timed out and is waiting for X seconds
// 5. we bring down an instance on m1 to start the new revision.
// 6. the LB thinks all backends are down and routes randomly
// 7. no profit.
// setting this field to 30s can prevent the kube-apiserver from triggering the above flow on AWS.

Comment 2 Xingxing Xia 2021-05-24 14:20:13 UTC

Tested in 4.8.0-0.nightly-2021-05-21-233425 env:
$ oc get po -l apiserver --no-headers -L revision
kube-apiserver-ip-10-0-159-135.us-east-2.compute.internal   5/5   Running   0     24m   10
kube-apiserver-ip-10-0-179-166.us-east-2.compute.internal   5/5   Running   0     32m   10
kube-apiserver-ip-10-0-222-145.us-east-2.compute.internal   5/5   Running   0     39m   10

$ N=10

$ INSTALLER_PODS=`oc get po -l app=installer --no-headers -o name | grep "installer-$N" | grep -o '[^/]*$'`

$ KAS_PODS=`oc get po -l apiserver --no-headers -o name | grep -o '[^/]*$'`

Check pod ready timestamp
$ oc get po $KAS_PODS -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.conditions[?(@.type=="Ready")].lastTransitionTime}{"\n"}{end}' > pods

Check installer pod creation timestamp
$ oc get po $INSTALLER_PODS -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.metadata.creationTimestamp}{"\n"}{end}' >> pods

$ sort -k 2 pods
installer-10-ip-10-0-222-145.us-east-2.compute.internal 2021-05-24T12:56:18Z
kube-apiserver-ip-10-0-222-145.us-east-2.compute.internal 2021-05-24T13:02:28Z
installer-10-ip-10-0-179-166.us-east-2.compute.internal 2021-05-24T13:03:24Z # new installer pod creation is more than 30s after the ready timestamp of last kube-apiserver instance pod
kube-apiserver-ip-10-0-179-166.us-east-2.compute.internal 2021-05-24T13:09:44Z
installer-10-ip-10-0-159-135.us-east-2.compute.internal 2021-05-24T13:10:36Z # new installer pod creation is more than 30s after the ready timestamp of last kube-apiserver instance pod
kube-apiserver-ip-10-0-159-135.us-east-2.compute.internal 2021-05-24T13:16:36Z

Comment 5 errata-xmlrpc 2021-07-27 23:00:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.