Bug 1949956

Summary: kaso: add minreadyseconds to ensure we don't have an LB outage on kas
Product: OpenShift Container Platform Reporter: Lukasz Szaszkiewicz <lszaszki>
Component: kube-apiserverAssignee: Lukasz Szaszkiewicz <lszaszki>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: aos-bugs, kewang, mfojtik, xxia
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:00:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lukasz Szaszkiewicz 2021-04-15 13:26:48 UTC
// minReadySeconds is the time to wait between the completion of an operand becoming ready (all containers ready)
// and starting the rollout onto the next node. This avoids a problem with an external load balancer that looks like
// 1. for some reason we have two instances, maybe a liveness check blipped on one node. it doesn't matter why
// 2. we bring down an instance on m0 to start a new revision
// 3. at this point we have one instance running on m1
// 4. m0 starts up and goes ready, but the LB ready check just timed out and is waiting for X seconds
// 5. we bring down an instance on m1 to start the new revision.
// 6. the LB thinks all backends are down and routes randomly
// 7. no profit.
// setting this field to 30s can prevent the kube-apiserver from triggering the above flow on AWS.

Comment 2 Xingxing Xia 2021-05-24 14:20:13 UTC
Tested in 4.8.0-0.nightly-2021-05-21-233425 env:
$ oc get po -l apiserver --no-headers -L revision
kube-apiserver-ip-10-0-159-135.us-east-2.compute.internal   5/5   Running   0     24m   10
kube-apiserver-ip-10-0-179-166.us-east-2.compute.internal   5/5   Running   0     32m   10
kube-apiserver-ip-10-0-222-145.us-east-2.compute.internal   5/5   Running   0     39m   10

$ N=10

$ INSTALLER_PODS=`oc get po -l app=installer --no-headers -o name | grep "installer-$N" | grep -o '[^/]*$'`

$ KAS_PODS=`oc get po -l apiserver --no-headers -o name | grep -o '[^/]*$'`

Check pod ready timestamp
$ oc get po $KAS_PODS -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.conditions[?(@.type=="Ready")].lastTransitionTime}{"\n"}{end}' > pods

Check installer pod creation timestamp
$ oc get po $INSTALLER_PODS -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.metadata.creationTimestamp}{"\n"}{end}' >> pods

$ sort -k 2 pods
installer-10-ip-10-0-222-145.us-east-2.compute.internal 2021-05-24T12:56:18Z
kube-apiserver-ip-10-0-222-145.us-east-2.compute.internal 2021-05-24T13:02:28Z
installer-10-ip-10-0-179-166.us-east-2.compute.internal 2021-05-24T13:03:24Z # new installer pod creation is more than 30s after the ready timestamp of last kube-apiserver instance pod
kube-apiserver-ip-10-0-179-166.us-east-2.compute.internal 2021-05-24T13:09:44Z
installer-10-ip-10-0-159-135.us-east-2.compute.internal 2021-05-24T13:10:36Z # new installer pod creation is more than 30s after the ready timestamp of last kube-apiserver instance pod
kube-apiserver-ip-10-0-159-135.us-east-2.compute.internal 2021-05-24T13:16:36Z

Comment 5 errata-xmlrpc 2021-07-27 23:00:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438