1959194 – Ingress controller should use minReadySeconds because otherwise it is disrupted during deployment updates

Bug 1959194 - Ingress controller should use minReadySeconds because otherwise it is disrupted during deployment updates

Summary: Ingress controller should use minReadySeconds because otherwise it is disrupt...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Clayton Coleman
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-10 21:45 UTC by Clayton Coleman
Modified:	2022-08-04 22:32 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:07:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 569	0	None	open	Bug 1959194: Ingress rollouts should specify minReadySeconds	2021-05-12 16:05:05 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:08:11 UTC

Description Clayton Coleman 2021-05-10 21:45:51 UTC

Deployments with replicas=2 and maxUnavailable!=0 have a subtle behavior - the moment the deployment controller sees that the new pod is ready, it deletes the old pod. That delete propagates fast - faster than a load balancer might see it.

So if you had a LB that was checking for readiness, you'd potentially be at risk in the default config of having the old pod removed before the new pod was fully in rotation. By default we will recommend 30s to bring ingress / api in and out of rotation (i.e. set (healthy/healthy threshold +1) * interval to be < 30s), so by setting minReady we ensure consistency there.  Experimentally in the wild it takes about 30s for kube-proxy events to reach all nodes even under heavy iptables contention, so 30s works well for simply waiting long enough to ensure all nodes see the update when the endpoints are changed.

The ingress controller is the only component that must make this change at this time, but any future service load balancer exposed component should follow in its footsteps.  kube-apiserver currently mitigates a bug in AWS load balancers by waiting significantly longer - that is not necessary here because the kube-proxy routes requests from other nodes (with https://github.com/openshift/cluster-ingress-operator/pull/609 going into 4.8) and so any node behind the LB can still send to the right target.

Comment 2 Hongan Li 2021-05-24 08:00:34 UTC

verified with 4.8.0-0.nightly-2021-05-21-233425 and passed.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-21-233425   True        False         5h18m   Cluster version is 4.8.0-0.nightly-2021-05-21-233425

$ oc -n openshift-ingress get deploy/router-default -oyaml
<---snip--->
spec:
  minReadySeconds: 30
  progressDeadlineSeconds: 600
  replicas: 2

check pod status during deployment updates:
$ oc -n openshift-ingress get pod 
NAME                             READY   STATUS        RESTARTS   AGE
router-default-6467bf666-kjrmm   1/1     Running       0          29s
router-default-6467bf666-tbw7m   1/1     Running       0          29s
router-default-c4cdc666d-cc64l   0/1     Terminating   0          12m
router-default-c4cdc666d-fgzgq   1/1     Running       0          12m
...

$ oc -n openshift-ingress get pod 
NAME                             READY   STATUS        RESTARTS   AGE
router-default-6467bf666-kjrmm   1/1     Running       0          36s
router-default-6467bf666-tbw7m   1/1     Running       0          36s
router-default-c4cdc666d-cc64l   0/1     Terminating   0          12m
router-default-c4cdc666d-fgzgq   1/1     Terminating   0          12m

Comment 5 errata-xmlrpc 2021-07-27 23:07:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.