1960284 – ExternalTrafficPolicy Local does not preserve connections correctly on shutdown, policy Cluster has significant performance cost

Bug 1960284 - ExternalTrafficPolicy Local does not preserve connections correctly on shutdown, policy Cluster has significant performance cost

Summary: ExternalTrafficPolicy Local does not preserve connections correctly on shutdo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-13 14:10 UTC by Clayton Coleman
Modified:	2024-10-01 18:12 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:08:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 622	None	closed	Bug 1960284: Set the "local-with-fallback" service annotation	2021-06-07 04:13:29 UTC
Github	openshift sdn pull 310	None	closed	Bug 1960284: Bump openshift/kubernetes for "local-with-fallback"	2021-06-02 16:31:35 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 23:08:37 UTC

Description Clayton Coleman 2021-05-13 14:10:04 UTC

ExternalTrafficPolicy Local results in connections disrupted when the pod shuts down (and requires significant work to push through upstream and we've been trying since 4.3/4.4).  ExternalTrafficPolicy Cluster does not use service health checks (which means it doesn't prefer to route traffic to nodes).

Neither of these behaviors is correct but we have allowed it to stay that way upstream, yet we continue to disrupt all users on the platform via ingress and SLB.  We need to:

a) articulate the upstream plan that resolves this "you have two crappy options" and set a timeline
b) identify the minimal workaround that allows services to "do the right thing" (roughly I think the behavior all users want is PreferLocal where service health checks are in use AND if no pods are on a node we get rerouted).

Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1929396 - apps behind load balancers are either slow (Cluster) or disrupted (Local).

Using this as a summary bug to track all the work since we continue to forget implications of this.

Comment 1 Clayton Coleman 2021-05-13 16:23:30 UTC

There is a workaround for DNS in kube-proxy that for local traffic services when no endpoints are possible uses any endpoint, which is roughly the behavior we want.  The workaround is currently based on service name. It would be better to use an annotation for both dns, ingress, and the service availability test and be consistent.

Comment 2 Clayton Coleman 2021-05-13 16:25:10 UTC

To clarify, additionally we cannot set ExternalTrafficPolicy for ingress to Cluster because that would result in an extra hop (SLB only use health check to filter the set when ExternalTrafficPolicy is Local, and we can't change the behavior on short notice for either of the existing options by default without potentially impacting customer workloads).

Comment 3 Clayton Coleman 2021-05-13 16:34:21 UTC

Specifically workaround https://github.com/openshift/sdn/pull/254

Comment 5 Hongan Li 2021-06-08 09:13:28 UTC

Verified with 4.8.0-0.nightly-2021-06-08-005718 and passed.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-08-005718   True        False         56m     Cluster version is 4.8.0-0.nightly-2021-06-08-005718

### LB service has the annotation by default
$ oc -n openshift-ingress get svc/router-default -oyaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    traffic-policy.network.alpha.openshift.io/local-with-fallback: ""


### set localWithFallback as "false" can remove the annotation from the LB service.
spec:
  unsupportedConfigOverrides:
    localWithFallback: "false"


### NodePort service also has the annotation by default
$ oc -n openshift-ingress get svc/router-nodeport-nodeport -oyaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    traffic-policy.network.alpha.openshift.io/local-with-fallback: ""

Comment 8 errata-xmlrpc 2021-07-27 23:08:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.