Bug 1960284

Summary:	ExternalTrafficPolicy Local does not preserve connections correctly on shutdown, policy Cluster has significant performance cost
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Networking	Assignee:	Miciah Dashiel Butler Masters <mmasters>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	amcdermo, aos-bugs, bbennett, cholman, sgreene, swasthan, wking
Version:	4.8
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 23:08:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2021-05-13 14:10:04 UTC

ExternalTrafficPolicy Local results in connections disrupted when the pod shuts down (and requires significant work to push through upstream and we've been trying since 4.3/4.4).  ExternalTrafficPolicy Cluster does not use service health checks (which means it doesn't prefer to route traffic to nodes).

Neither of these behaviors is correct but we have allowed it to stay that way upstream, yet we continue to disrupt all users on the platform via ingress and SLB.  We need to:

a) articulate the upstream plan that resolves this "you have two crappy options" and set a timeline
b) identify the minimal workaround that allows services to "do the right thing" (roughly I think the behavior all users want is PreferLocal where service health checks are in use AND if no pods are on a node we get rerouted).

Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1929396 - apps behind load balancers are either slow (Cluster) or disrupted (Local).

Using this as a summary bug to track all the work since we continue to forget implications of this.

Comment 1 Clayton Coleman 2021-05-13 16:23:30 UTC

There is a workaround for DNS in kube-proxy that for local traffic services when no endpoints are possible uses any endpoint, which is roughly the behavior we want.  The workaround is currently based on service name. It would be better to use an annotation for both dns, ingress, and the service availability test and be consistent.

Comment 2 Clayton Coleman 2021-05-13 16:25:10 UTC

To clarify, additionally we cannot set ExternalTrafficPolicy for ingress to Cluster because that would result in an extra hop (SLB only use health check to filter the set when ExternalTrafficPolicy is Local, and we can't change the behavior on short notice for either of the existing options by default without potentially impacting customer workloads).

Comment 3 Clayton Coleman 2021-05-13 16:34:21 UTC

Specifically workaround https://github.com/openshift/sdn/pull/254

Comment 5 Hongan Li 2021-06-08 09:13:28 UTC

Verified with 4.8.0-0.nightly-2021-06-08-005718 and passed.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-08-005718   True        False         56m     Cluster version is 4.8.0-0.nightly-2021-06-08-005718

### LB service has the annotation by default
$ oc -n openshift-ingress get svc/router-default -oyaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    traffic-policy.network.alpha.openshift.io/local-with-fallback: ""


### set localWithFallback as "false" can remove the annotation from the LB service.
spec:
  unsupportedConfigOverrides:
    localWithFallback: "false"


### NodePort service also has the annotation by default
$ oc -n openshift-ingress get svc/router-nodeport-nodeport -oyaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    traffic-policy.network.alpha.openshift.io/local-with-fallback: ""

Comment 8 errata-xmlrpc 2021-07-27 23:08:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438