Bug 1960284 - ExternalTrafficPolicy Local does not preserve connections correctly on shutdown, policy Cluster has significant performance cost
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.8
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.8.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
Reported: 2021-05-13 14:10 UTC by Clayton Coleman
Modified: 2021-07-27 23:08 UTC (History)
Doc Type: If docs needed, set a value
Last Closed: 2021-07-27 23:08:23 UTC
System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 622 0 None closed Bug 1960284: Set the "local-with-fallback" service annotation 2021-06-07 04:13:29 UTC
Github openshift sdn pull 310 0 None closed Bug 1960284: Bump openshift/kubernetes for "local-with-fallback" 2021-06-02 16:31:35 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:08:37 UTC

Description Clayton Coleman 2021-05-13 14:10:04 UTC
ExternalTrafficPolicy Local results in connections disrupted when the pod shuts down (and requires significant work to push through upstream and we've been trying since 4.3/4.4).  ExternalTrafficPolicy Cluster does not use service health checks (which means it doesn't prefer to route traffic to nodes).

Neither of these behaviors is correct but we have allowed it to stay that way upstream, yet we continue to disrupt all users on the platform via ingress and SLB.  We need to:

a) articulate the upstream plan that resolves this "you have two crappy options" and set a timeline
b) identify the minimal workaround that allows services to "do the right thing" (roughly I think the behavior all users want is PreferLocal where service health checks are in use AND if no pods are on a node we get rerouted).


https://bugzilla.redhat.com/show_bug.cgi?id=1929396 - apps behind load balancers are either slow (Cluster) or disrupted (Local).

Using this as a summary bug to track all the work since we continue to forget implications of this.

Comment 1 Clayton Coleman 2021-05-13 16:23:30 UTC
There is a workaround for DNS in kube-proxy that for local traffic services when no endpoints are possible uses any endpoint, which is roughly the behavior we want.  The workaround is currently based on service name. It would be better to use an annotation for both dns, ingress, and the service availability test and be consistent.

Comment 2 Clayton Coleman 2021-05-13 16:25:10 UTC
To clarify, additionally we cannot set ExternalTrafficPolicy for ingress to Cluster because that would result in an extra hop (SLB only use health check to filter the set when ExternalTrafficPolicy is Local, and we can't change the behavior on short notice for either of the existing options by default without potentially impacting customer workloads).

Comment 3 Clayton Coleman 2021-05-13 16:34:21 UTC
Specifically workaround https://github.com/openshift/sdn/pull/254

Comment 5 Hongan Li 2021-06-08 09:13:28 UTC
Verified with 4.8.0-0.nightly-2021-06-08-005718 and passed.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-08-005718   True        False         56m     Cluster version is 4.8.0-0.nightly-2021-06-08-005718

### LB service has the annotation by default
$ oc -n openshift-ingress get svc/router-default -oyaml
apiVersion: v1
kind: Service
    traffic-policy.network.alpha.openshift.io/local-with-fallback: ""

### set localWithFallback as "false" can remove the annotation from the LB service.
    localWithFallback: "false"

### NodePort service also has the annotation by default
$ oc -n openshift-ingress get svc/router-nodeport-nodeport -oyaml
apiVersion: v1
kind: Service
    traffic-policy.network.alpha.openshift.io/local-with-fallback: ""

Comment 8 errata-xmlrpc 2021-07-27 23:08:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


