Bug 1912916

Summary: Set external traffic policy to cluster for IBM platform
Product: OpenShift Container Platform Reporter: Rudi Braun <rudi.braun>
Component: RoutingAssignee: Miciah Dashiel Butler Masters <mmasters>
Status: CLOSED ERRATA QA Contact: Hongan Li <hongli>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.6CC: amcdermo, aos-bugs, cewong, mmasters
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:50:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Rudi Braun 2021-01-05 15:29:16 UTC
Description of problem:

Looks like we've encountered some regression in 4.6 around external traffic policy being set to "local". Any manipulation, it seems the operator is just swapping to back to "local." In previous versions, if this value was set to "cluster" it would stick, without intervention from the operator. 

In IBM's IKS the loadbalancer impl is created within the cluster, LB places a VIP on one of the worker nodes, using keepalived to maintain the VIP and ensure redundancy. This LB depends on iptable rules kube-proxy puts in to send traffic from the vip to the cluster. 

With a policy of local, the traffic is only sent to pods on the local node - specifically setting back to Cluster (for IBM plat) will enable the traffic to flow to all pods in the cluster.  

Version-Release number of selected component (if applicable):
seems to be some time during 4.6

How reproducible:
easy to reproduce

Steps to Reproduce:
1. create an IBM IKS openshift 4.6 cluster
2. check external traffic policy on LB after provisioning
3. traffic policy will be set to local

Actual results:
traffic policy is set to local after cluster provisioning, any subsequent manipulations get overwritten by the operator

Expected results:
traffic policy is set to cluster on IBM plat after cluster provisioning, any subsequent manipulations will be honored. 

Additional info:

Comment 1 Miciah Dashiel Butler Masters 2021-01-05 15:40:30 UTC
(In reply to Rudi Braun from comment #0)
> Looks like we've encountered some regression in 4.6 around external traffic
> policy being set to "local". Any manipulation, it seems the operator is just
> swapping to back to "local." In previous versions, if this value was set to
> "cluster" it would stick, without intervention from the operator. 

This may have been caused by https://github.com/openshift/cluster-ingress-operator/pull/482, which was reverted in https://github.com/openshift/cluster-ingress-operator/pull/507 to fix bug 1905490.  #482 shipped in 4.6.6, and #507 shipped in 4.6.9.  On what specific version are you seeing the issue?  

I agree though that if IBM Cloud needs "Cluster" external traffic policy, then the operator should set that (as per <https://github.com/openshift/cluster-ingress-operator/pull/516>).  Nothing but the operator should be modifying the service that the operator manages.

Comment 2 Rudi Braun 2021-01-05 15:48:22 UTC
We were testing against 4.6.6 when observing the issue, have not tried against a 4.6.9+ build.

Comment 4 Miciah Dashiel Butler Masters 2021-01-05 18:39:41 UTC
We'll try to get https://github.com/openshift/cluster-ingress-operator/pull/516 merged in time for the OCP 4.7.0 release so that the operator sets the "Cluster" external traffic policy on IBM Cloud.  

I gather that you are currently using some workaround to set the external traffic policy.  Do you want https://github.com/openshift/cluster-ingress-operator/pull/516 to be backported to 4.6.z in order to obviate the need for the workaround?  (A backport will require some manual conflict resolution, but I do not mind doing it.)

Comment 6 Rudi Braun 2021-01-06 18:54:49 UTC
We've given 4.6.9 a shot per the comment above about the revert, and it does look like we're seeing the original behavior pre-4.6.6. I think if at some point you'd like to reintroduce that suspected change in 4.6, it would make sense to backport - however I differ to your guys' best judgement. As things stand, we appear to be working ok in 4.6.9.

Comment 7 Cesar Wong 2021-01-15 23:15:58 UTC
Verified with 4.7.0-0.nightly-2021-01-15-194305

Comment 8 Hongan Li 2021-01-18 01:11:07 UTC
no regression on other Cloud platform and moving to verified

Comment 11 errata-xmlrpc 2021-02-24 15:50:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633