Description of problem:
Looks like we've encountered some regression in 4.6 around external traffic policy being set to "local". Any manipulation, it seems the operator is just swapping to back to "local." In previous versions, if this value was set to "cluster" it would stick, without intervention from the operator.
In IBM's IKS the loadbalancer impl is created within the cluster, LB places a VIP on one of the worker nodes, using keepalived to maintain the VIP and ensure redundancy. This LB depends on iptable rules kube-proxy puts in to send traffic from the vip to the cluster.
With a policy of local, the traffic is only sent to pods on the local node - specifically setting back to Cluster (for IBM plat) will enable the traffic to flow to all pods in the cluster.
Version-Release number of selected component (if applicable):
seems to be some time during 4.6
easy to reproduce
Steps to Reproduce:
1. create an IBM IKS openshift 4.6 cluster
2. check external traffic policy on LB after provisioning
3. traffic policy will be set to local
traffic policy is set to local after cluster provisioning, any subsequent manipulations get overwritten by the operator
traffic policy is set to cluster on IBM plat after cluster provisioning, any subsequent manipulations will be honored.
(In reply to Rudi Braun from comment #0)
> Looks like we've encountered some regression in 4.6 around external traffic
> policy being set to "local". Any manipulation, it seems the operator is just
> swapping to back to "local." In previous versions, if this value was set to
> "cluster" it would stick, without intervention from the operator.
This may have been caused by https://github.com/openshift/cluster-ingress-operator/pull/482, which was reverted in https://github.com/openshift/cluster-ingress-operator/pull/507 to fix bug 1905490. #482 shipped in 4.6.6, and #507 shipped in 4.6.9. On what specific version are you seeing the issue?
I agree though that if IBM Cloud needs "Cluster" external traffic policy, then the operator should set that (as per <https://github.com/openshift/cluster-ingress-operator/pull/516>). Nothing but the operator should be modifying the service that the operator manages.
We were testing against 4.6.6 when observing the issue, have not tried against a 4.6.9+ build.
We'll try to get https://github.com/openshift/cluster-ingress-operator/pull/516 merged in time for the OCP 4.7.0 release so that the operator sets the "Cluster" external traffic policy on IBM Cloud.
I gather that you are currently using some workaround to set the external traffic policy. Do you want https://github.com/openshift/cluster-ingress-operator/pull/516 to be backported to 4.6.z in order to obviate the need for the workaround? (A backport will require some manual conflict resolution, but I do not mind doing it.)
We've given 4.6.9 a shot per the comment above about the revert, and it does look like we're seeing the original behavior pre-4.6.6. I think if at some point you'd like to reintroduce that suspected change in 4.6, it would make sense to backport - however I differ to your guys' best judgement. As things stand, we appear to be working ok in 4.6.9.
Verified with 4.7.0-0.nightly-2021-01-15-194305
no regression on other Cloud platform and moving to verified
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.