Bug 2055470

Summary:	Ingresscontroller LB scope change behaviour differs for different values of aws-load-balancer-internal annotation
Product:	OpenShift Container Platform	Reporter:	Ravi Trivedi <travi>
Component:	Networking	Assignee:	aos-network-edge-staff <aos-network-edge-staff>
Networking sub component:	router	QA Contact:	Arvind iyengar <aiyengar>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, cblecker, hongli, mfisher, mmasters, nmalik, wking
Version:	4.10	Keywords:	ServiceDeliveryBlocker, Upgrades
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The AWS cloud-provider implementation checks the "service.beta.kubernetes.io/aws-load-balancer-internal" service annotation to determine whether a service load-balancer (SLB) should be configured to be internal (as opposed to being public). The cloud-provider implementation recognizes both the value "0.0.0.0/0" and the value "true" as indicating that an SLB should be internal. The ingress operator in OpenShift 4.7 and earlier sets the value "0.0.0.0/0", and the ingress operator in OpenShift 4.8 and later sets the value "true" for services that the operator creates for internal SLBs. A service that was created on an older cluster might have the annotation value "0.0.0.0/0", which could cause comparisons that check for the "true" value to return the wrong result. Consequence: When a cluster had an internal SLB that had been configured using the old annotation value and the cluster was upgraded to OpenShift 4.10, the ingress operator would report the Progressing=True clusteroperator status condition, preventing the upgrade from completing. Fix: Logic was added to the ingress operator to normalize the service.beta.kubernetes.io/aws-load-balancer-internal service annotation for operator-managed services by replacing the value "0.0.0.0/0" with the value "true". Result: The ingress operator no longer prevents upgrades of clusters with the "service.beta.kubernetes.io/aws-load-balancer-internal=0.0.0.0/0" annotation from completing.	Story Points:	---
Clone Of:
Clones:	2056928 2057518 (view as bug list)		Environment:
Last Closed:	2022-08-10 10:50:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2056928

Description Ravi Trivedi 2022-02-17 03:52:33 UTC

Description of problem:

For managed clusters, upgrades from 4.9 to 4.10 got stuck on AWS clusters with below message:
~~~
ingress 4.10.0-rc.1 True True False 254d ingresscontroller "default" is progressing: ScopeChanged: The IngressController scope was changed from "External" to "Internal". To effectuate this change, you must delete the service: `oc -n openshift-ingress delete svc/router-default`; the service load-balancer will then be deprovisioned and a new one created. This will most likely cause the new load-balancer to have a different host name and IP address from the old one's. Alternatively, you can revert the change to the IngressController: `oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"External"}}}}'.
~~~

In order to reproduce the issue, went through BZ https://bugzilla.redhat.com/show_bug.cgi?id=2035193 and enhancement https://github.com/openshift/enhancements/blob/master/enhancements/ingress/mutable-publishing-scope.md#proposal to understand the working details better.

However, based on the enhancement its expected that the Ingress ClusterOperator would be stuck in Progressing state when the scope is changed from "External" to "Internal" before 4.9 to 4.10 upgrade of an AWS cluster. But what was observed is that the above message to manually delete the router-default service occurred only when the router-default service has the below annotation:
~~~
service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0
~~~

By default, for 4.8+ clusters, the value of this annotation is the following:
~~~
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
~~~

Thus it was noticed that for old clusters which were installed at version 4.7 or earlier, the clusters maintained the annotation value to be "0.0.0.0/0" and were stuck in upgrade from 4.9 to 4.10. But a cluster installed at version 4.9 had this annotation value of "true" and managed to finish the upgrade without getting stuck. This bugzilla is raised to track this difference in behaviour for long running clusters (4.7 or earlier) and recently installed clusters (4.8+).

OpenShift release version:
4.10

Cluster Platform:
AWS

How reproducible:
Always

Steps to Reproduce (in detail):
Scenario 1:
1. Create AWS cluster at version 4.9
2. Change LB scope to "Internal" initiate upgrade to 4.10 (router-default service annotation should be `service.beta.kubernetes.io/aws-load-balancer-internal: "true"`. Use command `oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"scope":"Internal"}}}}'`
3. Initiate Upgrade to 4.10. It's expected to upgrade fine without getting stuck.

Scenario 2:
1. Create AWS cluster at version 4.9
2. Change LB scope to "Internal" initiate upgrade to 4.10 (router-default service annotation should be `service.beta.kubernetes.io/aws-load-balancer-internal: "true"`. Use command `oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"scope":"Internal"}}}}'`.
3. Edit router-default service and change the annotation from `service.beta.kubernetes.io/aws-load-balancer-internal: "true"` to `service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0`.
4. Initiate Upgrade to 4.10. It's expected to getting stuck in upgrade by throwing the above message to manually delete the router-default service.

Actual results:
- Upgrades from 4.9 to 4.10 on AWS clusters behave differently based on the value of `service.beta.kubernetes.io/aws-load-balancer-internal` annotation in router-default service.

Expected results:
- Upgrades from 4.9 to 4.10 on AWS clusters should behave the same based on the value of `service.beta.kubernetes.io/aws-load-balancer-internal` annotation in router-default service. This is considering the fact that the default value of this annotation was changed from "0.0.0.0/0" to "true" from OCP 4.8.

Impact of the problem:
- Impact is high as it would block lot of 4.9 to 4.10 upgrades for the managed clusters fleet. Manual intervention to delete router-default service is not scalable.
- Considering https://github.com/openshift/enhancements/blob/master/enhancements/ingress/mutable-publishing-scope.md#as-the-provider-of-a-managed-service-i-want-to-automate-changing-an-ingresscontrollers-scope, the workaround exists to annotate the default ingresscontroller to automatically delete the router-default service. This was tested and it worked as expected.

Additional info:

Tests were mostly done on managed OSD clusters which also has https://github.com/openshift/cloud-ingress-operator deployed. However, this operator does not manage the `service.beta.kubernetes.io/aws-load-balancer-internal` annotation.

Following tests were done:
1. Create 4.9 AWS cluster and upgrade to 4.10 after changing scope to internal via OCM console (annotation value "true" in 4.9) = UPGRADE PASSED
2. Create 4.7 AWS cluster and upgrade to 4.10 after changing scope to internal via OCM console (annotation value "0.0.0.0/0" in 4.7-4.9) = UPGRADE STUCK
3. Create 4.9 AWS cluster and upgrade to 4.10 after changing scope to internal via manual patch of ingresscontrollers/default (annotation value "true" in 4.9) = UPGRADE PASSED
4. Create 4.9 AWS cluster and upgrade to 4.10 after changing scope to internal and manually changing annotation value to "0.0.0.0/0" = UPGRADE STUCK
5. Create 4.9 AWS cluster and upgrade to 4.10 after changing scope to internal and adding ingress.operator.openshift.io/auto-delete-load-balancer= annotation to default ingresscontroller = UPGRADE PASSED

Comment 3 mfisher 2022-02-17 15:09:25 UTC

Resetting the target-release field, and pri/severity/bocker- fields as Miciah had intended.

Comment 7 Arvind iyengar 2022-02-23 07:57:02 UTC

The PR merge had made into the "4.11.0-0.ci-2022-02-22-163446" image as of writing. Performing an upgrade test 4.10.0-rc.3 -> 4.11.0-0.ci-2022-02-22-163446, it is observed that the fix works as intended, the "service.beta.kubernetes.io/aws-load-balancer-internal" annotation reverts to "true" when the ingress operator gets upgraded during the progress:
-------
oc get clusterversion                            
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.3   True        False         27m     Cluster version is 4.10.0-rc.3

oc -n openshift-ingress edit service/router-default     
service/router-default edited

oc -n openshift-ingress get service/router-default -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "5"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "4"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0  <--------
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
    traffic-policy.network.alpha.openshift.io/local-with-fallback: ""
  creationTimestamp: "2022-02-23T05:35:06Z"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup

oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-02-22-163446 --allow-explicit-upgrade=true --force
Updating to release image registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-02-22-163446

oc get clusterversion                                  
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-rc.3   True        True          2m56s   Working towards 4.11.0-0.ci-2022-02-22-163446: 95 of 773 done (12% complete)

oc get co ingress                                         
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE  
ingress                                    4.10.0-rc.3   True        True          False      54m     ingresscontroller "default" is progressing: ScopeChanged: The IngressController scope was changed from "External" to "Internal".  To effectuate this change, you must delete the service: `oc -n openshift-ingress delete svc/router-default`; the service load-balancer will then be deprovisioned and a new one created.  This will most likely cause the new load-balancer to have a different host name and IP address from the old one's.  Alternatively, you can revert the change to the IngressController: `oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"External"}}}}'.
insights                                   4.10.0-rc.3   True        False         False      53m     

Post upgrade:
oc get co ingress                                             
NAME                                       VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress                                    4.11.0-0.ci-2022-02-22-163446   True        False         False      88m  

oc -n openshift-ingress get service/router-default -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "5"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "4"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"  <--------
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
    traffic-policy.network.alpha.openshift.io/local-with-fallback: ""
  creationTimestamp: "2022-02-23T05:35:06Z"
--------

Based on the above outcome, marking this BZ as "verified"

Comment 10 errata-xmlrpc 2022-08-10 10:50:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069