Description of problem: For managed clusters, upgrades from 4.9 to 4.10 got stuck on AWS clusters with below message: ~~~ ingress 4.10.0-rc.1 True True False 254d ingresscontroller "default" is progressing: ScopeChanged: The IngressController scope was changed from "External" to "Internal". To effectuate this change, you must delete the service: `oc -n openshift-ingress delete svc/router-default`; the service load-balancer will then be deprovisioned and a new one created. This will most likely cause the new load-balancer to have a different host name and IP address from the old one's. Alternatively, you can revert the change to the IngressController: `oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"External"}}}}'. ~~~ In order to reproduce the issue, went through BZ https://bugzilla.redhat.com/show_bug.cgi?id=2035193 and enhancement https://github.com/openshift/enhancements/blob/master/enhancements/ingress/mutable-publishing-scope.md#proposal to understand the working details better. However, based on the enhancement its expected that the Ingress ClusterOperator would be stuck in Progressing state when the scope is changed from "External" to "Internal" before 4.9 to 4.10 upgrade of an AWS cluster. But what was observed is that the above message to manually delete the router-default service occurred only when the router-default service has the below annotation: ~~~ service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0 ~~~ By default, for 4.8+ clusters, the value of this annotation is the following: ~~~ service.beta.kubernetes.io/aws-load-balancer-internal: "true" ~~~ Thus it was noticed that for old clusters which were installed at version 4.7 or earlier, the clusters maintained the annotation value to be "0.0.0.0/0" and were stuck in upgrade from 4.9 to 4.10. But a cluster installed at version 4.9 had this annotation value of "true" and managed to finish the upgrade without getting stuck. This bugzilla is raised to track this difference in behaviour for long running clusters (4.7 or earlier) and recently installed clusters (4.8+). OpenShift release version: 4.10 Cluster Platform: AWS How reproducible: Always Steps to Reproduce (in detail): Scenario 1: 1. Create AWS cluster at version 4.9 2. Change LB scope to "Internal" initiate upgrade to 4.10 (router-default service annotation should be `service.beta.kubernetes.io/aws-load-balancer-internal: "true"`. Use command `oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"scope":"Internal"}}}}'` 3. Initiate Upgrade to 4.10. It's expected to upgrade fine without getting stuck. Scenario 2: 1. Create AWS cluster at version 4.9 2. Change LB scope to "Internal" initiate upgrade to 4.10 (router-default service annotation should be `service.beta.kubernetes.io/aws-load-balancer-internal: "true"`. Use command `oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"type":"LoadBalancerService","loadBalancer":{"scope":"Internal"}}}}'`. 3. Edit router-default service and change the annotation from `service.beta.kubernetes.io/aws-load-balancer-internal: "true"` to `service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0`. 4. Initiate Upgrade to 4.10. It's expected to getting stuck in upgrade by throwing the above message to manually delete the router-default service. Actual results: - Upgrades from 4.9 to 4.10 on AWS clusters behave differently based on the value of `service.beta.kubernetes.io/aws-load-balancer-internal` annotation in router-default service. Expected results: - Upgrades from 4.9 to 4.10 on AWS clusters should behave the same based on the value of `service.beta.kubernetes.io/aws-load-balancer-internal` annotation in router-default service. This is considering the fact that the default value of this annotation was changed from "0.0.0.0/0" to "true" from OCP 4.8. Impact of the problem: - Impact is high as it would block lot of 4.9 to 4.10 upgrades for the managed clusters fleet. Manual intervention to delete router-default service is not scalable. - Considering https://github.com/openshift/enhancements/blob/master/enhancements/ingress/mutable-publishing-scope.md#as-the-provider-of-a-managed-service-i-want-to-automate-changing-an-ingresscontrollers-scope, the workaround exists to annotate the default ingresscontroller to automatically delete the router-default service. This was tested and it worked as expected. Additional info: Tests were mostly done on managed OSD clusters which also has https://github.com/openshift/cloud-ingress-operator deployed. However, this operator does not manage the `service.beta.kubernetes.io/aws-load-balancer-internal` annotation. Following tests were done: 1. Create 4.9 AWS cluster and upgrade to 4.10 after changing scope to internal via OCM console (annotation value "true" in 4.9) = UPGRADE PASSED 2. Create 4.7 AWS cluster and upgrade to 4.10 after changing scope to internal via OCM console (annotation value "0.0.0.0/0" in 4.7-4.9) = UPGRADE STUCK 3. Create 4.9 AWS cluster and upgrade to 4.10 after changing scope to internal via manual patch of ingresscontrollers/default (annotation value "true" in 4.9) = UPGRADE PASSED 4. Create 4.9 AWS cluster and upgrade to 4.10 after changing scope to internal and manually changing annotation value to "0.0.0.0/0" = UPGRADE STUCK 5. Create 4.9 AWS cluster and upgrade to 4.10 after changing scope to internal and adding ingress.operator.openshift.io/auto-delete-load-balancer= annotation to default ingresscontroller = UPGRADE PASSED
Resetting the target-release field, and pri/severity/bocker- fields as Miciah had intended.
The PR merge had made into the "4.11.0-0.ci-2022-02-22-163446" image as of writing. Performing an upgrade test 4.10.0-rc.3 -> 4.11.0-0.ci-2022-02-22-163446, it is observed that the fix works as intended, the "service.beta.kubernetes.io/aws-load-balancer-internal" annotation reverts to "true" when the ingress operator gets upgraded during the progress: ------- oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-rc.3 True False 27m Cluster version is 4.10.0-rc.3 oc -n openshift-ingress edit service/router-default service/router-default edited oc -n openshift-ingress get service/router-default -o yaml apiVersion: v1 kind: Service metadata: annotations: service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2" service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "5" service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "4" service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2" service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0 <-------- service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*' traffic-policy.network.alpha.openshift.io/local-with-fallback: "" creationTimestamp: "2022-02-23T05:35:06Z" finalizers: - service.kubernetes.io/load-balancer-cleanup oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-02-22-163446 --allow-explicit-upgrade=true --force Updating to release image registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-02-22-163446 oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-rc.3 True True 2m56s Working towards 4.11.0-0.ci-2022-02-22-163446: 95 of 773 done (12% complete) oc get co ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.10.0-rc.3 True True False 54m ingresscontroller "default" is progressing: ScopeChanged: The IngressController scope was changed from "External" to "Internal". To effectuate this change, you must delete the service: `oc -n openshift-ingress delete svc/router-default`; the service load-balancer will then be deprovisioned and a new one created. This will most likely cause the new load-balancer to have a different host name and IP address from the old one's. Alternatively, you can revert the change to the IngressController: `oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"External"}}}}'. insights 4.10.0-rc.3 True False False 53m Post upgrade: oc get co ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.11.0-0.ci-2022-02-22-163446 True False False 88m oc -n openshift-ingress get service/router-default -o yaml apiVersion: v1 kind: Service metadata: annotations: service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2" service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "5" service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "4" service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2" service.beta.kubernetes.io/aws-load-balancer-internal: "true" <-------- service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*' traffic-policy.network.alpha.openshift.io/local-with-fallback: "" creationTimestamp: "2022-02-23T05:35:06Z" -------- Based on the above outcome, marking this BZ as "verified"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069