Bug 2056928

Summary:	Ingresscontroller LB scope change behaviour differs for different values of aws-load-balancer-internal annotation
Product:	OpenShift Container Platform	Reporter:	Ravi Trivedi <travi>
Component:	Networking	Assignee:	Miciah Dashiel Butler Masters <mmasters>
Networking sub component:	router	QA Contact:	Arvind iyengar <aiyengar>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	urgent	CC:	aiyengar, aos-bugs, aos-network-edge-staff, cblecker, hongli, mfisher, mifiedle, mmasters, nmalik, wking
Version:	4.10	Keywords:	ServiceDeliveryBlocker, Upgrades
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The AWS cloud-provider implementation checks the "service.beta.kubernetes.io/aws-load-balancer-internal" service annotation to determine whether a service load-balancer (SLB) should be configured to be internal (as opposed to being public). The cloud-provider implementation recognizes both the value "0.0.0.0/0" and the value "true" as indicating that an SLB should be internal. The ingress operator in OpenShift 4.7 and earlier sets the value "0.0.0.0/0", and the ingress operator in OpenShift 4.8 and later sets the value "true" for services that the operator creates for internal SLBs. A service that was created on an older cluster might have the annotation value "0.0.0.0/0", which could cause comparisons that check for the "true" value to return the wrong result. Consequence: When a cluster had an internal SLB that had been configured using the old annotation value and the cluster was upgraded to OpenShift 4.10, the ingress operator would report the Progressing=True clusteroperator status condition, preventing the upgrade from completing. Fix: Logic was added to the ingress operator to normalize the service.beta.kubernetes.io/aws-load-balancer-internal service annotation for operator-managed services by replacing the value "0.0.0.0/0" with the value "true". Result: The ingress operator no longer prevents upgrades of clusters with the "service.beta.kubernetes.io/aws-load-balancer-internal=0.0.0.0/0" annotation from completing.	Story Points:	---
Clone Of:	2055470	Environment:
Last Closed:	2022-03-10 16:44:19 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2055470
Bug Blocks:	2057518

Comment 2 Hongan Li 2022-02-23 05:07:13 UTC

tested with the cluster that launched by cluster-bot (launch openshift/cluster-ingress-operator#705 aws) and the PR works as expected.

after changing scope to internal and manually changing annotation value to "0.0.0.0/0", ingress-operator update the annotation to "true" immediately.

$ oc -n openshift-ingress annotate svc/router-default service.beta.kubernetes.io/aws-load-balancer-internal="0.0.0.0/0" --overwrite
service/router-default annotated

$ oc -n openshift-ingress get svc/router-default -oyaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "5"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "4"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
    traffic-policy.network.alpha.openshift.io/local-with-fallback: ""


### logs of ingress-operator

2022-02-23T04:57:26.536Z	INFO	operator.ingress_controller	ingress/load_balancer_service.go:294	normalized annotation	{"namespace": "openshift-ingress", "name": "router-default", "annotation": "service.beta.kubernetes.io/aws-load-balancer-internal", "old": "0.0.0.0/0", "new": "true"}


$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.ci.test-2022-02-23-041216-ci-ln-bqsc5qk-latest   True        False         20m     Cluster version is 4.10.0-0.ci.test-2022-02-23-041216-ci-ln-bqsc5qk-latest

Comment 4 Mike Fiedler 2022-02-23 19:17:55 UTC

Moving to MODIFIED.  No nightly 4.10 includes the fix for this.

Comment 8 Arvind iyengar 2022-02-24 08:14:08 UTC

Verified in "4.10.0-0.nightly-2022-02-24-034852" release version. Testing upgrade from  4.9.23 to 4.10.0-0.nightly-2022-02-24-034852, it is observed that the patch works as intended and upgrade gets completed successfully:
--------
oc get clusterversion     
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.23    True        False         5m57s   Cluster version is 4.9.23

oc -n openshift-ingress edit service/router-default    
service/router-default edited

oc -n openshift-ingress get service/router-default -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "5"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "4"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0   <-------
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
    traffic-policy.network.alpha.openshift.io/local-with-fallback: ""
  creationTimestamp: "2022-02-24T06:26:49Z"

oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-24-034852 --allow-explicit-upgrade=true --force
Updating to release image registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-02-24-034852

Post upgrade:

oc get clusterversion                                 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-02-24-034852   True        False         9m52s   Cluster version is 4.10.0-0.nightly-2022-02-24-034852

oc get co ingress
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE 
ingress                                    4.10.0-0.nightly-2022-02-24-034852   True        False         False      66m     
  
oc -n openshift-ingress get service/router-default -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "5"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "4"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "2"
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"  <-----------
    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: '*'
    traffic-policy.network.alpha.openshift.io/local-with-fallback: ""
  creationTimestamp: "2022-02-24T06:26:49Z"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
--------

Comment 11 errata-xmlrpc 2022-03-10 16:44:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056