1904582 – All application traffic broken due to unexpected load balancer change on 4.6.4 -> 4.6.6 upgrade

Bug 1904582 - All application traffic broken due to unexpected load balancer change on 4.6.4 -> 4.6.6 upgrade

Summary: All application traffic broken due to unexpected load balancer change on 4.6....

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1904594
TreeView+	depends on / blocked

Reported:	2020-12-04 20:22 UTC by Ben Browning
Modified:	2022-08-04 22:30 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:38:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ingresscontroller yaml after upgrade (3.69 KB, text/plain) 2020-12-04 20:25 UTC, Ben Browning	no flags	Details
ingress operator logs after upgrade (29.99 KB, text/plain) 2020-12-04 20:25 UTC, Ben Browning	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 502	0	None	closed	Bug 1904582: Assume ingresscontroller is external absent status	2021-02-15 06:01:23 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:40:03 UTC

Description Ben Browning 2020-12-04 20:22:07 UTC

Description of problem:

When upgrading my long-lived OpenShift cluster (started life as 4.1.x) from 4.6.4 to 4.6.6, the OpenShift Console and all application URLs (anything in the *.apps.<my-cluster> domain) stopped working. Digging into the AWS cloud console, it appears the *.apps DNS entry got pointed to the internal AWS load balancer upon upgrading to 4.6.6. But, prior to that upgrade, that DNS entry was pointed to the external AWS load balanacer.

This resulted in none of the cluster's application traffic being reachable from outside the cluster as all the app domain names now resolved to internal 10.*.*.* IP addresses instead of public routable ones like they did on 4.6.4.


Version-Release number of selected component (if applicable):

OCP 4.6.4 -> 4.6.6 upgrade


How reproducible:

Every time


Steps to Reproduce:
1. Create an OpenShift 4.1.x cluster in a public cloud provider (I used AWS).
2. Upgrade this cluster incrementally until you get to 4.6.4
3. Deploy any application with a corresponding OpenShift Route under the default apps domain. Verify the application and route work.
4. Upgrade the cluster to 4.6.6.

Actual results:

The OpenShift Console as well as the application will no longer be accessible over the internet.

Expected results:

The OpenShift Console and application should still be accessible over the internet.

Additional info:

Comment 1 Ben Browning 2020-12-04 20:25:08 UTC

Created attachment 1736517 [details]
ingresscontroller yaml after upgrade

Comment 2 Ben Browning 2020-12-04 20:25:54 UTC

Created attachment 1736519 [details]
ingress operator logs after upgrade

Comment 3 Clayton Coleman 2020-12-04 21:08:57 UTC

When we have a fix for this, we should make sure to normalize the data and add a test case that verifies future clusters cannot regress.  It also highlights a weakness in our testing regime - this was only accidentally caught by someone upgrading a personal cluster.

Comment 4 Miciah Dashiel Butler Masters 2020-12-04 21:15:12 UTC

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  
  Clusters on AWS that are upgraded from OpenShift 4.1 to OpenShift 4.6.6.

  All edges leading to 4.6.6 must be blocked.

What is the impact?  Is it serious enough to warrant blocking edges?

  If the cluster's default ingresscontroller was created on OpenShift 4.1, uses the LoadBalancerService endpoint publishing strategy, and has not been recreated on a later OpenShift version, then upgrading to OpenShift 4.6.6 causes the ingresscontroller's load balancer's scope to be changed to internal.  All routes (including the Console and OAuth routes) then become inaccessible from outside the cluster's VPN.

  This is serious enough to block edges.  

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

  Remediation requires either deleting and recreating or patching the default ingresscontroller:

    oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"External"}}}}'

  If the administrator does not have a valid OAuth token, the administrator must perform the remediation from inside the cluster's VPN in order to get a new OAuth token.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
 
  Yes, this is a regression.

Comment 5 W. Trevor King 2020-12-04 21:23:28 UTC

[1] is blocking * -> 4.6.6 and * -> 4.6.7 to keep folks off the impacted releases.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/577

Comment 7 Hongan Li 2020-12-08 10:03:52 UTC

Verified with upgrading cluster from 4.1.41 -> 4.2.36 -> 4.3.40 -> 4.5.22 -> 4.6.4 -> 4.7.0-0.nightly-2020-12-07-232943 and passed, the LB service is still external though status.endpointPublishingStrategy.loadBalancer is nil.


# oc -n openshift-ingress get svc
NAME                      TYPE           CLUSTER-IP     EXTERNAL-IP                                                               PORT(S)                      AGE
router-default            LoadBalancer   172.30.61.1    a0c4dd66838f011eb999202803dda8bb-1181204001.us-east-2.elb.amazonaws.com   80:31612/TCP,443:30343/TCP   9h

# oc -n openshift-ingress-operator get ingresscontroller/default -oyaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2020-12-08T00:55:20Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: default
  namespace: openshift-ingress-operator
  resourceVersion: "267207"
  selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default
  uid: 0bd4f3aa-38f0-11eb-9992-02803dda8bb2
spec: {}
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2020-12-08T05:43:39Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    message: The LoadBalancer service is provisioned
    reason: LoadBalancerProvisioned
    status: "True"
    type: LoadBalancerReady
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    message: DNS management is supported and zones are specified in the cluster DNS config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2020-12-08T02:14:06Z"
    message: The record is provisioned in all reported zones.
    reason: NoFailedZones
    status: "True"
    type: DNSReady
  - lastTransitionTime: "2020-12-08T09:59:23Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-12-08T05:43:39Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "False"
    type: DeploymentDegraded
  - lastTransitionTime: "2020-12-08T08:01:16Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "True"
    type: DeploymentAvailable
  - lastTransitionTime: "2020-12-08T08:01:16Z"
    message: Minimum replicas requirement is met
    reason: DeploymentMinimumReplicasMet
    status: "True"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2020-12-08T09:50:34Z"
    message: All replicas are available
    reason: DeploymentReplicasAvailable
    status: "True"
    type: DeploymentReplicasAllAvailable
  - lastTransitionTime: "2020-12-08T09:43:57Z"
    status: "True"
    type: PodsScheduled
  - lastTransitionTime: "2020-12-08T09:59:23Z"
    message: Canary route checks for the default ingress controller are successful
    reason: CanaryChecksSucceeding
    status: "True"
    type: CanaryChecksSucceeding
  domain: apps.hongli-41upg.qe.devcluster.openshift.com
  endpointPublishingStrategy:
    type: LoadBalancerService
  observedGeneration: 1
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
  tlsProfile:
    ciphers:
    - TLS_AES_128_GCM_SHA256
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
    - ECDHE-ECDSA-AES128-GCM-SHA256
    - ECDHE-RSA-AES128-GCM-SHA256
    - ECDHE-ECDSA-AES256-GCM-SHA384
    - ECDHE-RSA-AES256-GCM-SHA384
    - ECDHE-ECDSA-CHACHA20-POLY1305
    - ECDHE-RSA-CHACHA20-POLY1305
    - DHE-RSA-AES128-GCM-SHA256
    - DHE-RSA-AES256-GCM-SHA384
    minTLSVersion: VersionTLS12

Comment 10 errata-xmlrpc 2021-02-24 15:38:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 11 W. Trevor King 2021-04-05 17:46:33 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.