Bug 1904582 - All application traffic broken due to unexpected load balancer change on 4.6.4 -> 4.6.6 upgrade
Summary: All application traffic broken due to unexpected load balancer change on 4.6....
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.7.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
Depends On:
Blocks: 1904594
TreeView+ depends on / blocked
Reported: 2020-12-04 20:22 UTC by Ben Browning
Modified: 2022-08-04 22:30 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2021-02-24 15:38:11 UTC
Target Upstream Version:

Attachments (Terms of Use)
ingresscontroller yaml after upgrade (3.69 KB, text/plain)
2020-12-04 20:25 UTC, Ben Browning
no flags Details
ingress operator logs after upgrade (29.99 KB, text/plain)
2020-12-04 20:25 UTC, Ben Browning
no flags Details

System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 502 0 None closed Bug 1904582: Assume ingresscontroller is external absent status 2021-02-15 06:01:23 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:40:03 UTC

Description Ben Browning 2020-12-04 20:22:07 UTC
Description of problem:

When upgrading my long-lived OpenShift cluster (started life as 4.1.x) from 4.6.4 to 4.6.6, the OpenShift Console and all application URLs (anything in the *.apps.<my-cluster> domain) stopped working. Digging into the AWS cloud console, it appears the *.apps DNS entry got pointed to the internal AWS load balancer upon upgrading to 4.6.6. But, prior to that upgrade, that DNS entry was pointed to the external AWS load balanacer.

This resulted in none of the cluster's application traffic being reachable from outside the cluster as all the app domain names now resolved to internal 10.*.*.* IP addresses instead of public routable ones like they did on 4.6.4.

Version-Release number of selected component (if applicable):

OCP 4.6.4 -> 4.6.6 upgrade

How reproducible:

Every time

Steps to Reproduce:
1. Create an OpenShift 4.1.x cluster in a public cloud provider (I used AWS).
2. Upgrade this cluster incrementally until you get to 4.6.4
3. Deploy any application with a corresponding OpenShift Route under the default apps domain. Verify the application and route work.
4. Upgrade the cluster to 4.6.6.

Actual results:

The OpenShift Console as well as the application will no longer be accessible over the internet.

Expected results:

The OpenShift Console and application should still be accessible over the internet.

Additional info:

Comment 1 Ben Browning 2020-12-04 20:25:08 UTC
Created attachment 1736517 [details]
ingresscontroller yaml after upgrade

Comment 2 Ben Browning 2020-12-04 20:25:54 UTC
Created attachment 1736519 [details]
ingress operator logs after upgrade

Comment 3 Clayton Coleman 2020-12-04 21:08:57 UTC
When we have a fix for this, we should make sure to normalize the data and add a test case that verifies future clusters cannot regress.  It also highlights a weakness in our testing regime - this was only accidentally caught by someone upgrading a personal cluster.

Comment 4 Miciah Dashiel Butler Masters 2020-12-04 21:15:12 UTC
Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  Clusters on AWS that are upgraded from OpenShift 4.1 to OpenShift 4.6.6.

  All edges leading to 4.6.6 must be blocked.

What is the impact?  Is it serious enough to warrant blocking edges?

  If the cluster's default ingresscontroller was created on OpenShift 4.1, uses the LoadBalancerService endpoint publishing strategy, and has not been recreated on a later OpenShift version, then upgrading to OpenShift 4.6.6 causes the ingresscontroller's load balancer's scope to be changed to internal.  All routes (including the Console and OAuth routes) then become inaccessible from outside the cluster's VPN.

  This is serious enough to block edges.  

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

  Remediation requires either deleting and recreating or patching the default ingresscontroller:

    oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"External"}}}}'

  If the administrator does not have a valid OAuth token, the administrator must perform the remediation from inside the cluster's VPN in order to get a new OAuth token.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  Yes, this is a regression.

Comment 5 W. Trevor King 2020-12-04 21:23:28 UTC
[1] is blocking * -> 4.6.6 and * -> 4.6.7 to keep folks off the impacted releases.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/577

Comment 7 Hongan Li 2020-12-08 10:03:52 UTC
Verified with upgrading cluster from 4.1.41 -> 4.2.36 -> 4.3.40 -> 4.5.22 -> 4.6.4 -> 4.7.0-0.nightly-2020-12-07-232943 and passed, the LB service is still external though status.endpointPublishingStrategy.loadBalancer is nil.

# oc -n openshift-ingress get svc
NAME                      TYPE           CLUSTER-IP     EXTERNAL-IP                                                               PORT(S)                      AGE
router-default            LoadBalancer    a0c4dd66838f011eb999202803dda8bb-1181204001.us-east-2.elb.amazonaws.com   80:31612/TCP,443:30343/TCP   9h

# oc -n openshift-ingress-operator get ingresscontroller/default -oyaml
apiVersion: operator.openshift.io/v1
kind: IngressController
  creationTimestamp: "2020-12-08T00:55:20Z"
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: default
  namespace: openshift-ingress-operator
  resourceVersion: "267207"
  selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default
  uid: 0bd4f3aa-38f0-11eb-9992-02803dda8bb2
spec: {}
  availableReplicas: 2
  - lastTransitionTime: "2020-12-08T05:43:39Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    message: The LoadBalancer service is provisioned
    reason: LoadBalancerProvisioned
    status: "True"
    type: LoadBalancerReady
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    message: DNS management is supported and zones are specified in the cluster DNS config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2020-12-08T02:14:06Z"
    message: The record is provisioned in all reported zones.
    reason: NoFailedZones
    status: "True"
    type: DNSReady
  - lastTransitionTime: "2020-12-08T09:59:23Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-12-08T05:43:39Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "False"
    type: DeploymentDegraded
  - lastTransitionTime: "2020-12-08T08:01:16Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "True"
    type: DeploymentAvailable
  - lastTransitionTime: "2020-12-08T08:01:16Z"
    message: Minimum replicas requirement is met
    reason: DeploymentMinimumReplicasMet
    status: "True"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2020-12-08T09:50:34Z"
    message: All replicas are available
    reason: DeploymentReplicasAvailable
    status: "True"
    type: DeploymentReplicasAllAvailable
  - lastTransitionTime: "2020-12-08T09:43:57Z"
    status: "True"
    type: PodsScheduled
  - lastTransitionTime: "2020-12-08T09:59:23Z"
    message: Canary route checks for the default ingress controller are successful
    reason: CanaryChecksSucceeding
    status: "True"
    type: CanaryChecksSucceeding
  domain: apps.hongli-41upg.qe.devcluster.openshift.com
    type: LoadBalancerService
  observedGeneration: 1
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
    - TLS_AES_128_GCM_SHA256
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
    - DHE-RSA-AES128-GCM-SHA256
    - DHE-RSA-AES256-GCM-SHA384
    minTLSVersion: VersionTLS12

Comment 10 errata-xmlrpc 2021-02-24 15:38:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Comment 11 W. Trevor King 2021-04-05 17:46:33 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.