Bug 1904594 - All application traffic broken due to unexpected load balancer change on 4.6.4 -> 4.6.6 upgrade
Summary: All application traffic broken due to unexpected load balancer change on 4.6....
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Routing
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.z
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On: 1904582
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-04 21:29 UTC by OpenShift BugZilla Robot
Modified: 2020-12-14 13:51 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-14 13:51:28 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-ingress-operator pull 503 0 None closed [release-4.6] Bug 1904594: Assume ingresscontroller is external absent status 2021-01-12 13:25:42 UTC
Red Hat Product Errata RHSA-2020:5259 0 None None None 2020-12-14 13:51:41 UTC

Description OpenShift BugZilla Robot 2020-12-04 21:29:49 UTC
+++ This bug was initially created as a clone of Bug #1904582 +++

Description of problem:

When upgrading my long-lived OpenShift cluster (started life as 4.1.x) from 4.6.4 to 4.6.6, the OpenShift Console and all application URLs (anything in the *.apps.<my-cluster> domain) stopped working. Digging into the AWS cloud console, it appears the *.apps DNS entry got pointed to the internal AWS load balancer upon upgrading to 4.6.6. But, prior to that upgrade, that DNS entry was pointed to the external AWS load balanacer.

This resulted in none of the cluster's application traffic being reachable from outside the cluster as all the app domain names now resolved to internal 10.*.*.* IP addresses instead of public routable ones like they did on 4.6.4.


Version-Release number of selected component (if applicable):

OCP 4.6.4 -> 4.6.6 upgrade


How reproducible:

Every time


Steps to Reproduce:
1. Create an OpenShift 4.1.x cluster in a public cloud provider (I used AWS).
2. Upgrade this cluster incrementally until you get to 4.6.4
3. Deploy any application with a corresponding OpenShift Route under the default apps domain. Verify the application and route work.
4. Upgrade the cluster to 4.6.6.

Actual results:

The OpenShift Console as well as the application will no longer be accessible over the internet.

Expected results:

The OpenShift Console and application should still be accessible over the internet.

Additional info:

--- Additional comment from bbrownin@redhat.com on 2020-12-04 20:25:08 UTC ---

Created attachment 1736517 [details]
ingresscontroller yaml after upgrade

--- Additional comment from bbrownin@redhat.com on 2020-12-04 20:25:54 UTC ---

Created attachment 1736519 [details]
ingress operator logs after upgrade

--- Additional comment from ccoleman@redhat.com on 2020-12-04 21:08:57 UTC ---

When we have a fix for this, we should make sure to normalize the data and add a test case that verifies future clusters cannot regress.  It also highlights a weakness in our testing regime - this was only accidentally caught by someone upgrading a personal cluster.

--- Additional comment from mmasters@redhat.com on 2020-12-04 21:15:12 UTC ---

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  
  Clusters on AWS that are upgraded from OpenShift 4.1 to OpenShift 4.6.6.

  All edges leading to 4.6.6 must be blocked.

What is the impact?  Is it serious enough to warrant blocking edges?

  If the cluster's default ingresscontroller was created on OpenShift 4.1, uses the LoadBalancerService endpoint publishing strategy, and has not been recreated on a later OpenShift version, then upgrading to OpenShift 4.6.6 causes the ingresscontroller's load balancer's scope to be changed to internal.  All routes (including the Console and OAuth routes) then become inaccessible from outside the cluster's VPN.

  This is serious enough to block edges.  

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

  Remediation requires either deleting and recreating or patching the default ingresscontroller:

    oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"External"}}}}'

  If the administrator does not have a valid OAuth token, the administrator must perform the remediation from inside the cluster's VPN in order to get a new OAuth token.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
 
  Yes, this is a regression.

--- Additional comment from wking@redhat.com on 2020-12-04 21:23:28 UTC ---

[1] is blocking * -> 4.6.6 and * -> 4.6.7 to keep folks off the impacted releases.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/577

Comment 3 Hongan Li 2020-12-09 08:27:57 UTC
verified with upgrading a cluster on AWS from 4.1.41 -> 4.2.36 -> 4.3.40 -> 4.4.31 -> 4.5.22 to 4.6.0-0.nightly-2020-12-08-021151 and passed. The LB service is external though status.endpointPublishingStrategy.loadBalancer is nil.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-12-08-021151   True        False         24m     Cluster version is 4.6.0-0.nightly-2020-12-08-021151

# oc -n openshift-ingress-operator get ingresscontroller/default -oyaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2020-12-08T22:29:10Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: default
  namespace: openshift-ingress-operator
  resourceVersion: "252361"
  selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default
  uid: cab7d992-39a4-11eb-b5f7-0a1fac23787e
spec: {}
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2020-12-09T03:34:51Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2020-12-09T01:17:15Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2020-12-09T01:17:16Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2020-12-09T01:17:16Z"
    message: The LoadBalancer service is provisioned
    reason: LoadBalancerProvisioned
    status: "True"
    type: LoadBalancerReady
  - lastTransitionTime: "2020-12-09T01:17:16Z"
    message: DNS management is supported and zones are specified in the cluster DNS config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2020-12-09T01:17:16Z"
    message: The record is provisioned in all reported zones.
    reason: NoFailedZones
    status: "True"
    type: DNSReady
  - lastTransitionTime: "2020-12-09T01:17:16Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-12-09T03:34:51Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "False"
    type: DeploymentDegraded
  - lastTransitionTime: "2020-12-09T07:18:17Z"
    status: "True"
    type: PodsScheduled
  - lastTransitionTime: "2020-12-09T07:18:17Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "True"
    type: DeploymentAvailable
  - lastTransitionTime: "2020-12-09T07:18:17Z"
    message: Minimum replicas requirement is met
    reason: DeploymentMinimumReplicasMet
    status: "True"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2020-12-09T07:47:07Z"
    message: All replicas are available
    reason: DeploymentReplicasAvailable
    status: "True"
    type: DeploymentReplicasAllAvailable
  domain: apps.hongli-41upg.qe.devcluster.openshift.com
  endpointPublishingStrategy:
    type: LoadBalancerService
  observedGeneration: 1
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
  tlsProfile:
    ciphers:
    - TLS_AES_128_GCM_SHA256
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
    - ECDHE-ECDSA-AES128-GCM-SHA256
    - ECDHE-RSA-AES128-GCM-SHA256
    - ECDHE-ECDSA-AES256-GCM-SHA384
    - ECDHE-RSA-AES256-GCM-SHA384
    - ECDHE-ECDSA-CHACHA20-POLY1305
    - ECDHE-RSA-CHACHA20-POLY1305
    - DHE-RSA-AES128-GCM-SHA256
    - DHE-RSA-AES256-GCM-SHA384
    minTLSVersion: VersionTLS12

Comment 4 Hongan Li 2020-12-09 08:29:15 UTC
# oc -n openshift-ingress get svc/router-default
NAME             TYPE           CLUSTER-IP      EXTERNAL-IP                                                              PORT(S)                      AGE
router-default   LoadBalancer   172.30.101.66   acb14c57339a411ebb5f70a1fac23787-760536154.us-east-2.elb.amazonaws.com   80:30232/TCP,443:32319/TCP   9h

Comment 6 errata-xmlrpc 2020-12-14 13:51:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.6.8 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5259


Note You need to log in before you can comment on or make changes to this bug.