Bug 1904582

Summary:

All application traffic broken due to unexpected load balancer change on 4.6.4 -> 4.6.6 upgrade

Product:

OpenShift Container Platform

Reporter:

Ben Browning <bbrownin>

Component:

Networking

Assignee:

Miciah Dashiel Butler Masters <mmasters>

Networking sub component:

router

QA Contact:

Hongan Li <hongli>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

urgent

Priority:

urgent

CC:

afield, aos-bugs, dmoessne, jhou, jnordell, mmasters, nmalik, sdodson, sreber, wking

Version:

4.6.z

Keywords:

Regression

Target Milestone:

---

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-02-24 15:38:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1904594

Attachments:

Description	Flags
ingresscontroller yaml after upgrade	none
ingress operator logs after upgrade	none

Description Ben Browning 2020-12-04 20:22:07 UTC

Description of problem:

When upgrading my long-lived OpenShift cluster (started life as 4.1.x) from 4.6.4 to 4.6.6, the OpenShift Console and all application URLs (anything in the *.apps.<my-cluster> domain) stopped working. Digging into the AWS cloud console, it appears the *.apps DNS entry got pointed to the internal AWS load balancer upon upgrading to 4.6.6. But, prior to that upgrade, that DNS entry was pointed to the external AWS load balanacer.

This resulted in none of the cluster's application traffic being reachable from outside the cluster as all the app domain names now resolved to internal 10.*.*.* IP addresses instead of public routable ones like they did on 4.6.4.


Version-Release number of selected component (if applicable):

OCP 4.6.4 -> 4.6.6 upgrade


How reproducible:

Every time


Steps to Reproduce:
1. Create an OpenShift 4.1.x cluster in a public cloud provider (I used AWS).
2. Upgrade this cluster incrementally until you get to 4.6.4
3. Deploy any application with a corresponding OpenShift Route under the default apps domain. Verify the application and route work.
4. Upgrade the cluster to 4.6.6.

Actual results:

The OpenShift Console as well as the application will no longer be accessible over the internet.

Expected results:

The OpenShift Console and application should still be accessible over the internet.

Additional info:

Comment 1 Ben Browning 2020-12-04 20:25:08 UTC

Created attachment 1736517 [details]
ingresscontroller yaml after upgrade

Comment 2 Ben Browning 2020-12-04 20:25:54 UTC

Created attachment 1736519 [details]
ingress operator logs after upgrade

Comment 3 Clayton Coleman 2020-12-04 21:08:57 UTC

When we have a fix for this, we should make sure to normalize the data and add a test case that verifies future clusters cannot regress.  It also highlights a weakness in our testing regime - this was only accidentally caught by someone upgrading a personal cluster.

Comment 4 Miciah Dashiel Butler Masters 2020-12-04 21:15:12 UTC

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  
  Clusters on AWS that are upgraded from OpenShift 4.1 to OpenShift 4.6.6.

  All edges leading to 4.6.6 must be blocked.

What is the impact?  Is it serious enough to warrant blocking edges?

  If the cluster's default ingresscontroller was created on OpenShift 4.1, uses the LoadBalancerService endpoint publishing strategy, and has not been recreated on a later OpenShift version, then upgrading to OpenShift 4.6.6 causes the ingresscontroller's load balancer's scope to be changed to internal.  All routes (including the Console and OAuth routes) then become inaccessible from outside the cluster's VPN.

  This is serious enough to block edges.  

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

  Remediation requires either deleting and recreating or patching the default ingresscontroller:

    oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"External"}}}}'

  If the administrator does not have a valid OAuth token, the administrator must perform the remediation from inside the cluster's VPN in order to get a new OAuth token.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
 
  Yes, this is a regression.

Comment 5 W. Trevor King 2020-12-04 21:23:28 UTC

[1] is blocking * -> 4.6.6 and * -> 4.6.7 to keep folks off the impacted releases.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/577

Comment 7 Hongan Li 2020-12-08 10:03:52 UTC

Verified with upgrading cluster from 4.1.41 -> 4.2.36 -> 4.3.40 -> 4.5.22 -> 4.6.4 -> 4.7.0-0.nightly-2020-12-07-232943 and passed, the LB service is still external though status.endpointPublishingStrategy.loadBalancer is nil.


# oc -n openshift-ingress get svc
NAME                      TYPE           CLUSTER-IP     EXTERNAL-IP                                                               PORT(S)                      AGE
router-default            LoadBalancer   172.30.61.1    a0c4dd66838f011eb999202803dda8bb-1181204001.us-east-2.elb.amazonaws.com   80:31612/TCP,443:30343/TCP   9h

# oc -n openshift-ingress-operator get ingresscontroller/default -oyaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2020-12-08T00:55:20Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: default
  namespace: openshift-ingress-operator
  resourceVersion: "267207"
  selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default
  uid: 0bd4f3aa-38f0-11eb-9992-02803dda8bb2
spec: {}
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2020-12-08T05:43:39Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    message: The LoadBalancer service is provisioned
    reason: LoadBalancerProvisioned
    status: "True"
    type: LoadBalancerReady
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    message: DNS management is supported and zones are specified in the cluster DNS config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2020-12-08T02:14:06Z"
    message: The record is provisioned in all reported zones.
    reason: NoFailedZones
    status: "True"
    type: DNSReady
  - lastTransitionTime: "2020-12-08T09:59:23Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-12-08T05:43:39Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "False"
    type: DeploymentDegraded
  - lastTransitionTime: "2020-12-08T08:01:16Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "True"
    type: DeploymentAvailable
  - lastTransitionTime: "2020-12-08T08:01:16Z"
    message: Minimum replicas requirement is met
    reason: DeploymentMinimumReplicasMet
    status: "True"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2020-12-08T09:50:34Z"
    message: All replicas are available
    reason: DeploymentReplicasAvailable
    status: "True"
    type: DeploymentReplicasAllAvailable
  - lastTransitionTime: "2020-12-08T09:43:57Z"
    status: "True"
    type: PodsScheduled
  - lastTransitionTime: "2020-12-08T09:59:23Z"
    message: Canary route checks for the default ingress controller are successful
    reason: CanaryChecksSucceeding
    status: "True"
    type: CanaryChecksSucceeding
  domain: apps.hongli-41upg.qe.devcluster.openshift.com
  endpointPublishingStrategy:
    type: LoadBalancerService
  observedGeneration: 1
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
  tlsProfile:
    ciphers:
    - TLS_AES_128_GCM_SHA256
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
    - ECDHE-ECDSA-AES128-GCM-SHA256
    - ECDHE-RSA-AES128-GCM-SHA256
    - ECDHE-ECDSA-AES256-GCM-SHA384
    - ECDHE-RSA-AES256-GCM-SHA384
    - ECDHE-ECDSA-CHACHA20-POLY1305
    - ECDHE-RSA-CHACHA20-POLY1305
    - DHE-RSA-AES128-GCM-SHA256
    - DHE-RSA-AES256-GCM-SHA384
    minTLSVersion: VersionTLS12

Comment 10 errata-xmlrpc 2021-02-24 15:38:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 11 W. Trevor King 2021-04-05 17:46:33 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475