Bug 1904582

Summary: All application traffic broken due to unexpected load balancer change on 4.6.4 -> 4.6.6 upgrade
Product: OpenShift Container Platform Reporter: Ben Browning <bbrownin>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: afield, aos-bugs, dmoessne, jhou, jnordell, mmasters, nmalik, sdodson, sreber, wking
Version: 4.6.zKeywords: Regression
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:38:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1904594    
Attachments:
Description Flags
ingresscontroller yaml after upgrade
none
ingress operator logs after upgrade none

Description Ben Browning 2020-12-04 20:22:07 UTC
Description of problem:

When upgrading my long-lived OpenShift cluster (started life as 4.1.x) from 4.6.4 to 4.6.6, the OpenShift Console and all application URLs (anything in the *.apps.<my-cluster> domain) stopped working. Digging into the AWS cloud console, it appears the *.apps DNS entry got pointed to the internal AWS load balancer upon upgrading to 4.6.6. But, prior to that upgrade, that DNS entry was pointed to the external AWS load balanacer.

This resulted in none of the cluster's application traffic being reachable from outside the cluster as all the app domain names now resolved to internal 10.*.*.* IP addresses instead of public routable ones like they did on 4.6.4.


Version-Release number of selected component (if applicable):

OCP 4.6.4 -> 4.6.6 upgrade


How reproducible:

Every time


Steps to Reproduce:
1. Create an OpenShift 4.1.x cluster in a public cloud provider (I used AWS).
2. Upgrade this cluster incrementally until you get to 4.6.4
3. Deploy any application with a corresponding OpenShift Route under the default apps domain. Verify the application and route work.
4. Upgrade the cluster to 4.6.6.

Actual results:

The OpenShift Console as well as the application will no longer be accessible over the internet.

Expected results:

The OpenShift Console and application should still be accessible over the internet.

Additional info:

Comment 1 Ben Browning 2020-12-04 20:25:08 UTC
Created attachment 1736517 [details]
ingresscontroller yaml after upgrade

Comment 2 Ben Browning 2020-12-04 20:25:54 UTC
Created attachment 1736519 [details]
ingress operator logs after upgrade

Comment 3 Clayton Coleman 2020-12-04 21:08:57 UTC
When we have a fix for this, we should make sure to normalize the data and add a test case that verifies future clusters cannot regress.  It also highlights a weakness in our testing regime - this was only accidentally caught by someone upgrading a personal cluster.

Comment 4 Miciah Dashiel Butler Masters 2020-12-04 21:15:12 UTC
Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  
  Clusters on AWS that are upgraded from OpenShift 4.1 to OpenShift 4.6.6.

  All edges leading to 4.6.6 must be blocked.

What is the impact?  Is it serious enough to warrant blocking edges?

  If the cluster's default ingresscontroller was created on OpenShift 4.1, uses the LoadBalancerService endpoint publishing strategy, and has not been recreated on a later OpenShift version, then upgrading to OpenShift 4.6.6 causes the ingresscontroller's load balancer's scope to be changed to internal.  All routes (including the Console and OAuth routes) then become inaccessible from outside the cluster's VPN.

  This is serious enough to block edges.  

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

  Remediation requires either deleting and recreating or patching the default ingresscontroller:

    oc -n openshift-ingress-operator patch ingresscontrollers/default --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"External"}}}}'

  If the administrator does not have a valid OAuth token, the administrator must perform the remediation from inside the cluster's VPN in order to get a new OAuth token.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
 
  Yes, this is a regression.

Comment 5 W. Trevor King 2020-12-04 21:23:28 UTC
[1] is blocking * -> 4.6.6 and * -> 4.6.7 to keep folks off the impacted releases.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/577

Comment 7 Hongan Li 2020-12-08 10:03:52 UTC
Verified with upgrading cluster from 4.1.41 -> 4.2.36 -> 4.3.40 -> 4.5.22 -> 4.6.4 -> 4.7.0-0.nightly-2020-12-07-232943 and passed, the LB service is still external though status.endpointPublishingStrategy.loadBalancer is nil.


# oc -n openshift-ingress get svc
NAME                      TYPE           CLUSTER-IP     EXTERNAL-IP                                                               PORT(S)                      AGE
router-default            LoadBalancer   172.30.61.1    a0c4dd66838f011eb999202803dda8bb-1181204001.us-east-2.elb.amazonaws.com   80:31612/TCP,443:30343/TCP   9h

# oc -n openshift-ingress-operator get ingresscontroller/default -oyaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2020-12-08T00:55:20Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: default
  namespace: openshift-ingress-operator
  resourceVersion: "267207"
  selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default
  uid: 0bd4f3aa-38f0-11eb-9992-02803dda8bb2
spec: {}
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2020-12-08T05:43:39Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    message: The endpoint publishing strategy supports a managed load balancer
    reason: WantedByEndpointPublishingStrategy
    status: "True"
    type: LoadBalancerManaged
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    message: The LoadBalancer service is provisioned
    reason: LoadBalancerProvisioned
    status: "True"
    type: LoadBalancerReady
  - lastTransitionTime: "2020-12-08T02:14:05Z"
    message: DNS management is supported and zones are specified in the cluster DNS config.
    reason: Normal
    status: "True"
    type: DNSManaged
  - lastTransitionTime: "2020-12-08T02:14:06Z"
    message: The record is provisioned in all reported zones.
    reason: NoFailedZones
    status: "True"
    type: DNSReady
  - lastTransitionTime: "2020-12-08T09:59:23Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-12-08T05:43:39Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "False"
    type: DeploymentDegraded
  - lastTransitionTime: "2020-12-08T08:01:16Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "True"
    type: DeploymentAvailable
  - lastTransitionTime: "2020-12-08T08:01:16Z"
    message: Minimum replicas requirement is met
    reason: DeploymentMinimumReplicasMet
    status: "True"
    type: DeploymentReplicasMinAvailable
  - lastTransitionTime: "2020-12-08T09:50:34Z"
    message: All replicas are available
    reason: DeploymentReplicasAvailable
    status: "True"
    type: DeploymentReplicasAllAvailable
  - lastTransitionTime: "2020-12-08T09:43:57Z"
    status: "True"
    type: PodsScheduled
  - lastTransitionTime: "2020-12-08T09:59:23Z"
    message: Canary route checks for the default ingress controller are successful
    reason: CanaryChecksSucceeding
    status: "True"
    type: CanaryChecksSucceeding
  domain: apps.hongli-41upg.qe.devcluster.openshift.com
  endpointPublishingStrategy:
    type: LoadBalancerService
  observedGeneration: 1
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
  tlsProfile:
    ciphers:
    - TLS_AES_128_GCM_SHA256
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
    - ECDHE-ECDSA-AES128-GCM-SHA256
    - ECDHE-RSA-AES128-GCM-SHA256
    - ECDHE-ECDSA-AES256-GCM-SHA384
    - ECDHE-RSA-AES256-GCM-SHA384
    - ECDHE-ECDSA-CHACHA20-POLY1305
    - ECDHE-RSA-CHACHA20-POLY1305
    - DHE-RSA-AES128-GCM-SHA256
    - DHE-RSA-AES256-GCM-SHA384
    minTLSVersion: VersionTLS12

Comment 10 errata-xmlrpc 2021-02-24 15:38:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 11 W. Trevor King 2021-04-05 17:46:33 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475