Bug 2029034 - enabling ExternalCloudProvider leads to inoperative cluster
Summary: enabling ExternalCloudProvider leads to inoperative cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Pierre Prinetti
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-12-04 09:14 UTC by rlobillo
Modified: 2022-03-10 16:31 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Fixes an error where workers may be deleted or enter a failed state when making changes to a machine object. Changing a machine object is not supported. With this fix, changes to the machine object will be safely ignored. * Fixes an error where kubelet-serving CSRs were not approved if the machine had additional ports or IP addresses.
Clone Of:
Environment:
Last Closed: 2022-03-10 16:31:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather (9.00 MB, application/gzip)
2021-12-04 09:14 UTC, rlobillo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-openstack pull 210 0 None Merged Bug 2022627: Fix nodelink and CSR approval when a machine has multiple addresses 2021-12-08 16:44:02 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:31:53 UTC

Description rlobillo 2021-12-04 09:14:27 UTC
Created attachment 1844682 [details]
must-gather

Description of problem: Enabling ExternalCloudProvider on healthy 4.10 cluster on top of OSP16.1 is causing some clusterOperators to be degraded.


Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2021-12-02-033910 on top of OSP16.1 (RHOS-16.1-RHEL-8-20210903.n.0) with manila and SSL encryption enabled.


How reproducible:

1. Install OCP4.10 with IPI succesfully:

$ tail ostest/.openshift_install.log 
time="2021-12-02T18:53:18Z" level=info msg="To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/stack/ostest/auth/kubeconfig'"
time="2021-12-02T18:53:18Z" level=info msg="Access the OpenShift web-console here: https://console-openshift-console.apps.ostest.shiftstack.com"
time="2021-12-02T18:53:18Z" level=info msg="Login to the console with user: \"kubeadmin\", and password: \"SpEFF-zx7QT-iTYvv-I7mhn\""
time="2021-12-02T18:53:18Z" level=debug msg="Time elapsed per stage:"
time="2021-12-02T18:53:18Z" level=debug msg="                  : 2m0s"
time="2021-12-02T18:53:18Z" level=debug msg="Bootstrap Complete: 29m34s"
time="2021-12-02T18:53:18Z" level=debug msg="               API: 11m37s"
time="2021-12-02T18:53:18Z" level=debug msg=" Bootstrap Destroy: 42s"
time="2021-12-02T18:53:18Z" level=debug msg=" Cluster Operators: 16m39s"
time="2021-12-02T18:53:18Z" level=info msg="Time elapsed: 56m6s"


$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
baremetal                                  4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
cloud-controller-manager                   4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
cloud-credential                           4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
cluster-api                                4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
cluster-autoscaler                         4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
config-operator                            4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
console                                    4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
csi-snapshot-controller                    4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
dns                                        4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
etcd                                       4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
image-registry                             4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
ingress                                    4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
insights                                   4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
kube-apiserver                             4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
kube-controller-manager                    4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
kube-scheduler                             4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
kube-storage-version-migrator              4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
machine-api                                4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
machine-approver                           4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
machine-config                             4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
marketplace                                4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
monitoring                                 4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
network                                    4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
node-tuning                                4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
openshift-apiserver                        4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
openshift-controller-manager               4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
openshift-samples                          4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
operator-lifecycle-manager                 4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
operator-lifecycle-manager-catalog         4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
operator-lifecycle-manager-packageserver   4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
service-ca                                 4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     
storage                                    4.10.0-0.nightly-2021-12-02-033910   True        False         False      14h     

2. Enabling external CCM by adding the featureGate:

$ oc get featureGate -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: FeatureGate
  metadata:
    annotations:
      include.release.openshift.io/ibm-cloud-managed: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
      release.openshift.io/create-only: "true"
    creationTimestamp: "2021-12-02T18:18:57Z"
    generation: 2
    name: cluster
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 78b3d110-635c-4fdf-b628-b20a1069e8f7
    resourceVersion: "298866"
    uid: 7b9e8597-0186-4c3d-88fa-60856b46629d
  spec:
    customNoUpgrade:
      enabled:
      - ExternalCloudProvider
    featureSet: CustomNoUpgrade
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

3. The external CCM pods are deployed:

$ oc get pods -n openshift-cloud-controller-manager
NAME                                                 READY   STATUS    RESTARTS   AGE
openstack-cloud-controller-manager-c698f7f49-fjw6w   1/1     Running   0          166m
openstack-cloud-controller-manager-c698f7f49-kpqrc   1/1     Running   0          162m


but the cluster became unhealthy and inoperative:

$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-0.nightly-2021-12-02-033910   False       False         True       166m    OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ostest.shiftstack.com/healthz": EOF
baremetal                                  4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
cloud-controller-manager                   4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
cloud-credential                           4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
cluster-api                                4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
cluster-autoscaler                         4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
config-operator                            4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
console                                    4.10.0-0.nightly-2021-12-02-033910   False       False         False      166m    RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ostest.shiftstack.com): Get "https://console-openshift-console.apps.ostest.shiftstack.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
csi-snapshot-controller                    4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
dns                                        4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
etcd                                       4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
image-registry                             4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
ingress                                    4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
insights                                   4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
kube-apiserver                             4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
kube-controller-manager                    4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
kube-scheduler                             4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
kube-storage-version-migrator              4.10.0-0.nightly-2021-12-02-033910   True        False         False      169m    
machine-api                                4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
machine-approver                           4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
machine-config                             4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
marketplace                                4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
monitoring                                 4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
network                                    4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
node-tuning                                4.10.0-0.nightly-2021-12-02-033910   True        False         False      163m    
openshift-apiserver                        4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
openshift-controller-manager               4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
openshift-samples                          4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
operator-lifecycle-manager                 4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
operator-lifecycle-manager-catalog         4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
operator-lifecycle-manager-packageserver   4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     
service-ca                                 4.10.0-0.nightly-2021-12-02-033910   True        False         False      18h     
storage                                    4.10.0-0.nightly-2021-12-02-033910   True        False         False      17h     


$ oc get co/authentication -o json | jq -r '.status.conditions[] | select (.type=="Degraded")'
{
  "lastTransitionTime": "2021-12-03T09:42:18Z",
  "message": "OAuthServerRouteEndpointAccessibleControllerDegraded: Get \"https://oauth-openshift.apps.ostest.shiftstack.com/healthz\": EOF",
  "reason": "OAuthServerRouteEndpointAccessibleController_SyncError",
  "status": "True",
  "type": "Degraded"
}

$ oc logs -n openshift-authentication-operator -l app=authentication-operator| tail
E1203 11:39:22.050357       1 base_controller.go:272] OAuthServerRouteEndpointAccessibleController reconciliation failed: Get "https://oauth-openshift.apps.ostest.shiftstack.com/healthz": EOF
E1203 11:39:22.327328       1 base_controller.go:272] OAuthServerRouteEndpointAccessibleController reconciliation failed: Get "https://oauth-openshift.apps.ostest.shiftstack.com/healthz": EOF

$ oc rsh -n openshift-authentication-operator $(oc get pod -n openshift-authentication-operator -l app=authentication-operator -o NAME) curl -k https://oauth-openshift.apps.ostest.shiftstack.com/healthz
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to oauth-openshift.apps.ostest.shiftstack.com:443

Same command is working on a cluster with in-tree cloud manager:

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-12-02-155018   True        False         14h     Cluster version is 4.9.0-0.nightly-2021-12-02-155018

$ oc rsh -n openshift-authentication-operator $(oc get pod -n openshift-authentication-operator -l app=authentication-operator -o NAME) curl -k https://oauth-openshift.apps.ostest.shiftstack.com/healthz
ok



Actual results: cluster inoperative after enabling the TP feature.


Expected results: TP feature is enabled successfully.


Additional info:
 - Must-gather on attached.
 - install-config.yaml attached.

Comment 5 rlobillo 2021-12-09 11:43:33 UTC
Verified on 4.10.0-0.nightly-2021-12-06-201335 on top of OSP16.1 (RHOS-16.1-RHEL-8-20210903.n.0)

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-12-06-201335   True        False         10m     Cluster version is 4.10.0-0.nightly-2021-12-06-201335                                                       

$ oc get featureGate cluster -o yaml
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    release.openshift.io/create-only: "true"
  creationTimestamp: "2021-12-09T10:40:57Z"
  generation: 1
  name: cluster
  resourceVersion: "1420"
  uid: dcfaf925-591b-4628-a9a6-0a104d2afa74
spec:
  customNoUpgrade:
    enabled:
    - ExternalCloudProvider
  featureSet: CustomNoUpgrade

Comment 10 errata-xmlrpc 2022-03-10 16:31:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.