Description of problem: - CSO is degraded during an upgrade because vsphere-problem-detector doesn't trust the vCenter CA. Version-Release number of selected component (if applicable): - RHOCP 4.7.z - CSO 4.7.z How reproducible: - Every time Steps to Reproduce: - Install an OCP 4.7.z cluster with vSphere platform integration. - In the install-config.yaml file, add the vCenter CA cert in additionalTrustBundles. - Once the install is complete, check if CSO is degraded. - Confirm if cm/cloud-provider-config has 'insecure-flag = "1"' set in it. - If it is set, remove the flag and wait for CSO to transition to a degraded state. ------------------------------ Steps to Reproduce with an upgrade: - Confirm that the vCenter CA cert is imported in OCP. - If the cm/cloud-provider-config has 'insecure-flag = "1"' set, remove it. - Start the upgrade. (4.6.z --> 4.7.z) - Cluster upgrade will be stalled upgrading CSO(degraded). It will only progress when you set 'insecure-flag = "1"' again in the CM. Actual results: - The `vsphere-problem-detector-operator` pod in openshift-cluster-storage-operator project does not appear to mount any custom CA configmap - The CSO remains degraded with the bellow error message: ~~~ VSphereProblemDetectorControllerDegraded: failed to connect to <vCenterFQDN>: Post "https://<vCenterFQDN>/sdk": x509: certificate signed by unknown authority ~~~ Expected results: - The `vsphere-problem-detector-operator` pod in openshift-cluster-storage-operator project should mount a CA configmap for it to allow direct communication to vCenter. - If not via a configmap, there should be inbuilt logic to trust the vCenter CA. Additional info: - The cloud-provider-config did NOT include `insecure-flag = "1"` previously when the cluster was working correctly (create/attaching PVCs using vCenter) on v4.6. - Setting 'insecure-flag = "1"' in cloud-provider-config is considered as a workaround and not a fix.
This should be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1959546, the cluster won't get Degraded, it should just emit an alert. Still, if vsphere-problem-detector cannot connect to vSphere API, it means that most probably also the rest of OCP cannot connect there too and the cluster *is* broken. Storage and Machine API will not work and who knows what else breaks. Cluster admins should make sure that TLS certificates and their config (or insecure-flag = "1") is set correctly. *** This bug has been marked as a duplicate of bug 1959546 ***
Once the update completes, `insecure-flag = "1"` can be removed from the cp config again and PVCs can be created/attached (because the components that do these task correctly mount/utilise the custom CA's injected using the additionalTrustBundles parameter, just as they did in v4.6). However, the vsphere-problem-detector pods do not mount/utilise that custom CA, so the storage co returns to Degraded. In our case, the TLS certificates for the vCenter ARE set correctly, and the necessary custom CA is included in the additionalTrustBundles/user-ca-bundle which is mounted and utilised by most of the OpenShift components that talk to vCenter. Unfortunately, not vsphere-problem-detector. Simply adding insecure-flag to the cp config isn't a particularly great workaround because it means that the vCenter service account that is used can be intercepted via Man-in-the-middle.
I'd propose that this is NOT a duplicate of 1959546. The credentials (username, password and CA certificate) in our case ARE valid. But the vsphere-problem-detector pods do not mount/utilise the additionalTrustBundles/user-ca-bundle ...
Did you do anything special to make the certs available in KCM pods or did you just follow the process documented in - https://docs.openshift.com/container-platform/4.5/installing/installing_vsphere/installing-vsphere-installer-provisioned.html#installation-adding-vcenter-root-certificates_installing-vsphere-installer-provisioned ?
At that time the cluster in question was installed (on 4.5 originally), I referenced the user-ca-bundle (autocreated by the install from additionalTrustBundles) in a clusterwide proxy object apiVersion: config.openshift.io/v1 kind: Proxy metadata: name: cluster spec: trustedCA: name: user-ca-bundle (procedure from https://docs.openshift.com/container-platform/4.7/networking/configuring-a-custom-pki.html ) Re-reading the docs page ( https://docs.openshift.com/container-platform/4.7/networking/configuring-a-custom-pki.html#certificate-injection-using-operators_configuring-a-custom-pki ) , I might experiment in a dev cluster with adding a config map with the label config.openshift.io/inject-trusted-cabundle="true" to the openshift-cluster-storage-operator project... it might work... most of the other openshift-* projects have one. But openshift-cluster-storage-operator project does not ... I'll report back when I've tested this...
Alright - thanks for updates. I am going to reopen this bug and treat it as different from other BZ.
So, adding a configmap with the label config.openshift.io/inject-trusted-cabundle="true" does result in the CNO automatically populating that configmap with the correct set of CA certs (the "well-known" CAs with any custom CA certs from "user-ca-bundle" concatenated). However, the vsphere-problem-detector-operator pod doesn't mount the configmap so this alone doesn't really fix the problem. If you oc rsh into the vsphere-problem-detector-operator pod, you can view the /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem it uses; this ONLY includes the "well-known" CAs. This is why trying to talk to a vCenter with a private CA signed cert gets the "x509: certificate signed by unknown authority" error. For comparison, an operator that handles the custom CAs correctly is openshift-machine-api - inside openshift-machine-api project, there is a configmap "mao-trusted-ca" which has the config.openshift.io/inject-trusted-cabundle label, so is auto-populated with the correct CAs. Also the "machine-controller" container in the machine-api-controllers pod mounts that configmap such that its /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem file is replaced with the correct set of CAs - this is why it can talk to vCenter to scale machines even without insecure-flag being set... The desired behaviour needed to fix this bug in vsphere-problem-detector is (by whatever mechanism) for the /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem in the vsphere-problem-detector-operator pod to include both the "well-known" CA certs, and the contents of user-ca-bundle - this will let it talk to a vCenter with a private CA.
Fix: https://github.com/openshift/cluster-storage-operator/pull/178 Somewhat related fix to the CSI driver (tech preview for now): https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/28
Hi I installed cluster with additionalTrustBundle with proxy env on both 4.8 and 4.9: In OCP4.8(without fix), after removing `insecure-flag = "1"` in cm/cloud-provider-config, I got following error in the storage CO (storage CO doesn't become degraded due to another bug fix) - lastTransitionTime: "2021-07-30T05:39:47Z" message: 'VSphereProblemDetectorControllerAvailable: failed to connect to vcenter.sddc-44-236-21-251.vmwarevmc.com: Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/sdk": proxyconnect tcp: x509: certificate signed by unknown authority' reason: AsExpected status: "True" type: Available In 4.9.0-0.nightly-2021-07-29-103526 (with fix), after removing `insecure-flag = "1"`, the storage CO doesn't report such message. And checked inside the vsphere-problem-detector-operator pod, the addtional CA is added in /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem. @Jan and @gellner, before changing the status to "Verified", could you help confirm if my verification is ok from your side?
From my side it looks OK.
Changed status to "Verified" as test result above, and also checked with vsphere CSI Driver.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759