Bug 1969719

Summary: vsphere-problem-detector cannot connect to vCenter API over https
Product: OpenShift Container Platform Reporter: Siddhant More <simore>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Operators QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, gellner, hekumar, jsafrane
Version: 4.7Keywords: Reopened
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2034270 (view as bug list) Environment:
Last Closed: 2021-10-18 17:33:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 2034270    

Description Siddhant More 2021-06-09 06:25:10 UTC
Description of problem:
- CSO is degraded during an upgrade because vsphere-problem-detector doesn't trust the vCenter CA.

Version-Release number of selected component (if applicable):
- RHOCP 4.7.z 
- CSO 4.7.z

How reproducible:
- Every time 

Steps to Reproduce:
- Install an OCP 4.7.z cluster with vSphere platform integration.
- In the install-config.yaml file, add the vCenter CA cert in additionalTrustBundles.
- Once the install is complete, check if CSO is degraded.
- Confirm if cm/cloud-provider-config has 'insecure-flag = "1"' set in it. 
- If it is set, remove the flag and wait for CSO to transition to a degraded state. 


Steps to Reproduce with an upgrade: 
- Confirm that the vCenter CA cert is imported in OCP. 
- If the cm/cloud-provider-config has 'insecure-flag = "1"' set, remove it. 
- Start the upgrade. (4.6.z --> 4.7.z)
- Cluster upgrade will be stalled upgrading CSO(degraded). It will only progress when you set 'insecure-flag = "1"' again in the CM. 

Actual results:
- The `vsphere-problem-detector-operator` pod in openshift-cluster-storage-operator project does not appear to mount any custom CA configmap
- The CSO remains degraded with the bellow error message: 
VSphereProblemDetectorControllerDegraded: failed to connect to <vCenterFQDN>: Post "https://<vCenterFQDN>/sdk": x509: certificate signed by unknown authority

Expected results:
- The `vsphere-problem-detector-operator` pod in openshift-cluster-storage-operator project should mount a CA configmap for it to allow direct communication to vCenter.
- If not via a configmap, there should be inbuilt logic to trust the vCenter CA. 

Additional info:
- The cloud-provider-config did NOT include `insecure-flag = "1"` previously when the cluster was working correctly (create/attaching PVCs using vCenter) on v4.6.
- Setting 'insecure-flag = "1"' in cloud-provider-config is considered as a workaround and not a fix.

Comment 2 Jan Safranek 2021-06-10 12:34:21 UTC
This should be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1959546, the cluster won't get Degraded, it should just emit an alert.

Still, if vsphere-problem-detector cannot connect to vSphere API, it means that most probably also the rest of OCP cannot connect there too and the cluster *is* broken. Storage and Machine API will not work and who knows what else breaks. Cluster admins should make sure that TLS certificates and their config (or insecure-flag = "1") is set correctly.

*** This bug has been marked as a duplicate of bug 1959546 ***

Comment 3 gellner 2021-06-10 16:20:46 UTC
Once the update completes, `insecure-flag = "1"` can be removed from the cp config again and PVCs can be created/attached (because the components that do these task correctly mount/utilise the custom CA's injected using the additionalTrustBundles parameter, just as they did in v4.6). However, the vsphere-problem-detector pods do not mount/utilise that custom CA, so the storage co returns to Degraded.

In our case, the TLS certificates for the vCenter ARE set correctly, and the necessary custom CA is included in the additionalTrustBundles/user-ca-bundle which is mounted and utilised by most of the OpenShift components that talk to vCenter. Unfortunately, not vsphere-problem-detector.

Simply adding insecure-flag to the cp config isn't a particularly great workaround because it means that the vCenter service account that is used can be intercepted via Man-in-the-middle.

Comment 4 gellner 2021-06-10 16:25:03 UTC
I'd propose that this is NOT a duplicate of 1959546.

The credentials (username, password and CA certificate) in our case ARE valid. But the vsphere-problem-detector pods do not mount/utilise the additionalTrustBundles/user-ca-bundle ...

Comment 5 Hemant Kumar 2021-06-10 16:59:44 UTC
Did you do anything special to make the certs available in KCM pods or did you just follow the process documented in - https://docs.openshift.com/container-platform/4.5/installing/installing_vsphere/installing-vsphere-installer-provisioned.html#installation-adding-vcenter-root-certificates_installing-vsphere-installer-provisioned ?

Comment 6 gellner 2021-06-10 17:17:28 UTC
At that time the cluster in question was installed (on 4.5 originally), I referenced the user-ca-bundle (autocreated by the install from additionalTrustBundles) in a clusterwide proxy object

apiVersion: config.openshift.io/v1
kind: Proxy
  name: cluster
    name: user-ca-bundle 

(procedure from https://docs.openshift.com/container-platform/4.7/networking/configuring-a-custom-pki.html )

Re-reading the docs page ( https://docs.openshift.com/container-platform/4.7/networking/configuring-a-custom-pki.html#certificate-injection-using-operators_configuring-a-custom-pki ) , I might experiment in a dev cluster with adding a config map with the label config.openshift.io/inject-trusted-cabundle="true" to the openshift-cluster-storage-operator project... it might work... most of the other openshift-* projects have one. But openshift-cluster-storage-operator project does not ...

I'll report back when I've tested this...

Comment 7 Hemant Kumar 2021-06-10 17:49:50 UTC
Alright - thanks for updates. I am going to reopen this bug and treat it as different from other BZ.

Comment 8 gellner 2021-06-11 10:18:53 UTC
So, adding a configmap with the label config.openshift.io/inject-trusted-cabundle="true" does result in the CNO automatically populating that configmap with the correct set of CA certs (the "well-known" CAs with any custom CA certs from "user-ca-bundle" concatenated).

However, the vsphere-problem-detector-operator pod doesn't mount the configmap so this alone doesn't really fix the problem.

If you oc rsh into the vsphere-problem-detector-operator pod, you can view the /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem it uses; this ONLY includes the "well-known" CAs. This is why trying to talk to a vCenter with a private CA signed cert gets the "x509: certificate signed by unknown authority" error.

For comparison, an operator that handles the custom CAs correctly is openshift-machine-api - inside openshift-machine-api project, there is a configmap "mao-trusted-ca" which has the config.openshift.io/inject-trusted-cabundle label, so is auto-populated with the correct CAs. Also the "machine-controller" container in the machine-api-controllers pod mounts that configmap such that its /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem file is replaced with the correct set of CAs - this is why it can talk to vCenter to scale machines even without insecure-flag being set...

The desired behaviour needed to fix this bug in vsphere-problem-detector is (by whatever mechanism) for the /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem in the vsphere-problem-detector-operator pod to include both the "well-known" CA certs, and the contents of user-ca-bundle - this will let it talk to a vCenter with a private CA.

Comment 9 Jan Safranek 2021-06-18 13:30:58 UTC
Fix: https://github.com/openshift/cluster-storage-operator/pull/178
Somewhat related fix to the CSI driver (tech preview for now): https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/28

Comment 11 Wei Duan 2021-07-30 09:05:21 UTC

I installed cluster with additionalTrustBundle with proxy env on both 4.8 and 4.9:
In OCP4.8(without fix), after removing `insecure-flag = "1"` in cm/cloud-provider-config, I got following error in the storage CO (storage CO doesn't become degraded due to another bug fix)
  - lastTransitionTime: "2021-07-30T05:39:47Z"
    message: 'VSphereProblemDetectorControllerAvailable: failed to connect to vcenter.sddc-44-236-21-251.vmwarevmc.com:
      Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/sdk": proxyconnect tcp:
      x509: certificate signed by unknown authority'
    reason: AsExpected
    status: "True"
    type: Available
In 4.9.0-0.nightly-2021-07-29-103526 (with fix), after removing `insecure-flag = "1"`, the storage CO doesn't report such message. And checked inside the vsphere-problem-detector-operator pod, the addtional CA is added in /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem. 

@Jan and @gellner, before changing the status to "Verified", could you help confirm if my verification is ok from your side?

Comment 12 Jan Safranek 2021-08-02 08:46:34 UTC
From my side it looks OK.

Comment 13 Wei Duan 2021-08-05 00:43:26 UTC
Changed status to "Verified" as test result above, and also checked with vsphere CSI Driver.

Comment 16 errata-xmlrpc 2021-10-18 17:33:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.