1969719 – vsphere-problem-detector cannot connect to vCenter API over https

Bug 1969719 - vsphere-problem-detector cannot connect to vCenter API over https

Summary: vsphere-problem-detector cannot connect to vCenter API over https

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Jan Safranek
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2034270
TreeView+	depends on / blocked

Reported:	2021-06-09 06:25 UTC by Siddhant More
Modified:	2021-12-20 14:35 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2034270 (view as bug list)
Environment:
Last Closed:	2021-10-18 17:33:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-storage-operator pull 178	0	None	open	Bug 1969719: Add trusted CA bundle to vsphere operators	2021-06-18 13:20:24 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:33:39 UTC

Description Siddhant More 2021-06-09 06:25:10 UTC

Description of problem:
- CSO is degraded during an upgrade because vsphere-problem-detector doesn't trust the vCenter CA.


Version-Release number of selected component (if applicable):
- RHOCP 4.7.z 
- CSO 4.7.z


How reproducible:
- Every time 


Steps to Reproduce:
- Install an OCP 4.7.z cluster with vSphere platform integration.
- In the install-config.yaml file, add the vCenter CA cert in additionalTrustBundles.
- Once the install is complete, check if CSO is degraded.
- Confirm if cm/cloud-provider-config has 'insecure-flag = "1"' set in it. 
- If it is set, remove the flag and wait for CSO to transition to a degraded state. 

------------------------------

Steps to Reproduce with an upgrade: 
- Confirm that the vCenter CA cert is imported in OCP. 
- If the cm/cloud-provider-config has 'insecure-flag = "1"' set, remove it. 
- Start the upgrade. (4.6.z --> 4.7.z)
- Cluster upgrade will be stalled upgrading CSO(degraded). It will only progress when you set 'insecure-flag = "1"' again in the CM. 


Actual results:
- The `vsphere-problem-detector-operator` pod in openshift-cluster-storage-operator project does not appear to mount any custom CA configmap
- The CSO remains degraded with the bellow error message: 
~~~
VSphereProblemDetectorControllerDegraded: failed to connect to <vCenterFQDN>: Post "https://<vCenterFQDN>/sdk": x509: certificate signed by unknown authority
~~~


Expected results:
- The `vsphere-problem-detector-operator` pod in openshift-cluster-storage-operator project should mount a CA configmap for it to allow direct communication to vCenter.
- If not via a configmap, there should be inbuilt logic to trust the vCenter CA. 


Additional info:
- The cloud-provider-config did NOT include `insecure-flag = "1"` previously when the cluster was working correctly (create/attaching PVCs using vCenter) on v4.6.
- Setting 'insecure-flag = "1"' in cloud-provider-config is considered as a workaround and not a fix.

Comment 2 Jan Safranek 2021-06-10 12:34:21 UTC

This should be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1959546, the cluster won't get Degraded, it should just emit an alert.

Still, if vsphere-problem-detector cannot connect to vSphere API, it means that most probably also the rest of OCP cannot connect there too and the cluster *is* broken. Storage and Machine API will not work and who knows what else breaks. Cluster admins should make sure that TLS certificates and their config (or insecure-flag = "1") is set correctly.

*** This bug has been marked as a duplicate of bug 1959546 ***

Comment 3 gellner 2021-06-10 16:20:46 UTC

Once the update completes, `insecure-flag = "1"` can be removed from the cp config again and PVCs can be created/attached (because the components that do these task correctly mount/utilise the custom CA's injected using the additionalTrustBundles parameter, just as they did in v4.6). However, the vsphere-problem-detector pods do not mount/utilise that custom CA, so the storage co returns to Degraded.

In our case, the TLS certificates for the vCenter ARE set correctly, and the necessary custom CA is included in the additionalTrustBundles/user-ca-bundle which is mounted and utilised by most of the OpenShift components that talk to vCenter. Unfortunately, not vsphere-problem-detector.

Simply adding insecure-flag to the cp config isn't a particularly great workaround because it means that the vCenter service account that is used can be intercepted via Man-in-the-middle.

Comment 4 gellner 2021-06-10 16:25:03 UTC

I'd propose that this is NOT a duplicate of 1959546.

The credentials (username, password and CA certificate) in our case ARE valid. But the vsphere-problem-detector pods do not mount/utilise the additionalTrustBundles/user-ca-bundle ...

Comment 5 Hemant Kumar 2021-06-10 16:59:44 UTC

Did you do anything special to make the certs available in KCM pods or did you just follow the process documented in - https://docs.openshift.com/container-platform/4.5/installing/installing_vsphere/installing-vsphere-installer-provisioned.html#installation-adding-vcenter-root-certificates_installing-vsphere-installer-provisioned ?

Comment 6 gellner 2021-06-10 17:17:28 UTC

At that time the cluster in question was installed (on 4.5 originally), I referenced the user-ca-bundle (autocreated by the install from additionalTrustBundles) in a clusterwide proxy object

apiVersion: config.openshift.io/v1
kind: Proxy
metadata:
  name: cluster
spec:
  trustedCA:
    name: user-ca-bundle 

(procedure from https://docs.openshift.com/container-platform/4.7/networking/configuring-a-custom-pki.html )


Re-reading the docs page ( https://docs.openshift.com/container-platform/4.7/networking/configuring-a-custom-pki.html#certificate-injection-using-operators_configuring-a-custom-pki ) , I might experiment in a dev cluster with adding a config map with the label config.openshift.io/inject-trusted-cabundle="true" to the openshift-cluster-storage-operator project... it might work... most of the other openshift-* projects have one. But openshift-cluster-storage-operator project does not ...

I'll report back when I've tested this...

Comment 7 Hemant Kumar 2021-06-10 17:49:50 UTC

Alright - thanks for updates. I am going to reopen this bug and treat it as different from other BZ.

Comment 8 gellner 2021-06-11 10:18:53 UTC

So, adding a configmap with the label config.openshift.io/inject-trusted-cabundle="true" does result in the CNO automatically populating that configmap with the correct set of CA certs (the "well-known" CAs with any custom CA certs from "user-ca-bundle" concatenated).

However, the vsphere-problem-detector-operator pod doesn't mount the configmap so this alone doesn't really fix the problem.


If you oc rsh into the vsphere-problem-detector-operator pod, you can view the /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem it uses; this ONLY includes the "well-known" CAs. This is why trying to talk to a vCenter with a private CA signed cert gets the "x509: certificate signed by unknown authority" error.

For comparison, an operator that handles the custom CAs correctly is openshift-machine-api - inside openshift-machine-api project, there is a configmap "mao-trusted-ca" which has the config.openshift.io/inject-trusted-cabundle label, so is auto-populated with the correct CAs. Also the "machine-controller" container in the machine-api-controllers pod mounts that configmap such that its /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem file is replaced with the correct set of CAs - this is why it can talk to vCenter to scale machines even without insecure-flag being set...




The desired behaviour needed to fix this bug in vsphere-problem-detector is (by whatever mechanism) for the /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem in the vsphere-problem-detector-operator pod to include both the "well-known" CA certs, and the contents of user-ca-bundle - this will let it talk to a vCenter with a private CA.

Comment 9 Jan Safranek 2021-06-18 13:30:58 UTC

Fix: https://github.com/openshift/cluster-storage-operator/pull/178
Somewhat related fix to the CSI driver (tech preview for now): https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/28

Comment 11 Wei Duan 2021-07-30 09:05:21 UTC

Hi 

I installed cluster with additionalTrustBundle with proxy env on both 4.8 and 4.9:
In OCP4.8(without fix), after removing `insecure-flag = "1"` in cm/cloud-provider-config, I got following error in the storage CO (storage CO doesn't become degraded due to another bug fix)
  - lastTransitionTime: "2021-07-30T05:39:47Z"
    message: 'VSphereProblemDetectorControllerAvailable: failed to connect to vcenter.sddc-44-236-21-251.vmwarevmc.com:
      Post "https://vcenter.sddc-44-236-21-251.vmwarevmc.com/sdk": proxyconnect tcp:
      x509: certificate signed by unknown authority'
    reason: AsExpected
    status: "True"
    type: Available
    
In 4.9.0-0.nightly-2021-07-29-103526 (with fix), after removing `insecure-flag = "1"`, the storage CO doesn't report such message. And checked inside the vsphere-problem-detector-operator pod, the addtional CA is added in /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem. 

@Jan and @gellner, before changing the status to "Verified", could you help confirm if my verification is ok from your side?

Comment 12 Jan Safranek 2021-08-02 08:46:34 UTC

From my side it looks OK.

Comment 13 Wei Duan 2021-08-05 00:43:26 UTC

Changed status to "Verified" as test result above, and also checked with vsphere CSI Driver.

Comment 16 errata-xmlrpc 2021-10-18 17:33:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.