Bug 1797123

Summary: Cluster-version operator loads proxy certs from the trustedCA source, and so is vulnerable to data-entry errors
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.4CC: anowak, aos-bugs, jokerman
Target Milestone: ---Keywords: Reopened
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The cluster-version operator used to load trusted trusted CAs from the ConfigMap referenced by the Proxy configuration's trustedCA property. Consequence: The referenced ConfigMap is user maintained, so a user setting corrupted certificates would break cluster-version operator access to the proxy. Fix: The cluster-version operator now loads trusted CAs from openshift-config-managed/trusted-ca-bundle, which the network operator populates when the Proxy configuration's referenced trustedCA ConfigMap is valid. Result: If a user corrupts the referenced trustedCA ConfigMap, the network operator will not copy the corrupted content into openshift-config-managed/trusted-ca-bundle, and the cluster-version operator will continue to connect to the proxy using those old certificates.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 15:55:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-01-31 23:57:25 UTC
The API docs [1] recommend avoiding trustedCA (unless you happen to be the "proxy validator") and instead pulling the trust bundle from this managed namespace.  This protects from cluster-admin data-entry errors, because openshift-config-managed/trusted-ca-bundle will remain unchanged until the proxy validator sees a config in openshift-config/$trustedCA that it considers valid.  We should pivot the CVO to load the proxy certs from trusted-ca-bundle instead.

[1]: https://github.com/openshift/api/blob/f2a771e1a90ceb4e65f1ca2c8b11fc1ac6a66da8/config/v1/types_proxy.go#L44-L52

Comment 1 W. Trevor King 2020-02-01 00:06:36 UTC
Some discussion around the MCO making a similar pivot to the post-validator trusted-ca-bundle in bug 1784201.

Comment 2 W. Trevor King 2020-07-10 19:43:52 UTC
Lala and Ben were going to take a look, but don't seem to have had time yet (although Abinav approved it 3d ago [1]).  So  waiting on review, which is unlikely to happen today.  Adding UpcomingSprint.

[1]: https://github.com/openshift/cluster-version-operator/pull/311#issuecomment-655024387

Comment 5 W. Trevor King 2020-08-01 05:33:57 UTC
ON_QA, but might get kicked back to me if verification fails, so adding UpcomingSprint.  Hopefully the last time for this bug :)

Comment 6 liujia 2020-08-04 08:22:55 UTC
Hi W. Trevor King

Could u help provide some background information for this bug? I'm not sure how should I check it? Or some reproduce steps are also welcome:)

Comment 7 W. Trevor King 2020-08-04 22:23:04 UTC
Two things that would be useful to check:

* Narrowly, for this bug:
  1. Install a cluster with Proxy configured appropriately, including a valid trustedCA.
  2. Edit the trustedCA-referenced ConfigMap to inject some garbage, non-PEM content.
  3. Before this change, that should break the CVO's attempts to connect to the configured upstream (Cincinnati) service, and also its ability to connect to external HTTPS signature stores.  With this change, the CVO will continue to connect to both locations using the previous, good trustedCA content.

     The network operator, which is in charge of copying the trustedCA-referenced ConfigMap into openshift-config-managed/trusted-ca-bundle, will ideally complain about the new, broken trustedCA-referenced ConfigMap.  If it doesn't, probably file a bug against them.

* More broadly, as a workaround for bug 1773419, you can:
  1. Set up a Cincinnati service using a TLS certificate signed by a non-standard CA.
  2. Configure Proxy with a trustedCA-referenced ConfigMap that includes the non-standard CA.
  3. Update your ClusterVersion.spec.upstream to point at your service.
  4. Before this change, that should break the CVO's attempts to connect to the configured upstream, because without httpsProxy set in Proxy, the CVO would ignore the configured trustedCA.  With this change, the CVO will connect to the configured upstream, trusting its non-standard-CA-signed certificate.

     The same handling applies to HTTPS signature fetches, but because the base URIs for those are not easily configurable, it's a bit harder to test.  If you feel so inclined, you could verify this angle by configuring networking to route all outgoing HTTPS traffic through a proxy with a non-standard-CA-signed certificate, and continue to leave httpsProxy unset.

Comment 8 W. Trevor King 2020-08-04 22:25:20 UTC
Ah, left out in comment 7 for the narrow case: you'll want to set httpsProxy, at least for the "before this change" case, because without it the CVO will completely ignore the trustedCA-referenced ConfigMap.

Comment 14 W. Trevor King 2020-08-21 22:25:26 UTC
Not clear to me what went wrong in the latest verification attempt.  Will continue to look next sprint.

Comment 15 W. Trevor King 2020-08-22 00:04:29 UTC
Pointing a recent build at the Kube API as an "upstream" (because it's self-signed, and we can distinguish between "X.509 failure" and "not Cincy JSON" in the error message):

$ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version
4.6.0-0.ci-2020-08-20-163422
$ oc get -o jsonpath='{.spec.trustedCA}{"\n"}' proxy cluster
map[name:]
$ yaml2json < "${KUBECONFIG}" | jq -r '.clusters[0].cluster.server'
https://api.ci-ln-5fqp6qt-f76d1.origin-ci-int-gce.dev.openshift.com:6443
$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "whatever"},{"op": "add", "path": "/spec/upstream", "value": "https://api.ci-ln-5fqp6qt-f76d1.origin-ci-int-gce.dev.openshift.com:6443"}]'
$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2020-08-21T22:11:03Z RetrievedUpdates=False RemoteFailed: Unable to retrieve available updates: Get "https://api.ci-ln-5fqp6qt-f76d1.origin-ci-int-gce.dev.openshift.com:6443?arch=amd64&channel=whatever&id=662fb0a6-04c7-4363-8c11-8c6fd9d2d5e8&version=4.6.0-0.ci-2020-08-20-163422": x509: certificate signed by unknown authority
2020-08-21T22:50:03Z Failing=False : 
2020-08-21T22:50:33Z Available=True : Done applying 4.6.0-0.ci-2020-08-20-163422
2020-08-21T22:50:33Z Progressing=False : Cluster version is 4.6.0-0.ci-2020-08-20-163422

So "x509: certificate signed by unknown authority".  Good.  Now fill in the Proxy trustedCA.

$ yaml2json < "${KUBECONFIG}" | jq -r '.clusters[0].cluster["certificate-authority-data"]' | base64 -d >ca-bundle.crt
$ head -n2 ca-bundle.crt
-----BEGIN CERTIFICATE-----
MIIDkjCCAnqgAwIBAgIIWm+b1NvzbOYwDQYJKoZIhvcNAQELBQAwJjEkMCIGA1UE
$ oc -n openshift-config create configmap user-ca-bundle --from-file=ca-bundle.crt
$ oc patch proxy cluster --type json -p '[{"op": "add", "path": "/spec/trustedCA/name", "value": "user-ca-bundle"}]'
$ sleep 20  # or whatever, waiting for the network manager to populate the trusted bundle
$ diff -u ca-bundle.crt <(oc -n openshift-config-managed get -o json configmap trusted-ca-bundle | jq -r '.data["ca-bundle.crt"]') | head -n7
--- ca-bundle.crt	2020-08-21 15:58:33.146533690 -0700
+++ /dev/fd/63	2020-08-21 16:18:40.399523102 -0700
@@ -98,3 +98,3754 @@
 1eDPBGGc2pxk2eshDeX4THjpzF+GWksGmYc+5Az6+Qd7ImYDKReFnbPQz3OIDcq+
 egBKR65U
 -----END CERTIFICATE-----
+# ACCVRAIZ1

But I was still seeing the "certificate signed by unknown authority".  Poking around locally, I think the issue is cvo#441 (fixup PR).  With that in place, the flow above results in:

$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2020-08-21T22:11:03Z RetrievedUpdates=False ResponseFailed: Unable to retrieve available updates: unexpected HTTP status: 403 Forbidden
...

Which makes sense, because the CVO making an upstream request is hitting the Kube API server, which says "who are you?".  Anyhow, it shows that the CVO trusts the X.509 cert guarding the Kube API endpoint.

Comment 17 liujia 2020-08-27 02:44:40 UTC
Verified on 4.6.0-0.nightly-2020-08-25-222652, it works well for both of scenarios.

Comment 19 errata-xmlrpc 2020-10-27 15:55:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 20 W. Trevor King 2021-01-21 19:36:43 UTC
*** Bug 1918816 has been marked as a duplicate of this bug. ***