Description of problem: Metrics endpoint is not using TLS to encrypt traffic. Version-Release number of selected component (if applicable): 4.4 (possibly also earlier versions) How reproducible: Always Steps to Reproduce: 1. Start a cluster 2. Go to prometheus UI 3. Check connection schema for this component Actual results: Metrics are exposed over HTTP connection Expected results: Metrics are exposed over HTTPS connection Additional info: API server operator ServiceMonitor definition can be used as a template on how to fix this issue: https://github.com/openshift/cluster-openshift-apiserver-operator/blob/master/manifests/0000_90_openshift-apiserver-operator_03_servicemonitor.yaml
What's the impact of this, is this a requirement in a specific version of the product?
This is tied to a 4.5 epic for monitoring team, so 4.5 TR is fine.
After fixing please remove your component from an exclusion list in e2e tests at https://github.com/openshift/origin/blob/master/test/extended/prometheus/prometheus.go#L253-L268
Retitled, because serving over HTTPS does not protect from leaking metrics. The CVO is still happy to serve the metrics content to all clients, regardless of HTTP vs. HTTPS. With HTTPS, clients gain the ability to verify that they are talking to the CVO (assuming no chain-of-trust compromises). But unless we also teach the CVO to verify client certificates, we aren't protecting against leaks. My understanding is that verified client certificates are out of scope for this bug, so I've retitled it, but please correct me if I'm wrong. Having something in the bug about the threat model we want to address would be good, since I'm just assuming it's "attacker pretends to be the CVO, somehow middlemans the metrics service, and serves bogus metrics to Prometheus scrapers".
I'm optimistic that we'll land the referenced PR this sprint.
I agree. Sorry, I had misused the UpcomingSprint keyword.
The pr is not landed in the latest v4.6 nightly build 4.6.0-0.nightly-2020-06-12-084204.
Version: 4.6.0-0.nightly-2020-06-15-214351 In a fresh installed cluster, the schema in servicemonitor is updated to "https". But there are both "http" and "https" targets on prometheus console. From the bug's description, it meant to update original "http" schema with "https" schema. And there are not other targets with both http and https schema, so reopen the bug. Please correct me if my understanding is wrong here. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-06-15-214351 True False 95m Cluster version is 4.6.0-0.nightly-2020-06-15-214351 # ./oc -n openshift-cluster-version get servicemonitor cluster-version-operator -ojson|jq .spec.endpoints[1].scheme "https" Endpoint State http://10.0.218.217:9099/metrics UP Endpoint State https://10.0.218.217:9099/metrics UP
[1] is AWS CI for 4.6.0-0.nightly-2020-06-15-214351, and it has: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1272648161621446656/artifacts/e2e-aws/must-gather.tar | tar -xOz ./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-3380e948730d010362a635dac91ae2ecd41b02ce4a03c2e099b615d8c26b6bae/namespaces/openshift-cluster-version/monitoring.coreos.com/servicemonitors/cluster-version-operator.yaml | yaml2json | jq .spec.endpoints [ { "bearerTokenFile": "/var/run/secrets/kubernetes.io/serviceaccount/token" }, { "tlsConfig": { "serverName": "cluster-version-operator.openshift-cluster-version.svc", "caFile": "/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt" }, "interval": "30s", "port": "metrics", "scheme": "https" } ] I'm not sure why the Prometheus console would show both HTTP and HTTPS under those conditions. But we do expect to *serve* metrics over both HTTP and HTTPS for some transition period (e.g. to support 4.5 -> 4.6 updates, since 4.5 Prometheus expects access over HTTP and 4.6 CVO will run while 4.5 Prometheus is running during part of all 4.5 -> 4.6 updates. For more in this space, see comment 4. The bug should just be about "is the 4.6 Prometheus scraper using HTTPS"; whether the CVO responds to HTTP as well is orthogonal. [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1272648161621446656
About the issue that both http and https endpoints shows in Prometheus UI[1], after confirmed with monitoring qe @juzhao, and according to our test, it should be still cvo issue which involved in https://github.com/openshift/cluster-version-operator/pull/358/files#diff-9cebe4481f6ba21cc13df1cfa2682b79. Current/wrong format: # ./oc -n openshift-cluster-version get servicemonitor/cluster-version-operator -oyaml ... spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token - interval: 30s port: metrics scheme: https Correct format should be: ... spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 30s port: metrics scheme: https I tried to edit servicemonitor to correct format, and then refresh the targets page on Prometheus UI, after several seconds, it would show https endpoint only(expected). So assign the bug back for further fix. [1] https://prometheus-k8s-openshift-monitoring.apps.jliu-46.qe.devcluster.openshift.com/targets
(In reply to liujia from comment #12) > About the issue that both http and https endpoints shows in Prometheus ... Created https://github.com/openshift/cluster-version-operator/pull/385 to address.
I've filed two origin PR that need monitoring-team approval. The order they need to land in is: 1. origin#25134 (blocked on monitoring-team approval) 2. cvo#385 (blocked on origin#25134) 4. origin#25135 (blocked on monitoring-team approval and cvo#385)
Assigning a needinfo to help flag the monitoring-team approval of the origin PRs.
reviewed PRs in origin, all looks good
Version:4.6.0-0.nightly-2020-06-23-033310 For a fresh installed cluster: 1) The schema in servicemonitor is "https"(expected). # ./oc -n openshift-cluster-version get servicemonitor cluster-version-operator -ojson|jq .spec.endpoints[].scheme "https" 2) There is only https endpoint targets UI(expected). https://10.0.166.236:9099/metrics 3) Curl above endpoint in cvo pod without token successfully.(expected result should be rejected.) # ./oc -n openshift-cluster-version exec cluster-version-operator-6ddbd8cf48-htg92 -- curl -k https://10.0.166.236:9099/metrics % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP cluster_installer Reports info about the installation process and, if applicable, the install tool. # TYPE cluster_installer gauge cluster_installer{invoker="user",type="openshift-install",version="v4.6.0"} 1 ... Take mco for example, the connect without token should be rejected. # ./oc -n openshift-machine-config-operator exec machine-config-operator-6b4b577d49-gnqd7 -- curl -k https://10.129.0.26:8443/metrics % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 240 100 240 0 0 5633 0 --:--:-- --:--:-- --:--:-- 5714 { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"", "reason": "Forbidden", "details": { }, "code": 403 }
> 3) Curl above endpoint in cvo pod without token successfully.(expected > result should be rejected.) > ... > Take mco for example, the connect without token should be rejected. > # ./oc -n openshift-machine-config-operator exec > machine-config-operator-6b4b577d49-gnqd7 -- curl -k > https://10.129.0.26:8443/metrics > ... > "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"", I did not expect us to reject anonymous users, see my guess at the threat model in comment 4. How does one accomplish this? Does it matter that the CVO pods are 'hostNetwork: true' [1] (although the motivation there is API access, not metrics hosting [2])? Is token-gated access a requirement for this feature request? Pawel, is there a monitoring-side threat model that covers the aspects you're looking to see addressed? [1]: https://github.com/openshift/cluster-version-operator/blob/2b849292e49f1365ce1c84541194b736b76d7b18/install/0000_00_cluster-version-operator_03_deployment.yaml#L56 [2]: https://github.com/openshift/cluster-version-operator/commit/68b74c698c4321adbb5c595ee4a0bdb68d084552
I was going to do a bit more research but to add onto wking 's points above, I tested this and found at least one other case, openshift-machine-config-operator/machine-config-daemon, of being able to access without a token. I do not readily see a difference in how our metrics are now configured and any of the others I have checked.
I'm not an expert on matrix. I did the test against machine-config-operator and console operator, both of them can not access endpoint without token. So i think cvo is not expected at this point. @Pawel Krupa For this bug, could u help confirm if "cvo should not access ep without token" an expect result? > 3) Curl above endpoint in cvo pod without token successfully.(expected > result should be rejected.)
(In reply to liujia from comment #21) > I'm not an expert on matrix. I did the test against machine-config-operator > and console operator, both of them can not access endpoint without token. So > i think cvo is not expected at this point. > > @Pawel Krupa For this bug, could u help confirm if "cvo should not access ep > without token" an expect result? > > > 3) Curl above endpoint in cvo pod without token successfully.(expected > > result should be rejected.) Note also that we discussed in slack thread https://coreos.slack.com/archives/CEGKQ43CP/p1593117184028100 where @pkrupa referenced scope is covered in https://issues.redhat.com/browse/MON-417 which does not require a token.
According to comment 18 and comment 22, move the bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196