1809195 – Metrics over HTTPS so it's harder for attackers to spoof

Bug 1809195 - Metrics over HTTPS so it's harder for attackers to spoof

Summary: Metrics over HTTPS so it's harder for attackers to spoof

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Jack Ottofaro
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:	1834568
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-02 14:59 UTC by Pawel Krupa
Modified:	2021-04-25 07:48 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Currently CVO serves metrics over HTTP. Consequence: Since CVO uses HTTP and not HTTPS it is subject to "man-in-the-middle" attacks and the metrics data is not encrypted. Fix: CVO serves metrics over HTTPS. Result: CVO metrics data is encrypted using HTTPS/TLS connection. CVO verifies client's certificate to prevent "man-in-the-middle" attacks.
Clone Of:
Environment:
Last Closed:	2020-10-27 15:56:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 358	None	closed	Bug 1809195: Send CVO metrics over https	2021-01-13 02:31:04 UTC
Github	openshift cluster-version-operator pull 385	None	closed	Bug 1809195: Remove '-' from servicemonitor spec	2021-01-13 02:31:04 UTC
Github	openshift origin pull 25134	None	closed	Bug 1809195: test/extended/prometheus: During transition, allow HTTP(S) for CVO	2021-01-13 02:31:04 UTC
Github	openshift origin pull 25135	None	closed	Bug 1809195: test/extended/prometheus: Require HTTPS for the CVO	2021-01-13 02:31:04 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 15:56:16 UTC

Description Pawel Krupa 2020-03-02 14:59:22 UTC

Description of problem:
Metrics endpoint is not using TLS to encrypt traffic.

Version-Release number of selected component (if applicable):
4.4 (possibly also earlier versions)

How reproducible:
Always

Steps to Reproduce:
1. Start a cluster
2. Go to prometheus UI
3. Check connection schema for this component

Actual results:
Metrics are exposed over HTTP connection

Expected results:
Metrics are exposed over HTTPS connection

Additional info:
API server operator ServiceMonitor definition can be used as a template on how to fix this issue: https://github.com/openshift/cluster-openshift-apiserver-operator/blob/master/manifests/0000_90_openshift-apiserver-operator_03_servicemonitor.yaml

Comment 1 Scott Dodson 2020-03-02 17:32:00 UTC

What's the impact of this, is this a requirement in a specific version of the product?

Comment 2 Scott Dodson 2020-03-03 15:05:57 UTC

This is tied to a 4.5 epic for monitoring team, so 4.5 TR is fine.

Comment 3 Pawel Krupa 2020-04-06 11:41:44 UTC

After fixing please remove your component from an exclusion list in e2e tests at https://github.com/openshift/origin/blob/master/test/extended/prometheus/prometheus.go#L253-L268

Comment 4 W. Trevor King 2020-05-11 16:28:21 UTC

Retitled, because serving over HTTPS does not protect from leaking metrics.  The CVO is still happy to serve the metrics content to all clients, regardless of HTTP vs. HTTPS.  With HTTPS, clients gain the ability to verify that they are talking to the CVO (assuming no chain-of-trust compromises).  But unless we also teach the CVO to verify client certificates, we aren't protecting against leaks.  My understanding is that verified client certificates are out of scope for this bug, so I've retitled it, but please correct me if I'm wrong.  Having something in the bug about the threat model we want to address would be good, since I'm just assuming it's "attacker pretends to be the CVO, somehow middlemans the metrics service, and serves bogus metrics to Prometheus scrapers".

Comment 5 W. Trevor King 2020-05-19 04:55:22 UTC

I'm optimistic that we'll land the referenced PR this sprint.

Comment 6 Jack Ottofaro 2020-05-20 16:44:04 UTC

I agree. Sorry, I had misused the UpcomingSprint keyword.

Comment 9 liujia 2020-06-15 08:22:20 UTC

The pr is not landed in the latest v4.6 nightly build 4.6.0-0.nightly-2020-06-12-084204.

Comment 10 liujia 2020-06-16 03:56:53 UTC

Version: 4.6.0-0.nightly-2020-06-15-214351

In a fresh installed cluster, the schema in servicemonitor is updated to "https". But there are both "http" and "https" targets on prometheus console. From the bug's description, it meant to update original "http" schema with "https" schema. And there are not other targets with both http and https schema, so reopen the bug. Please correct me if my understanding is wrong here.

# ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-06-15-214351   True        False         95m     Cluster version is 4.6.0-0.nightly-2020-06-15-214351

# ./oc -n openshift-cluster-version get servicemonitor cluster-version-operator -ojson|jq .spec.endpoints[1].scheme
"https"


Endpoint	State	
http://10.0.218.217:9099/metrics UP


Endpoint	State	
https://10.0.218.217:9099/metrics UP

Comment 11 W. Trevor King 2020-06-16 04:12:10 UTC

[1] is AWS CI for 4.6.0-0.nightly-2020-06-15-214351, and it has:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1272648161621446656/artifacts/e2e-aws/must-gather.tar | tar -xOz ./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-3380e948730d010362a635dac91ae2ecd41b02ce4a03c2e099b615d8c26b6bae/namespaces/openshift-cluster-version/monitoring.coreos.com/servicemonitors/cluster-version-operator.yaml | yaml2json | jq .spec.endpoints
  [
    {
      "bearerTokenFile": "/var/run/secrets/kubernetes.io/serviceaccount/token"
    },
    {
      "tlsConfig": {
        "serverName": "cluster-version-operator.openshift-cluster-version.svc",
        "caFile": "/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt"
      },
      "interval": "30s",
      "port": "metrics",
      "scheme": "https"
    }
  ]

I'm not sure why the Prometheus console would show both HTTP and HTTPS under those conditions.  But we do expect to *serve* metrics over both HTTP and HTTPS for some transition period (e.g. to support 4.5 -> 4.6 updates, since 4.5 Prometheus expects access over HTTP and 4.6 CVO will run while 4.5 Prometheus is running during part of all 4.5 -> 4.6 updates.  For more in this space, see comment 4.  The bug should just be about "is the 4.6 Prometheus scraper using HTTPS"; whether the CVO responds to HTTP as well is orthogonal.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.6/1272648161621446656

Comment 12 liujia 2020-06-16 07:32:51 UTC

About the issue that both http and https endpoints shows in Prometheus UI[1], after confirmed with monitoring qe @juzhao, and according to our test, it should be still cvo issue which involved in https://github.com/openshift/cluster-version-operator/pull/358/files#diff-9cebe4481f6ba21cc13df1cfa2682b79.

Current/wrong format:
# ./oc -n openshift-cluster-version get servicemonitor/cluster-version-operator -oyaml
...
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  - interval: 30s
    port: metrics
    scheme: https

Correct format should be:
...
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    port: metrics
    scheme: https

I tried to edit servicemonitor to correct format, and then refresh the targets page on Prometheus UI, after several seconds, it would show https endpoint only(expected).
So assign the bug back for further fix.

[1] https://prometheus-k8s-openshift-monitoring.apps.jliu-46.qe.devcluster.openshift.com/targets

Comment 13 Jack Ottofaro 2020-06-16 14:47:10 UTC

(In reply to liujia from comment #12)
> About the issue that both http and https endpoints shows in Prometheus
...

Created https://github.com/openshift/cluster-version-operator/pull/385 to address.

Comment 14 W. Trevor King 2020-06-17 04:00:47 UTC

I've filed two origin PR that need monitoring-team approval.  The order they need to land in is:

1. origin#25134 (blocked on monitoring-team approval)
2. cvo#385 (blocked on origin#25134)
4. origin#25135 (blocked on monitoring-team approval and cvo#385)

Comment 15 W. Trevor King 2020-06-17 04:01:52 UTC

Assigning a needinfo to help flag the monitoring-team approval of the origin PRs.

Comment 16 Pawel Krupa 2020-06-17 10:58:16 UTC

reviewed PRs in origin, all looks good

Comment 18 liujia 2020-06-23 07:21:38 UTC

Version:4.6.0-0.nightly-2020-06-23-033310

For a fresh installed cluster:
1) The schema in servicemonitor is "https"(expected).
# ./oc -n openshift-cluster-version get servicemonitor cluster-version-operator -ojson|jq .spec.endpoints[].scheme
"https"

2) There is only https endpoint targets UI(expected).
https://10.0.166.236:9099/metrics

3) Curl above endpoint in cvo pod without token successfully.(expected result should be rejected.)
# ./oc -n openshift-cluster-version exec cluster-version-operator-6ddbd8cf48-htg92 -- curl -k https://10.0.166.236:9099/metrics
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP cluster_installer Reports info about the installation process and, if applicable, the install tool.
# TYPE cluster_installer gauge
cluster_installer{invoker="user",type="openshift-install",version="v4.6.0"} 1
...

Take mco for example, the connect without token should be rejected.
# ./oc -n openshift-machine-config-operator exec machine-config-operator-6b4b577d49-gnqd7 -- curl -k https://10.129.0.26:8443/metrics
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   240  100   240    0     0   5633      0 --:--:-- --:--:-- --:--:--  5714
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
    
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
  "reason": "Forbidden",
  "details": {
    
  },
  "code": 403
}

Comment 19 W. Trevor King 2020-06-24 05:51:49 UTC

> 3) Curl above endpoint in cvo pod without token successfully.(expected
> result should be rejected.)
> ...
> Take mco for example, the connect without token should be rejected.
> # ./oc -n openshift-machine-config-operator exec
> machine-config-operator-6b4b577d49-gnqd7 -- curl -k
> https://10.129.0.26:8443/metrics
> ...
>   "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",

I did not expect us to reject anonymous users, see my guess at the threat model in comment 4.  How does one accomplish this?  Does it matter that the CVO pods are 'hostNetwork: true' [1] (although the motivation there is API access, not metrics hosting [2])?  Is token-gated access a requirement for this feature request?  Pawel, is there a monitoring-side threat model that covers the aspects you're looking to see addressed?

[1]: https://github.com/openshift/cluster-version-operator/blob/2b849292e49f1365ce1c84541194b736b76d7b18/install/0000_00_cluster-version-operator_03_deployment.yaml#L56
[2]: https://github.com/openshift/cluster-version-operator/commit/68b74c698c4321adbb5c595ee4a0bdb68d084552

Comment 20 Jack Ottofaro 2020-06-24 13:01:49 UTC

I was going to do a bit more research but to add onto wking 's points above, I tested this and found at least one other case, openshift-machine-config-operator/machine-config-daemon, of being able to access without a token. I do not readily see a difference in how our metrics are now configured and any of the others I have checked.

Comment 21 liujia 2020-06-28 01:47:58 UTC

I'm not an expert on matrix. I did the test against machine-config-operator and console operator, both of them can not access endpoint without token. So i think cvo is not expected at this point. 

@Pawel Krupa For this bug, could u help confirm if "cvo should not access ep without token" an expect result?

> 3) Curl above endpoint in cvo pod without token successfully.(expected
> result should be rejected.)

Comment 22 Jack Ottofaro 2020-06-29 21:21:30 UTC

(In reply to liujia from comment #21)
> I'm not an expert on matrix. I did the test against machine-config-operator
> and console operator, both of them can not access endpoint without token. So
> i think cvo is not expected at this point. 
> 
> @Pawel Krupa For this bug, could u help confirm if "cvo should not access ep
> without token" an expect result?
> 
> > 3) Curl above endpoint in cvo pod without token successfully.(expected
> > result should be rejected.)

Note also that we discussed in slack thread https://coreos.slack.com/archives/CEGKQ43CP/p1593117184028100 where @pkrupa referenced scope is covered in https://issues.redhat.com/browse/MON-417 which does not require a token.

Comment 23 liujia 2020-06-30 01:30:42 UTC

According to comment 18 and comment 22, move the bug to verified.

Comment 25 errata-xmlrpc 2020-10-27 15:56:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.