Bug 1913386
Summary: | users can see metrics of namespaces for which they don't have rights when monitoring own services with prometheus user workloads | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | German Parente <gparente> | |
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> | |
Status: | CLOSED ERRATA | QA Contact: | hongyan li <hongyli> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.6 | CC: | alegrand, anpicker, erooth, hongyli, kakkoyun, lcosic, mas-hatada, mfuruta, pkrupa, rh-container, spasquie, surbania | |
Target Milestone: | --- | Keywords: | UpcomingSprint | |
Target Release: | 4.7.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: The user-workload monitoring Prometheus uses kube-rbac-proxy to prevent requests from accessing the /metrics endpoint unless the request is authenticated and has the authorization to perform a GET on /metrics. Because of the way kube-rbac-proxy is configured, it is possible for authenticated requests without elevated permissions to access the /api/v1/query and /api/v1/query_range endpoints of Prometheus.
Consequence: any user having access to a regular service account's token can query the /api/v1/query and /api/v1/query_range endpoints and read metrics from any monitored target.
Fix: kube-rbac-proxy is configured to allow requests to the /metrics endpoint only.
Result: authenticated requests without cluster-wide permission on /metrics trying to query the /api/v1/query and /api/v1/query_range endpoints get a 404 status code.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1941634 (view as bug list) | Environment: | ||
Last Closed: | 2021-02-24 15:50:41 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1941634 |
Description
German Parente
2021-01-06 16:00:29 UTC
reproduced the issue on payload 4.7.0-0.nightly-2021-01-14-014511 Hello Simon Pasquier, Yesterday I had a TAM regular conference call with the NEC OCP team and I've confirmed their concern as below, would you please take a look and triage this sooner ? Feedback from NEC: ~~~ For now, unprivileged users can call the Prometheus API for metrics, but they can also view metrics for other users. This should be seen as a kind of security issue, and NEC believes it needs to be fixed sooner. Does RH have an ETA? One of NEC's end customers, a major Japanese bank, wants to use this feature to monitor their workloads, but due to this type of security issue, it cannot be put into production. This is because compliance requires strict security requirements. ~~~ I am grateful for your help and support. Thank you, BR, Masaki verified with payload 4.7.0-0.nightly-2021-02-03-165316 login with kubeadmin create testuser-0 create one project ns1 and deploy prometheus-example-app get token of user testuser-0 #oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-user-workload.openshift-user-workload-monitoring.svc:9091/metrics' Forbidden (user=testuser-0, verb=get, resource=, subresource=) #oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-user-workload.openshift-user-workload-monitoring.svc.cluster.local:9091/api/v1/query?query=up' 404 page not found #oc adm policy add-cluster-role-to-user cluster-monitoring-operator testuser-0 #oc adm policy add-role-to-user view testuser-0 -n ns1 #oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-user-workload.openshift-user-workload-monitoring.svc:9091/metrics' # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 2.8409e-05 go_gc_duration_seconds{quantile="0.25"} 5.938e-05 -------------------- #oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-user-workload.openshift-user-workload-monitoring.svc.cluster.local:9091/api/v1/query?query=up' 404 page not found Hi, It seems that Comment 9's verification is wrong. The problem was reported for the prometheus running on openshift-user-workload-monitoring, but this verification tested for the prometheus running on openshift-monitoring. Why did Red Hat try to verify with the different way from Comment 1? @Masaki Thanks for the sharp eyes. Assigning back to QA for additional verification. @Masaki and @Simon I don't think there is anything wrong with the verification, I did verified prometheus running on openshift-user-workload-monitoring. I checked the svc https://prometheus-user-workload.openshift-user-workload-monitoring.svc:9091 under different situations with different token. It many be confusing for the part 'oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H', the pod here doesn't matter at all. Hi Hongyan-san,
Thank you for commenting.
> It many be confusing for the part 'oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H', the pod here doesn't matter at all.
Yes, the above is right.
But I have another question.
What the problem this bugzilla reported is that user1 can see metrics of application deployed by user2.
But in your verification, only there is only one user(testuser-0). So how did you check if the original problem was fixed?
And, really is 'go_gc_duration_seconds' the metric of your application? It is the metric of prometheus-k8s-0 itself, I think. So your verification is for checking whether user1 can see metrics of prometheus-k8s-0 itself. It's not for checking whether user1 can see metrics of application deployed by user2. And, I'm still wonder "Why did Red Hat try to verify with the different way from Comment 1?". Test with payload:4.7.0-0.nightly-2021-02-17-130606 login with cluster-admin, create two projects and deploy prometheus-example-app respectively #oc new-project ns1 #oc get pod -n ns1 NAME READY STATUS RESTARTS AGE prometheus-example-app-7c887b8bb-kc6xh 1/1 Running 0 19s #oc new-project ns2 #oc get pod -n ns2 NAME READY STATUS RESTARTS AGE prometheus-example-app-7c887b8bb-kvbfw 1/1 Running 0 8m24s #oc policy add-role-to-user admin testuser-1 -n ns1 #oc policy add-role-to-user admin testuser-2 -n ns2 ------------------------------------------------------------------- #oc login -u testuser-1 -p secret Login successful. You have one project on this server: "ns1" ---------- #oc login -u testuser-2 -p secret Login successful. You have one project on this server: "ns2" -----get token of testuser-2 #toke=`oc whoami -t` #oc run curl --image=curlimages/curl --command -- sleep 3600 #oc rsh curl #curl -k -H "Authorization: Bearer $token https://prometheus-user-workload.openshift-user-workload-monitoring.svc.cluster.local:9091/api/v1/query?query=up The above curl command get nothing ------------------------------------------------------------------- login with cluster-admin #oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-user-workload.openshift-user-workload-monitoring.svc.cluster.local:9091/api/v1/query?query=up' #oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-user-workload.openshift-user-workload-monitoring.svc.cluster.local:9091/api/v1/query?query=up' Both the above commands get message '404 page not found' Update comments 17 #oc login -u testuser-2 -p secret Login successful. You have one project on this server: "ns2" -----get token of testuser-2 #toke=`oc whoami -t` #oc run curl --image=curlimages/curl --command -- sleep 3600 #oc rsh curl #curl -k -H "Authorization: Bearer $token" https://prometheus-user-workload.openshift-user-workload-monitoring.svc.cluster.local:9091/api/v1/query?query=up Get message "404 page not found" This is expected according to the fix solution in the doc text feild Authenticated requests without cluster-wide permission on /metrics trying to query the /api/v1/query and /api/v1/query_range endpoints get a 404 status code. Thank you for testing, but unfortunately the verification was still not enough... Yes, you checked if user1 cannot see metrics of application deployed by user2? However, how about whether user1 can see metrics of application deployed by *user1*? As far as we tested with OCP4.7-rc.0, user1 cannot see even metrics of application deployed by user1... $ ./oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-user-workload.openshift-user-workload-monitoring.svc:9091/metrics' Forbidden (user=user1, verb=get, resource=, subresource=) $ ./oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-user-workload.openshift-user-workload-monitoring.svc:9091/api/v1/query?query=up' 404 page not found It's a regression. You verified that the original bug was fixed, but you should verify whether other related functions still can work correctly or not. Dear Simon, As I mentioned at Comment 19, it seems there is a regression. Could you check it? testuser-2 can't see metrics of project ns2, even give an admin/view role, the forbidden message is expected. For role admin and view have no rule - nonResourceURLs: - /metrics verbs: - get # oc get clusterroles view -oyaml|grep -i nonresource -A6 # oc get clusterroles admin -oyaml|grep -i nonresource -A6 # oc get clusterroles cluster-monitoring-operator -oyaml|grep -i nonresource -A6 - nonResourceURLs: - /metrics verbs: - get - apiGroups: - "" resources: #oc policy add-role-to-user admin testuser-2 -n ns2 $./oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-user-workload.openshift-user-workload-monitoring.svc:9091/metrics' Forbidden (user=user1, verb=get, resource=, subresource=) #oc adm policy add-cluster-role-to-user cluster-monitoring-operator testuser-2 oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-user-workload.openshift-user-workload-monitoring.svc:9091/metrics' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 3.0286e-05 go_gc_duration_seconds{quantile="0.25"} 0.000156758 go_gc_duration_seconds{quantile="0.5"} 0.00018472 go_gc_duration_seconds{quantile="0.75"} 0.000216007 go_gc_duration_seconds{quantile="1"} 0.000617816 go_gc_duration_seconds_sum 0.013435326 We did verified all the related functions and comments 9 including the test steps. There is no regression here. ???
The regression I pointed out is that user became unable to see even metrics of application deployed by user itself.
Where did you test this?
> testuser-2 can't see metrics of project ns2,
If this is right, this is a regression, isn't this?
And I pointed out that 'go_gc_duration_seconds' is not a metric of your application. It is a metric of prometheus-k8s-0 itself.
Why do you stick to check it?
Non-admin user don't need to check a metric of prometheus-k8s-0 itself. They want to check a metric of their application.
What kind of test you want to do?
If user cannot see even metrics of application deployed by user itself, user-workload monitoring is no longer useful, isn't it?
(In reply to Masaki Hatada from comment #14) > Hi Hongyan-san, > > Thank you for commenting. > > > It many be confusing for the part 'oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H', the pod here doesn't matter at all. > > Yes, the above is right. > > But I have another question. > > What the problem this bugzilla reported is that user1 can see metrics of > application deployed by user2. > But in your verification, only there is only one user(testuser-0). So how > did you check if the original problem was fixed? I create one user because the user can't query data about prometheus-example-app in its project, that is, the project the user had admin/view role. It's unnecessary to create 2 users for /api/v1/query? is forbidden under any situation. (In reply to Masaki Hatada from comment #23) > ??? > > The regression I pointed out is that user became unable to see even metrics > of application deployed by user itself. > Where did you test this? > > > testuser-2 can't see metrics of project ns2, > > If this is right, this is a regression, isn't this? > > And I pointed out that 'go_gc_duration_seconds' is not a metric of your > application. It is a metric of prometheus-k8s-0 itself. > Why do you stick to check it? > Non-admin user don't need to check a metric of prometheus-k8s-0 itself. They > want to check a metric of their application. > > What kind of test you want to do? > If user cannot see even metrics of application deployed by user itself, > user-workload monitoring is no longer useful, isn't it? user became unable to see even metrics of application deployed by user itself, this is not a regression testuser-2 can't see metrics of project ns2 from prometheus-user-workload Where did you test this, you can only see metrics from thano-qerier With the following command you can get metrics and user-app, and this part we have tested a user can only query metrics about his app and can't query metrics of other users' app. This part has nothing with the bug. #oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9092/api/v1/query?query=query=up' (In reply to hongyan li from comment #26) > With the following command you can get metrics and user-app, and this part > we have tested a user can only query metrics about his app and can't query > metrics of other users' app. This part has nothing with the bug. > #oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k > -H "Authorization: Bearer $token" > 'https://thanos-querier.openshift-monitoring.svc:9092/api/v1/ > query?query=query=up' Hummm, It's a new information... Ok, we will do test. But, could you include this in your verification result? It's the most important information for this verification. And, if your opinion(we need to access thanos-querier to get metrics of user applications) is right, https://access.redhat.com/solutions/5151831 should be updated. Who will update this? Test results for thanos-querier login with cluster-admin, create two projects and deploy prometheus-example-app respectively #oc new-project ns1 #oc get pod -n ns1 NAME READY STATUS RESTARTS AGE prometheus-example-app-7c887b8bb-kc6xh 1/1 Running 0 19s #oc new-project ns2 #oc get pod -n ns2 NAME READY STATUS RESTARTS AGE prometheus-example-app-7c887b8bb-kvbfw 1/1 Running 0 8m24s #oc policy add-role-to-user admin testuser-1 -n ns1 #oc policy add-role-to-user admin testuser-2 -n ns2 ------------------------------------------------------------------- #oc login -u testuser-1 -p secret Login successful. You have one project on this server: "ns1" ---------- #oc login -u testuser-2 -p secret Login successful. You have one project on this server: "ns2" -----get token of testuser-2 #toke=`oc whoami -t` ------------------------------------------------------------------- login with cluster-admin $ oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9092/api/v1/query?query=up&namespace=ns2' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 363 100 363 0 0 12517 0 --:--:-- --:--:-- --:--:-- 12517 {"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"up","endpoint":"web","instance":"10.129.2.42:8080","job":"prometheus-example-app","namespace":"ns2","pod":"prometheus-example-app-7c887b8bb-kvbfw","prometheus":"openshift-user-workload-monitoring/user-workload","service":"prometheus-example-app"},"value":[1613628023.118,"1"]}]}} $ oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9092/api/v1/query?query=up&namespace=ns1' % Total % Received % Xferd Average Speed Time Time Time Current Dload Forbidden (user=testuser-2, verb=get, resource=pods, subresource=) Upload Total Spent Left Speed 100 67 100 67 0 0 2310 0 --:--:-- --:--:-- --:--:-- 2310 $ oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9092/api/v1/query?query=up' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 56 100 56 0 0 1142 0 --:--:-- --:--:-- --:--:-- 1142 Bad Request. The request or configuration is malformed. (In reply to Masaki Hatada from comment #28) > And, if your opinion(we need to access thanos-querier to get metrics of user > applications) is right, https://access.redhat.com/solutions/5151831 should > be updated. > Who will update this? I don't know who should updata this, I am asking in slack channel, I will post here if I get the answer. (In reply to Masaki Hatada from comment #28) > And, if your opinion(we need to access thanos-querier to get metrics of user > applications) is right, https://access.redhat.com/solutions/5151831 should > be updated. > Who will update this? Yes I confirm that to access user metrics, you need to go through the Thanos querier API. The prometheus-user-workload.openshift-user-workload-monitoring.svc service only exists to expose Prometheus internal metrics. I'll follow up the person that wrote the article to remove the confusion. Thanks for your patience. Thank you for updating and testing multiple times. We got the same result as Comment 29. Our concerns are gone. $ ./oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9092/api/v1/query?query=up&namespace=ns1' {"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"up","endpoint":"web","instance":"10.130.0.5:8080","job":"prometheus-example-app","namespace":"ns1","pod":"prometheus-example-app-7f8f8b8d4-cqcxt","prometheus":"openshift-user-workload-monitoring/user-workload","service":"prometheus-example-app"},"value":[1613630574.467,"1"]}]}} $ ./oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9092/api/v1/query?query=up&namespace=ns2' Forbidden (user=user1, verb=get, resource=pods, subresource=) > I'll follow up the person that wrote the article to remove the confusion. We are looking forward to the update of this. hi, I will update the article to show that the query has to be done use thanos-querier service or the exposed route of it. regards, German. @German and @Masaki I find I have access and have drafted a solution, I don't know the update process, you can delete it if it doesn't make sense. https://access.redhat.com/solutions/5151831/moderation @Hongyan, thanks a lot. I cannot see your former link. I will contact you in irc or slack. regards, German. Thanks a lot to Hongyan. The article has been modified with the right information. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |