Bug 1986829

Summary: [AUTH-20] Make prometheus authenticate with a certificate while scraping the cluster's core components metrics
Product: OpenShift Container Platform Reporter: Standa Laznicka <slaznick>
Component: apiserver-authAssignee: Standa Laznicka <slaznick>
Status: CLOSED ERRATA QA Contact: Rahul Gangwar <rgangwar>
Severity: medium Docs Contact:
Priority: high    
Version: 4.9CC: aos-bugs, kewang, liyao, mfojtik, surbania, xxia
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:42:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Standa Laznicka 2021-07-28 12:03:48 UTC
Description of problem:
https://github.com/openshift/cluster-monitoring-operator/pull/1282 introduced a possibility for the metrics scraper to authenticate with a certificate and therefore omit a single TokenReview call to the kube-apiserver (which happens usually once every 30s per scraped component).

The core components and operators should use this capability to lower the API server load and to make it possible to scrape the metrics even when the kube-API is down (only if the contacted component is using static authorization for their /metrics endpoint, though).

Comment 3 Rahul Gangwar 2021-08-11 08:02:42 UTC
oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-08-07-175228   True        False         2m10s   Cluster version is 4.9.0-0.nightly-2021-08-07-175228

Checked metric client certificate
 oc get secret -n openshift-monitoring
 
metrics-client-certs                             Opaque                                2      22m


oc get car
system:openshift:openshift-monitoring-gnqcs      30s   kubernetes.io/kube-apiserver-client           system:serviceaccount:openshift-monitoring:cluster-monitoring-operator            Approved,Issued


Check metric client certificate again to check new cert
 oc get secret -n openshift-monitoring
 
metrics-client-certs                             Opaque                                2      2m30s

Gather prometheus metrics by  using curl cert for below operators:
openshift-apiserver-operator 
openshift-kube-apiserver-operator
 openshift-kube-controller-manager-operator
openshift-kube-storage-version-migrator-operator

For e.g. oc rsh -n openshift-apiserver-operator openshift-apiserver-operator-7f7cd7d86c-5bm49  curl -k --key /tmp/tls.key --cert /tmp/tls.crt  https://localhost:8443/metrics > /tmp/metrics.txt

The curl commands succeed, and checked /tmp/metrics.txt files is not empty content.
 
Checked Openssl and checked the user of cert in the CN, it is prometheus-k8s. 
openssl x509 -in tls.crt -noout -text|grep CN
 
Issuer: CN=kube-csr-signer_@1628567334
        Subject: CN=system:serviceaccount:openshift-monitoring:prometheus-k8s
 
 
oc get pod -n openshift-kube-apiserver -l apiserver --show-labels
NAME                                                READY   STATUS    RESTARTS   AGE   LABELS
kube-apiserver-ci-ln-qvmriyb-f76d1-dt7gb-master-0   5/5     Running   0          25m   apiserver=true,app=openshift-kube-apiserver,revision=5
kube-apiserver-ci-ln-qvmriyb-f76d1-dt7gb-master-1   5/5     Running   0          32m   apiserver=true,app=openshift-kube-apiserver,revision=5
kube-apiserver-ci-ln-qvmriyb-f76d1-dt7gb-master-2   5/5     Running   0          29m   apiserver=true,app=openshift-kube-apiserver,revision=5

Configured audit profile from default  to WriteRequestBodies in apiserver/cluster and wait to restart kube-apiserver
 
oc get pod -n openshift-kube-apiserver -l apiserver --show-labels
NAME                                                READY   STATUS    RESTARTS   AGE     LABELS
kube-apiserver-ci-ln-qvmriyb-f76d1-dt7gb-master-0   5/5     Running   0          95s     apiserver=true,app=openshift-kube-apiserver,revision=6
kube-apiserver-ci-ln-qvmriyb-f76d1-dt7gb-master-1   5/5     Running   0          8m18s   apiserver=true,app=openshift-kube-apiserver,revision=6
kube-apiserver-ci-ln-qvmriyb-f76d1-dt7gb-master-2   5/5     Running   0          5m5s    apiserver=true,app=openshift-kube-apiserver,revision=6
 
Check and gather audit logs after kube-apiserver restart and wait for 15mins. Login to all master and gather audit logs.
 
oc debug node/ci-ln-qvmriyb-f76d1-dt7gb-master-2 -T -- chroot /host grep '"requestURI":"/apis/authentication.k8s.io/v1/tokenreviews"' /var/log/kube-apiserver/audit.log > /tmp/all_tokenreviews_requests.log
 
grep '"status":{"authenticated":true,"user":{"username":"system:serviceaccount:openshift-monitoring:prometheus-k8s"' /tmp/all_tokenreviews_requests.log > /tmp/all_tokenreviews_for_serviceaccount_prometheus-k8s.log
 
jq '.user.username' /tmp/all_tokenreviews_for_serviceaccount_prometheus-k8s.log > /tmp/all_users_that_make_traffic_to_check_token_of_serviceaccount_prometheus-k8s.log
 
sort /tmp/all_users_that_make_traffic_to_check_token_of_serviceaccount_prometheus-k8s.log | uniq -c | sort -rh>/tmp/users.txt
 
 
Check there are no token validation requests sent to  kube-apiserver from below users and there will be no output/display.
 
for i in kube-apiserver openshift-apiserver openshift-controller-manager kube-scheduler kubelet node-exporter kube-controller-manager etcd; do grep "$i" /tmp/users.txt;done;
 
1 "system:serviceaccount:openshift-controller-manager:openshift-controller-manager-sa"
4 "system:kube-scheduler"
 
 
Still see tokenreview requests from some targets for the prometheus SA and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1991900
And when we bring kube-apiserver unavailable unable to gather metrics, filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1990281

Comment 6 errata-xmlrpc 2021-10-18 17:42:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759