Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2091211

Summary:	prometheus adapter tokens are not getting validated and causing the load on API server
Product:	OpenShift Container Platform	Reporter:	Abdoul Djire <adjire>
Component:	Monitoring	Assignee:	Simon Pasquier <spasquie>
Status:	CLOSED DUPLICATE	QA Contact:	Junqi Zhao <juzhao>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.10	CC:	akashem, annelson, anpicker, asheth, dgrisonn, diflores, eglottma, hasun, hgomes, jolee, mfojtik, ngirard, pducai, spasquie, wking, xxia
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-09-06 14:26:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Abdoul Djire 2022-05-27 20:51:20 UTC

Description of problem:

After upgrade from 4.8 facing slow response of oc commands and doesn't and complete/timeout.
API call was delayed and received 429 error messages while executing the oc command with debug output.
We found metrics API services related errors in the kube-API server logs and log of "too many requests" messages in the KCM logs
~~~
kube-controller-manager-pp101-qv9rg-master-2/cluster-policy-controller/cluster-policy-controller/logs/current.log:2022-05-20T20:02:33.302232777Z E0520 20:02:33.302161       1 reconciliation_controller.go:167] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server has received too many requests and has asked us to try again later
kube-controller-manager-pp101-qv9rg-master-2/cluster-policy-controller/cluster-policy-controller/logs/current.log:2022-05-20T20:03:14.335420305Z E0520 20:03:14.335351       1 reconciliation_controller.go:167] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server has received too many requests and has asked us to try again later
~~~
However, after exporting the localhost kubeconfig on one of the master node, the oc commands were executing pretty fast.

Also, after scaling down the prometheus-adapter deployment from the namespace openshift-monitoring to 0, the oc commands started responding fast.
It seems the Prometheus adapter is making too many requests which are failing with invalid bearer token errors that are resulting in 429 errors while executing the `oc` command.

We tried deleting prometheus-adapter-token secrets and the adapter pods restarted but the issue is still present. 



How reproducible: Not sure

Actual results:


Expected results:
oc command to run fast and do not get timeouts

Additional info:

Comment 1 Haoyu Sun 2022-05-30 15:37:35 UTC

Hello Abdoul,

In the must-gather archives, we find our customer is currently running OCP 4.10.10 and having many errors of invalid certificates in the log. We advise upgrading the cluster to the version 4.10.14 so that we can eliminate most of the certificate related errors. Even if the problem of API server overload persists, it will future investigation much easier without these noises.

Could you please upgrade the cluster to version 4.10.14, collect again must-gather information and share the information on the support case?

Best regards,

Haoyu

Comment 5 Haoyu Sun 2022-05-31 10:37:51 UTC

Hello Abdoul, 

Could you please ask the customer to share some screenshots about the graphs with title starting with "Priority & Fairness: " in the Openshift web UI, section Observe -> Dashboard (dashboard: API performance,  APIserver: kube-apiserver)?

Thank you.

--------------

Hello Diego,

Yes, the issue you mentioned is the same, request throttling on metrics API. we can put them in this same bug record.

Comment 15 Abu Kashem 2022-06-02 17:48:27 UTC

highlighting what we have found so far:

kube-apiserver log (from the must-gather we looked at) has the following error:
> E0526 15:02:52.380171      16 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 429, Body: Too many requests, please try again later.

This implies that kube-apiserver received a 429 status code from the 'prometheus adapter'. 

we looked at the 'prometheus adapter' pod log file, it was practically empty - probably because must-gather was captured after the prometheus adapter was scaled down?

the audit logs for Ford did not show any 429 status code.

so we want to have a fresh data capture - https://bugzilla.redhat.com/show_bug.cgi?id=2091211#c14

Comment 20 Abu Kashem 2022-06-04 14:24:16 UTC

> FYI new must-gather is attached in case 03223955 but it's too big to be attached to BZ. Please check.

can you share the link here for must-gather, audit logs? 
were you able to take prometheus dump?

Comment 23 Abu Kashem 2022-06-07 17:06:59 UTC

> must-gather.local.384055102101016969

is this the right must-gather to look at? I went through the case and saw that to be the latest. I pulled it down. Please make sure you share a link to the attachment from the case so there is no confusion. also, salesforce link does not work for us.


kas log:
> E0603 21:02:05.930626      17 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 429, Body: Too many requests, please try again later.

I still see the 429 error coming from the prometheus adapter. But i did not see any logs from the prometheus adapter at all. can you double check the must-gather?

Comment 24 Abu Kashem 2022-06-07 17:22:02 UTC

> $  cat logs/namespaces/openshift-kube-apiserver/pods/kube-apiserver-pp101-qv9rg-master-0/kube-apiserver/kube-apiserver/logs/current.log | grep 'v1beta1.metrics.k8s.io' | grep '429, Body: Too many requests, please try again later' | head -n 1


first entry: 2022-06-02T05:06:16.179517373Z E0602 05:06:16.179426 

> $  cat logs/namespaces/openshift-kube-apiserver/pods/kube-apiserver-pp101-qv9rg-master-0/kube-apiserver/kube-apiserver/logs/current.log | grep 'v1beta1.metrics.k8s.io' | grep '429, Body: Too many requests, please try again later' | tail -n 1


second entry: 2022-06-03T21:08:31.188648222Z E0603 21:08:31.188543 

> $  cat logs/namespaces/openshift-kube-apiserver/pods/kube-apiserver-pp101-qv9rg-master-0/kube-apiserver/kube-apiserver/logs/current.log | grep 'v1beta1.metrics.k8s.io' | grep '429, Body: Too many requests, please try again later' | wc -l
> 1833



so the logs from kas show that kube-apiserver is getting 429 errors from the prometheus server consistently.

Comment 25 Abu Kashem 2022-06-07 17:28:22 UTC

 - there are lots of "http: proxy error: context canceled" error in openshift-monitoring namespace, not sure if they are related
 - lots of "Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid" errors  in openshift-monitoring namespace
 - "x509: certificate has expired or is not yet valid" errors

i am not sure if they are related, but it would be nice to get these errors looked at and ruled out.

Comment 26 Haoyu Sun 2022-06-08 08:00:50 UTC

The certificate expiration is not related to HTTP 429 error. It is related to authentication. 

Can we confirm that the throttling happens only on the prometheus-adapter side, not related to priority and fairness settings?

Comment 28 Abu Kashem 2022-06-09 15:32:50 UTC

hasun

> Can we confirm that the throttling happens only on the prometheus-adapter side, not related to priority and fairness settings?

yes, only on the prometheus-adapter side. I see the following error in kas log

> E0603 21:02:05.930626      17 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 429, Body: Too many requests, please try again later.

This log tells me that kube-apiserver got a 429 from the aggregated apiserver for "v1beta1.metrics.k8s.io", that is prometheus adapter no? I did not see similar errors from any other aggregated servers



ngirard

> Even with the pods scaled down, we are still seeing the 429 when just trying to get the api model even though svc has no endpoints for the v1beta1.metrics.k8s.io apiservice.

good finding!

- can you run the kube-apiserver in "Debug" log level (--v=4) and then share a new must-gather with us (with the prometheus adapter scaled down)?
- can you capture prometheus data dump?

Comment 29 Neil Girard 2022-06-09 18:48:37 UTC

Hello,

I have requested the new logs from the customer.  I am awaiting for them to recreate the issue.

Also, I have already collected prometheus data if you would like to look at it.  I've reviewed some of the metrics and see a few things I find odd, but I honestly am not good at reading the apiserver_flowcontrol metrics.  I also had customer run oc command to capture the 429s.  One thing if you could look at it is the constant growth for the prometheus-adapters in apiserver_current_inflight_requests{namespace="openshift-monitoring"}

Thanks,
Neil Girard

Comment 38 Simon Pasquier 2022-09-06 14:26:53 UTC


*** This bug has been marked as a duplicate of bug 2091902 ***