Bug 2091211
| Summary: | prometheus adapter tokens are not getting validated and causing the load on API server | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Abdoul Djire <adjire> |
| Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
| Status: | CLOSED DUPLICATE | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.10 | CC: | akashem, annelson, anpicker, asheth, dgrisonn, diflores, eglottma, hasun, hgomes, jolee, mfojtik, ngirard, pducai, spasquie, wking, xxia |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-09-06 14:26:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Abdoul Djire
2022-05-27 20:51:20 UTC
Hello Abdoul, In the must-gather archives, we find our customer is currently running OCP 4.10.10 and having many errors of invalid certificates in the log. We advise upgrading the cluster to the version 4.10.14 so that we can eliminate most of the certificate related errors. Even if the problem of API server overload persists, it will future investigation much easier without these noises. Could you please upgrade the cluster to version 4.10.14, collect again must-gather information and share the information on the support case? Best regards, Haoyu Hello Abdoul, Could you please ask the customer to share some screenshots about the graphs with title starting with "Priority & Fairness: " in the Openshift web UI, section Observe -> Dashboard (dashboard: API performance, APIserver: kube-apiserver)? Thank you. -------------- Hello Diego, Yes, the issue you mentioned is the same, request throttling on metrics API. we can put them in this same bug record. highlighting what we have found so far: kube-apiserver log (from the must-gather we looked at) has the following error: > E0526 15:02:52.380171 16 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 429, Body: Too many requests, please try again later. This implies that kube-apiserver received a 429 status code from the 'prometheus adapter'. we looked at the 'prometheus adapter' pod log file, it was practically empty - probably because must-gather was captured after the prometheus adapter was scaled down? the audit logs for Ford did not show any 429 status code. so we want to have a fresh data capture - https://bugzilla.redhat.com/show_bug.cgi?id=2091211#c14 > FYI new must-gather is attached in case 03223955 but it's too big to be attached to BZ. Please check.
can you share the link here for must-gather, audit logs?
were you able to take prometheus dump?
> must-gather.local.384055102101016969 is this the right must-gather to look at? I went through the case and saw that to be the latest. I pulled it down. Please make sure you share a link to the attachment from the case so there is no confusion. also, salesforce link does not work for us. kas log: > E0603 21:02:05.930626 17 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 429, Body: Too many requests, please try again later. I still see the 429 error coming from the prometheus adapter. But i did not see any logs from the prometheus adapter at all. can you double check the must-gather? > $ cat logs/namespaces/openshift-kube-apiserver/pods/kube-apiserver-pp101-qv9rg-master-0/kube-apiserver/kube-apiserver/logs/current.log | grep 'v1beta1.metrics.k8s.io' | grep '429, Body: Too many requests, please try again later' | head -n 1 first entry: 2022-06-02T05:06:16.179517373Z E0602 05:06:16.179426 > $ cat logs/namespaces/openshift-kube-apiserver/pods/kube-apiserver-pp101-qv9rg-master-0/kube-apiserver/kube-apiserver/logs/current.log | grep 'v1beta1.metrics.k8s.io' | grep '429, Body: Too many requests, please try again later' | tail -n 1 second entry: 2022-06-03T21:08:31.188648222Z E0603 21:08:31.188543 > $ cat logs/namespaces/openshift-kube-apiserver/pods/kube-apiserver-pp101-qv9rg-master-0/kube-apiserver/kube-apiserver/logs/current.log | grep 'v1beta1.metrics.k8s.io' | grep '429, Body: Too many requests, please try again later' | wc -l > 1833 so the logs from kas show that kube-apiserver is getting 429 errors from the prometheus server consistently. - there are lots of "http: proxy error: context canceled" error in openshift-monitoring namespace, not sure if they are related - lots of "Unable to authenticate the request due to an error: x509: certificate has expired or is not yet valid" errors in openshift-monitoring namespace - "x509: certificate has expired or is not yet valid" errors i am not sure if they are related, but it would be nice to get these errors looked at and ruled out. The certificate expiration is not related to HTTP 429 error. It is related to authentication. Can we confirm that the throttling happens only on the prometheus-adapter side, not related to priority and fairness settings? hasun > Can we confirm that the throttling happens only on the prometheus-adapter side, not related to priority and fairness settings? yes, only on the prometheus-adapter side. I see the following error in kas log > E0603 21:02:05.930626 17 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 429, Body: Too many requests, please try again later. This log tells me that kube-apiserver got a 429 from the aggregated apiserver for "v1beta1.metrics.k8s.io", that is prometheus adapter no? I did not see similar errors from any other aggregated servers ngirard > Even with the pods scaled down, we are still seeing the 429 when just trying to get the api model even though svc has no endpoints for the v1beta1.metrics.k8s.io apiservice. good finding! - can you run the kube-apiserver in "Debug" log level (--v=4) and then share a new must-gather with us (with the prometheus adapter scaled down)? - can you capture prometheus data dump? Hello,
I have requested the new logs from the customer. I am awaiting for them to recreate the issue.
Also, I have already collected prometheus data if you would like to look at it. I've reviewed some of the metrics and see a few things I find odd, but I honestly am not good at reading the apiserver_flowcontrol metrics. I also had customer run oc command to capture the 429s. One thing if you could look at it is the constant growth for the prometheus-adapters in apiserver_current_inflight_requests{namespace="openshift-monitoring"}
Thanks,
Neil Girard
*** This bug has been marked as a duplicate of bug 2091902 *** |