Bug 1887745
| Summary: | API server is throwing 5xx error code for 42.11% of requests for LIST events | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Rutvik <rkshirsa> |
| Component: | openshift-controller-manager | Assignee: | Otávio Fernandes <olemefer> |
| openshift-controller-manager sub component: | controller-manager | QA Contact: | wewang <wewang> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | adam.kaplan, aos-bugs, mfojtik, olemefer, openshift-bugs-escalate, rcarrier, rkshirsa, rsandu, scuppett, xxia |
| Version: | 3.11.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | devex | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:25:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1935362 | ||
|
Description
Rutvik
2020-10-13 09:16:37 UTC
Looks like the unidling-controller lists even over the whole cluster: /api/v1/events. This times out after a minute. The controller should not do that. It's expected to fail eventually for big clusters. Is this a persistent issue? Events are supposed to be pruned approximately an hour after they were fired. When deployments are stable the event count on the cluster should be manageable for the apiserver. Are there particular events that are firing repeatedly? Sharing the latest observations done by the customer: --- There was a job which was failing for last 12 days and it generated 6230 job objects while failing. It also was permanently allocating all resources within their quota in the sandbox namespace as job in question was requesting 2GB of RAM for each job / pod (not very efficient while only curl was being executed). Cronjob itself was being executed every minute and each execution left behind 1 job object + 4 crash-looping pods that filled up 8 GB RAM quota for the namespace. All above was causing unwanted and unneeded stress to our OpenShift platform. "oc get events" for this namespace was stuck for like 2 mins before retrieving result because of how huge query was with all those error quota messages and there were thousands of events with errors --- Then I checked the SOS again for the controller and etcd logs to see if they have recorded any sort of failure due to this faulty job. --- E1020 07:57:07.483445 1 job_controller.go:789] pods "pi-1603116120-kmlgv" is forbidden: exceeded quota: quota-svc-its-sandbox4-test, requested: limits.cpu=500m,limits.memory=2000Mi, used: limits.cpu=2,limits.memory=8000Mi, limited: limits.cpu=2,limits.memory=8Gi --- # cat master-logs_controllers_controllers | grep -i quota-svc-its-sandbox4-test | wc -l 75897 $ etcd_logs --- 2020-10-17 01:58:00.030504 W | etcdserver: request "header:<ID:16547760597330720220 username:\"abc.com\" auth_revision:1 > txn:<compare:<target:MOD key:\"/openshift.io/clusterresourcequotas/quota-svc-its-sandbox4-test\" mod_revision:1042719945 > success:<request_put:<key:\"/openshift.io/clusterresourcequotas/quota-svc-its-sandbox4-test\" value_size:1760 >> failure:<>>" with result "size:22" took too long (228.984198ms) to execute --- They have deleted the job and now the API error alerts are stopped. We will be monitoring the situation. Could such failure in the job create an impact on api<->etcd performance which then endup triggering such alerts? > Could such failure in the job create an impact on api<->etcd performance which then endup triggering such alerts?
Absolutely. Every one of these job failures was creating multiple events. The unidling controller basically does "oc get events" in each reconciliation, and with sufficiently high event counts it can lead to timeouts or out of memory errors. In this case, 1 failing job per minute with 4 crash-looping pods per minute is at least 300 events per hour (perhaps an order of magnitude more).
The right way to work around this issue is to determine why there are so many events on the cluster - which in this case was a CronJob run amok.
After PR #145 and comments here, removing "needinfo" flag. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |