Description of problem: We have come across another cluster where an issue similar to https://bugzilla.redhat.com/show_bug.cgi?id=1748434 has been rectified. Since that BZ was closed, though of opening new one to understand our course of action fixing this in favor of v3.11. --------- apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0/system:serviceaccount:openshift-infra:unidling-controller",cluster="abc",code="500",endpoint="https",instance=<lb_ip>:443",job="apiserver",namespace="default",resource="events",scope="cluster",service="kubernetes",verb="LIST"} 2772 apiserver_request_count{client="openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0/system:serviceaccount:openshift-infra:unidling-controller",cluster="abc",code="504",contentType="cluster",endpoint="https",instance="lb_ip:443",job="apiserver",namespace="default",resource="events",service="kubernetes",verb="LIST"} 2783 apiserver_request_count{client="felix-stats/v0.0.0 (linux/amd64) kubernetes/$Format",cluster="abc",code="500",contentType="application/json",endpoint="https",instance="lb_ip:443",job="apiserver",namespace="default",resource="events",scope="cluster",service="kubernetes",verb="LIST"} 1 --------- Also, the below errors were seen in the api logs -------- E1012 16:29:41.699881 1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1012 16:29:41.699929 1 writers.go:149] apiserver was unable to write a JSON response: http: Handler timeout E1012 16:29:41.701173 1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} -------- Version-Release number of selected component (if applicable): v3.11.146 How reproducible: OpenshiftAPIErrorsHigh warnings are still noticed Actual results: KubeAPIErrorsHigh` firing due to HTTP 500, 504 errors reported in metrics Expected results: The alert should not be triggered especially when the api is working fine.
Looks like the unidling-controller lists even over the whole cluster: /api/v1/events. This times out after a minute. The controller should not do that. It's expected to fail eventually for big clusters.
Is this a persistent issue? Events are supposed to be pruned approximately an hour after they were fired. When deployments are stable the event count on the cluster should be manageable for the apiserver. Are there particular events that are firing repeatedly?
Sharing the latest observations done by the customer: --- There was a job which was failing for last 12 days and it generated 6230 job objects while failing. It also was permanently allocating all resources within their quota in the sandbox namespace as job in question was requesting 2GB of RAM for each job / pod (not very efficient while only curl was being executed). Cronjob itself was being executed every minute and each execution left behind 1 job object + 4 crash-looping pods that filled up 8 GB RAM quota for the namespace. All above was causing unwanted and unneeded stress to our OpenShift platform. "oc get events" for this namespace was stuck for like 2 mins before retrieving result because of how huge query was with all those error quota messages and there were thousands of events with errors --- Then I checked the SOS again for the controller and etcd logs to see if they have recorded any sort of failure due to this faulty job. --- E1020 07:57:07.483445 1 job_controller.go:789] pods "pi-1603116120-kmlgv" is forbidden: exceeded quota: quota-svc-its-sandbox4-test, requested: limits.cpu=500m,limits.memory=2000Mi, used: limits.cpu=2,limits.memory=8000Mi, limited: limits.cpu=2,limits.memory=8Gi --- # cat master-logs_controllers_controllers | grep -i quota-svc-its-sandbox4-test | wc -l 75897 $ etcd_logs --- 2020-10-17 01:58:00.030504 W | etcdserver: request "header:<ID:16547760597330720220 username:\"abc.com\" auth_revision:1 > txn:<compare:<target:MOD key:\"/openshift.io/clusterresourcequotas/quota-svc-its-sandbox4-test\" mod_revision:1042719945 > success:<request_put:<key:\"/openshift.io/clusterresourcequotas/quota-svc-its-sandbox4-test\" value_size:1760 >> failure:<>>" with result "size:22" took too long (228.984198ms) to execute --- They have deleted the job and now the API error alerts are stopped. We will be monitoring the situation. Could such failure in the job create an impact on api<->etcd performance which then endup triggering such alerts?
> Could such failure in the job create an impact on api<->etcd performance which then endup triggering such alerts? Absolutely. Every one of these job failures was creating multiple events. The unidling controller basically does "oc get events" in each reconciliation, and with sufficiently high event counts it can lead to timeouts or out of memory errors. In this case, 1 failing job per minute with 4 crash-looping pods per minute is at least 300 events per hour (perhaps an order of magnitude more). The right way to work around this issue is to determine why there are so many events on the cluster - which in this case was a CronJob run amok.
After PR #145 and comments here, removing "needinfo" flag.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633