Description of problem: A large number of Throttling requests taints kube-apiserver operator log in a short time. Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-05-13-221558 OCP install: ipi on AWS How reproducible: always Steps to Reproduce: $ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-745f6658c8-c9ggm | grep 'Throttling request'| wc -l 7030 Actual results: 1. To view the kube-apiserver operator log, filled with lots of throttling requests, see below, ... I0514 09:21:22.546693 1 request.go:621] Throttling request took 1.787306428s, request: GET:https://....0.1:443/api/v1/namespaces/openshift-kube-apiserver/configmaps/config I0514 09:21:23.546696 1 request.go:621] Throttling request took 1.795265499s, request: GET:https://....0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-...-187.us-east-2.compute.internal I0514 09:21:24.746697 1 request.go:621] Throttling request took 1.790598242s, request: GET:https://....0.1:443/api/v1/namespaces/openshift-kube-apiserver/serviceaccounts/localhost-recovery-client I0514 09:21:25.746707 1 request.go:621] Throttling request took 1.595281175s, request: GET:https://....0.1:443/api/v1/namespaces/openshift-config-managed/secrets?labelSelector=encryption.apiserver.operator.openshift.io%2Fcomponent%3Dopenshift-kube-apiserver I0514 09:21:26.746713 1 request.go:621] Throttling request took 1.593812045s, request: GET:https://....0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods?labelSelector=apiserver%3Dtrue ... This seriously taints log reading. Expected results: Reduce the throttling request and do not taint the kube-apiserver operator log. Additional info:
Discussed with xingxing, this could be a performance of etcd, did a quick checking on etcd pod, seems reading key took long. $ oc logs -n openshift-etcd etcd-member-6fmwc-m-0.c.openshift-qe.internal -c etcd-member | grep -n 'took too long' | wc -l 301 ... 77:2020-05-14 09:24:41.420117 W | etcdserver: request "header:<ID:2425876772913985041 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/secrets/openshift-kube-controller-manager/csr-signer-4\" mod_revision:0 > success:<request_put:<key:\"/kubernetes.io/secrets/openshift-kube-controller-manager/csr-signer-4\" value_size:3098 >> failure:<>>" with result "size:16" took too long (215.750159ms) to execute 278:2020-05-14 09:24:41.420141 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-service-ca/configmap-cabundle-injector-lock\" " with result "range_response_count:1 size:472" took too long (330.950256ms) to execute 279:2020-05-14 09:24:41.420332 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-kube-scheduler/installer-4-kewang-6fmwc-m-1.c.openshift-qe.internal\" " with result "range_response_count:1 size:1587" took too long (351.36024ms) to execute 280:2020-05-14 09:24:41.420580 W | etcdserver: read-only range request "key:\"/kubernetes.io/deployments/openshift-monitoring/kube-state-metrics\" " with result "range_response_count:1 size:2692" took too long (573.790763ms) to execute 281:2020-05-14 09:24:41.420658 W | etcdserver: read-only range request "key:\"/kubernetes.io/minions/\" range_end:\"/kubernetes.io/minions0\" " with result "range_response_count:5 size:25883" took too long (595.191608ms) to execute 282:2020-05-14 09:24:41.420729 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-sdn/openshift-network-controller\" " with result "range_response_count:1 size:454" took too long (576.390566ms) to execute ...
CC: mifiedle
Please ignore Comment 1, it's not for this bug's environment. Discussed with xingxing, this could be a performance issue of etcd, did a quick checking on etcd pod, seems reading key took long, see below, $ etcd_pod=$(oc get pods -n openshift-etcd | grep etcd | head -1 | cut -d " " -f1) $ oc logs -n openshift-etcd $etcd_pod -c etcd | grep -n 'took too long' | wc -l 423 Pasted snipped log here, ... 681:2020-05-14 03:11:44.599784 W | etcdserver: read-only range request "key:\"/kubernetes.io/services/endpoints/openshift-authentication/oauth-openshift\" " with result "range_response_count:1 size:525" took too long (236.278694ms) to execute 682:2020-05-14 03:11:44.599945 W | etcdserver: read-only range request "key:\"/kubernetes.io/deployments/openshift-monitoring/openshift-state-metrics\" " with result "range_response_count:1 size:4866" took too long (271.452897ms) to execute 683:2020-05-14 03:11:44.600272 W | etcdserver: read-only range request "key:\"/kubernetes.io/cloudcredential.openshift.io/credentialsrequests/openshift-cloud-credential-operator/openshift-image-registry-openstack\" " with result "range_response_count:1 size:1601" took too long (203.700944ms) to execute 684:2020-05-14 03:11:44.600589 W | etcdserver: read-only range request "key:\"/kubernetes.io/secrets/openshift-kube-apiserver/user-serving-cert-003\" " with result "range_response_count:0 size:6" took too long (235.389736ms) to execute ... In openshift-etcd-operator pod, found the lots of following errors, oc logs -n openshift-etcd-operator etcd-operator-ff88476b7-tffgl | grep 'transport is closing' | wc -l 27477 I0514 03:16:58.171427 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
Hit the bug again, we have one upgrade env OCP_4.3.24->4.4.7-> 4.5.0nightly. This caused a bad performance of system. oc logs -n openshift-apiserver-operator openshift-apiserver-operator-67d856d6fd-tfpzb also shows many Throttling request took 1.787306428s When we test running below command line, while true; do oc get oauthclients.oauth.openshift.io; date; echo; sleep 1; done will intermmittently hit: Error from server (ServiceUnavailable): the server is currently unable to handle the request (get oauthclients.oauth.openshift.io) authentication operator reports many related errors: $ oc logs -n openshift-authentication-operator authentication-operator-66c7d588dd-76dtn | grep 'the server is currently unable to handle the request' | wc -l 1637
Please don't hijack BZs with different topics. Start a new BZ instead. Here in #7 this is obviously about etcd performance, not about throttling events in some operator log.
Found one right bug 1825219 to track this issue.
Hello team, Observed the same issue in 4.6 cluster. Noticed high CPU usage on nodes. Following is the snippet of TOP command and etcd logs on master node. 1] TOP output ``````````````````````````````````````````````````````````````````````````````````````````````````````````````` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2427 root 20 0 6269864 494828 64316 S 181.8 0.8 848:07.26 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote+ 3671 nfsnobo+ 20 0 721912 42556 11236 S 68.2 0.1 86:18.39 /bin/node_exporter --web.listen-address=127.0.0.1:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/host/root --no-collector.wifi --no-collector.+ 25168 root 20 0 2672940 420468 30672 S 54.5 0.6 1205:42 /usr/bin/ruby /usr/local/bin/fluentd --suppress-config-dump --no-supervisor -r /usr/local/share/gems/gems/fluent-plugin-elasticsearch-4.1.1/lib/fluent/plugin/elast+ 3073278 core 20 0 71620 6840 4712 R 22.7 0.0 0:00.09 top -bc 1 root 20 0 267676 26908 9052 S 18.2 0.0 176:06.21 /usr/lib/systemd/systemd --switched-root --system --deserialize 18 3072973 core 20 0 101548 17308 8132 S 13.6 0.0 0:00.78 /usr/lib/systemd/systemd --user 1499 openvsw+ 10 -10 1247940 98004 34688 S 9.1 0.1 57:47.21 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openv+ 2377 root 20 0 9706628 459056 35280 S 9.1 0.7 81:03.56 /usr/bin/crio --enable-metrics=true --metrics-port=9537 26747 1004800+ 20 0 1216088 101020 53224 S 9.1 0.2 4:15.34 node ./dist/server.js 957 root 20 0 320776 172580 127028 S 4.5 0.3 25:39.76 /usr/lib/systemd/systemd-journald 1368 root 20 0 223872 38252 36444 S 4.5 0.1 0:05.26 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files 1408 openvsw+ 10 -10 81796 10256 4748 S 4.5 0.0 4:11.29 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private+ 643038 1002660+ 20 0 4639344 209704 63028 S 4.5 0.3 1:17.29 /opt/rh/rh-dotnet21/root/usr/bin/dotnet ./app/publish/PapComprobanteApi.dll ````````````````````````````````````````````````````````````````````````````````````````````````````````````````` 2] etcd log ```````````````````````````````````````````````````````````````````````````````````````````````````````````````` W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/dev-cm-emailrecaudador/cm-batch-mailrecaudador-9-2wwm2\" " with result "range_response_count:1 size:5540" took too long (433.202098ms) to execute 2021-04-17 16:36:23.348229 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-kube-controller-manager/kube-controller-manager-lxmaster01\" " with result "range_response_count:1 size:15220" took too long (433.617051ms) to execute 2021-04-17 16:36:23.348472 W | etcdserver: read-only range request "key:\"/kubernetes.io/serviceaccounts/kube-system/certificate-controller\" " with result "range_response_count:1 size:368" took too long (433.980821ms) to execute 2021-04-17 16:36:23.348759 W | etcdserver: read-only range request "key:\"/kubernetes.io/events/dev-prueba-variable/legajos-okd-1-hqp25.1676a5a82c0c66dd\" " with result "range_response_count:1 size:926" took too long (454.855935ms) to execute 2021-04-17 16:36:23.348970 W | etcdserver: read-only range request "key:\"/kubernetes.io/clusterrolebindings/system:node-proxiers\" " with result "range_response_count:1 size:397" took too long (469.170154ms) to execute ````````````````````````````````````````````````````````````````````````````````````````````````````````````````
Hi ksalunkh, I tried the latest 4.6 nightly and gave some workload to the cluster, after a while, checked the etcd and cluster, no longer found this issue. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2021-05-08-154328 True False 4h25m Cluster version is 4.6.0-0.nightly-2021-05-08-154328 Added 60 pods of client-server mode, $ oc get po -A | grep -c clusterbuster 60 If you think this still is an issue, you might open a new bug on etcd component and attach the must-gather.