Bug 1835686
Summary: | bad performance indicated by throttling requests in pod logs makes authentication degraded to reach openshift-apiserver | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ke Wang <kewang> |
Component: | kube-apiserver | Assignee: | Stefan Schimanski <sttts> |
Status: | CLOSED WONTFIX | QA Contact: | Xingxing Xia <xxia> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.5 | CC: | aos-bugs, juzhao, kewang, ksalunkh, mfojtik |
Target Milestone: | --- | Keywords: | Performance, Reopened, TestBlocker |
Target Release: | 4.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-06-05 07:45:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ke Wang
2020-05-14 10:18:18 UTC
Discussed with xingxing, this could be a performance of etcd, did a quick checking on etcd pod, seems reading key took long. $ oc logs -n openshift-etcd etcd-member-6fmwc-m-0.c.openshift-qe.internal -c etcd-member | grep -n 'took too long' | wc -l 301 ... 77:2020-05-14 09:24:41.420117 W | etcdserver: request "header:<ID:2425876772913985041 username:\"etcd\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/secrets/openshift-kube-controller-manager/csr-signer-4\" mod_revision:0 > success:<request_put:<key:\"/kubernetes.io/secrets/openshift-kube-controller-manager/csr-signer-4\" value_size:3098 >> failure:<>>" with result "size:16" took too long (215.750159ms) to execute 278:2020-05-14 09:24:41.420141 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-service-ca/configmap-cabundle-injector-lock\" " with result "range_response_count:1 size:472" took too long (330.950256ms) to execute 279:2020-05-14 09:24:41.420332 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-kube-scheduler/installer-4-kewang-6fmwc-m-1.c.openshift-qe.internal\" " with result "range_response_count:1 size:1587" took too long (351.36024ms) to execute 280:2020-05-14 09:24:41.420580 W | etcdserver: read-only range request "key:\"/kubernetes.io/deployments/openshift-monitoring/kube-state-metrics\" " with result "range_response_count:1 size:2692" took too long (573.790763ms) to execute 281:2020-05-14 09:24:41.420658 W | etcdserver: read-only range request "key:\"/kubernetes.io/minions/\" range_end:\"/kubernetes.io/minions0\" " with result "range_response_count:5 size:25883" took too long (595.191608ms) to execute 282:2020-05-14 09:24:41.420729 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-sdn/openshift-network-controller\" " with result "range_response_count:1 size:454" took too long (576.390566ms) to execute ... CC: mifiedle Please ignore Comment 1, it's not for this bug's environment. Discussed with xingxing, this could be a performance issue of etcd, did a quick checking on etcd pod, seems reading key took long, see below, $ etcd_pod=$(oc get pods -n openshift-etcd | grep etcd | head -1 | cut -d " " -f1) $ oc logs -n openshift-etcd $etcd_pod -c etcd | grep -n 'took too long' | wc -l 423 Pasted snipped log here, ... 681:2020-05-14 03:11:44.599784 W | etcdserver: read-only range request "key:\"/kubernetes.io/services/endpoints/openshift-authentication/oauth-openshift\" " with result "range_response_count:1 size:525" took too long (236.278694ms) to execute 682:2020-05-14 03:11:44.599945 W | etcdserver: read-only range request "key:\"/kubernetes.io/deployments/openshift-monitoring/openshift-state-metrics\" " with result "range_response_count:1 size:4866" took too long (271.452897ms) to execute 683:2020-05-14 03:11:44.600272 W | etcdserver: read-only range request "key:\"/kubernetes.io/cloudcredential.openshift.io/credentialsrequests/openshift-cloud-credential-operator/openshift-image-registry-openstack\" " with result "range_response_count:1 size:1601" took too long (203.700944ms) to execute 684:2020-05-14 03:11:44.600589 W | etcdserver: read-only range request "key:\"/kubernetes.io/secrets/openshift-kube-apiserver/user-serving-cert-003\" " with result "range_response_count:0 size:6" took too long (235.389736ms) to execute ... In openshift-etcd-operator pod, found the lots of following errors, oc logs -n openshift-etcd-operator etcd-operator-ff88476b7-tffgl | grep 'transport is closing' | wc -l 27477 I0514 03:16:58.171427 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" Hit the bug again, we have one upgrade env OCP_4.3.24->4.4.7-> 4.5.0nightly. This caused a bad performance of system. oc logs -n openshift-apiserver-operator openshift-apiserver-operator-67d856d6fd-tfpzb also shows many Throttling request took 1.787306428s When we test running below command line, while true; do oc get oauthclients.oauth.openshift.io; date; echo; sleep 1; done will intermmittently hit: Error from server (ServiceUnavailable): the server is currently unable to handle the request (get oauthclients.oauth.openshift.io) authentication operator reports many related errors: $ oc logs -n openshift-authentication-operator authentication-operator-66c7d588dd-76dtn | grep 'the server is currently unable to handle the request' | wc -l 1637 Please don't hijack BZs with different topics. Start a new BZ instead. Here in #7 this is obviously about etcd performance, not about throttling events in some operator log. Found one right bug 1825219 to track this issue. Hello team, Observed the same issue in 4.6 cluster. Noticed high CPU usage on nodes. Following is the snippet of TOP command and etcd logs on master node. 1] TOP output ``````````````````````````````````````````````````````````````````````````````````````````````````````````````` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2427 root 20 0 6269864 494828 64316 S 181.8 0.8 848:07.26 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote+ 3671 nfsnobo+ 20 0 721912 42556 11236 S 68.2 0.1 86:18.39 /bin/node_exporter --web.listen-address=127.0.0.1:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/host/root --no-collector.wifi --no-collector.+ 25168 root 20 0 2672940 420468 30672 S 54.5 0.6 1205:42 /usr/bin/ruby /usr/local/bin/fluentd --suppress-config-dump --no-supervisor -r /usr/local/share/gems/gems/fluent-plugin-elasticsearch-4.1.1/lib/fluent/plugin/elast+ 3073278 core 20 0 71620 6840 4712 R 22.7 0.0 0:00.09 top -bc 1 root 20 0 267676 26908 9052 S 18.2 0.0 176:06.21 /usr/lib/systemd/systemd --switched-root --system --deserialize 18 3072973 core 20 0 101548 17308 8132 S 13.6 0.0 0:00.78 /usr/lib/systemd/systemd --user 1499 openvsw+ 10 -10 1247940 98004 34688 S 9.1 0.1 57:47.21 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openv+ 2377 root 20 0 9706628 459056 35280 S 9.1 0.7 81:03.56 /usr/bin/crio --enable-metrics=true --metrics-port=9537 26747 1004800+ 20 0 1216088 101020 53224 S 9.1 0.2 4:15.34 node ./dist/server.js 957 root 20 0 320776 172580 127028 S 4.5 0.3 25:39.76 /usr/lib/systemd/systemd-journald 1368 root 20 0 223872 38252 36444 S 4.5 0.1 0:05.26 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files 1408 openvsw+ 10 -10 81796 10256 4748 S 4.5 0.0 4:11.29 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private+ 643038 1002660+ 20 0 4639344 209704 63028 S 4.5 0.3 1:17.29 /opt/rh/rh-dotnet21/root/usr/bin/dotnet ./app/publish/PapComprobanteApi.dll ````````````````````````````````````````````````````````````````````````````````````````````````````````````````` 2] etcd log ```````````````````````````````````````````````````````````````````````````````````````````````````````````````` W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/dev-cm-emailrecaudador/cm-batch-mailrecaudador-9-2wwm2\" " with result "range_response_count:1 size:5540" took too long (433.202098ms) to execute 2021-04-17 16:36:23.348229 W | etcdserver: read-only range request "key:\"/kubernetes.io/pods/openshift-kube-controller-manager/kube-controller-manager-lxmaster01\" " with result "range_response_count:1 size:15220" took too long (433.617051ms) to execute 2021-04-17 16:36:23.348472 W | etcdserver: read-only range request "key:\"/kubernetes.io/serviceaccounts/kube-system/certificate-controller\" " with result "range_response_count:1 size:368" took too long (433.980821ms) to execute 2021-04-17 16:36:23.348759 W | etcdserver: read-only range request "key:\"/kubernetes.io/events/dev-prueba-variable/legajos-okd-1-hqp25.1676a5a82c0c66dd\" " with result "range_response_count:1 size:926" took too long (454.855935ms) to execute 2021-04-17 16:36:23.348970 W | etcdserver: read-only range request "key:\"/kubernetes.io/clusterrolebindings/system:node-proxiers\" " with result "range_response_count:1 size:397" took too long (469.170154ms) to execute ```````````````````````````````````````````````````````````````````````````````````````````````````````````````` Hi ksalunkh, I tried the latest 4.6 nightly and gave some workload to the cluster, after a while, checked the etcd and cluster, no longer found this issue. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2021-05-08-154328 True False 4h25m Cluster version is 4.6.0-0.nightly-2021-05-08-154328 Added 60 pods of client-server mode, $ oc get po -A | grep -c clusterbuster 60 If you think this still is an issue, you might open a new bug on etcd component and attach the must-gather. |