Description of problem: On multiple clusters installed on Azure with IPI, seeing timeouts and slowness with etcd. Version-Release number of selected component (if applicable): Seen on 4.2 and 4.3 How reproducible: Unconfirmed Steps to Reproduce: Install IPI cluster on Azure Actual results: etcd cluster "etcd": member communication with xxxxxxxxxxxxxxxx is taking 0.1979733333333334s on etcd instance 10.x.x.x:9979. 2020-02-12 11:35:21.017555 W | rafthttp: health check for peer xxxxxxxxxxxxxxxx could not connect: dial tcp 10.x.x.x:2380: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE") 2020-02-12 11:35:45.807396 W | rafthttp: health check for peer xxxxxxxxxxxxxxxx could not connect: dial tcp: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE") Expected results: Successful health checks Additional info: Both clusters where this was seen had clock syncing problems, but that was resolved Clusters have premium disks It seems that rebooting an affected master fixes it for a period of time
I also observed some messages like this during the outage: embed: rejected connection from "10.0.0.6:51769" (error "EOF", ServerName etcd-0.cluster.example.com) And even if the cluster is stable and working (with all etcd members healthy), there's lot of these messages: W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 105.137408ms) W | etcdserver: server is likely overloaded W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-scheduler/scheduler-kubeconfig\" " with result "range_response_count:1 size:867" took too long (403.657982ms) to execute W | etcdserver: read-only range request "key:\"/kubernetes.io/leases/kube-node-lease/cluster-qjrwn-worker-brazilsouth-fvq4d\" " with result "range_response_count:1 size:419" took too long (394.615802ms) to execute W | etcdserver: read-only range request "key:\"/kubernetes.io/leases/kube-node-lease/cluster-qjrwn-worker-brazilsouth-76bhv\" " with result "range_response_count:1 size:420" took too long (391.729576ms) to execute [...] etcdserver: read-only range request "key:..." took too long (310.0123213ms) to execute Which is strange because it's currently a small cluster (etcd data is using "Premium SSD"). Etcd members info: +--------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +--------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://10.0.0.5:2379 | 40dba90b906e037b | 3.3.10 | 76 MB | true | 50 | 9727909 | | https://10.0.0.6:2379 | d55695151c4d4f0c | 3.3.10 | 76 MB | false | 50 | 9727909 | | https://10.0.0.7:2379 | e0af0a8719adc3f7 | 3.3.10 | 75 MB | false | 50 | 9727909 | +--------------------------+------------------+---------+---------+-----------+-----------+------------+ Also, during an outage, we tested that these connections weren't working (tested with curl and seeing in log messages): * Pod master1/kube-apiserver couldn't connect to master2/apiserver; * Pod master2/kube-apiserver couldn't connect to master1/apiserver; * Pod infra/prometheus-k8s-1 couldn't connect to master0/apiserver; * The connections were failing with example: "dial: tcp 10.128.0.22:8443: route to host" error. Which translates to: some servers couldn't communicate with each other, returning "Route to Host" error. Rebooting fixed the issue and curl between servers began to work again.
Hi, I'm suspecting this is not an etcd bug at all; this is looking mostly like a networking issue (or kernelspace?) so I'm moving to the networking team. I'll be uploading some output from crio.ERROR in /tmp
I also see 91:2020-02-14 14:29:03.851645 W | rafthttp: health check for peer 7c0f2e5d5ac02df2 could not connect: dial tcp 10.0.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT") 92:2020-02-14 14:29:03.851689 W | rafthttp: health check for peer 7c0f2e5d5ac02df2 could not connect: dial tcp 10.0.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE") on 4.3.0-0.nightly-2020-02-13-214539 on Azure. component: networking sub-component: OpenShiftSDN
also on 4.3.0-0.nightly-2020-02-13-214539 on Azure "westus" 572:2020-02-14 14:42:43.975749 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-service-ca-operator/service-ca-operator-lock\" " with result "range_response_count:1 size:474" took too long (1.2125509s) to execute 573:2020-02-14 14:42:43.975876 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-sdn/openshift-network-controller\" " with result "range_response_count:1 size:442" took too long (2.1964767s) to execute 574:2020-02-14 14:42:43.976202 W | etcdserver: read-only range request "key:\"/kubernetes.io/config.openshift.io/clusteroperators/openshift-samples\" " with result "range_response_count:1 size:903" took too long (755.3249ms) to execute 575:2020-02-14 14:42:43.976932 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "range_response_count:0 size:5" took too long (102.5986ms) to execute
Thanks for the confirmation Ross! As a quick note, I had the customer get "top" and "free" output and confirmed the VM itself is not experiencing any resource contention at all.
I would be highly surprised if this was not a result of disk I/O latency masking as networking. If that is the case the issue is known[1]. Can you provide the results of the following prom queries. # backend_commit histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) # fsync histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) # peer round trip histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) [1] https://bugzilla.redhat.com/show_bug.cgi?id=1798785
Hi, Customer is using Premium SSDs https://docs.microsoft.com/sv-se/azure/virtual-machines/linux/disks-types -- would that be sufficient? These queries would be run in Prometheus running on the cluster, right?
> Customer is using Premium SSDs https://docs.microsoft.com/sv-se/azure/virtual-machines/linux/disks-types -- would that be sufficient? These queries would be run in Prometheus running on the cluster, right? I am pretty sure Premium_LRS "Premium SSD" is also affected. These queries would be run on the cluster yes.
Greg has attached the output to the bz in a private comment. Is there a workaround or is it just, "use the best disk possible"?
@jdesousa ACK, I'll discuss with the account team and will send the request to the customer
(In reply to Caden Marchese from comment #55) > rafthttp issue shown below: > There's a workaround to move passed the raft issue in: https://bugzilla.redhat.com/show_bug.cgi?id=1765609
For the accelerated networking piece I created the following PR to enable this on masters. https://github.com/openshift/installer/pull/3197 https://docs.microsoft.com/en-us/azure/virtual-network/create-vm-accelerated-networking-cli Enabling this does not increase any costs and if the OS for some reason does not support it, it will fall back to being disabled according to Microsoft.
Setting Target Release to current development branch for investigation (4.5.0). Where fixes are created and backports requested/required (if any) clones will be created for those releases once identified.
Stephen after the investigation done during the troubleshooting of this case, it seems that without this changes OpenShift won't perform properly in Azure at all. Please consider including the minor changes for a next release and not postpone it to 4.5.0 for investigation. Refer to all the previous updates to understand how so badly etcd performs in Azure even using premium disks that even wihtout any workload etcd running inside OpenShift is losing quorum and moving into a leader election posture.
Raft runtime changes landed in 4.4 and were backported to 4.3 will ship in next 4.3 z stream and 4.4 GA https://bugzilla.redhat.com/show_bug.cgi?id=1806700 This bug is probably a dupe of 1806700
Thanks for confirming Sam. So, what is Stephen delaying to 4.5.0?