Bug 1802768 - Azure IPI rafthttp dial tcp i/o timeout prober ROUND_TRIPPER_RAFT_MESSAGE
Summary: Azure IPI rafthttp dial tcp i/o timeout prober ROUND_TRIPPER_RAFT_MESSAGE
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-13 19:39 UTC by Steven Walter
Modified: 2023-12-15 17:20 UTC (History)
44 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-19 14:31:35 UTC
Target Upstream Version:
Embargoed:
scuppett: needinfo-


Attachments (Terms of Use)

Description Steven Walter 2020-02-13 19:39:45 UTC
Description of problem:
On multiple clusters installed on Azure with IPI, seeing timeouts and slowness with etcd.

Version-Release number of selected component (if applicable):
Seen on 4.2 and 4.3

How reproducible:
Unconfirmed

Steps to Reproduce:
Install IPI cluster on Azure

Actual results:

etcd cluster "etcd": member communication with xxxxxxxxxxxxxxxx is taking 0.1979733333333334s on etcd instance 10.x.x.x:9979.
2020-02-12 11:35:21.017555 W | rafthttp: health check for peer xxxxxxxxxxxxxxxx could not connect: dial tcp 10.x.x.x:2380: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-02-12 11:35:45.807396 W | rafthttp: health check for peer xxxxxxxxxxxxxxxx could not connect: dial tcp: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE")

Expected results:
Successful health checks

Additional info:
Both clusters where this was seen had clock syncing problems, but that was resolved
Clusters have premium disks
It seems that rebooting an affected master fixes it for a period of time

Comment 2 Hugo Cisneiros (Eitch) 2020-02-13 20:15:18 UTC
I also observed some messages like this during the outage:

embed: rejected connection from "10.0.0.6:51769" (error "EOF", ServerName etcd-0.cluster.example.com)

And even if the cluster is stable and working (with all etcd members healthy), there's lot of these messages:

W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 105.137408ms)
W | etcdserver: server is likely overloaded
W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-scheduler/scheduler-kubeconfig\" " with result "range_response_count:1 size:867" took too long (403.657982ms) to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/leases/kube-node-lease/cluster-qjrwn-worker-brazilsouth-fvq4d\" " with result "range_response_count:1 size:419" took too long (394.615802ms) to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/leases/kube-node-lease/cluster-qjrwn-worker-brazilsouth-76bhv\" " with result "range_response_count:1 size:420" took too long (391.729576ms) to execute
[...]

etcdserver: read-only range request "key:..." took too long (310.0123213ms) to execute


Which is strange because it's currently a small cluster (etcd data is using "Premium SSD"). Etcd members info:

+--------------------------+------------------+---------+---------+-----------+-----------+------------+
|         ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://10.0.0.5:2379    | 40dba90b906e037b |  3.3.10 |   76 MB |      true |        50 |    9727909 |
| https://10.0.0.6:2379    | d55695151c4d4f0c |  3.3.10 |   76 MB |     false |        50 |    9727909 |
| https://10.0.0.7:2379    | e0af0a8719adc3f7 |  3.3.10 |   75 MB |     false |        50 |    9727909 |
+--------------------------+------------------+---------+---------+-----------+-----------+------------+

Also, during an outage, we tested that these connections weren't working (tested with curl and seeing in log messages):

* Pod master1/kube-apiserver couldn't connect to master2/apiserver;
* Pod master2/kube-apiserver couldn't connect to master1/apiserver;
* Pod infra/prometheus-k8s-1 couldn't connect to master0/apiserver;
* The connections were failing with example: "dial: tcp 10.128.0.22:8443: route to host" error.

Which translates to: some servers couldn't communicate with each other, returning "Route to Host" error. Rebooting fixed the issue and curl between servers began to work again.

Comment 4 Steven Walter 2020-02-14 17:00:15 UTC
Hi,
I'm suspecting this is not an etcd bug at all; this is looking mostly like a networking issue (or kernelspace?) so I'm moving to the networking team.
I'll be uploading some output from crio.ERROR in /tmp

Comment 8 Ross Brattain 2020-02-14 19:31:47 UTC
I also see 

91:2020-02-14 14:29:03.851645 W | rafthttp: health check for peer 7c0f2e5d5ac02df2 could not connect: dial tcp 10.0.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
92:2020-02-14 14:29:03.851689 W | rafthttp: health check for peer 7c0f2e5d5ac02df2 could not connect: dial tcp 10.0.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")


on 4.3.0-0.nightly-2020-02-13-214539 on Azure.

component: networking  sub-component: OpenShiftSDN

Comment 9 Ross Brattain 2020-02-14 19:34:32 UTC
also on 4.3.0-0.nightly-2020-02-13-214539 on Azure "westus"

572:2020-02-14 14:42:43.975749 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-service-ca-operator/service-ca-operator-lock\" " with result "range_response_count:1 size:474" took too long (1.2125509s) to execute
573:2020-02-14 14:42:43.975876 W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-sdn/openshift-network-controller\" " with result "range_response_count:1 size:442" took too long (2.1964767s) to execute
574:2020-02-14 14:42:43.976202 W | etcdserver: read-only range request "key:\"/kubernetes.io/config.openshift.io/clusteroperators/openshift-samples\" " with result "range_response_count:1 size:903" took too long (755.3249ms) to execute
575:2020-02-14 14:42:43.976932 W | etcdserver: read-only range request "key:\"kubernetes.io/health\" " with result "range_response_count:0 size:5" took too long (102.5986ms) to execute

Comment 10 Steven Walter 2020-02-14 20:01:42 UTC
Thanks for the confirmation Ross!
As a quick note, I had the customer get "top" and "free" output and confirmed the VM itself is not experiencing any resource contention at all.

Comment 11 Sam Batschelet 2020-02-14 20:23:51 UTC
I would be highly surprised if this was not a result of disk I/O latency masking as networking. If that is the case the issue is known[1].

Can you provide the results of the following prom queries.

# backend_commit
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))

# fsync
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

# peer round trip
histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1798785

Comment 12 Steven Walter 2020-02-14 20:42:47 UTC
Hi,
Customer is using Premium SSDs https://docs.microsoft.com/sv-se/azure/virtual-machines/linux/disks-types -- would that be sufficient?
These queries would be run in Prometheus running on the cluster, right?

Comment 13 Sam Batschelet 2020-02-14 20:53:38 UTC
> Customer is using Premium SSDs https://docs.microsoft.com/sv-se/azure/virtual-machines/linux/disks-types -- would that be sufficient?
These queries would be run in Prometheus running on the cluster, right?

I am pretty sure Premium_LRS "Premium SSD" is also affected.

These queries would be run on the cluster yes.

Comment 15 Steven Walter 2020-02-14 21:14:58 UTC
Greg has attached the output to the bz in a private comment.
Is there a workaround or is it just, "use the best disk possible"?

Comment 41 Sergio G. 2020-02-17 11:35:19 UTC
@jdesousa ACK, I'll discuss with the account team and will send the request to the customer

Comment 58 Steven Walter 2020-02-18 17:59:47 UTC
(In reply to Caden Marchese from comment #55)
> rafthttp issue shown below:
> 
There's a workaround to move passed the raft issue in:
https://bugzilla.redhat.com/show_bug.cgi?id=1765609

Comment 96 Ryan Howe 2020-02-27 02:00:42 UTC
For the accelerated networking piece I created the following PR to enable this on masters. 

  https://github.com/openshift/installer/pull/3197
  https://docs.microsoft.com/en-us/azure/virtual-network/create-vm-accelerated-networking-cli

Enabling this does not increase any costs and if the OS for some reason does not support it, it will fall back to being disabled according to Microsoft.

Comment 97 Stephen Cuppett 2020-03-06 13:03:29 UTC
Setting Target Release to current development branch for investigation (4.5.0). Where fixes are created and backports requested/required (if any) clones will be created for those releases once identified.

Comment 98 Sergio G. 2020-03-17 11:48:15 UTC
Stephen after the investigation done during the troubleshooting of this case, it seems that without this changes OpenShift won't perform properly in Azure at all. Please consider including the minor changes for a next release and not postpone it to 4.5.0 for investigation.

Refer to all the previous updates to understand how so badly etcd performs in Azure even using premium disks that even wihtout any workload etcd running inside OpenShift is losing quorum and moving into a leader election posture.

Comment 99 Sam Batschelet 2020-03-17 13:23:58 UTC
Raft runtime changes landed in 4.4 and were backported to 4.3 will ship in next 4.3 z stream and 4.4 GA 

https://bugzilla.redhat.com/show_bug.cgi?id=1806700

This bug is probably a dupe of 1806700

Comment 100 Sergio G. 2020-03-18 07:20:25 UTC
Thanks for confirming Sam. So, what is Stephen delaying to 4.5.0?


Note You need to log in before you can comment on or make changes to this bug.