Bug 1415839 - etcd traffic increased at least by a factor of 4 due to increase in quorum reads
Summary: etcd traffic increased at least by a factor of 4 due to increase in quorum reads
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod   
(Show other bugs)
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: ---
Assignee: Jordan Liggitt
QA Contact: Mike Fiedler
URL:
Whiteboard: aos-scalability-35
Keywords: OpsBlocker
: 1570183 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-23 21:50 UTC by ihorvath
Modified: 2018-12-15 14:35 UTC (History)
30 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-10 05:17:28 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
AWS stats that show the increase after the upgrade (161.78 KB, image/jpeg)
2017-01-23 21:50 UTC, ihorvath
no flags Details
IOPS graph from our zabbix monitoring (214.36 KB, image/jpeg)
2017-01-23 21:55 UTC, ihorvath
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1716 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.6 RPM Release Advisory 2017-08-10 09:02:50 UTC
Red Hat Knowledge Base (Solution) 2916381 None None None 2017-09-23 05:32 UTC

Description ihorvath 2017-01-23 21:50:11 UTC
Created attachment 1243774 [details]
AWS stats that show the increase after the upgrade

Description of problem:
After OCP3.4 upgrade on the preview cluster we noticed that etcd iops has increased significantly. In some graphs the increase seems to be 4 or 5 fold. 

Version-Release number of selected component (if applicable):
oc v3.4.0.39
kubernetes v1.4.0+776c994


How reproducible:
100%

Steps to Reproduce:
1. Upgrade cluster to 3.4 from 3.3
2. Have 100 nodes running

Actual results:
ETCD iops is crushing io1 type drive attached to master instances with iops hitting 500 and up.

Expected results:
In 3.3 on this cluster with the same 100 nodes running we had about 100-120 iops.

Additional info:

Comment 1 ihorvath 2017-01-23 21:55 UTC
Created attachment 1243775 [details]
IOPS graph from our zabbix monitoring

Comment 3 Timothy St. Clair 2017-01-24 20:23:05 UTC
We have noticed in 3.4 that traffic pattern is much more noisy.  

To easy the iops, could you try tuning the snapshot interval? 
https://coreos.com/etcd/docs/latest/tuning.html#snapshot-tuning

Comment 4 Scott Dodson 2017-01-24 20:36:57 UTC
It'd be nice to understand the etcd package versions on all clusters where you have data. Do 3.3 clusters running etcd-3.x exhibit the same problems?

Comment 5 ihorvath 2017-01-24 21:22:50 UTC
Besides Wesley's test cluster, we do not have any cluster that has a combination of etcd 3.0.x along oc 3.3.x.x

We only have etcd 2.3.7 with oc 3.3.x.x or etcd 3.0.15 with oc 3.4.x.x.


We are ready to try workaround solutions, as you also mentioned snapshot tuning and create new io1 instances with more iops limits. Would still like to know if etcd works as it should and we need to plan for working with it as is, or this is a problem and it will cause other issues? Can we make it dynamic, so when it detects multiple sync failures between etcd members it slows down the rate of snapshotting? Maybe that's not a good idea for some reason that we are not aware right now.

Comment 6 Timothy St. Clair 2017-01-24 22:08:32 UTC
We have validated etcd 3.0.15 against openshift 3.4 a number of times against much larger scale environments but the system had dedicated hardware, and always we recommend SSDs where possible.  

A general increase in iops is not a major concern so long as you can meet the hardware recommendations: 
https://github.com/coreos/etcd/blob/master/Documentation/op-guide/hardware.md#hardware-recommendations

Also fwiw in our testing, we had our etcd cluster split out externally from our masters.

Comment 7 Timothy St. Clair 2017-01-24 22:23:38 UTC
Are you following the recommended hw configurations listed above?  

Are you having leader election issues?  

Are you having write failures?  

Or is this just a notification on the change?

Comment 8 ihorvath 2017-01-25 14:29:52 UTC
Apparently we are not following the recommended specs as close as we thought we were. At least as far as the iops numbers are in the hardware recommendation page.

Not sure if it's leader election related, but we see errors such as this often:
etcdserver: request timed out, possibly due to previous leader failure
and
failed to send out heartbeat on time (exceeded the 500ms timeout for 4.352732302s)
server is likely overloaded

Write failures I don't see, but everything is complaining about 
sync duration of 15.10874125s, expected less than 1s
but seems eventually it finishes.

Anyway, as I said earlier, from an outsider's point of view, this looks like a regression, and we wanted to know if this is expected with the new etcd version and we just have to tailor the cloud provider resources to match it better for the cluster's needs.

Comment 9 Timothy St. Clair 2017-01-25 14:53:38 UTC
> but everything is complaining about sync duration of 15.10874125s

That's basically a write issue, and it's taking much longer to perform the sync then it should.  

> Anyway, as I said earlier, from an outsider's point of view, this looks like a regression, and we wanted to know if this is expected with the new etcd version and we just have to tailor the cloud provider resources to match it better for the cluster's needs.

filed: https://github.com/coreos/etcd/issues/7232

Comment 11 Timothy St. Clair 2017-01-25 19:51:33 UTC
Could you please try updating the snapshot interval and following the guidelines and report back if you are seeing any long sync durations.  

If there is a regression, it's not looking like upstream etcd from the afore mentioned issue.

Comment 12 ihorvath 2017-01-25 20:06:10 UTC
We are in the process of (In reply to Timothy St. Clair from comment #11)
> Could you please try updating the snapshot interval and following the
> guidelines and report back if you are seeing any long sync durations.  
> 
> If there is a regression, it's not looking like upstream etcd from the afore
> mentioned issue.

We are in the process of creating new volumes for etcd data, with increased iops limits. Will report back on results once it's deployed and running.

Comment 16 Timothy St. Clair 2017-01-31 20:41:49 UTC
Would it be possible to see the results of setting the snapshot count to 5000? - https://coreos.com/etcd/docs/latest/tuning.html#snapshot-tuning

Comment 17 Timothy St. Clair 2017-01-31 20:41:49 UTC
Would it be possible to see the results of setting the snapshot count to 5000? - https://coreos.com/etcd/docs/latest/tuning.html#snapshot-tuning

Comment 20 Timothy St. Clair 2017-02-08 17:04:18 UTC
In debugging w/etcd-debug logs we see an obscene number of quorum reads, which causes a write. 

biggest counts are:
endpoints controller
secrets
serviceaccounts

Comment 21 Timothy St. Clair 2017-02-08 17:59:37 UTC
rooted: https://github.com/kubernetes/kubernetes/issues/41143 

but fix needs to be thought through.

Comment 22 Timothy St. Clair 2017-02-08 20:02:56 UTC
Temporary work around is to ensure your etcd instance is on a high iop SSD drive where possible, and monitor your iop rate.

Comment 25 Eric Paris 2017-02-14 18:30:06 UTC
Marking upcoming release. We aren't going to fix this in 3.5. But will hopefully do better in 3.6 (after the rebase and etcd3 switch)

Comment 27 Andy Goldstein 2017-03-22 19:42:47 UTC
Still waiting on 1.6 rebase

Comment 28 Andy Goldstein 2017-04-17 14:17:36 UTC
Still waiting on 1.6 rebase

Comment 29 Andy Goldstein 2017-05-02 18:32:18 UTC
1.6.1 in in, moving to MODIFIED

Comment 30 Mike Fiedler 2017-05-02 19:29:51 UTC
While this bug can be tested prior, the full story requires OCP 3.6 installing in v3 storage/client mode and having migration for pre-3.6 etc stores.

Comment 31 Aleksandar Kostadinov 2017-05-29 09:47:29 UTC
What is the final resolution? Reading github, it sounds like an upgrade to ETCD 3.1 [1].

Or did we implement any other additional fixes?

[1] https://github.com/kubernetes/kubernetes/issues/41143

Comment 43 DeShuai Ma 2017-06-05 07:39:53 UTC
Could you help verify the bug? thanks

Comment 44 Mike Fiedler 2017-06-05 12:30:27 UTC
Yes, I'll take QA on this.

Comment 52 Mike Fiedler 2017-07-06 13:52:21 UTC
Verified on 3.6.122

Comment 54 errata-xmlrpc 2017-08-10 05:17:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Comment 57 Ryan Howe 2018-04-26 21:14:26 UTC
*** Bug 1570183 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.