Created attachment 1243774 [details]
AWS stats that show the increase after the upgrade
Description of problem:
After OCP3.4 upgrade on the preview cluster we noticed that etcd iops has increased significantly. In some graphs the increase seems to be 4 or 5 fold.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Upgrade cluster to 3.4 from 3.3
2. Have 100 nodes running
ETCD iops is crushing io1 type drive attached to master instances with iops hitting 500 and up.
In 3.3 on this cluster with the same 100 nodes running we had about 100-120 iops.
Created attachment 1243775 [details]
IOPS graph from our zabbix monitoring
We have noticed in 3.4 that traffic pattern is much more noisy.
To easy the iops, could you try tuning the snapshot interval?
It'd be nice to understand the etcd package versions on all clusters where you have data. Do 3.3 clusters running etcd-3.x exhibit the same problems?
Besides Wesley's test cluster, we do not have any cluster that has a combination of etcd 3.0.x along oc 3.3.x.x
We only have etcd 2.3.7 with oc 3.3.x.x or etcd 3.0.15 with oc 3.4.x.x.
We are ready to try workaround solutions, as you also mentioned snapshot tuning and create new io1 instances with more iops limits. Would still like to know if etcd works as it should and we need to plan for working with it as is, or this is a problem and it will cause other issues? Can we make it dynamic, so when it detects multiple sync failures between etcd members it slows down the rate of snapshotting? Maybe that's not a good idea for some reason that we are not aware right now.
We have validated etcd 3.0.15 against openshift 3.4 a number of times against much larger scale environments but the system had dedicated hardware, and always we recommend SSDs where possible.
A general increase in iops is not a major concern so long as you can meet the hardware recommendations:
Also fwiw in our testing, we had our etcd cluster split out externally from our masters.
Are you following the recommended hw configurations listed above?
Are you having leader election issues?
Are you having write failures?
Or is this just a notification on the change?
Apparently we are not following the recommended specs as close as we thought we were. At least as far as the iops numbers are in the hardware recommendation page.
Not sure if it's leader election related, but we see errors such as this often:
etcdserver: request timed out, possibly due to previous leader failure
failed to send out heartbeat on time (exceeded the 500ms timeout for 4.352732302s)
server is likely overloaded
Write failures I don't see, but everything is complaining about
sync duration of 15.10874125s, expected less than 1s
but seems eventually it finishes.
Anyway, as I said earlier, from an outsider's point of view, this looks like a regression, and we wanted to know if this is expected with the new etcd version and we just have to tailor the cloud provider resources to match it better for the cluster's needs.
> but everything is complaining about sync duration of 15.10874125s
That's basically a write issue, and it's taking much longer to perform the sync then it should.
> Anyway, as I said earlier, from an outsider's point of view, this looks like a regression, and we wanted to know if this is expected with the new etcd version and we just have to tailor the cloud provider resources to match it better for the cluster's needs.
Could you please try updating the snapshot interval and following the guidelines and report back if you are seeing any long sync durations.
If there is a regression, it's not looking like upstream etcd from the afore mentioned issue.
We are in the process of (In reply to Timothy St. Clair from comment #11)
> Could you please try updating the snapshot interval and following the
> guidelines and report back if you are seeing any long sync durations.
> If there is a regression, it's not looking like upstream etcd from the afore
> mentioned issue.
We are in the process of creating new volumes for etcd data, with increased iops limits. Will report back on results once it's deployed and running.
Would it be possible to see the results of setting the snapshot count to 5000? - https://coreos.com/etcd/docs/latest/tuning.html#snapshot-tuning
In debugging w/etcd-debug logs we see an obscene number of quorum reads, which causes a write.
biggest counts are:
but fix needs to be thought through.
Temporary work around is to ensure your etcd instance is on a high iop SSD drive where possible, and monitor your iop rate.
Marking upcoming release. We aren't going to fix this in 3.5. But will hopefully do better in 3.6 (after the rebase and etcd3 switch)
Still waiting on 1.6 rebase
1.6.1 in in, moving to MODIFIED
While this bug can be tested prior, the full story requires OCP 3.6 installing in v3 storage/client mode and having migration for pre-3.6 etc stores.
What is the final resolution? Reading github, it sounds like an upgrade to ETCD 3.1 .
Or did we implement any other additional fixes?
Could you help verify the bug? thanks
Yes, I'll take QA on this.
Verified on 3.6.122
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
*** Bug 1570183 has been marked as a duplicate of this bug. ***