Bug 1250310
Summary: | CPU usage of etcd is too high after setting up etcd cluster | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Gaoyun Pei <gpei> |
Component: | Installer | Assignee: | Scott Dodson <sdodson> |
Status: | CLOSED WORKSFORME | QA Contact: | Ma xiaoqiang <xiama> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0.0 | CC: | eparis, gpei, jchaloup, jokerman, libra-bugs, libra-onpremise-devel, matt, mmccomas, tstclair, xtian |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-09-10 17:07:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1250707 | ||
Bug Blocks: |
Description
Gaoyun Pei
2015-08-05 06:25:16 UTC
Tim, Have you seen anything like this? Here's the etcd.config template we're using, are there other tuning changes we should make? https://github.com/openshift/openshift-ansible/blob/master/roles/etcd/templates/etcd.conf.j2 @Scott with our wide open raw k8's env we do not see this type of behavior. So I have several questions: 1. Are you using the an external etcd now? e.g. using the actual unit files vs. your previously bundled version. If so there is an enablement of GOMAXPROCS in that unit file but that doesn't explain the overload. 2. Are you starting from a clean etcd baseline? wiped the entire contents of /var/lib/etcd ... 3. Can you scrape the api /metrics endpoints to find out what are the offending calls that are eating all the bandwidth. Write(s) will now be much more expensive, and if there are a lot of writes on the openshift that don't exist in raw k8's then there will be an issue. Especially if there are overlapping list operations along with writes. Appears to be tls on peer connections. You may need an etcd bump - https://github.com/coreos/etcd/issues/2539 <tstclair> yichengq_ ping.. Do you regularly test with tls on peer connections? We are seeing a huge perf hit on peer tls. https://bugzilla.redhat.com/show_bug.cgi?id=1250310 <yichengq_> tstclair: 2.0 could have this problem because transport layer is not optimized <yichengq_> tstclair: we have improved it at 2.1 We are running: etcd Version: 2.1.1+git on our cluster, I will try to reconfigure our setup and verify. Removing TLS from peer connections reduced the cpu usage down to around 2%. Upgrading to 2.1 and re-enabling TLS shows CPU usage stays around 2% as well. Partially remediated by https://github.com/openshift/openshift-ansible/pull/427 On my three vms this cut usage in half, but that's on a cluster that's pretty much idle. I'm not sure how well this remediation scales to an active cluster. Setting this ON_QA to get feedback from QA as to how much it improves the situation in their testing. I don't see this as a fix however. After setting up env using the new openshift-ansible, the cpu usage taken by etcd reduced a lot. [root@etcd-1 ~# ps aux|grep etcd etcd 8800 12.9 0.5 26608 22716 ? Ssl 13:55 0:16 /usr/bin/etcd [root@etcd-2 ~]# ps aux|grep etcd etcd 6414 14.9 0.8 36408 32048 ? Ssl 13:55 0:17 /usr/bin/etcd [root@etcd-3 ~]# ps aux |grep etcd etcd 6531 22.4 0.5 27696 23192 ? Ssl 13:55 0:23 /usr/bin/etcd Made a small comparison between the new env and the old one. Create projects concurrently, and get the average creation time of each project. Creating 30 projects concurrently: 16.309s(new) vs. 17.0467s(old) Creating 100 projects concurrently: 48.8168s(new) vs. 45.2532s(old), however, there're 29 requests failed on the old env due to TLS handshake timeout, all the 100 requests succeed on the new nev. During the test, the cpu usage of leader etcd in the new env topped to 36.5%, then start reducing when requests finished, while the leader etcd in the old env once hit 107% CPU usage. Overall, the new env works better than the old one in my testing. From QE side, this issue mainly affects the full functional testing when the three etcd server were installed on master/node1/node2, this is done to save instance usage on OpenStack. Build or deployment would fail somewhile since the etcd takes too much cpu resource on nodes. I'd prefer to mark this as verified if the etcd really working well during next round of testing. Gaoyun, OK, if you make it through the full functional suite we can mark this as verified. -- Scott *** This bug has been marked as a duplicate of bug 1250707 *** During OSE-3.0.2 full functional testing, QE didn't encounter the same issue happened in OSE-3.0.1. No build or deployment failed due to slow system performance. QE monitored the etcd cpu usage from beginning to end, the leader etcd topped to 34.2% cpu and the follower etcd topped to 18.2% cpu. It turns out the tuning to etcd is an acceptable workaround for OSE-3.0.x with etcd 2.0, so I'll change this as WORKSFORME. etcd version: etcd-2.0.13-2.el7.x86_64. Thanks. I believe etcd 2.1.1 is already in RHEL-Extras channel. Hi Gaoyun, can you test the same for etcd-2.1.1 [1] if there is no regression? Or was it already tested with 2.1.1? [1] https://brewweb.devel.redhat.com/buildinfo?buildID=452521 Jan Hi Jan, Thanks for providing the etcd-2.1.1 package, I tried building ose-3.0.2 env with etcd-2.1.1 cluster today. The cpu usage of etcd is around 1%~3%, which is much reduced. QE will start using etcd-2.1.1 in ose/aep from next round of testing. (In reply to Jan Chaloupka from comment #14) > Hi Gaoyun, > > can you test the same for etcd-2.1.1 [1] if there is no regression? Or was > it already tested with 2.1.1? > > [1] https://brewweb.devel.redhat.com/buildinfo?buildID=452521 > > Jan |