RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1290839 - Performance Degradation With 3 node etcd cluster all in vms on same system
Summary: Performance Degradation With 3 node etcd cluster all in vms on same system
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: etcd
Version: 7.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Jan Chaloupka
QA Contact: atomic-bugs@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-11 15:35 UTC by Devan Goodwin
Modified: 2017-07-24 13:29 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-07-24 13:29:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Vagrantfile used to bring up this environment. (5.86 KB, text/plain)
2015-12-11 15:35 UTC, Devan Goodwin
no flags Details
etcd.conf from one of the affected cluster members (1.02 KB, text/plain)
2015-12-11 15:35 UTC, Devan Goodwin
no flags Details

Description Devan Goodwin 2015-12-11 15:35:04 UTC
Created attachment 1104731 [details]
Vagrantfile used to bring up this environment.

Description of problem:

This issue surfaced in attempted local testing of an HA openshift environment with 3 openshift masters each running a separate etcd process, 3 nodes, and a load balancer. (7 total vms) Each vm is using a on the same physical disk.

The Vagrant box is the RHEL 7.1 CDK image.

Vagrantfile attached.

Version-Release number of selected component (if applicable): etcd-2.1.1-2.el7.x86_64


How reproducible: Very


Steps to Reproduce:

Perform openshift-ansible installation for an HA environment with this VM configuration.

Actual results:

atomic-openshift-master-api service makes a number of raw etcd requests on first startup which are very slow, ~2s in this case, and is still working through them when it runs into the 90s systemd service startup timeout, which kills the process and thus your install.

Example from logs:

Dec 08 12:15:45 master1.aos.example.com atomic-openshift-master-api[26581]: [2.316744162s] [2.316744162s] END
Dec 08 12:15:48 master1.aos.example.com atomic-openshift-master-api[26581]: I1208 12:15:48.242100   26581 trace.go:57] Trace "Update *api.ClusterPolicyBinding" (started 2015-12-08 12:15:45.757181437 -0500 EST):
Dec 08 12:15:48 master1.aos.example.com atomic-openshift-master-api[26581]: [2.484902084s] [2.484902084s] END
Dec 08 12:15:49 master1.aos.example.com atomic-openshift-master-api[26581]: I1208 12:15:49.861606   26581 trace.go:57] Trace "Update *api.ClusterPolicyBinding" (started 2015-12-08 12:15:48.246875312 -0500 EST):
Dec 08 12:15:49 master1.aos.example.com atomic-openshift-master-api[26581]: [1.614713007s] [1.614713007s] END
Dec 08 12:15:52 master1.aos.example.com atomic-openshift-master-api[26581]: I1208 12:15:52.579077   26581 trace.go:57] Trace "Update *api.ClusterPolicyBinding" (started 2015-12-08 12:15:49.866518612 -0500 EST):
Dec 08 12:15:52 master1.aos.example.com atomic-openshift-master-api[26581]: [2.712543775s] [2.712543775s] END
Dec 08 12:15:55 master1.aos.example.com atomic-openshift-master-api[26581]: I1208 12:15:55.054774   26581 trace.go:57] Trace "Update *api.ClusterPolicyBinding" (started 2015-12-08 12:15:52.583697062 -0500 EST):
Dec 08 12:15:55 master1.aos.example.com atomic-openshift-master-api[26581]: [2.471049882s] [2.471049882s] END
Dec 08 12:15:56 master1.aos.example.com systemd[1]: atomic-openshift-master-api.service start operation timed out. Terminating.
Dec 08 12:15:56 master1.aos.example.com systemd[1]: atomic-openshift-master-api.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 08 12:15:56 master1.aos.example.com systemd[1]: Failed to start Atomic OpenShift Master API.
Dec 08 12:15:56 master1.aos.example.com systemd[1]: Unit atomic-openshift-master-api.service entered failed state.
Dec 08 12:15:56 master1.aos.example.com systemd[1]: atomic-openshift-master-api.service failed.


The problem is reproducible with raw etcd curl commands on the affected vms:

Ping reports .3ms or less between the vms.

Write:

(root@master1 ~) $ time curl --cacert /etc/etcd/ca.crt --cert
/etc/etcd/peer.crt --key /etc/etcd/peer.key -H "Content-Type:
application/json" https://192.168.133.10:2379/v2/keys/hello -XPUT -d
value="world 2"
{"action":"set","node":{"key":"/hello","value":"","modifiedIndex":78,"createdIndex":78},"prevNode":{"key":"/hello","value":"","modifiedIndex":77,"createdIndex":77}}
curl --cacert /etc/etcd/ca.crt --cert /etc/etcd/peer.crt --key  -H
-XPUT -d  0.03s user 0.02s system 3% cpu 1.509 total

Read:

(root@master1 ~) $ time curl --cacert /etc/etcd/ca.crt --cert
/etc/etcd/peer.crt --key /etc/etcd/peer.key -H "Content-Type:
application/json" https://192.168.133.10:2379/v2/keys/hello
{"action":"get","node":{"key":"/hello","value":"","modifiedIndex":79,"createdIndex":79}}
curl --cacert /etc/etcd/ca.crt --cert /etc/etcd/peer.crt --key  -H
0.03s user 0.03s system 84% cpu 0.064 total

Quorum read:

(root@master1 ~) $ time curl --cacert /etc/etcd/ca.crt --cert
/etc/etcd/peer.crt --key /etc/etcd/peer.key -H "Content-Type:
application/json"
"https://192.168.133.10:2379/v2/keys/hello?quorum=true"
{"action":"get","node":{"key":"/hello","value":"","modifiedIndex":79,"createdIndex":79}}
curl --cacert /etc/etcd/ca.crt --cert /etc/etcd/peer.crt --key  -H
0.04s user 0.03s system 5% cpu 1.217 total


These commands were run on the same system as the URL we're hitting, if I point this to another node in the cluster the time jumps up to exactly the ~2s we're seeing in the service startup.


In case it's relevant disk read speeds within the guest:


$ hdparm -Tt /dev/vda2

/dev/vda2:
 Timing cached reads:   12540 MB in  2.00 seconds = 6273.90 MB/sec
 Timing buffered disk reads:  16 MB in  3.51 seconds =   4.56 MB/sec

And on the host: (same disk)

$ sudo hdparm -Tt /dev/sdb1

/dev/sdb1:
 Timing cached reads:   26900 MB in  2.00 seconds = 13463.35 MB/sec
 Timing buffered disk reads: 314 MB in  3.01 seconds = 104.39 MB/sec


etcd.conf the installer generates is also attached.


NOTE: I am not sure if this is a bug or just poor vm configuration. We do have a report of almost exactly the same behavior when running etcd in 3 docker containers on same host. (via Jordan Liggitt)

Comment 1 Devan Goodwin 2015-12-11 15:35:35 UTC
Created attachment 1104732 [details]
etcd.conf from one of the affected cluster members

Comment 2 Timothy St. Clair 2015-12-11 15:51:03 UTC
This seems odd to me in leiu of - https://bugzilla.redhat.com/show_bug.cgi?id=1250707 

@rrati could you do a quick verification on our cluster. 

Also we've verified this numerous times using etcdctl - 

# etcdctl -C https://ose3-master1.ansible.test.example.com:2379,https://ose3-master2.ansible.test.example.com:2379,https://ose3-master3.ansible.test.example.com:2379 --ca-file /etc/etcd/ca.crt --cert-file=/etc/openshift/master/master.etcd-client.crt --key-file=/etc/openshift/master/master.etcd-client.key cluster-health 

# etcdctl -C https://ose3-master1.ansible.test.example.com:2379,https://ose3-master2.ansible.test.example.com:2379,https://ose3-master3.ansible.test.example.com:2379 --ca-file /etc/etcd/ca.crt --cert-file=/etc/openshift/master/master.etcd-client.crt --key-file=/etc/openshift/master/master.etcd-client.key member list
 
+ getting & setting keys, etc.

Comment 5 Andy Goldstein 2016-06-24 13:50:37 UTC
Is this still an issue?


Note You need to log in before you can comment on or make changes to this bug.