Bug 1431222 - etcd 3.0.17 creates high load when configured with TLS
etcd 3.0.17 creates high load when configured with TLS
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: etcd (Show other bugs)
25
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Jan Chaloupka
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-03-10 11:18 EST by Spyros Trigazis
Modified: 2017-03-22 15:49 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-03-22 15:49:38 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
notes from irc chat with clayton (3.35 KB, text/plain)
2017-03-10 16:20 EST, Dusty Mabe
no flags Details

  None (edit)
Description Spyros Trigazis 2017-03-10 11:18:18 EST
Description of problem:
If etcd is configured with TLS it creates a massive amount of load in the host.
If the host is powerful enough and you don't create actual load on etcd you
might not notice. 

Version-Release number of selected component (if applicable):
3.0.17

How reproducible:
Always


Steps to Reproduce:
1. On a Fedora Atomic host with etcd 3.0.17
2. Create TLS certificates. A ca, a cert and a key  
3. configure etcd like follows:
ETCD_NAME=<ip>
ETCD_DATA_DIR=/var/lib/etcd/default.etcd
ETCD_LISTEN_CLIENT_URLS=https://<ip>:2379
ETCD_LISTEN_PEER_URLS=https://<ip>:2380

ETCD_ADVERTISE_CLIENT_URLS=https://<ip>:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://<ip>:2380
ETCD_DISCOVERY=https://discovery.etcd.io/8765108671b4ee96a9efaf0f6714164a
ETCD_TRUSTED_CA_FILE=/srv/kubernetes/ca.crt
ETCD_CERT_FILE=/srv/kubernetes/server.crt
ETCD_KEY_FILE=/srv/kubernetes/server.key
ETCD_PEER_TRUSTED_CA_FILE=/srv/kubernetes/ca.crt
ETCD_PEER_CERT_FILE=/srv/kubernetes/server.crt
ETCD_PEER_KEY_FILE=/srv/kubernetes/server.key

4. Start etcd

Actual results:
A lot of:
Mar 09 14:13:02 te-kvz2ov3ur-0-dovsiocfsl5c-swarm-master-bxb4bk5pthcy.novalocal etcd[2038]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.

See [1].

Expected results:
Etcd running without the above error and reasonable load.


Additional info:
Here is the fun part. Using etcd in the docker image registry.fedoraproject.org/f25/etcd (3.0.17) doesn't work as well.
But, using gcr.io/google_containers/etcd:3.0.17 or quay.io/coreos/etcd:v3.1.1 works fine.

etcd 3.0.15 in fedora doesn't suffer from this issue. In [2], etcd works fine.

[1] http://logs.openstack.org/02/443002/2/check/gate-functional-dsvm-magnum-swarm-ubuntu-xenial/50d25b2/logs/cluster-nodes/master-test_start_stop_container_from_api-172.24.5.8/etcd.txt.gz
[2] https://kojipkgs.fedoraproject.org/compose/twoweek/Fedora-Atomic-25-20170205.0/compose/CloudImages/x86_64/images/
Comment 1 Dusty Mabe 2017-03-10 16:20 EST
Created attachment 1262080 [details]
notes from irc chat with clayton

I chatted with clayton about this and he's not sure what it could be exactly. Copied the notes from the irc log here just in case our conversation could trigger something in some other peoples minds.
Comment 2 Spyros Trigazis 2017-03-11 12:31:35 EST
Detailed steps to reproduce. Can someone try to reproduce?

1. Boot a fresh atomic host with etcd 3.0.17. I used [1].
2. Configure etcd with TLS like so:

HOST_IP=<YOUR host ip here>
cert_dir=/srv/etcd
mkdir -p ${cert_dir}
cd ${cert_dir}

openssl genrsa -out ca-key.pem 2048

openssl req -x509 -new -nodes -key ca-key.pem -days 10000 -out ca.pem -subj "/CN=kube-ca"

cat > openssl.cnf <<EOF
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
subjectAltName = @alt_names
[alt_names]
IP.1 = ${HOST_IP}
IP.2 = 127.0.0.1
EOF

openssl genrsa -out apiserver-key.pem 2048
openssl req -new -key apiserver-key.pem -out apiserver.csr -subj "/CN=kube-apiserver" -config openssl.cnf

openssl x509 -req -in apiserver.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out apiserver.pem -days 365 -extensions v3_req -extfile openssl.cnf

groupadd kube_etcd
usermod -a -G kube_etcd etcd
usermod -a -G kube_etcd kube
SERVER_KEY=$cert_dir/apiserver-key.pem
chmod 550 "${cert_dir}"
chown -R etcd:kube_etcd "${cert_dir}"
chmod 440 $SERVER_KEY

ETCD_DISCOVERY=$(curl -w "\n" 'https://discovery.etcd.io/new?size=1')

cat > /etc/etcd/etcd.conf <<EOF
ETCD_NAME=${HOST_IP}
ETCD_DATA_DIR=/var/lib/etcd/default.etcd
ETCD_LISTEN_CLIENT_URLS=https://${HOST_IP}:2379
ETCD_LISTEN_PEER_URLS=https://${HOST_IP}:2380

ETCD_ADVERTISE_CLIENT_URLS=https://${HOST_IP}:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://${HOST_IP}:2380
ETCD_DISCOVERY=${ETCD_DISCOVERY}
ETCD_TRUSTED_CA_FILE=${cert_dir}/ca.pem
ETCD_CERT_FILE=${cert_dir}/apiserver.pem
ETCD_KEY_FILE=${cert_dir}/apiserver-key.pem
ETCD_PEER_TRUSTED_CA_FILE=${cert_dir}/ca.pem
ETCD_PEER_CERT_FILE=${cert_dir}/apiserver.pem
ETCD_PEER_KEY_FILE=${cert_dir}/apiserver-key.pem
EOF

3. Start etcd
systemctl enable etcd
systemctl start etcd

4. Check the logs
journalctl -u etcd

You will see a lot of:
etcd[2386]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
etcd[2386]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.

[1] https://download.fedoraproject.org/pub/alt/atomic/stable/Fedora-Atomic-25-20170228.0/CloudImages/x86_64/images/Fedora-Atomic-25-20170228.0.x86_64.qcow2

I don't see a link to the bug between the bug and if we are using kolla or magnum.

I'm not sure what this means:
do they have a client using the elliptic p224 curve that we compile out?

FYI, Nothing was hammering the etcd server. When I saw the error, it was an obvious move to try identify who creates the load.
Comment 3 Spyros Trigazis 2017-03-11 12:34:27 EST
Check the load with top. This is on a vm with 2 cores and 4GB ram.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
2386 etcd      20   0 10.254g 122308  12708 R 183.0  3.3  26:51.00 etcd
Comment 4 Spyros Trigazis 2017-03-11 15:00:07 EST
A faster way to reproduce:

docker run --rm --env HOST_IP=<YOUR HOST IP HERE> --net host -p 2379 -p 2380 --name etcd strigazi/test-etcd

You can build the image yourself from this repo:

https://github.com/strigazi/test-etcd

The image is based on registry.fedoraproject.org/f25/etcd . When I was writing these lines, the image had etcd 3.0.17 built with golang 1.7.4.
Comment 5 Spyros Trigazis 2017-03-11 21:34:10 EST
I think found the issue. The issue comes from golang 1.7 and it's fixed in etcd here [1]. The bad thing is that it is included only in etcd v3.1.x.

etcd v3.0.17 that works, as I mentioned, from gcr.io/google_containers/etcd:3.0.17 is built with go 1.6.4.

here is what I built and tested golang/etcd version

go\etcd        | v3.0.15 | v3.0.17 |    v3.1.1     |
1.6.4          |   Yes   |   Yes   | compile fails |
1.7.5, 1.7.4   |   No    |   No    |      Yes      |
1.8.rc3        |   No    |   No    |      Yes      |

We must either move to etcd v3.1.x or build etcd 3.0.17 with go 1.6.4 (or 1.6.x, probably).

On the other hand, in a FA image from 20170205 which includes etcd v3.0.15 built with go 1.7.3 the problem doesn't occur which makes absolutely no sense. I tried to reproduce but when I built etcd v3.0.15 with go 1.7.3 I had the same issue.

Final comment, given fix [1] and if we continue to use go 1.7, we should move to etcd v3.1.x. 

[1] https://github.com/coreos/etcd/commit/7a48ca4ceaa10451b48594104e14fe36781c1a01
Comment 6 Spyros Trigazis 2017-03-12 08:01:07 EDT
FYI, etcd v3.0.17 was released on Jan 20 2017 with go 1.6.4 not 1.7 [1]

[1] https://github.com/coreos/etcd/releases/tag/v3.0.17
Comment 7 Dusty Mabe 2017-03-12 12:09:27 EDT
spyros, we have an open ticket for moving to newer etcd. There are some blockers there that we need to clear out of the way before hand.

https://bugzilla.redhat.com/show_bug.cgi?id=1415341
Comment 8 Spyros Trigazis 2017-03-12 13:10:57 EDT
Ok, in the mean time, Could we fix etcd 3.0.17 by building with go 1.6.4? We the current state we can't use the latest FA25 nor benefit from the recent release [1].

[1] http://www.projectatomic.io/blog/2017/03/fedora_atomic_2week/
Comment 9 Dusty Mabe 2017-03-12 15:53:34 EDT
Unfortunately in Fedora 25 we can't officially build against 1.6.4 when the current version of golang in the F25 repos is 1.7. I could give you a custom built container with 3.0.17 built against f24 (and thus golang 1.6.4). Once we get 3.1.3 building I can also give you a preview container with that content in it, neither one of those would be official yet, but let me know if you are interested.

As for 3.1.3, i'm actively working to unblock that and possibly get the new rpm into updates-testing this week. However, I don't know if it would make this week's release, scheduled for Tuesday.
Comment 10 Dusty Mabe 2017-03-20 23:16:29 EDT
This update seems to fix the problem for your test reproducer:
https://bodhi.fedoraproject.org/updates/FEDORA-2017-d841d68f7c
Comment 11 Spyros Trigazis 2017-03-22 09:04:05 EDT
I tested with the above update. Looks good now.
Comment 12 Dusty Mabe 2017-03-22 15:49:38 EDT
Fixed in etcd-3.1.3-1.fc25

Note You need to log in before you can comment on or make changes to this bug.