Description of problem: If etcd is configured with TLS it creates a massive amount of load in the host. If the host is powerful enough and you don't create actual load on etcd you might not notice. Version-Release number of selected component (if applicable): 3.0.17 How reproducible: Always Steps to Reproduce: 1. On a Fedora Atomic host with etcd 3.0.17 2. Create TLS certificates. A ca, a cert and a key 3. configure etcd like follows: ETCD_NAME=<ip> ETCD_DATA_DIR=/var/lib/etcd/default.etcd ETCD_LISTEN_CLIENT_URLS=https://<ip>:2379 ETCD_LISTEN_PEER_URLS=https://<ip>:2380 ETCD_ADVERTISE_CLIENT_URLS=https://<ip>:2379 ETCD_INITIAL_ADVERTISE_PEER_URLS=https://<ip>:2380 ETCD_DISCOVERY=https://discovery.etcd.io/8765108671b4ee96a9efaf0f6714164a ETCD_TRUSTED_CA_FILE=/srv/kubernetes/ca.crt ETCD_CERT_FILE=/srv/kubernetes/server.crt ETCD_KEY_FILE=/srv/kubernetes/server.key ETCD_PEER_TRUSTED_CA_FILE=/srv/kubernetes/ca.crt ETCD_PEER_CERT_FILE=/srv/kubernetes/server.crt ETCD_PEER_KEY_FILE=/srv/kubernetes/server.key 4. Start etcd Actual results: A lot of: Mar 09 14:13:02 te-kvz2ov3ur-0-dovsiocfsl5c-swarm-master-bxb4bk5pthcy.novalocal etcd[2038]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF. See [1]. Expected results: Etcd running without the above error and reasonable load. Additional info: Here is the fun part. Using etcd in the docker image registry.fedoraproject.org/f25/etcd (3.0.17) doesn't work as well. But, using gcr.io/google_containers/etcd:3.0.17 or quay.io/coreos/etcd:v3.1.1 works fine. etcd 3.0.15 in fedora doesn't suffer from this issue. In [2], etcd works fine. [1] http://logs.openstack.org/02/443002/2/check/gate-functional-dsvm-magnum-swarm-ubuntu-xenial/50d25b2/logs/cluster-nodes/master-test_start_stop_container_from_api-172.24.5.8/etcd.txt.gz [2] https://kojipkgs.fedoraproject.org/compose/twoweek/Fedora-Atomic-25-20170205.0/compose/CloudImages/x86_64/images/
Created attachment 1262080 [details] notes from irc chat with clayton I chatted with clayton about this and he's not sure what it could be exactly. Copied the notes from the irc log here just in case our conversation could trigger something in some other peoples minds.
Detailed steps to reproduce. Can someone try to reproduce? 1. Boot a fresh atomic host with etcd 3.0.17. I used [1]. 2. Configure etcd with TLS like so: HOST_IP=<YOUR host ip here> cert_dir=/srv/etcd mkdir -p ${cert_dir} cd ${cert_dir} openssl genrsa -out ca-key.pem 2048 openssl req -x509 -new -nodes -key ca-key.pem -days 10000 -out ca.pem -subj "/CN=kube-ca" cat > openssl.cnf <<EOF [req] req_extensions = v3_req distinguished_name = req_distinguished_name [req_distinguished_name] [ v3_req ] basicConstraints = CA:FALSE keyUsage = nonRepudiation, digitalSignature, keyEncipherment subjectAltName = @alt_names [alt_names] IP.1 = ${HOST_IP} IP.2 = 127.0.0.1 EOF openssl genrsa -out apiserver-key.pem 2048 openssl req -new -key apiserver-key.pem -out apiserver.csr -subj "/CN=kube-apiserver" -config openssl.cnf openssl x509 -req -in apiserver.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out apiserver.pem -days 365 -extensions v3_req -extfile openssl.cnf groupadd kube_etcd usermod -a -G kube_etcd etcd usermod -a -G kube_etcd kube SERVER_KEY=$cert_dir/apiserver-key.pem chmod 550 "${cert_dir}" chown -R etcd:kube_etcd "${cert_dir}" chmod 440 $SERVER_KEY ETCD_DISCOVERY=$(curl -w "\n" 'https://discovery.etcd.io/new?size=1') cat > /etc/etcd/etcd.conf <<EOF ETCD_NAME=${HOST_IP} ETCD_DATA_DIR=/var/lib/etcd/default.etcd ETCD_LISTEN_CLIENT_URLS=https://${HOST_IP}:2379 ETCD_LISTEN_PEER_URLS=https://${HOST_IP}:2380 ETCD_ADVERTISE_CLIENT_URLS=https://${HOST_IP}:2379 ETCD_INITIAL_ADVERTISE_PEER_URLS=https://${HOST_IP}:2380 ETCD_DISCOVERY=${ETCD_DISCOVERY} ETCD_TRUSTED_CA_FILE=${cert_dir}/ca.pem ETCD_CERT_FILE=${cert_dir}/apiserver.pem ETCD_KEY_FILE=${cert_dir}/apiserver-key.pem ETCD_PEER_TRUSTED_CA_FILE=${cert_dir}/ca.pem ETCD_PEER_CERT_FILE=${cert_dir}/apiserver.pem ETCD_PEER_KEY_FILE=${cert_dir}/apiserver-key.pem EOF 3. Start etcd systemctl enable etcd systemctl start etcd 4. Check the logs journalctl -u etcd You will see a lot of: etcd[2386]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF. etcd[2386]: transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF. [1] https://download.fedoraproject.org/pub/alt/atomic/stable/Fedora-Atomic-25-20170228.0/CloudImages/x86_64/images/Fedora-Atomic-25-20170228.0.x86_64.qcow2 I don't see a link to the bug between the bug and if we are using kolla or magnum. I'm not sure what this means: do they have a client using the elliptic p224 curve that we compile out? FYI, Nothing was hammering the etcd server. When I saw the error, it was an obvious move to try identify who creates the load.
Check the load with top. This is on a vm with 2 cores and 4GB ram. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2386 etcd 20 0 10.254g 122308 12708 R 183.0 3.3 26:51.00 etcd
A faster way to reproduce: docker run --rm --env HOST_IP=<YOUR HOST IP HERE> --net host -p 2379 -p 2380 --name etcd strigazi/test-etcd You can build the image yourself from this repo: https://github.com/strigazi/test-etcd The image is based on registry.fedoraproject.org/f25/etcd . When I was writing these lines, the image had etcd 3.0.17 built with golang 1.7.4.
I think found the issue. The issue comes from golang 1.7 and it's fixed in etcd here [1]. The bad thing is that it is included only in etcd v3.1.x. etcd v3.0.17 that works, as I mentioned, from gcr.io/google_containers/etcd:3.0.17 is built with go 1.6.4. here is what I built and tested golang/etcd version go\etcd | v3.0.15 | v3.0.17 | v3.1.1 | 1.6.4 | Yes | Yes | compile fails | 1.7.5, 1.7.4 | No | No | Yes | 1.8.rc3 | No | No | Yes | We must either move to etcd v3.1.x or build etcd 3.0.17 with go 1.6.4 (or 1.6.x, probably). On the other hand, in a FA image from 20170205 which includes etcd v3.0.15 built with go 1.7.3 the problem doesn't occur which makes absolutely no sense. I tried to reproduce but when I built etcd v3.0.15 with go 1.7.3 I had the same issue. Final comment, given fix [1] and if we continue to use go 1.7, we should move to etcd v3.1.x. [1] https://github.com/coreos/etcd/commit/7a48ca4ceaa10451b48594104e14fe36781c1a01
FYI, etcd v3.0.17 was released on Jan 20 2017 with go 1.6.4 not 1.7 [1] [1] https://github.com/coreos/etcd/releases/tag/v3.0.17
spyros, we have an open ticket for moving to newer etcd. There are some blockers there that we need to clear out of the way before hand. https://bugzilla.redhat.com/show_bug.cgi?id=1415341
Ok, in the mean time, Could we fix etcd 3.0.17 by building with go 1.6.4? We the current state we can't use the latest FA25 nor benefit from the recent release [1]. [1] http://www.projectatomic.io/blog/2017/03/fedora_atomic_2week/
Unfortunately in Fedora 25 we can't officially build against 1.6.4 when the current version of golang in the F25 repos is 1.7. I could give you a custom built container with 3.0.17 built against f24 (and thus golang 1.6.4). Once we get 3.1.3 building I can also give you a preview container with that content in it, neither one of those would be official yet, but let me know if you are interested. As for 3.1.3, i'm actively working to unblock that and possibly get the new rpm into updates-testing this week. However, I don't know if it would make this week's release, scheduled for Tuesday.
This update seems to fix the problem for your test reproducer: https://bodhi.fedoraproject.org/updates/FEDORA-2017-d841d68f7c
I tested with the above update. Looks good now.
Fixed in etcd-3.1.3-1.fc25