Bug 1389736 - installer often exit due to etcd server intermittent request time out issue
Summary: installer often exit due to etcd server intermittent request time out issue
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
urgent
low
Target Milestone: ---
: ---
Assignee: Stefan Schimanski
QA Contact: Xiaoli Tian
URL:
Whiteboard: aos-scalability-34
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-28 11:31 UTC by Johnny Liu
Modified: 2019-07-03 17:55 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-03 17:55:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
master log with embeded etcd (5.22 MB, text/x-vhdl)
2016-11-03 06:28 UTC, Johnny Liu
no flags Details
etcd log (24.13 KB, text/x-vhdl)
2016-11-04 05:15 UTC, Johnny Liu
no flags Details
master api log (270.50 KB, text/x-vhdl)
2016-11-04 05:16 UTC, Johnny Liu
no flags Details
etcd logs on all etcd hosts (210.00 KB, application/x-tar)
2016-11-08 05:42 UTC, Gaoyun Pei
no flags Details
atomic-openshift-master-api and atomic-openshift-master-controllers service logs (11.20 MB, application/x-tar)
2016-11-08 06:12 UTC, Gaoyun Pei
no flags Details

Description Johnny Liu 2016-10-28 11:31:45 UTC
Description of problem:
Set up a env, 2 master + 3 etcd + 2 infra node + 2 primary node.

TASK [cockpit-ui : Deploy registry-console] ************************************
Friday 28 October 2016  08:02:38 +0000 (0:00:02.136)       0:32:19.636 ******** 
fatal: [ec2-54-147-197-159.compute-1.amazonaws.com]: FAILED! => {"changed": true, "cmd": ["oc", "new-app", "--template=registry-console", "-p", "IMAGE_PREFIX=registry.ops.openshift.com/openshift3/", "-p", "OPENSHIFT_OAUTH_PROVIDER_URL=https://qe-master-api-elb-1283845662.us-east-1.elb.amazonaws.com", "-p", "REGISTRY_HOST=docker-registry-default.001-fixed.qe.rhcloud.com", "-p", "COCKPIT_KUBE_URL=https://registry-console-default.001-fixed.qe.rhcloud.com", "--config=/tmp/openshift-ansible-9vlPtH/admin.kubeconfig", "-n", "default"], "delta": "0:00:11.722168", "end": "2016-10-28 04:02:51.728199", "failed": true, "failed_when_result": true, "rc": 1, "start": "2016-10-28 04:02:40.006031", "stderr": "    error: client: etcd member https://ip-172-18-7-225.ec2.internal:2379 has no leader", "stdout": "--> Deploying template \"openshift/registry-console\" to project default\n\n     registry-console\n     ---------\n     Template for deploying registry web console. Requires cluster-admin.\n\n     * With parameters:\n        * IMAGE_PREFIX=registry.ops.openshift.com/openshift3/\n        * IMAGE_VERSION=3.3\n        * OPENSHIFT_OAUTH_PROVIDER_URL=https://qe-master-api-elb-1283845662.us-east-1.elb.amazonaws.com\n        * COCKPIT_KUBE_URL=https://registry-console-default.001-fixed.qe.rhcloud.com\n        * OPENSHIFT_OAUTH_CLIENT_SECRET=user4DRH2I4PwwGPDDKx3RMPoeFCj3bYwDtQgJBI7LgR0yoiuA0kmjMEQEX4TnwKS2nF # generated\n        * OPENSHIFT_OAUTH_CLIENT_ID=cockpit-oauth-client\n        * REGISTRY_HOST=docker-registry-default.001-fixed.qe.rhcloud.com\n\n--> Creating resources ...\n    service \"registry-console\" created\n    imagestream \"registry-console\" created\n    oauthclient \"cockpit-oauth-client\" created\n--> Failed", "stdout_lines": ["--> Deploying template \"openshift/registry-console\" to project default", "", "     registry-console", "     ---------", "     Template for deploying registry web console. Requires cluster-admin.", "", "     * With parameters:", "        * IMAGE_PREFIX=registry.ops.openshift.com/openshift3/", "        * IMAGE_VERSION=3.3", "        * OPENSHIFT_OAUTH_PROVIDER_URL=https://qe-master-api-elb-1283845662.us-east-1.elb.amazonaws.com", "        * COCKPIT_KUBE_URL=https://registry-console-default.001-fixed.qe.rhcloud.com", "        * OPENSHIFT_OAUTH_CLIENT_SECRET=user4DRH2I4PwwGPDDKx3RMPoeFCj3bYwDtQgJBI7LgR0yoiuA0kmjMEQEX4TnwKS2nF # generated", "        * OPENSHIFT_OAUTH_CLIENT_ID=cockpit-oauth-client", "        * REGISTRY_HOST=docker-registry-default.001-fixed.qe.rhcloud.com", "", "--> Creating resources ...", "    service \"registry-console\" created", "    imagestream \"registry-console\" created", "    oauthclient \"cockpit-oauth-client\" created", "--> Failed"], "warnings": []}

From the error output, it is saying "    error: client: etcd member https://ip-172-18-7-225.ec2.internal:2379 has no leader", log into the server, 

Check etcd health, they are working well.
# etcdctl --ca-file "${ca_file}" --cert-file "${cert_file}" --key-file "${key_file}" -C ${url} cluster-health
member 39a0caa2c9f81140 is healthy: got healthy result from https://172.18.2.32:2379
member 66e9e3c8efb1f076 is healthy: got healthy result from https://172.18.2.33:2379
member 8e3e66cb2733c776 is healthy: got healthy result from https://172.18.5.248:2379


Check pod, found registry-console is deployed successfully.
# oc get po
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-2-hpo29    1/1       Running   0          2h
docker-registry-2-mlwvl    1/1       Running   0          2h
registry-console-1-d8a6h   1/1       Running   0          2h

In the etcd log, found some error:
Oct 28 04:07:58 ip-172-18-2-33.ec2.internal etcd[16084]: 66e9e3c8efb1f076 [term: 3] ignored a MsgApp message with lower term from 39a0caa2c9
Oct 28 04:28:58 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out)
Oct 28 04:28:59 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out) [merged 4 repeated li
Oct 28 04:29:20 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out)
Oct 28 04:29:28 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out)
Oct 28 04:29:29 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out) [merged 3 repeated li
Oct 28 04:29:36 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out)
Oct 28 04:29:37 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out) [merged 2 repeated li
Oct 28 04:29:44 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out)
Oct 28 04:29:45 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out) [merged 1 repeated li
Oct 28 04:29:47 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out) [merged 1 repeated li
Oct 28 04:30:10 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out)
Oct 28 04:30:13 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out)
Oct 28 04:30:20 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out)
Oct 28 04:30:21 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out) [merged 3 repeated li
Oct 28 04:30:27 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out)
Oct 28 04:30:29 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out) [merged 21 repeated l
Oct 28 04:30:36 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out)
Oct 28 04:30:37 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out) [merged 4 repeated li
Oct 28 04:30:42 ip-172-18-2-33.ec2.internal etcd[16084]: got unexpected response error (etcdserver: request timed out)
Oct 28 04:46:18 ip-172-18-2-33.ec2.internal etcd[16084]: start to snapshot (applied: 10001, lastsnap: 0)

Seem like etcd broken for a while, later it came back.

The interesting thing is my installation job always failed at registry-console deploy step, docker-registry, router are deployed successfully. Not sure what happened there.

This is a big trouble for QE's testing, we need enhance installer, so that registry-console deploy failure (if enable metrics deployment, metrics deployment would also failed due to etcd has no leader) would not break the whole installation. 


Version-Release number of selected component (if applicable):
openshift-ansible-3.4.13-1.git.0.ff1d588.el7.noarch
etcd-2.3.7-4.el7.x86_64


How reproducible:
Often

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Johnny Liu 2016-11-02 05:41:14 UTC
Today I re-run an new installation, this issue seem like disappeared, registry-console is deployed successfully, and the whole installation succeed, though seen from etcd log, still have some "got unexpected response error" message.

atomic-openshift-3.4.0.18-1.git.0.ada983f.el7.x86_64
atomic-openshift-sdn-ovs-3.4.0.18-1.git.0.ada983f.el7.x86_64
atomic-openshift-master-3.4.0.18-1.git.0.ada983f.el7.x86_64
atomic-openshift-node-3.4.0.18-1.git.0.ada983f.el7.x86_64
atomic-openshift-clients-3.4.0.18-1.git.0.ada983f.el7.x86_64
tuned-profiles-atomic-openshift-node-3.4.0.18-1.git.0.ada983f.el7.x86_64
atomic-openshift-tests-3.4.0.18-1.git.0.ada983f.el7.x86_64

ansible-2.2.0.0-0.71.rc4.el7.noarch.rpm

openshift-ansible master branch

Comment 3 Johnny Liu 2016-11-03 06:26:37 UTC
Today QE encounter such issue again when installing a 3.3 env with embedded etcd.
openshift v3.3.1.4
kubernetes v1.3.0+52492b4
etcd 2.3.0+git



Seen from master log, get the following info:
...
Nov  3 01:31:32 ip-172-18-3-88 atomic-openshift-node: I1103 01:31:32.723383   18887 reflector.go:403] pkg/kubelet/kubelet.go:265: Watch close - *api.Service total 4 items received
Nov  3 01:31:33 ip-172-18-3-88 atomic-openshift-master: I1103 01:31:33.781887   11554 controller.go:105] Found 0 scheduledjobs
Nov  3 01:31:33 ip-172-18-3-88 atomic-openshift-master: I1103 01:31:33.783637   11554 controller.go:113] Found 0 jobs
Nov  3 01:31:33 ip-172-18-3-88 atomic-openshift-master: I1103 01:31:33.783648   11554 controller.go:116] Found 0 groups
Nov  3 01:31:34 ip-172-18-3-88 openshift: got unexpected response error (etcdserver: request timed out)
Nov  3 01:31:34 ip-172-18-3-88 atomic-openshift-master: E1103 01:31:34.810723   11554 controller.go:121] unable to sync kubernetes service: client: etcd cluster is unavailable or misconfigured
Nov  3 01:31:35 ip-172-18-3-88 atomic-openshift-node: I1103 01:31:35.627844   18887 kube_docker_client.go:292] Pulling image "registry.ops.openshift.com/openshift3/ose-deployer:v3.3.1.4": "17c7cb45bad6: Extracting [==================================================>] 66.13 MB/66.13 MB"
Nov  3 01:31:36 ip-172-18-3-88 openshift: got unexpected response error (etcdserver: request timed out)
Nov  3 01:31:36 ip-172-18-3-88 atomic-openshift-master: E1103 01:31:36.635044   11554 errors.go:63] apiserver received an error that is not an unversioned.Status: client: etcd cluster is unavailable or misconfigured
Nov  3 01:31:36 ip-172-18-3-88 atomic-openshift-master: I1103 01:31:36.635136   11554 trace.go:61] Trace "Delete /oapi/v1/images/sha256:00066d0ede35a5adf02ea04642ab44de5dfbd19221ee2f49fa715a383827128a" (started 2016-11-03 01:31:29.630872161 -0400 EDT):
Nov  3 01:31:36 ip-172-18-3-88 atomic-openshift-master: [20.447µs] [20.447µs] About do delete object from database
Nov  3 01:31:36 ip-172-18-3-88 atomic-openshift-master: [7.004238234s] [7.004217787s] END
Nov  3 01:31:37 ip-172-18-3-88 atomic-openshift-node: I1103 01:31:37.063012   18887 kube_docker_client.go:292] Pulling image "registry.ops.openshift.com/openshift3/ose-deployer:v3.3.1.4": "17c7cb45bad6: Extracting [==================================================>] 66.13 MB/66.13 MB"
Nov  3 01:31:38 ip-172-18-3-88 kernel: lbr0: port 2(vethba4c757) entered forwarding state
...
Nov  3 01:32:35 ip-172-18-3-88 atomic-openshift-master: I1103 01:32:35.420225   11554 controller.go:308] No daemon sets found for pod docker-registry-1-deploy, daemon set controller will avoid syncing
Nov  3 01:32:35 ip-172-18-3-88 atomic-openshift-master: I1103 01:32:35.420280   11554 pet_set.go:236] No PetSets found for pod docker-registry-1-deploy, PetSet controller will avoid syncing
Nov  3 01:32:35 ip-172-18-3-88 atomic-openshift-node: I1103 01:32:35.453774   18887 status_manager.go:439] Pod "docker-registry-1-deploy_default(824beac4-a186-11e6-ad07-0e7b0091a328)" fully terminated and removed from etcd
Nov  3 01:32:35 ip-172-18-3-88 atomic-openshift-master: I1103 01:32:35.455314   11554 replication_controller.go:408] Pod default/docker-registry-1-deploy deleted through github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/replication.(*ReplicationManager).(github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/replication.deletePod)-fm, timestamp 2016-11-03T01:32:35-04:00, labels map[openshift.io/deployer-pod-for.name:docker-registry-1].
...
Nov  3 01:33:08 ip-172-18-3-88 atomic-openshift-master: I1103 01:33:08.421861   11554 controller.go:361] synchronizing PersistentVolume[regpv-volume]: volume is bound to claim default/regpv-claim
Nov  3 01:33:08 ip-172-18-3-88 atomic-openshift-master: I1103 01:33:08.421874   11554 controller.go:378] synchronizing PersistentVolume[regpv-volume]: claim default/regpv-claim found: phase: Bound, bound to: "regpv-volume", bindCompleted: true, boundByController: true
Nov  3 01:33:08 ip-172-18-3-88 atomic-openshift-master: I1103 01:33:08.421881   11554 controller.go:439] synchronizing PersistentVolume[regpv-volume]: all is bound
Nov  3 01:33:08 ip-172-18-3-88 atomic-openshift-master: I1103 01:33:08.421886   11554 controller.go:589] updating PersistentVolume[regpv-volume]: set phase Bound
Nov  3 01:33:08 ip-172-18-3-88 atomic-openshift-master: I1103 01:33:08.421891   11554 controller.go:592] updating PersistentVolume[regpv-volume]: phase Bound already set
Nov  3 01:33:09 ip-172-18-3-88 atomic-openshift-master: I1103 01:33:09.177476   11554 reflector.go:403] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:87: Watch close - *api.PersistentVolume total 4 items received
Nov  3 01:33:09 ip-172-18-3-88 openshift: got unexpected response error (etcdserver: request timed out)
Nov  3 01:33:09 ip-172-18-3-88 atomic-openshift-master: E1103 01:33:09.526166   11554 errors.go:63] apiserver received an error that is not an unversioned.Status: client: etcd cluster is unavailable or misconfigured
Nov  3 01:33:09 ip-172-18-3-88 atomic-openshift-master: I1103 01:33:09.526244   11554 trace.go:61] Trace "Update /api/v1/nodes/ip-172-18-3-88.ec2.internal/status" (started 2016-11-03 01:33:02.402391296 -0400 EDT):
Nov  3 01:33:09 ip-172-18-3-88 atomic-openshift-master: [18.759µs] [18.759µs] About to convert to expected version
Nov  3 01:33:09 ip-172-18-3-88 atomic-openshift-master: [85.845µs] [67.086µs] Conversion done
Nov  3 01:33:09 ip-172-18-3-88 atomic-openshift-master: [91.534µs] [5.689µs] About to store object in database
Nov  3 01:33:09 ip-172-18-3-88 atomic-openshift-master: [7.12382413s] [7.123732596s] END
Nov  3 01:33:09 ip-172-18-3-88 atomic-openshift-node: E1103 01:33:09.536086   18887 kubelet.go:2993] Error updating node status, will retry: client: etcd cluster is unavailable or misconfigured
Nov  3 01:33:09 ip-172-18-3-88 atomic-openshift-node: E1103 01:33:09.912716   18887 kubelet.go:2993] Error updating node status, will retry: Operation cannot be fulfilled on nodes "ip-172-18-3-88.ec2.internal": the object has been modified; please apply your changes to the latest version and try again
...
Nov  3 01:33:42 ip-172-18-3-88 atomic-openshift-master: I1103 01:33:42.847022   11554 controller.go:166] No jobs found for pod docker-registry-1-dx855, job controller will avoid syncing
Nov  3 01:33:42 ip-172-18-3-88 atomic-openshift-master: I1103 01:33:42.847033   11554 controller.go:414] Pod docker-registry-1-dx855 deleted.
Nov  3 01:33:42 ip-172-18-3-88 atomic-openshift-master: I1103 01:33:42.847056   11554 controller.go:308] No daemon sets found for pod docker-registry-1-dx855, daemon set controller will avoid syncing
Nov  3 01:33:42 ip-172-18-3-88 atomic-openshift-master: I1103 01:33:42.847110   11554 pet_set.go:208] Pod default/docker-registry-1-dx855 deleted through github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/petset.(*PetSetController).(github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/petset.deletePod)-fm.
Nov  3 01:33:42 ip-172-18-3-88 atomic-openshift-master: I1103 01:33:42.847126   11554 pet_set.go:236] No PetSets found for pod docker-registry-1-dx855, PetSet controller will avoid syncing
Nov  3 01:33:42 ip-172-18-3-88 atomic-openshift-node: I1103 01:33:42.859053   18887 status_manager.go:439] Pod "docker-registry-1-dx855_default(db39ff03-a186-11e6-ad07-0e7b0091a328)" fully terminated and removed from etcd
Nov  3 01:33:42 ip-172-18-3-88 atomic-openshift-node: I1103 01:33:42.859494   18887 docker_manager.go:1297] Calling network plugin redhat/openshift-ovs-subnet to tear down pod for registry-console-1-deploy_default
Nov  3 01:33:42 ip-172-18-3-88 atomic-openshift-node: I1103 01:33:42.863580   18887 config.go:269] Setting pods for source api
Nov  3 01:33:42 ip-172-18-3-88 atomic-openshift-node: I1103 01:33:42.864254   18887 kubelet.go:2653] SyncLoop (UPDATE, "api"): "docker-registry-1-dx855_default(db39ff03-a186-11e6-ad07-0e7b0091a328):DeletionTimestamp=2016-11-03T05:33:42Z"
Nov  3 01:33:42 ip-172-18-3-88 atomic-openshift-node: I1103 01:33:42.864303   18887 kubelet.go:2656] SyncLoop (REMOVE, "api"): "docker-registry-1-dx855_default(db39ff03-a186-11e6-ad07-0e7b0091a328)"
Nov  3 01:33:42 ip-172-18-3-88 atomic-openshift-node: I1103 01:33:42.864421   18887 kubelet.go:2424] Killing unwanted pod "docker-registry-1-dx855"
Nov  3 01:33:42 ip-172-18-3-88 atomic-openshift-node: I1103 01:33:42.864497   18887 docker_manager.go:1374] Killing container "43e1cea0045b3c7ffa654d6a93b11041a9c74f1206a82577ad4360e57afb433b registry default/docker-registry-1-dx855" with 0 second grace period
Nov  3 01:33:42 ip-172-18-3-88 docker-current: time="2016-11-03T01:33:42.876340184-04:00" level=info msg="{Action=stop, ID=43e1cea0045b3c7ffa654d6a93b11041a9c74f1206a82577ad4360e57afb433b, LoginUID=4294967295, PID=18887}"



The full master log is attached.

Comment 4 Johnny Liu 2016-11-03 06:28:11 UTC
Created attachment 1216853 [details]
master log with embeded etcd

Comment 5 Peter Ruan 2016-11-03 17:27:12 UTC
It's not intermittent for me...I've been hitting this issue for the past week w/o one instance of success. :(

Comment 6 Brenton Leanhardt 2016-11-03 20:13:25 UTC
Jianlin, are you able to reproduce this with the etcd3 packages?  That's what should be used at this point for testing 3.4.

Comment 7 Johnny Liu 2016-11-04 05:15:52 UTC
Created attachment 1217246 [details]
etcd log

Comment 8 Johnny Liu 2016-11-04 05:16:47 UTC
Created attachment 1217248 [details]
master api log

Comment 9 Johnny Liu 2016-11-04 05:20:59 UTC
(In reply to Brenton Leanhardt from comment #6)
> Jianlin, are you able to reproduce this with the etcd3 packages?  That's
> what should be used at this point for testing 3.4.

Re-run the same installation as comment 0 using etcd3, this issue reproduce again.

atomic-openshift-3.4.0.21-1.git.0.ca4702d.el7.x86_64
etcd3-3.0.13-1.el7.x86_64


TASK [cockpit-ui : Create passthrough route for registry-console] **************
Friday 04 November 2016  04:23:26 +0000 (0:00:01.834)       0:27:54.616 ******* 
fatal: [ec2-54-159-34-186.compute-1.amazonaws.com]: FAILED! => {"changed": true, "cmd": ["oc", "create", "route", "passthrough", "--service", "registry-console", "--port", "registry-console", "--config=/tmp/openshift-ansible-vF1vuO/admin.kubeconfig", "-n", "default"], "delta": "0:00:10.252417", "end": "2016-11-04 00:23:38.631388", "failed": true, "failed_when_result": true, "rc": 1, "start": "2016-11-04 00:23:28.378971", "stderr": "Error from server: client: etcd member https://ip-172-18-7-161.ec2.internal:2379 has no leader", "stdout": "", "stdout_lines": [], "warnings": []}

master log:
Nov 04 00:23:26 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:23:26.776249   24643 plugin.go:41] Allocating global shard *.fixed-001.qe.rhcloud.com to Route: docker-registry
Nov 04 00:23:26 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:23:26.776257   24643 controller.go:30] Route docker-registry allocated to shard global [suffix=fixed-001.qe.rhcloud.com]
Nov 04 00:23:26 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:23:26.776263   24643 controller.go:37] Generating host name for Route: docker-registry
Nov 04 00:23:26 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:23:26.776269   24643 controller.go:42] Route: docker-registry, generated host name/alias=docker-registry-default.fixed-001.qe.rhcloud.com
Nov 04 00:23:45 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:23:45.406620   24643 logs.go:41] http: TLS handshake error from 172.18.7.174:53497: EOF
Nov 04 00:24:15 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:24:15.407167   24643 logs.go:41] http: TLS handshake error from 172.18.7.174:53517: EOF
Nov 04 00:24:20 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:24:20.774418   24643 reflector.go:284] github.com/openshift/origin/pkg/user/cache/groups.go:51: forcing resync
Nov 04 00:24:27 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:24:27.167610   24643 reflector.go:284] github.com/openshift/origin/pkg/project/auth/cache.go:201: forcing resync
Nov 04 00:24:29 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:24:29.514218   24643 reflector.go:403] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:93: Watch close - *api.SecurityContextConstraints total 1 items received
Nov 04 00:24:45 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:24:45.408345   24643 logs.go:41] http: TLS handshake error from 172.18.7.174:53543: EOF
Nov 04 00:25:03 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:25:03.258886   24643 trace.go:61] Trace "etcdHelper::Create *api.Event" (started 2016-11-04 00:24:53.25770342 -0400 EDT):
Nov 04 00:25:03 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: [59.136µs] [59.136µs] Object encoded
Nov 04 00:25:03 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: [60.719µs] [1.583µs] Version checked
Nov 04 00:25:03 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: [10.001117732s] [10.001057013s] Object created
Nov 04 00:25:03 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: [10.001140981s] [23.249µs] END
Nov 04 00:25:03 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: E1104 00:25:03.258949   24643 errors.go:63] apiserver received an error that is not an unversioned.Status: client: etcd member https://ip-172-18-7-161.ec2.internal:2379 has no leader
Nov 04 00:25:03 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:25:03.259041   24643 trace.go:61] Trace "Create /api/v1/namespaces/default/events" (started 2016-11-04 00:24:53.257421139 -0400 EDT):
Nov 04 00:25:03 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: [53.594µs] [53.594µs] About to convert to expected version
Nov 04 00:25:03 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: [87.014µs] [33.42µs] Conversion done
Nov 04 00:25:03 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: [133.337µs] [46.323µs] About to store object in database
Nov 04 00:25:03 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: [10.001594418s] [10.001461081s] END
Nov 04 00:25:06 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:25:06.514803   24643 reflector.go:403] github.com/openshift/origin/pkg/controller/shared/shared_informer.go:93: Watch close - *api.ServiceAccount total 23 items received
Nov 04 00:25:13 ip-172-18-10-182.ec2.internal atomic-openshift-master-api[24643]: I1104 00:25:13.264014   24643 trace.go:61] Trace "etcdHelper::Create *api.Event" (started 2016-11-04 00:25:03.262844697 -0400 EDT):


After that failure, log into server to check etcd cluster health, they are working well.

# etcdctl --ca-file "${ca_file}" --cert-file "${cert_file}" --key-file "${key_file}" -C ${url} cluster-health
member 703a1f8f5ef91537 is healthy: got healthy result from https://172.18.7.161:2379
member 7b999750a40ef1f8 is healthy: got healthy result from https://172.18.20.161:2379
member e3b3de48fa68a648 is healthy: got healthy result from https://172.18.10.182:2379
cluster is healthy

Comment 10 Johnny Liu 2016-11-04 05:28:30 UTC
In comment 0, external etcd-2.3 + 3.4, FAIL.
In comment 2, external etcd-2.3 + 3.4, PASS.
In comment 3, embedded etcd-2.3 + 3.3, FAIL.
In comment 9, external etcd-3 + 3.4, FAIL.

Feel like this issue is a etcd performance issue. When a batch of data is written to etcd, etcd maybe break for a while, some mins later, etcd come back again.

Interestingly, we never encountered such issue before when we was running testing for 3.3 release, this issue often happened recently.

Comment 11 Johnny Liu 2016-11-04 05:36:36 UTC
So far, we do not encounter such issue on GCE/openstack yet, all the failure happened on AWS, is it relevant to AWS performance?

Comment 12 Johnny Liu 2016-11-04 10:34:30 UTC
Currently we often encounter this issue (not 100% reproduce for me), this is really bring a big trouble for our testing. So set this bug as testblocker.

Comment 13 Scott Dodson 2016-11-08 03:09:40 UTC
Jianlin,

If this happens again can you grab the logs from etcd from all etcd hosts?

Comment 14 Gaoyun Pei 2016-11-08 05:42:11 UTC
Created attachment 1218393 [details]
etcd logs on all etcd hosts

Also encounter this error on openstack today, I was running a containerized 2 master + 4 etcd + 4 node ocp-3.4 installation, actually the openshift-ansible job was done, it fails in my custom playbook when trying to update the imagestreams just followed openshift-ansible installation. 
 
TASK [delete all ImageSream images] ********************************************
Tuesday 08 November 2016  04:42:31 +0000 (0:00:05.485)       0:00:28.753 ****** 
fatal: [openshift-120.lab.sjc.redhat.com]: FAILED! => {"changed": true, "cmd": ["oc", "delete", "images", "--all"], "delta": "0:00:23.209399", "end": "2016-11-08 04:42:56.359233", "failed": true, "rc": 1, "start": "2016-11-08 04:42:33.149834", "stderr": "Error from server: client: etcd member https://openshift-120.lab.sjc.redhat.com:2379 has no leader",

etcd version should be etcd-2.3.7, working with ose:v3.4.0.23

Comment 15 Gaoyun Pei 2016-11-08 06:12:43 UTC
Created attachment 1218416 [details]
atomic-openshift-master-api and atomic-openshift-master-controllers service logs

Add the log of atomic-openshift-master-api and atomic-openshift-master-controllers service on the 2 masters

Comment 16 Johnny Liu 2016-11-08 10:43:22 UTC
I encounter this issue again on AWS for a all-in-one install.

TASK [openshift_hosted : Add registry environment variable] ********************
Tuesday 08 November 2016  10:02:56 +0000 (0:00:05.696)       0:18:04.671 ****** 

fatal: [ec2-54-152-248-175.compute-1.amazonaws.com]: FAILED! => {"changed": true, "cmd": ["oc", "--config=/tmp/openshift-ansible-k9cctP/admin.kubeconfig", "--namespace=default", "set", "env", "dc/docker-registry", "REGISTRY_CONFIGURATION_PATH=/etc/registry/config.yml"], "delta": "0:00:07.230430", "end": "2016-11-08 05:03:08.365631", "failed": true, "rc": 1, "start": "2016-11-08 05:03:01.135201", "stderr": "error: client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member https://ip-172-18-13-61.ec2.internal:4001 has no leader", "stdout": "", "stdout_lines": [], "warnings": []}


I did not run testing using etcd cluster yet today, according the above comments, seem like this is not relevant to etcd cluster. 

When I encounter the same issue in next time, will grab the logs from etcd from all etcd hosts, this time, you could refer to the log attached in comment 14.

Comment 17 Scott Dodson 2016-11-08 16:07:36 UTC
Dan,

Any idea who to have take a look at general failures in etcd? None of the components seem to make sense.

Comment 18 Seth Jennings 2016-11-09 17:37:55 UTC
I'm trying to go through the etcd logs to see what is happening with the etcd cluster.

host                              raft id
openshift-120.lab.sjc.redhat.com  a8cc7aee37323052
openshift-122.lab.sjc.redhat.com  dd6e7cb0677a7c5b
openshift-135.lab.sjc.redhat.com  f327eda988a9b19b
openshift-153.lab.sjc.redhat.com  8585ce08ceba81df

It appears that f327eda988a9b19b (135) is the initial leader but goes offline for some reason (network drop?):

Nov 08 04:53:36 openshift-135.lab.sjc.redhat.com etcd_container[17284]: 2016-11-08 04:53:36.144407 W | etcdserver: failed to send out heartbeat on time (deadline exceeded for 494.761018ms)
Nov 08 04:53:36 openshift-135.lab.sjc.redhat.com etcd_container[17284]: 2016-11-08 04:53:36.144466 W | etcdserver: server is likely overloaded
(repeats forever)

The remaining 3 nodes don't seem to be successfully electing a new leader.

dd6e7cb0677a7c5b (122) has some heartbeat failure but recovers.  It might be enough for the quorum to drop to 2, preventing a leader election.

All logs show the reelection of f327eda988a9b19b even though it nevers seems to come back online according to its own logs.

Comment 19 Seth Jennings 2016-11-09 17:59:45 UTC
Avesh, found this:
https://github.com/coreos/etcd/issues/5154

We might need to look at changing the --heartbeat-interval and --election-timeout
for etcd in these cloud situations to make it more tolerant of scheduling latency or network connection/congestion issues.

Comment 20 Andy Goldstein 2016-11-10 22:03:34 UTC
Gaoyun Pei, can we modify your OpenStack cluster to try to debug this? We'd like to stop & restart etcd containers at will, add in some debugging code, etc.

Comment 22 Andy Goldstein 2016-11-11 17:14:41 UTC
What we've identified is that high IO operations (docker pull, for example) can cause an etcd server to be delayed in performing its duties (sending/receiving messages to/from clients such as openshift, performing cluster heartbeats, etc). If you have etcd colocated with anything else (openshift master, openshift node), you're probably going to run into this until we can figure out the root cause of the issue.


We're not sure why disk IO operations would cause an etcd http request/response to take such a long time that it times out. We probably need some help from someone in the kernel or perf team who knows more about the kernel, IO, and fair scheduling.

Also, we've observed this so far in both EC2 and OpenStack on RHEL 7.3. Johnny, were you initially testing on 7.2 or 7.3?

Comment 23 Johnny Liu 2016-11-14 09:43:29 UTC
(In reply to Andy Goldstein from comment #22)
> Also, we've observed this so far in both EC2 and OpenStack on RHEL 7.3.
> Johnny, were you initially testing on 7.2 or 7.3?

Yes, it is 7.3, kernel-3.10.0-514.el7.x86_64. In the following testing, I also reproduce it on 7.2 (3.10.0-327.36.1.el7.x86_64).

Comment 24 Johnny Liu 2016-11-14 11:47:48 UTC
(In reply to Andy Goldstein from comment #22)
> What we've identified is that high IO operations (docker pull, for example)
> can cause an etcd server to be delayed in performing its duties
> (sending/receiving messages to/from clients such as openshift, performing
> cluster heartbeats, etc). If you have etcd colocated with anything else
> (openshift master, openshift node), you're probably going to run into this
> until we can figure out the root cause of the issue.
Seem like yes. When installing all-in-one env, this bug reproduced, separate the ectd into a standalone host, re-run the installation, succeed.

Comment 27 Andy Goldstein 2016-11-14 16:05:27 UTC
Johnny / Gaoyun Pei / Peter

In your testing, did you always have etcd colocated with a schedulable atomic-openshift-node?

Comment 28 Timothy St. Clair 2016-11-14 16:12:47 UTC
Do we know if the etcd vm's are on different hypervisor nodes or the same?

Comment 30 Johnny Liu 2016-11-15 02:16:47 UTC
(In reply to Andy Goldstein from comment #27)
> Johnny / Gaoyun Pei / Peter
> 
> In your testing, did you always have etcd colocated with a schedulable
> atomic-openshift-node?

Sometimes, but not always. 

We mainly have the following test scenarios:
1. embedded etcd running in master
2. external ecd collocated with master
3. external ecd collocated with a schedulable node

Comment 32 Andy Goldstein 2016-11-15 21:34:19 UTC
Please ensure that you test where either:

1) etcd is not colocated with anything else
2) etcd is colocated with the master

If you test with etcd colocated with a schedulable node, you will likely run into this issue.

Comment 33 Johnny Liu 2016-11-16 02:23:54 UTC
(In reply to Timothy St. Clair from comment #31)
> 1. is not supported, but good to verify 
> 2. yup, good 
> 3. what do you mean collocated with a schedulable node?  external etcd
> should *never* be co-located on a node running a openshift-node. 
> 
> In which one of these scenarios if this issue manifesting?

This issue was reproduced in all the 3 test scenarios.

Comment 34 Johnny Liu 2016-11-16 11:31:06 UTC
(In reply to Johnny Liu from comment #33)
> (In reply to Timothy St. Clair from comment #31)
> > 1. is not supported, but good to verify 
> > 2. yup, good 
> > 3. what do you mean collocated with a schedulable node?  external etcd
> > should *never* be co-located on a node running a openshift-node. 
> > 
> > In which one of these scenarios if this issue manifesting?
> 
> This issue was reproduced in all the 3 test scenarios.

About 2, now I am not very sure if I reproduced this issue in the past testing.

Today I re-run the following test scenarios:
1. embedded etcd running in all-in-one
This bug is reproduced. 

2. etcd is collocated with a schedulable node
This bug is reproduced.

3. etcd is not colocated with anything else
This bug is NOT reproduced.

4. etcd is colocated with the master
This bug is NOT reproduced.


So according to test result and comment 22, seem like separate etcd onto a standalone machine or a non-schedulable node could resolve this issue.

If that, this issue will become a doc bug to notify customer do not make etcd co-located on a node, especially for the following scenarios:
master + embedded etcd + N nodes, and master is marked as "schedulable".

Comment 35 Timothy St. Clair 2016-11-16 14:38:18 UTC
3 & 4 are the supported configurations, and masters should *never* be marked as "schedulable"

Comment 36 Johnny Liu 2016-11-17 02:21:57 UTC
I go through docs.openshift.com, I did not find any place to claim the un-supported configurations. If I miss something, pls help point to me to correct url, if docs miss that, we really need highlight that in some place.

1. etcd should *never* collocated with a schedulable node
2. master should *never* marked as "schedulable".


And I found the following docs to deploy a all-in-one standalone registry.

Check "All-in-one Stand-alone Registry Inventory File" in https://docs.openshift.com/container-platform/3.3/install_config/install/stand_alone_registry.html#registry-quick-installation

Comment 37 Jeremy Eder 2016-11-17 19:57:47 UTC
This is marked as a blocker for 3.4...should it be fixed with documentation?

Comment 38 Andy Goldstein 2016-11-17 20:00:54 UTC
This is a docs-only fix. It's not a 3.4 blocker.

Comment 39 Johnny Liu 2016-11-21 02:51:29 UTC
(In reply to Jeremy Eder from comment #37)
> This is marked as a blocker for 3.4...should it be fixed with documentation?

Personally I hope we could dig into this issue to have a real fix to resolve etcd's timeout issue caused by high IO operations (during 3.2/3.3 we often make etcd collocate with a schedulable node, and did not encounter such issue). 

Based on your comment and Andy's comments, etcd should never collocate with a schedulable node, and I also validate your suggestion, it indeed improve 
its behavior. So I remove "testblocker" keyword. That means it is not a blocker now.

If we really plan to fix this issue with documentation, I am okay with it too.

Comment 40 Stefan Schimanski 2016-11-21 07:42:00 UTC
As Andy writes in comment #22 we were able to get the etcd main loop to a halt for more than a second by creating IO load on the root partition.

In an empty OpenShift cluster we saw etcd trying to write something like 10-20 kb per second. So not really a lot.

We then moved the etcd storage to a tmpfs to decouple it from the IO load. Interestingly, we were still able to reproduce etcd's hiccups. That's why we have the suspicion that it might be kernel related.

@Andy: did we actually try with an older etcd version? I remember we wanted to.

Comment 45 Greg Blomquist 2019-07-03 17:55:10 UTC
No updates in 2 years on this BZ.  The last version discussed in this BZ has reached EOL.  If this problem still persists, please open a new bug against updated OCP versions.


Note You need to log in before you can comment on or make changes to this bug.