Bug 1459430
Summary: | ES Pod failed to start up if set openshift_logging_es_cluster_size as non-default value | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | ||||||
Component: | Logging | Assignee: | Jan Wozniak <jwozniak> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Xia Zhao <xiazhao> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.6.0 | CC: | aos-bugs, eminguez, jcantril, jwozniak, pportant, rmeggins | ||||||
Target Milestone: | --- | Keywords: | Regression | ||||||
Target Release: | 3.7.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||
Doc Text: |
undefined
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2017-11-28 21:56:17 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Created attachment 1285673 [details]
es pod log
It may related with https://bugzilla.redhat.com/show_bug.cgi?id=1456139 Same phenomenon with openshift_logging_es_ops_cluster_size. # oc get po | grep logging-es logging-es-data-master-ve2jlrng-1-0zr8z 1/1 Running 0 22m logging-es-ops-data-master-ce0zcfbw-1-deploy 0/1 Error 0 21m logging-es-ops-data-master-cttxmsd8-1-deploy 0/1 Error 0 21m logging-es-ops-data-master-m0bas5mp-1-deploy 0/1 Error 0 22m logging-es-ops-data-master-nb7q27qm-1-deploy 0/1 Error 0 21m # oc logs logging-es-ops-data-master-ce0zcfbw-1-deploy --> Scaling logging-es-ops-data-master-ce0zcfbw-1 to 1 --> Waiting up to 10m0s for pods in rc logging-es-ops-data-master-ce0zcfbw-1 to become ready error: update acceptor rejected logging-es-ops-data-master-ce0zcfbw-1: pods for rc "logging-es-ops-data-master-ce0zcfbw-1" took longer than 600 seconds to become ready Could be related to the readiness probe. (In reply to Junqi Zhao from comment #0) > Steps to Reproduce: > 1. Deploy logging 3.6 via ansible and check pods' status Junqi, could you please also include ansible branch/commit/tag you were using for the deployment? I am trying to reproduce this bug and so far I am unable to do so no longer needed, I can reproduce it with the current master branch. I think the sequence of events with cluster of size larger than 1 and readiness probe turns into a deadlock: - ES pods are started and wait for readiness probe to allow network communication [1] - ES containers try master discovery (requires network communication) [2] - readiness probe disallows master discovery because ES status is 503 because master discovery hasn't happened yet [3] possible workaround: - readiness probe could allow status 200 and 503 Could perhaps someone confirm or deny described logic? [1] https://raw.githubusercontent.com/wozniakjan/origin-aggregated-logging/6cb7ee9c61725b3b2b8ed9fdeb468a4e381770da/elasticsearch/probe/readiness.sh [2] https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-discovery-zen.html [3] https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-readiness-probes I proposed a solution based on labelling the ES pods and raised an issue in fabric8 plugin. https://github.com/fabric8io/elasticsearch-cloud-kubernetes/issues/90 https://github.com/wozniakjan/elasticsearch-cloud-kubernetes/pull/1 Will see what is the recommended way to resolve this FYI: I did not intend to clear your needinfo request, Jan; I am not sure what I did in BZ land that caused it to be cleared. Please restore if it makes sense to do so. This is a readiness check problem, and that check is preventing Elasticsearch from working at all. So having this as a "medium" severity might be a bit misleading. Can this severity be changed? The readiness check should be removed for now, so that proper investigation into a working readiness check can be done for a following release. The readiness probe will be temporarily removed when 4641 PR merges and reintroduced in 3.6.1 https://github.com/openshift/openshift-ansible/pull/4641 Just in case, it happened to me with the latest 3.5 too... maybe the dc timeout can be modified like: for i in $(oc get dc -n logging -o jsonpath='{.items[*].metadata.name}'); do echo $i; oc patch dc ${i} -p '{"spec":{"strategy":{"recreateParams":{"timeoutSeconds":"1200"}}}}'; done the temporary fix has been merged to master, it shouldn't be happening from now Issue was fixed, logging worked well when setting openshift_logging_es_cluster_size as non-default value, please change to ON_QA status so we can close it Testing environments: # rpm -qa | grep openshift-ansible openshift-ansible-filter-plugins-3.6.138-1.git.0.2c647a9.el7.noarch openshift-ansible-playbooks-3.6.138-1.git.0.2c647a9.el7.noarch openshift-ansible-3.6.138-1.git.0.2c647a9.el7.noarch openshift-ansible-callback-plugins-3.6.138-1.git.0.2c647a9.el7.noarch openshift-ansible-roles-3.6.138-1.git.0.2c647a9.el7.noarch openshift-ansible-docs-3.6.138-1.git.0.2c647a9.el7.noarch openshift-ansible-lookup-plugins-3.6.138-1.git.0.2c647a9.el7.noarch Images from brew registry logging-elasticsearch v3.6.138-1 2b2b061286e6 47 hours ago 404.2 MB logging-kibana v3.6.138-1 43aea64f70a8 47 hours ago 342.4 MB logging-curator v3.6.138-1 bf5d8756dcfc 47 hours ago 221.5 MB logging-fluentd v3.6.138-1 e4ed3ac61d69 47 hours ago 231.5 MB logging-auth-proxy v3.6.138-1 9b98fea74082 47 hours ago 214.8 MB Set it to VERIFIED based on Comment 14 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |
Created attachment 1285672 [details] ansible inventory file Description of problem: Inventory file, set openshift_logging_es_cluster_size as non-default value, such as 2, the es pods failed to start up. Checked es logs, it showed es rc took longer than 600 seconds to become ready. Version-Release number of selected component (if applicable): # openshift version openshift v3.6.96 kubernetes v1.6.1+5115d708d7 etcd 3.1.0 # rpm -qa | grep openshift-ansible openshift-ansible-3.6.96-1.git.0.8c6777b.el7.noarch openshift-ansible-roles-3.6.96-1.git.0.8c6777b.el7.noarch openshift-ansible-docs-3.6.96-1.git.0.8c6777b.el7.noarch openshift-ansible-lookup-plugins-3.6.96-1.git.0.8c6777b.el7.noarch openshift-ansible-callback-plugins-3.6.96-1.git.0.8c6777b.el7.noarch openshift-ansible-playbooks-3.6.96-1.git.0.8c6777b.el7.noarch openshift-ansible-filter-plugins-3.6.96-1.git.0.8c6777b.el7.noarch Image id from brew registry logging-auth-proxy v3.6 1f04cfd77ded 3 hours ago 230.2 MB logging-kibana v3.6 871018ff9145 3 hours ago 342.4 MB logging-elasticsearch v3.6 6b1f9d0935ab 3 hours ago 404.5 MB logging-fluentd v3.6 ff62b5088dcf 3 hours ago 232.5 MB logging-curator v3.6 028e689a3276 4 weeks ago 211.1 MB How reproducible: Always Steps to Reproduce: 1. Deploy logging 3.6 via ansible and check pods' status 2. 3. Actual results: es pods failed to start up Expected results: es pods can start up Additional info: Attached inventory file and log file