Bug 1459430 - ES Pod failed to start up if set openshift_logging_es_cluster_size as non-default value
Summary: ES Pod failed to start up if set openshift_logging_es_cluster_size as non-def...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.7.0
Assignee: Jan Wozniak
QA Contact: Xia Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-07 07:00 UTC by Junqi Zhao
Modified: 2017-11-28 21:56 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-11-28 21:56:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ansible inventory file (613 bytes, text/plain)
2017-06-07 07:00 UTC, Junqi Zhao
no flags Details
es pod log (14.43 KB, text/plain)
2017-06-07 07:01 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Junqi Zhao 2017-06-07 07:00:10 UTC
Created attachment 1285672 [details]
ansible inventory file

Description of problem:
Inventory file, set openshift_logging_es_cluster_size as non-default value, such as 2, the es pods failed to start up. Checked es logs, it showed es rc took longer than 600 seconds to become ready.


Version-Release number of selected component (if applicable):
# openshift version
openshift v3.6.96
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

# rpm -qa | grep openshift-ansible
openshift-ansible-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-roles-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-docs-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-lookup-plugins-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-callback-plugins-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-playbooks-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-filter-plugins-3.6.96-1.git.0.8c6777b.el7.noarch

Image id from brew registry
logging-auth-proxy      v3.6                1f04cfd77ded        3 hours ago         230.2 MB
logging-kibana          v3.6                871018ff9145        3 hours ago         342.4 MB
logging-elasticsearch   v3.6                6b1f9d0935ab        3 hours ago         404.5 MB
logging-fluentd         v3.6                ff62b5088dcf        3 hours ago         232.5 MB
logging-curator         v3.6                028e689a3276        4 weeks ago         211.1 MB


How reproducible:
Always

Steps to Reproduce:
1. Deploy logging 3.6 via ansible and check pods' status
2.
3.

Actual results:
es pods failed to start up

Expected results:
es pods can start up

Additional info:
Attached inventory file and log file

Comment 1 Junqi Zhao 2017-06-07 07:01:58 UTC
Created attachment 1285673 [details]
es pod log

Comment 2 Junqi Zhao 2017-06-07 07:03:25 UTC
It may related with https://bugzilla.redhat.com/show_bug.cgi?id=1456139

Comment 3 Junqi Zhao 2017-06-07 08:30:01 UTC
Same phenomenon with openshift_logging_es_ops_cluster_size.

# oc get po | grep logging-es
logging-es-data-master-ve2jlrng-1-0zr8z        1/1       Running   0          22m
logging-es-ops-data-master-ce0zcfbw-1-deploy   0/1       Error     0          21m
logging-es-ops-data-master-cttxmsd8-1-deploy   0/1       Error     0          21m
logging-es-ops-data-master-m0bas5mp-1-deploy   0/1       Error     0          22m
logging-es-ops-data-master-nb7q27qm-1-deploy   0/1       Error     0          21m
# oc logs logging-es-ops-data-master-ce0zcfbw-1-deploy
--> Scaling logging-es-ops-data-master-ce0zcfbw-1 to 1
--> Waiting up to 10m0s for pods in rc logging-es-ops-data-master-ce0zcfbw-1 to become ready
error: update acceptor rejected logging-es-ops-data-master-ce0zcfbw-1: pods for rc "logging-es-ops-data-master-ce0zcfbw-1" took longer than 600 seconds to become ready

Comment 4 Rich Megginson 2017-06-08 15:19:24 UTC
Could be related to the readiness probe.

Comment 5 Jan Wozniak 2017-06-12 11:25:56 UTC
(In reply to Junqi Zhao from comment #0)
> Steps to Reproduce:
> 1. Deploy logging 3.6 via ansible and check pods' status


Junqi, could you please also include ansible branch/commit/tag you were using for the deployment? I am trying to reproduce this bug and so far I am unable to do so

Comment 6 Jan Wozniak 2017-06-12 15:21:11 UTC
no longer needed, I can reproduce it with the current master branch.

Comment 7 Jan Wozniak 2017-06-13 11:44:41 UTC
I think the sequence of events with cluster of size larger than 1 and readiness probe turns into a deadlock:

- ES pods are started and wait for readiness probe to allow network communication [1]
- ES containers try master discovery (requires network communication) [2]
- readiness probe disallows master discovery because ES status is 503 because master discovery hasn't happened yet [3]

possible workaround:
- readiness probe could allow status 200 and 503 

Could perhaps someone confirm or deny described logic?




[1] https://raw.githubusercontent.com/wozniakjan/origin-aggregated-logging/6cb7ee9c61725b3b2b8ed9fdeb468a4e381770da/elasticsearch/probe/readiness.sh

[2] https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-discovery-zen.html

[3] https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-readiness-probes

Comment 8 Jan Wozniak 2017-06-23 14:31:19 UTC
I proposed a solution based on labelling the ES pods and raised an issue in fabric8 plugin.

https://github.com/fabric8io/elasticsearch-cloud-kubernetes/issues/90
https://github.com/wozniakjan/elasticsearch-cloud-kubernetes/pull/1

Will see what is the recommended way to resolve this

Comment 9 Peter Portante 2017-06-23 17:16:13 UTC
FYI: I did not intend to clear your needinfo request, Jan; I am not sure what I did in BZ land that caused it to be cleared.  Please restore if it makes sense to do so.

Comment 10 Peter Portante 2017-06-29 00:14:36 UTC
This is a readiness check problem, and that check is preventing Elasticsearch from working at all.  So having this as a "medium" severity might be a bit misleading.

Can this severity be changed?

The readiness check should be removed for now, so that proper investigation into a working readiness check can be done for a following release.

Comment 11 Jan Wozniak 2017-06-30 14:07:41 UTC
The readiness probe will be temporarily removed when 4641 PR merges and reintroduced in 3.6.1

https://github.com/openshift/openshift-ansible/pull/4641

Comment 12 Eduardo Minguez 2017-07-03 11:18:37 UTC
Just in case, it happened to me with the latest 3.5 too... maybe the dc timeout can be modified like:

for i in $(oc get dc -n logging -o jsonpath='{.items[*].metadata.name}'); do echo $i; oc patch dc ${i} -p '{"spec":{"strategy":{"recreateParams":{"timeoutSeconds":"1200"}}}}'; done

Comment 13 Jan Wozniak 2017-07-03 14:15:36 UTC
the temporary fix has been merged to master, it shouldn't be happening from now

Comment 14 Junqi Zhao 2017-07-10 07:38:48 UTC
Issue was fixed, logging worked well when setting openshift_logging_es_cluster_size as non-default value, please change to ON_QA status so we can close it

Testing environments:
# rpm -qa | grep openshift-ansible
openshift-ansible-filter-plugins-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-playbooks-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-callback-plugins-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-roles-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-docs-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-lookup-plugins-3.6.138-1.git.0.2c647a9.el7.noarch


Images from brew registry
logging-elasticsearch   v3.6.138-1          2b2b061286e6        47 hours ago        404.2 MB
logging-kibana          v3.6.138-1          43aea64f70a8        47 hours ago        342.4 MB
logging-curator         v3.6.138-1          bf5d8756dcfc        47 hours ago        221.5 MB
logging-fluentd         v3.6.138-1          e4ed3ac61d69        47 hours ago        231.5 MB
logging-auth-proxy      v3.6.138-1          9b98fea74082        47 hours ago        214.8 MB

Comment 15 Junqi Zhao 2017-07-10 08:00:30 UTC
Set it to VERIFIED based on Comment 14

Comment 19 errata-xmlrpc 2017-11-28 21:56:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.