Bug 1459430 - ES Pod failed to start up if set openshift_logging_es_cluster_size as non-default value
ES Pod failed to start up if set openshift_logging_es_cluster_size as non-def...
Status: VERIFIED
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging (Show other bugs)
3.6.0
Unspecified Unspecified
medium Severity medium
: ---
: 3.7.0
Assigned To: Jan Wozniak
Xia Zhao
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-07 03:00 EDT by Junqi Zhao
Modified: 2017-10-05 13:47 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ansible inventory file (613 bytes, text/plain)
2017-06-07 03:00 EDT, Junqi Zhao
no flags Details
es pod log (14.43 KB, text/plain)
2017-06-07 03:01 EDT, Junqi Zhao
no flags Details

  None (edit)
Description Junqi Zhao 2017-06-07 03:00:10 EDT
Created attachment 1285672 [details]
ansible inventory file

Description of problem:
Inventory file, set openshift_logging_es_cluster_size as non-default value, such as 2, the es pods failed to start up. Checked es logs, it showed es rc took longer than 600 seconds to become ready.


Version-Release number of selected component (if applicable):
# openshift version
openshift v3.6.96
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

# rpm -qa | grep openshift-ansible
openshift-ansible-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-roles-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-docs-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-lookup-plugins-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-callback-plugins-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-playbooks-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-filter-plugins-3.6.96-1.git.0.8c6777b.el7.noarch

Image id from brew registry
logging-auth-proxy      v3.6                1f04cfd77ded        3 hours ago         230.2 MB
logging-kibana          v3.6                871018ff9145        3 hours ago         342.4 MB
logging-elasticsearch   v3.6                6b1f9d0935ab        3 hours ago         404.5 MB
logging-fluentd         v3.6                ff62b5088dcf        3 hours ago         232.5 MB
logging-curator         v3.6                028e689a3276        4 weeks ago         211.1 MB


How reproducible:
Always

Steps to Reproduce:
1. Deploy logging 3.6 via ansible and check pods' status
2.
3.

Actual results:
es pods failed to start up

Expected results:
es pods can start up

Additional info:
Attached inventory file and log file
Comment 1 Junqi Zhao 2017-06-07 03:01 EDT
Created attachment 1285673 [details]
es pod log
Comment 2 Junqi Zhao 2017-06-07 03:03:25 EDT
It may related with https://bugzilla.redhat.com/show_bug.cgi?id=1456139
Comment 3 Junqi Zhao 2017-06-07 04:30:01 EDT
Same phenomenon with openshift_logging_es_ops_cluster_size.

# oc get po | grep logging-es
logging-es-data-master-ve2jlrng-1-0zr8z        1/1       Running   0          22m
logging-es-ops-data-master-ce0zcfbw-1-deploy   0/1       Error     0          21m
logging-es-ops-data-master-cttxmsd8-1-deploy   0/1       Error     0          21m
logging-es-ops-data-master-m0bas5mp-1-deploy   0/1       Error     0          22m
logging-es-ops-data-master-nb7q27qm-1-deploy   0/1       Error     0          21m
# oc logs logging-es-ops-data-master-ce0zcfbw-1-deploy
--> Scaling logging-es-ops-data-master-ce0zcfbw-1 to 1
--> Waiting up to 10m0s for pods in rc logging-es-ops-data-master-ce0zcfbw-1 to become ready
error: update acceptor rejected logging-es-ops-data-master-ce0zcfbw-1: pods for rc "logging-es-ops-data-master-ce0zcfbw-1" took longer than 600 seconds to become ready
Comment 4 Rich Megginson 2017-06-08 11:19:24 EDT
Could be related to the readiness probe.
Comment 5 Jan Wozniak 2017-06-12 07:25:56 EDT
(In reply to Junqi Zhao from comment #0)
> Steps to Reproduce:
> 1. Deploy logging 3.6 via ansible and check pods' status


Junqi, could you please also include ansible branch/commit/tag you were using for the deployment? I am trying to reproduce this bug and so far I am unable to do so
Comment 6 Jan Wozniak 2017-06-12 11:21:11 EDT
no longer needed, I can reproduce it with the current master branch.
Comment 7 Jan Wozniak 2017-06-13 07:44:41 EDT
I think the sequence of events with cluster of size larger than 1 and readiness probe turns into a deadlock:

- ES pods are started and wait for readiness probe to allow network communication [1]
- ES containers try master discovery (requires network communication) [2]
- readiness probe disallows master discovery because ES status is 503 because master discovery hasn't happened yet [3]

possible workaround:
- readiness probe could allow status 200 and 503 

Could perhaps someone confirm or deny described logic?




[1] https://raw.githubusercontent.com/wozniakjan/origin-aggregated-logging/6cb7ee9c61725b3b2b8ed9fdeb468a4e381770da/elasticsearch/probe/readiness.sh

[2] https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-discovery-zen.html

[3] https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-readiness-probes
Comment 8 Jan Wozniak 2017-06-23 10:31:19 EDT
I proposed a solution based on labelling the ES pods and raised an issue in fabric8 plugin.

https://github.com/fabric8io/elasticsearch-cloud-kubernetes/issues/90
https://github.com/wozniakjan/elasticsearch-cloud-kubernetes/pull/1

Will see what is the recommended way to resolve this
Comment 9 Peter Portante 2017-06-23 13:16:13 EDT
FYI: I did not intend to clear your needinfo request, Jan; I am not sure what I did in BZ land that caused it to be cleared.  Please restore if it makes sense to do so.
Comment 10 Peter Portante 2017-06-28 20:14:36 EDT
This is a readiness check problem, and that check is preventing Elasticsearch from working at all.  So having this as a "medium" severity might be a bit misleading.

Can this severity be changed?

The readiness check should be removed for now, so that proper investigation into a working readiness check can be done for a following release.
Comment 11 Jan Wozniak 2017-06-30 10:07:41 EDT
The readiness probe will be temporarily removed when 4641 PR merges and reintroduced in 3.6.1

https://github.com/openshift/openshift-ansible/pull/4641
Comment 12 Eduardo Minguez 2017-07-03 07:18:37 EDT
Just in case, it happened to me with the latest 3.5 too... maybe the dc timeout can be modified like:

for i in $(oc get dc -n logging -o jsonpath='{.items[*].metadata.name}'); do echo $i; oc patch dc ${i} -p '{"spec":{"strategy":{"recreateParams":{"timeoutSeconds":"1200"}}}}'; done
Comment 13 Jan Wozniak 2017-07-03 10:15:36 EDT
the temporary fix has been merged to master, it shouldn't be happening from now
Comment 14 Junqi Zhao 2017-07-10 03:38:48 EDT
Issue was fixed, logging worked well when setting openshift_logging_es_cluster_size as non-default value, please change to ON_QA status so we can close it

Testing environments:
# rpm -qa | grep openshift-ansible
openshift-ansible-filter-plugins-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-playbooks-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-callback-plugins-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-roles-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-docs-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-lookup-plugins-3.6.138-1.git.0.2c647a9.el7.noarch


Images from brew registry
logging-elasticsearch   v3.6.138-1          2b2b061286e6        47 hours ago        404.2 MB
logging-kibana          v3.6.138-1          43aea64f70a8        47 hours ago        342.4 MB
logging-curator         v3.6.138-1          bf5d8756dcfc        47 hours ago        221.5 MB
logging-fluentd         v3.6.138-1          e4ed3ac61d69        47 hours ago        231.5 MB
logging-auth-proxy      v3.6.138-1          9b98fea74082        47 hours ago        214.8 MB
Comment 15 Junqi Zhao 2017-07-10 04:00:30 EDT
Set it to VERIFIED based on Comment 14

Note You need to log in before you can comment on or make changes to this bug.