Bug 1459430

Summary:

ES Pod failed to start up if set openshift_logging_es_cluster_size as non-default value

Product:

OpenShift Container Platform

Reporter:

Junqi Zhao <juzhao>

Component:

Logging

Assignee:

Jan Wozniak <jwozniak>

Status:

CLOSED ERRATA

QA Contact:

Xia Zhao <xiazhao>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.6.0

CC:

aos-bugs, eminguez, jcantril, jwozniak, pportant, rmeggins

Target Milestone:

---

Keywords:

Regression

Target Release:

3.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

undefined

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-11-28 21:56:17 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ansible inventory file	none
es pod log	none

Description Junqi Zhao 2017-06-07 07:00:10 UTC

Created attachment 1285672 [details]
ansible inventory file

Description of problem:
Inventory file, set openshift_logging_es_cluster_size as non-default value, such as 2, the es pods failed to start up. Checked es logs, it showed es rc took longer than 600 seconds to become ready.


Version-Release number of selected component (if applicable):
# openshift version
openshift v3.6.96
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

# rpm -qa | grep openshift-ansible
openshift-ansible-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-roles-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-docs-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-lookup-plugins-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-callback-plugins-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-playbooks-3.6.96-1.git.0.8c6777b.el7.noarch
openshift-ansible-filter-plugins-3.6.96-1.git.0.8c6777b.el7.noarch

Image id from brew registry
logging-auth-proxy      v3.6                1f04cfd77ded        3 hours ago         230.2 MB
logging-kibana          v3.6                871018ff9145        3 hours ago         342.4 MB
logging-elasticsearch   v3.6                6b1f9d0935ab        3 hours ago         404.5 MB
logging-fluentd         v3.6                ff62b5088dcf        3 hours ago         232.5 MB
logging-curator         v3.6                028e689a3276        4 weeks ago         211.1 MB


How reproducible:
Always

Steps to Reproduce:
1. Deploy logging 3.6 via ansible and check pods' status
2.
3.

Actual results:
es pods failed to start up

Expected results:
es pods can start up

Additional info:
Attached inventory file and log file

Comment 1 Junqi Zhao 2017-06-07 07:01:58 UTC

Created attachment 1285673 [details]
es pod log

Comment 2 Junqi Zhao 2017-06-07 07:03:25 UTC

It may related with https://bugzilla.redhat.com/show_bug.cgi?id=1456139

Comment 3 Junqi Zhao 2017-06-07 08:30:01 UTC

Same phenomenon with openshift_logging_es_ops_cluster_size.

# oc get po | grep logging-es
logging-es-data-master-ve2jlrng-1-0zr8z        1/1       Running   0          22m
logging-es-ops-data-master-ce0zcfbw-1-deploy   0/1       Error     0          21m
logging-es-ops-data-master-cttxmsd8-1-deploy   0/1       Error     0          21m
logging-es-ops-data-master-m0bas5mp-1-deploy   0/1       Error     0          22m
logging-es-ops-data-master-nb7q27qm-1-deploy   0/1       Error     0          21m
# oc logs logging-es-ops-data-master-ce0zcfbw-1-deploy
--> Scaling logging-es-ops-data-master-ce0zcfbw-1 to 1
--> Waiting up to 10m0s for pods in rc logging-es-ops-data-master-ce0zcfbw-1 to become ready
error: update acceptor rejected logging-es-ops-data-master-ce0zcfbw-1: pods for rc "logging-es-ops-data-master-ce0zcfbw-1" took longer than 600 seconds to become ready

Comment 4 Rich Megginson 2017-06-08 15:19:24 UTC

Could be related to the readiness probe.

Comment 5 Jan Wozniak 2017-06-12 11:25:56 UTC

(In reply to Junqi Zhao from comment #0)
> Steps to Reproduce:
> 1. Deploy logging 3.6 via ansible and check pods' status


Junqi, could you please also include ansible branch/commit/tag you were using for the deployment? I am trying to reproduce this bug and so far I am unable to do so

Comment 6 Jan Wozniak 2017-06-12 15:21:11 UTC

no longer needed, I can reproduce it with the current master branch.

Comment 7 Jan Wozniak 2017-06-13 11:44:41 UTC

I think the sequence of events with cluster of size larger than 1 and readiness probe turns into a deadlock:

- ES pods are started and wait for readiness probe to allow network communication [1]
- ES containers try master discovery (requires network communication) [2]
- readiness probe disallows master discovery because ES status is 503 because master discovery hasn't happened yet [3]

possible workaround:
- readiness probe could allow status 200 and 503 

Could perhaps someone confirm or deny described logic?




[1] https://raw.githubusercontent.com/wozniakjan/origin-aggregated-logging/6cb7ee9c61725b3b2b8ed9fdeb468a4e381770da/elasticsearch/probe/readiness.sh

[2] https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-discovery-zen.html

[3] https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-readiness-probes

Comment 8 Jan Wozniak 2017-06-23 14:31:19 UTC

I proposed a solution based on labelling the ES pods and raised an issue in fabric8 plugin.

https://github.com/fabric8io/elasticsearch-cloud-kubernetes/issues/90
https://github.com/wozniakjan/elasticsearch-cloud-kubernetes/pull/1

Will see what is the recommended way to resolve this

Comment 9 Peter Portante 2017-06-23 17:16:13 UTC

FYI: I did not intend to clear your needinfo request, Jan; I am not sure what I did in BZ land that caused it to be cleared.  Please restore if it makes sense to do so.

Comment 10 Peter Portante 2017-06-29 00:14:36 UTC

This is a readiness check problem, and that check is preventing Elasticsearch from working at all.  So having this as a "medium" severity might be a bit misleading.

Can this severity be changed?

The readiness check should be removed for now, so that proper investigation into a working readiness check can be done for a following release.

Comment 11 Jan Wozniak 2017-06-30 14:07:41 UTC

The readiness probe will be temporarily removed when 4641 PR merges and reintroduced in 3.6.1

https://github.com/openshift/openshift-ansible/pull/4641

Comment 12 Eduardo Minguez 2017-07-03 11:18:37 UTC

Just in case, it happened to me with the latest 3.5 too... maybe the dc timeout can be modified like:

for i in $(oc get dc -n logging -o jsonpath='{.items[*].metadata.name}'); do echo $i; oc patch dc ${i} -p '{"spec":{"strategy":{"recreateParams":{"timeoutSeconds":"1200"}}}}'; done

Comment 13 Jan Wozniak 2017-07-03 14:15:36 UTC

the temporary fix has been merged to master, it shouldn't be happening from now

Comment 14 Junqi Zhao 2017-07-10 07:38:48 UTC

Issue was fixed, logging worked well when setting openshift_logging_es_cluster_size as non-default value, please change to ON_QA status so we can close it

Testing environments:
# rpm -qa | grep openshift-ansible
openshift-ansible-filter-plugins-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-playbooks-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-callback-plugins-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-roles-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-docs-3.6.138-1.git.0.2c647a9.el7.noarch
openshift-ansible-lookup-plugins-3.6.138-1.git.0.2c647a9.el7.noarch


Images from brew registry
logging-elasticsearch   v3.6.138-1          2b2b061286e6        47 hours ago        404.2 MB
logging-kibana          v3.6.138-1          43aea64f70a8        47 hours ago        342.4 MB
logging-curator         v3.6.138-1          bf5d8756dcfc        47 hours ago        221.5 MB
logging-fluentd         v3.6.138-1          e4ed3ac61d69        47 hours ago        231.5 MB
logging-auth-proxy      v3.6.138-1          9b98fea74082        47 hours ago        214.8 MB

Comment 15 Junqi Zhao 2017-07-10 08:00:30 UTC

Set it to VERIFIED based on Comment 14

Comment 19 errata-xmlrpc 2017-11-28 21:56:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188