Bug 1540099

Summary:

Failed to deploy clustered ES pods

Product:

OpenShift Container Platform

Reporter:

Anping Li <anli>

Component:

Logging

Assignee:

ewolinet

Status:

CLOSED CURRENTRELEASE

QA Contact:

Anping Li <anli>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.9.0

CC:

andcosta, anli, aos-bugs, eminguez, ewolinet, farandac, jcantril, pkanthal, pportant, rkant, rmeggins, sdodson, tlarsson, vlaad

Target Milestone:

---

Target Release:

3.9.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

When we create ES nodes we skip doing health checks for each node since the number of nodes necessary for recovery and quorum changes which can prevent any nodes from starting up in a timely manner unless all the other cluster members are available.

Story Points:

---

Clone Of:

Clones:

1581058 (view as bug list)

Environment:

Last Closed:

2018-08-29 21:21:20 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1544243, 1581058

Attachments:

Description	Flags
openshift-ansible logs	none

Description Anping Li 2018-01-30 09:43:16 UTC

Description of problem:
The clustered ES pod couldn't be started during logging deployed. It shouldn't  "Waiting for {{ _es_node }} to finish scaling up" in restart_es_node.yml. it should check ES node health after all ES dcs are rollout.
 

Version-Release number of selected component (if applicable):
openshift-ansible-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. deploy clustered ES logging
openshift_logging_es_cluster_size=3
openshift_logging_es_number_of_replicas=1
openshift_logging_es_number_of_shards=1

2. check ansible logs
RUNNING HANDLER [openshift_logging_elasticsearch : include_tasks] **************
^[[0;36mincluded: /usr/share/ansible/openshift-ansible/roles/openshift_logging_elasticsearch/tasks/restart_es_node.yml for openshift-181.example.com^[[0m

RUNNING HANDLER [openshift_logging_elasticsearch : command] ********************
^[[0;33mchanged: [openshift-181.example.com] => {"changed": true, "cmd": ["oc", "rollout", "latest", "logging-es-data-master-z60k9g53", "-n", "logging"], "delta": "0:00:00.353467", "end": "2018-01-30 03:50:05.967638", "rc": 0, "start": "2018-01-30 03:50:05.614171", "stderr": "", "stderr_lines": [], "stdout": "deploymentconfig \"logging-es-data-master-z60k9g53\" rolled out", "stdout_lines": ["deploymentconfig \"logging-es-data-master-z60k9g53\" rolled out"]}^[[0m

RUNNING HANDLER [openshift_logging_elasticsearch : oc_obj] *********************
^[[1;30mFAILED - RETRYING: Waiting for logging-es-data-master-z60k9g53 to finish scaling up (60 retries left).^[[0m
^[[1;30mFAILED - RETRYING: Waiting for logging-es-data-master-z60k9g53 to finish scaling up (59 retries left).^[[0m
----------

[[1;30mFAILED - RETRYING: Waiting for logging-es-data-master-z60k9g53 to finish scaling up (1 retries left).^[[0m
^[[0;31mfatal: [openshift-181.example.com]: FAILED! => {"attempts": 60, "changed": false, "results": {"cmd": "/usr/bin/oc get dc logging-es-data-master-z60k9g53 -o json -n logging", "results": [{"apiVersion": "v1", "kind": "DeploymentConfig", "metadata": {

3. check the dc and rc, You can find only the first ES dc was rollout when checking the ES healthy, It should checking healthy untill all ES dc are rollout

[root@host3-ha-master-1 ~]# oc get dc
NAME                              REVISION   DESIRED   CURRENT   TRIGGERED BY
logging-curator                   1          1         1         config
logging-es-data-master-2gg3vuno   0          1         0         
logging-es-data-master-e74f36rn   0          1         0         
logging-es-data-master-z60k9g53   1          1         0         
logging-kibana                    1          1         1         config


[root@host3-ha-master-1 ~]# oc get rc
NAME                                DESIRED   CURRENT   READY     AGE
logging-curator-1                   1         1         0         38m
logging-es-data-master-z60k9g53-1   0         0         0         37m
logging-kibana-1                    1         1         1         38m


Actual results:
Deploy failed to waiting ES became ready.

Expected results:
The logging deployed succeed.

Additional info:
Could we check the cluster health at the end of the playbooks or just print warning message in stead of fail the playbook? I don't think the error message in playbooks are helpful for debug.

Comment 1 Anping Li 2018-01-30 09:50:44 UTC

@scott, 
What do you think about the Additional info?  Especially when enable logging during fresh installation.

Comment 2 Anping Li 2018-01-30 10:03:11 UTC

Created attachment 1388245 [details]
openshift-ansible logs

Comment 3 Scott Dodson 2018-01-30 14:26:08 UTC

Sorry, i have to defer to the wisdom of the logging team here. If it's a problem that goes away after time and we don't need to block later tasks on its completion then I think it's fine to wait a reasonable time and then move on without failure.

Comment 4 ewolinet 2018-01-30 16:29:28 UTC

This is being done correctly as discussed with Peter Portante. 
We want to check the health of the ES pod that we just rolled out and ensure it is in a good state (yellow or green) before we continue to rollout the other DCs.

We are checking the health of the pods after the playbook since this is all being done in a handler.

Is the loop timing out waiting for the pod to be ready?
What is the status of the pods when this fails?

Comment 5 Anping Li 2018-01-31 02:23:48 UTC

@ewolinet the loop timing out is logging enough to wait ES Pod. I guess when es_cluster_size=3, the first ES node couldn't be ready unless two ES nodes are running at least.

By the way, The ES can be deployed succeed Once I removed the task "Waiting for {{ _es_node }} to finish scaling up" and "Waiting for ES to be ready for {{ _es_node }}'

Comment 7 Jeff Cantrill 2018-01-31 18:14:40 UTC

Please retest as I realized today the latest 3.9 image of elasticsearch from the dist-git repos did not include the simplifications to the readiness probe.  I wonder if that could make a difference in what you are seeing.

Comment 8 Anping Li 2018-02-01 10:58:07 UTC

@jeff, What is the image version or PR,  still failed with logging-elasticsearch/images/v3.9.0-0.35.0.0.

  Warning  Unhealthy              8m (x5 over 9m)   kubelet, 192.168.1.214  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000]
  Warning  Unhealthy              4m (x51 over 8m)  kubelet, 192.168.1.214  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503]

Comment 15 ewolinet 2018-02-07 15:16:28 UTC

With a fresh rollout the cluster does not require all intended nodes to be running prior to it being able to go green. As we add a new cluster member we should see the cluster eventually go green.


Is the loop timing out waiting for the pod to be ready?
What is the status of the pods when this fails?

What do the logs for the elasticsearch container show for the starting up pod?

What is the memory and cpu limits/requests for your ES pods while testing? It could be that the ES pod is taking a long time to start up.

Comment 18 ewolinet 2018-02-10 00:15:35 UTC

https://github.com/openshift/openshift-ansible/pull/7097

Comment 21 Takeshi Larsson 2018-02-20 08:07:11 UTC

@ewolinet, is there a corresponding bz for 3.7 for the same issue? also thanks for backporting it to 3.7

Comment 22 ewolinet 2018-02-20 19:34:55 UTC

@Takeshi

This has the same issue for 3.7 
https://bugzilla.redhat.com/show_bug.cgi?id=1544243

Comment 23 Anping Li 2018-03-01 02:59:34 UTC

Block HA ELasticsearch deploying.

Comment 26 Anping Li 2018-03-06 02:56:28 UTC

@ewolinet, The bug is fixed, Could you move to on_QA?

Comment 27 Anping Li 2018-03-07 02:29:21 UTC

Works well with openshift-ansible:v3.9.2

Comment 28 Andre Costa 2018-05-31 14:41:44 UTC

The issue persists on v3.9.27. Customer upgraded the logging stack using playbooks and this issue continues even when manually rolling out elasticsearch:

for x in $(oc get dc -l component=es -o=custom-columns=NAME:.metadata.name --no-headers); do oc rollout latest $x; done;

[2018-02-09 03:29:17,127][ERROR][com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction] [logging-es-data-master-2ibt4vf5] Failure while checking .searchguard.logging-es-data-master-2ibt4vf5 index MasterNotDiscoveredException[null]
MasterNotDiscoveredException[null]
	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$5.onTimeout(TransportMasterNodeAction.java:234)
	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:236)
	at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:816)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
[2018-02-09 03:29:47,133][WARN ][com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction] [logging-es-data-master-2ibt4vf5] index '.searchguard.logging-es-data-master-2ibt4vf5' not healthy yet, we try again ... (Reason: no response)
[2018-02-09 03:30:20,134][WARN ][com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction] [logging-es-data-master-2ibt4vf5] index '.searchguard.logging-es-data-master-2ibt4vf5' not healthy yet, we try again ... (Reason: no response)
[2018-02-09 03:30:53,135][WARN ][com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction] [logging-es-data-master-2ibt4vf5] index '.searchguard.logging-es-data-master-2ibt4vf5' not healthy yet, we try again ... (Reason: no response)

Comment 35 Jeff Cantrill 2018-08-28 18:19:56 UTC

This issue identified in #c16 is not the same as #c30.  One is the cause of each ES node not being able to see one another and therefore unable to elect a master.  The other is caused by unhealthy SG indices which could be resolved by:

* deleting each of the ".searchguard" indices
* reseeding each one on each pod: 'oc exec -c elasticsearch $pod -- es_seed_acl

I defer to QE findings and verification.  If they believe it is resolved then we should close the issue and open a new one which is specific to any other issues.