Description of problem: The clustered ES pod couldn't be started during logging deployed. It shouldn't "Waiting for {{ _es_node }} to finish scaling up" in restart_es_node.yml. it should check ES node health after all ES dcs are rollout. Version-Release number of selected component (if applicable): openshift-ansible-3.9.0-0.31.0.git.0.e0a0ad8.el7.noarch How reproducible: always Steps to Reproduce: 1. deploy clustered ES logging openshift_logging_es_cluster_size=3 openshift_logging_es_number_of_replicas=1 openshift_logging_es_number_of_shards=1 2. check ansible logs RUNNING HANDLER [openshift_logging_elasticsearch : include_tasks] ************** ^[[0;36mincluded: /usr/share/ansible/openshift-ansible/roles/openshift_logging_elasticsearch/tasks/restart_es_node.yml for openshift-181.example.com^[[0m RUNNING HANDLER [openshift_logging_elasticsearch : command] ******************** ^[[0;33mchanged: [openshift-181.example.com] => {"changed": true, "cmd": ["oc", "rollout", "latest", "logging-es-data-master-z60k9g53", "-n", "logging"], "delta": "0:00:00.353467", "end": "2018-01-30 03:50:05.967638", "rc": 0, "start": "2018-01-30 03:50:05.614171", "stderr": "", "stderr_lines": [], "stdout": "deploymentconfig \"logging-es-data-master-z60k9g53\" rolled out", "stdout_lines": ["deploymentconfig \"logging-es-data-master-z60k9g53\" rolled out"]}^[[0m RUNNING HANDLER [openshift_logging_elasticsearch : oc_obj] ********************* ^[[1;30mFAILED - RETRYING: Waiting for logging-es-data-master-z60k9g53 to finish scaling up (60 retries left).^[[0m ^[[1;30mFAILED - RETRYING: Waiting for logging-es-data-master-z60k9g53 to finish scaling up (59 retries left).^[[0m ---------- [[1;30mFAILED - RETRYING: Waiting for logging-es-data-master-z60k9g53 to finish scaling up (1 retries left).^[[0m ^[[0;31mfatal: [openshift-181.example.com]: FAILED! => {"attempts": 60, "changed": false, "results": {"cmd": "/usr/bin/oc get dc logging-es-data-master-z60k9g53 -o json -n logging", "results": [{"apiVersion": "v1", "kind": "DeploymentConfig", "metadata": { 3. check the dc and rc, You can find only the first ES dc was rollout when checking the ES healthy, It should checking healthy untill all ES dc are rollout [root@host3-ha-master-1 ~]# oc get dc NAME REVISION DESIRED CURRENT TRIGGERED BY logging-curator 1 1 1 config logging-es-data-master-2gg3vuno 0 1 0 logging-es-data-master-e74f36rn 0 1 0 logging-es-data-master-z60k9g53 1 1 0 logging-kibana 1 1 1 config [root@host3-ha-master-1 ~]# oc get rc NAME DESIRED CURRENT READY AGE logging-curator-1 1 1 0 38m logging-es-data-master-z60k9g53-1 0 0 0 37m logging-kibana-1 1 1 1 38m Actual results: Deploy failed to waiting ES became ready. Expected results: The logging deployed succeed. Additional info: Could we check the cluster health at the end of the playbooks or just print warning message in stead of fail the playbook? I don't think the error message in playbooks are helpful for debug.
@scott, What do you think about the Additional info? Especially when enable logging during fresh installation.
Created attachment 1388245 [details] openshift-ansible logs
Sorry, i have to defer to the wisdom of the logging team here. If it's a problem that goes away after time and we don't need to block later tasks on its completion then I think it's fine to wait a reasonable time and then move on without failure.
This is being done correctly as discussed with Peter Portante. We want to check the health of the ES pod that we just rolled out and ensure it is in a good state (yellow or green) before we continue to rollout the other DCs. We are checking the health of the pods after the playbook since this is all being done in a handler. Is the loop timing out waiting for the pod to be ready? What is the status of the pods when this fails?
@ewolinet the loop timing out is logging enough to wait ES Pod. I guess when es_cluster_size=3, the first ES node couldn't be ready unless two ES nodes are running at least. By the way, The ES can be deployed succeed Once I removed the task "Waiting for {{ _es_node }} to finish scaling up" and "Waiting for ES to be ready for {{ _es_node }}'
Please retest as I realized today the latest 3.9 image of elasticsearch from the dist-git repos did not include the simplifications to the readiness probe. I wonder if that could make a difference in what you are seeing.
@jeff, What is the image version or PR, still failed with logging-elasticsearch/images/v3.9.0-0.35.0.0. Warning Unhealthy 8m (x5 over 9m) kubelet, 192.168.1.214 Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000] Warning Unhealthy 4m (x51 over 8m) kubelet, 192.168.1.214 Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503]
With a fresh rollout the cluster does not require all intended nodes to be running prior to it being able to go green. As we add a new cluster member we should see the cluster eventually go green. Is the loop timing out waiting for the pod to be ready? What is the status of the pods when this fails? What do the logs for the elasticsearch container show for the starting up pod? What is the memory and cpu limits/requests for your ES pods while testing? It could be that the ES pod is taking a long time to start up.
https://github.com/openshift/openshift-ansible/pull/7097
@ewolinet, is there a corresponding bz for 3.7 for the same issue? also thanks for backporting it to 3.7
@Takeshi This has the same issue for 3.7 https://bugzilla.redhat.com/show_bug.cgi?id=1544243
Block HA ELasticsearch deploying.
@ewolinet, The bug is fixed, Could you move to on_QA?
Works well with openshift-ansible:v3.9.2
The issue persists on v3.9.27. Customer upgraded the logging stack using playbooks and this issue continues even when manually rolling out elasticsearch: for x in $(oc get dc -l component=es -o=custom-columns=NAME:.metadata.name --no-headers); do oc rollout latest $x; done; [2018-02-09 03:29:17,127][ERROR][com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction] [logging-es-data-master-2ibt4vf5] Failure while checking .searchguard.logging-es-data-master-2ibt4vf5 index MasterNotDiscoveredException[null] MasterNotDiscoveredException[null] at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$5.onTimeout(TransportMasterNodeAction.java:234) at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:236) at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:816) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) [2018-02-09 03:29:47,133][WARN ][com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction] [logging-es-data-master-2ibt4vf5] index '.searchguard.logging-es-data-master-2ibt4vf5' not healthy yet, we try again ... (Reason: no response) [2018-02-09 03:30:20,134][WARN ][com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction] [logging-es-data-master-2ibt4vf5] index '.searchguard.logging-es-data-master-2ibt4vf5' not healthy yet, we try again ... (Reason: no response) [2018-02-09 03:30:53,135][WARN ][com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction] [logging-es-data-master-2ibt4vf5] index '.searchguard.logging-es-data-master-2ibt4vf5' not healthy yet, we try again ... (Reason: no response)
This issue identified in #c16 is not the same as #c30. One is the cause of each ES node not being able to see one another and therefore unable to elect a master. The other is caused by unhealthy SG indices which could be resolved by: * deleting each of the ".searchguard" indices * reseeding each one on each pod: 'oc exec -c elasticsearch $pod -- es_seed_acl I defer to QE findings and verification. If they believe it is resolved then we should close the issue and open a new one which is specific to any other issues.