1660956 – Health check playbook failed at checking elasticsearch

Bug 1660956 - Health check playbook failed at checking elasticsearch

Summary: Health check playbook failed at checking elasticsearch

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.11.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Jeff Cantrill
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-19 17:12 UTC by DzungDo
Modified:	2020-06-03 21:35 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The exec call to check the elasticsearch health does not specify a container Consequence: The call fails because the output includes incorrectly formatted JSON but the script is expecting JSON Fix: Include the target container in the exec command Result: The command succeeds
Clone Of:
Environment:
Last Closed:	2020-06-03 21:35:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 11302	0	'None'	closed	bug 1660956. Add container name to elasticsearch exec call	2020-06-03 21:29:08 UTC

Description DzungDo 2018-12-19 17:12:28 UTC

Description of problem:
Run playbook: /usr/share/ansible/openshift-ansible/playbooks/openshift-checks/health.yml


Version-Release number of selected component (if applicable):
openshift v3.11.51

How reproducible:
Always

Steps to Reproduce:
1. Install OCP 3.11 

2. Run "ansible-playbook -i inventory /usr/share/ansible/openshift-ansible/playbooks/openshift-checks/health.yml"

3.

Actual results:

               Could not retrieve cluster health status from logging ES pod "logging-es-data-master-wqmmitnd-1-nvbpt".
               Response was:
               Defaulting container name to elasticsearch.
               Use 'oc describe pod/logging-es-data-master-wqmmitnd-1-nvbpt -n openshift-logging' to see all of the containers in this pod.
               {
                 "cluster_name" : "logging-es",
                 "status" : "green",
                 "timed_out" : false,
                 "number_of_nodes" : 3,
                 "number_of_data_nodes" : 3,
                 "active_primary_shards" : 8,
                 "active_shards" : 8,
                 "relocating_shards" : 0,
                 "initializing_shards" : 0,
                 "unassigned_shards" : 0,
                 "delayed_unassigned_shards" : 0,
                 "number_of_pending_tasks" : 0,
                 "number_of_in_flight_fetch" : 0,
                 "task_max_waiting_in_queue_millis" : 0,
                 "active_shards_percent_as_number" : 100.0
               }


               check "logging_index_time":
               Invalid response from Elasticsearch query:
                 exec logging-es-data-master-0oinld9m-1-8v5vh -- curl --max-time 30 -s -f --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/project.openshift-logging*/_count?q=message:287a4f77-fd04-4b81-b941-0380f4ed9ca0
               Response was:
               Defaulting container name to elasticsearch.
               Use 'oc describe pod/logging-es-data-master-0oinld9m-1-8v5vh -n openshift-logging' to see all of the containers in this pod.
               {"count":0,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0}}


Expected results:
playbook should should not fail at checking logging ES pod when all pods are running without error.

Additional info:

Comment 1 Mitchell Rollinson 2019-02-19 22:18:47 UTC

IHAC who is experiencing this very issue.
They experienced the issue when upgrading from OCP 3.10.35 - 3.11.51 AND also when upgrading from 3.10.45 - 3.11.59

Can we have an update on this please.
If you require any specific information to move this forward please shout.

Comment 3 Jeff Cantrill 2019-03-04 20:38:37 UTC

The problem is the check needs to explicitly exec to the 'elasticsearch' container since the output without it is printing JSON along with the warning message:

> Defaulting container name to elasticsearch.
> Use 'oc describe pod/logging-es-data-master-0oinld9m-1-8v5vh -n openshift-logging' to see all of the containers in this pod.
> {"count":0,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0}}

Comment 4 Mitchell Rollinson 2019-03-04 20:46:43 UTC

Thanks Jeff,

My cu has a Pre-prod proactive case for tomorrow (WED 04:00 - NZ TZ). They have been having much difficulty upgrading the logging stack from 3.10.x to 3.11.x.
Currently they need to ..upgrade, de-install, re-install in order to get a working stack. I will open another BZ for the upgrade issues, just adding for context.
RE - Do you have any recommendation, as to how best to verify logging stack health, if not able to resolve the issue with the health check playbook.
Can you please advise, it would be great to give them something definitive in time for tomorrow's pre-prod upgrade.

Comment 7 Qiaoling Tang 2019-03-22 06:56:13 UTC

Bug isn't fixed.

# rpm -qa |grep ansible
openshift-ansible-playbooks-3.11.98-1.git.0.3cfa7c3.el7.noarch
ansible-2.6.5-1.el7ae.noarch
openshift-ansible-docs-3.11.98-1.git.0.3cfa7c3.el7.noarch
openshift-ansible-roles-3.11.98-1.git.0.3cfa7c3.el7.noarch
openshift-ansible-3.11.98-1.git.0.3cfa7c3.el7.noarch
# cat /usr/share/ansible/openshift-ansible/roles/openshift_health_checker/openshift_checks/logging/elasticsearch.py |grep _build_es_curl_cmd -A 3
    def _build_es_curl_cmd(pod_name, url):
        base = "exec {name} -c elasticsearch " \
               "-- curl -s --cert {base}cert --key {base}key " \
               "--cacert {base}ca -XGET '{url}'"


error message:
               check "logging_index_time":
               Invalid response from Elasticsearch query:
                 exec logging-es-data-master-3et94op7-1-hh2v2 -- curl --max-time 30 -s -f --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://logging-es:9200/project.openshift-logging*/_count?q=message:e5fc1c61-4d51-40c7-8844-c67fd78a2868
               Response was:
               Defaulting container name to elasticsearch.
               Use 'oc describe pod/logging-es-data-master-3et94op7-1-hh2v2 -n openshift-logging' to see all of the containers in this pod.
               {"count":0,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0}}

Comment 9 Jeff Cantrill 2020-06-03 21:35:51 UTC

Closing WORKSFORME since the customer cases are closed and there are no other reported incidents.  Note https://bugzilla.redhat.com/show_bug.cgi?id=1660956#c7 output does not exec using the container which makes me believe something is out of sync

Note You need to log in before you can comment on or make changes to this bug.