Bug 1679931 - Elasticsearch pod is marked as Unhealthy by readiness probe
Summary: Elasticsearch pod is marked as Unhealthy by readiness probe
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine-metrics
Classification: oVirt
Component: Generic
Version: 1.2.0.2
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ovirt-4.3.3
: ---
Assignee: Shirly Radco
QA Contact: Ivana Saranova
URL:
Whiteboard:
Depends On:
Blocks: 1631193
TreeView+ depends on / blocked
 
Reported: 2019-02-22 09:22 UTC by Jan Zmeskal
Modified: 2020-02-25 09:19 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2019-04-16 13:58:31 UTC
oVirt Team: Metrics
Embargoed:
sradco: ovirt-4.3?
lleistne: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 97643 0 'None' MERGED Update metrics store installation role 2020-09-15 11:44:02 UTC

Description Jan Zmeskal 2019-02-22 09:22:37 UTC
Description of problem:
After successful rollout of deploymentconfig.apps.openshift.io/logging-es-data-master-*, the pod is marked as Unhealthy when using oc describe.

Version-Release number of selected component (if applicable):
ovirt-engine-metrics-1.2.1-0.0.master.20190220121053.el7.noarch (patchset 48)

How reproducible:
100 %

Steps to Reproduce:
1. Run install_okd.yml playbook
2. It fails on task Rolling out new pod(s) for {{ _es_node }}
3. SSH to OpenShift master node and cancel the current rollout
4. Initiate a new rollout. If you removed all causes of the original failure (e.g. insufficient memory), this rollout should succeed.
5. Run `oc get pods -n openshift-logging -l component=es` and make sure that both container of that pod are running
6. Make sure there's no problem in elasticsearch's container log: `oc logs $(oc get pods -n openshift-logging -l component=es -o name) -c elasticsearch`
7. Now describe the ES pod: `oc describe $(oc get pods -n openshift-logging -l component=es -o name) | tail -n 20`

Actual results:
http://pastebin.test.redhat.com/720256

Expected results:
Pod should pass readiness probe

Comment 1 Jan Zmeskal 2019-02-25 13:30:50 UTC
Moving to ASSIGNED. There is no link to patch that fixes it, no manual steps, no workaround. Basically nothing changed between status NEW and ON_QA.

Comment 2 Ivana Saranova 2019-02-27 10:00:04 UTC
Current workaround that worked for me:

1) On successfully deployed bastion VM, get this patch: https://github.com/openshift/openshift-ansible/pull/11220/files

Note: Since I couldn't apply the patch with yum, I simply got the files and overwrote the existing ones.

2) Run the metrics-store VM deployment ansible playbook install_okd (should end with no errors)

Note: The metrics-store VM must not be already created or you'll get into an error during the playbook.

3) Once metrics-store VM is successfully created and openshift is installed, check pods, svc and routes, that there are no errors -> ES pod (without deploy tag) should be running but unhealthy, when you describe the pod. If there is problem with '/opt/app-root/src/init_failures' not existing, go to pod line:

   oc rsh ES_POD_NAME

(once in pod line run this)
   touch /opt/app-root/src/init_failures
   chmod 777 /opt/app-root/src/init_failures 

4) Exit the pod line with exit command and see if this made the pod go healthy (events dissapear, there should be <none> in the Events: )

Comment 3 Ivana Saranova 2019-02-27 10:17:09 UTC
Update on the workound:

Do step 3 without the pod line commands (without connecting to pod line as well) and instead, do this command according to this: https://docs.openshift.com/container-platform/3.11/install_config/aggregate_logging.html#troubleshooting-related-to-elasticsearch

   for p in $(oc get pods -l component=es -o jsonpath={.items[*].metadata.name}); do \
  oc exec -c elasticsearch $ES_POD -- touch /opt/app-root/src/init_failures;  \
done

Looks like it takes a lot of time to be applied.

Do step 4 without the pod line commands.

Comment 4 Shirly Radco 2019-03-04 20:56:05 UTC
Fixed in current patch.

Comment 5 Sandro Bonazzola 2019-03-12 12:54:22 UTC
4.3.1 has been released, please re-target this bug as soon as possible.

Comment 6 Ivana Saranova 2019-04-04 09:09:43 UTC
Steps to Reproduce:
1. Run install_okd.yml playbook
2. It fails on task Rolling out new pod(s) for {{ _es_node }}
3. SSH to OpenShift master node and cancel the current rollout
4. Initiate a new rollout. If you removed all causes of the original failure (e.g. insufficient memory), this rollout should succeed.
5. Run `oc get pods -n openshift-logging -l component=es` and make sure that both container of that pod are running
6. Make sure there's no problem in elasticsearch's container log: `oc logs $(oc get pods -n openshift-logging -l component=es -o name) -c elasticsearch`
7. Now describe the ES pod

Result:
ES pod is ready and healthy, no error in the events of the pod.

Verified in: 
ovirt-engine-4.2.8.5-0.1.el7ev.noarch
ovirt-engine-metrics-1.2.1.3-1.el7ev.noarch

Verified tested in:
ovirt-engine-4.3.3.1-0.1.el7.noarch
ovirt-engine-metrics-1.2.1.3-1.el7ev.noarch

Comment 7 Sandro Bonazzola 2019-04-16 13:58:31 UTC
This bugzilla is included in oVirt 4.3.3 release, published on April 16th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.