1540147 – Elastic search pod did not get up in 10 minutes

Bug 1540147 - Elastic search pod did not get up in 10 minutes

Summary: Elastic search pod did not get up in 10 minutes

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.6.0
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Rich Megginson
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1510988
TreeView+	depends on / blocked

Reported:	2018-01-30 11:16 UTC by Pavol Brilla
Modified:	2018-02-14 12:22 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-02-14 12:22:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
inventory (1.10 KB, text/plain) 2018-02-01 07:03 UTC, Lukas Svaty	no flags	Details
vars.yaml (2.42 KB, text/plain) 2018-02-01 07:03 UTC, Lukas Svaty	no flags	Details
Dump with 770 on directory (64.00 KB, application/x-gzip) 2018-02-07 12:44 UTC, Pavol Brilla	no flags	Details
View All

Description Pavol Brilla 2018-01-30 11:16:07 UTC

Description of problem:
Using documentation for ovirt-metrics, 

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Follow steps in documentation and at 1.12. try to restart elasticsearch pod
2. See error: replication controller "logging-es-data-master-sowm2972-2" has failed progressing



Actual results:
# oc logs  $( oc get -n logging dc -l component=es -o name )
--> Scaling logging-es-data-master-sowm2972-3 to 1
--> Waiting up to 10m0s for pods in rc logging-es-data-master-sowm2972-3 to become ready
error: update acceptor rejected logging-es-data-master-sowm2972-3: pods for rc "logging-es-data-master-sowm2972-3" took longer than 600 seconds to become ready

Expected results:
pod should be started

Additional info:
Document will be provided in private comment as it is not yet publicly published

Comment 2 Pavol Brilla 2018-01-30 11:20:19 UTC

Machine stats:  16G RAM, 12 cores, 
Disk size: 150G

Comment 3 Pavol Brilla 2018-01-30 11:21:44 UTC

20G defined for machine, 16G guaranteed by engine

Comment 4 Shirly Radco 2018-01-30 13:51:13 UTC

Please test on a clean machine. I believe this might be specific to the environment and not really 100% reproducible.

It still needs to be resolved, but I want to make sure its not a blocker to the release.

Comment 5 Pavol Brilla 2018-01-30 14:01:08 UTC

Machine for this was cleanly installed this morning only for purpose to test docs.

Clean 7.4 RHEL from PXE

Comment 6 Shirly Radco 2018-01-30 14:06:46 UTC

What is the resource consumption of the machine? cpu, memory.

Comment 7 Rich Megginson 2018-01-30 14:09:48 UTC

Please run logging-dump.sh: https://github.com/openshift/origin-aggregated-logging/blob/master/hack/README-dump.md

https://github.com/openshift/origin-aggregated-logging/blob/master/hack/logging-dump.sh

and attach the output to this bz.

Comment 9 Lukas Svaty 2018-01-31 12:45:39 UTC

I was able to reproduce this in automation, If you would like to take a look at steps used take a look here:
https://github.com/StLuke/ovirt-metrics-store/blob/master/playbooks/viaq-store.yml

Comment 10 Rich Megginson 2018-01-31 14:50:08 UTC

not sure what's going on - logging-dump produced no es log info - es describe output isn't very useful

Nathan/Noriko - can one of you try to reproduce?

Comment 11 Noriko Hosoi 2018-01-31 16:49:30 UTC

(In reply to Rich Megginson from comment #10)
> not sure what's going on - logging-dump produced no es log info - es
> describe output isn't very useful
> 
> Nathan/Noriko - can one of you try to reproduce?

I'm having a difficulty to set up the 3.6 environment. :(

In the meantime, could it be possible for us to allow to access one of the failed system?

I'm interested in the ansible log from the previous section 1.11. Running Ansible /tmp/ansible.log and the pods' status and events.
  oc get pods
  oc get events

Comment 12 Noriko Hosoi 2018-02-01 00:18:17 UTC

(In reply to Lukas Svaty from comment #9)
> I was able to reproduce this in automation, If you would like to take a look
> at steps used take a look here:
> https://github.com/StLuke/ovirt-metrics-store/blob/master/playbooks/viaq-
> store.yml

The steps look good to me.  Can we also see the inventory file and vars.yml file?
https://github.com/StLuke/ovirt-metrics-store/blob/master/playbooks/viaq-store.yml#L125

Comment 13 Lukas Svaty 2018-02-01 07:03:19 UTC

Created attachment 1389363 [details]
inventory

Comment 14 Lukas Svaty 2018-02-01 07:03:47 UTC

Created attachment 1389364 [details]
vars.yaml

Comment 15 Lukas Svaty 2018-02-01 07:05:49 UTC

Also, my setup was done on VM with lower specs (8GB machine) so this might have affected the deployment of es pod. I am able to successfully run the playbooks now, on a bigger machine (however on CentOS vs origin).

Comment 16 Pavol Brilla 2018-02-01 08:57:10 UTC

Original machine was discarded by lsvaty test on CentOS

Comment 17 Rich Megginson 2018-02-01 15:50:28 UTC

(In reply to Lukas Svaty from comment #15)
> Also, my setup was done on VM with lower specs (8GB machine) so this might
> have affected the deployment of es pod. I am able to successfully run the
> playbooks now, on a bigger machine (however on CentOS vs origin).

So is this still a bug?  CLOSED NOTABUG?

Comment 18 Lukas Svaty 2018-02-02 08:04:54 UTC

I'll try to reproduce with my playbooks, if not pbrilla can reproduce manually as last stand. If we won't be we'll close the bug with INSUFFICIENT_DATA

Comment 19 Lukas Svaty 2018-02-05 15:38:12 UTC

was able to reproduce this with the playbook mentioned and vars taken from comment#1 still relevant

Comment 20 Noriko Hosoi 2018-02-05 18:08:02 UTC

(In reply to Lukas Svaty from comment #19)
> was able to reproduce this with the playbook mentioned and vars taken from
> comment#1 still relevant

Thanks for retrying the test, Lukas.

Is the test env the same as in #c2 and #c3?
> Machine stats:  16G RAM, 12 cores, 
> Disk size: 150G

> 20G defined for machine, 16G guaranteed by engine

And there is not log from the es pod again if you run logging-dump.sh as suggested in #c7?  How about "os get events", "os get events"?  Thanks.

Comment 22 Pavol Brilla 2018-02-07 12:44:10 UTC

Created attachment 1392653 [details]
Dump with 770 on directory

We changed against document:
chmod 0770 /var/lib/elastricsearch

still same result, attaching dump

Comment 23 Pavol Brilla 2018-02-08 12:25:19 UTC

ok, host reprovisioned, ES pod is up

changes against documentation 

Directory for persistent storage:
chmod -R g+wx /var/lib/elasticsearch

User received one extra scc:
oadm policy add-scc-to-user hostaccess system:serviceaccount:logging:aggregated-logging-elasticsearch

after those 2 changes pod started flawlessly

Comment 24 Pavol Brilla 2018-02-14 12:22:40 UTC

OK closing this bug, after discussion

SCC is not needed

and directory privilegies are correct for persistent storage

Note You need to log in before you can comment on or make changes to this bug.