Bug 1457642

Summary:	SearchGuard times out seeding ES pod's .seachguard.$HOSTNAME index
Product:	OpenShift Container Platform	Reporter:	Jaspreet Kaur <jkaur>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED ERRATA	QA Contact:	Xia Zhao <xiazhao>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.4.1	CC:	aos-bugs, clichybi, fcami, jcantril, jlee, juzhao, misalunk, pdwyer, pportant, smunilla, stwalter
Target Milestone:	---
Target Release:	3.6.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: ES was configured to write its ACL information to an ES index named .searchguard.$HOSTNAME Consequence: When ES pod started, it needs to recover all its indicies as well as re-seed its ACL index. This could render ES unavailable until the recovery completed and in some cases, the ACLs never were seeded because the seeding process timed out. Fix: ES is now configured to write its ACL to a .searchguard.$DC_NAME. Additionally, the seeding process for ACL continues to try until success or timesout after the arbitrary time of a week. Result: Since the ACL index remains the same, ES should be available and accessible once the cluster is YELLOW assuming the seeding process has succeeded at least once in the lifetime of the deploymentconfig. Additionally, the seeding process will continue to try even when the ES recovery process is slow due to latency issues.	Story Points:	---
Clone Of:
Clones:	1470368 (view as bug list)		Environment:
Last Closed:	2017-08-10 05:26:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1470368

Description Jaspreet Kaur 2017-06-01 06:01:01 UTC

Description of problem: When upgrading from 3.3.1.20 to 3.4.1.18 the elasticsearch fails to initialize. Upon investigation it was identified that there were nearly 5000 indices which were not quick enough to get search pattern initialise. It may be because the storage was also slower to fetch probably a storage latency.

Very often below messages are seen :

[2017-05-09 13:45:53,562][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2017-05-09 13:45:55,111][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2017-05-09 13:45:56,127][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[....]


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Upgrade went fine however the EFk fails to initialize.


Expected results: Upgrade should be successful and also the EFk stack should start successfully.


Additional info:

Comment 2 openshift-github-bot 2017-06-14 21:27:07 UTC

Commit pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/e125f746c81c5aeb6425e90246c62111b626c669
bug 1457642. Fix SG timeout

We repeatedly call the sgadmin script until it successfully returns,
sleeping 10 seconds between retries.

Partial fix for BZ #1457642

Comment 3 openshift-github-bot 2017-06-22 02:24:39 UTC

Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/2d9eeac5a20523e3574044bfbede1f0c0686c159
bug 1457642. Use same SG index to avoid seeding timeout

Comment 5 Jeff Cantrill 2017-06-27 14:00:12 UTC

*** Bug 1464854 has been marked as a duplicate of this bug. ***

Comment 6 Xia Zhao 2017-07-03 10:01:52 UTC

Tested with the latest v3.6 images on OCP 3.6.0, logging system worked fine and didn't meet this exception in es log. Set to verified.

Test env:
# openshift version
openshift v3.6.131
kubernetes v1.6.1+5115d708d7
etcd 3.2.1


ansible version:
openshift-ansible-playbooks-3.6.131-1.git.0.d87dfaa.el7.noarch

Images tested with:
openshift3/logging-elasticsearch    c601094a6111
openshift3/logging-kibana    c91b7ad68dc7
openshift3/logging-fluentd    82367a1102e0
openshift3/logging-curator    b609245a72f9
openshift3/logging-auth-proxy    39164e25543c

Comment 8 errata-xmlrpc 2017-08-10 05:26:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Comment 9 Miheer Salunke 2017-09-18 09:26:22 UTC

Do we have this fixed in OCP 3.5 ?

Comment 10 Jeff Cantrill 2017-09-19 18:20:02 UTC

Looks like we did not backport this fix to 3.5 [1] though I'm not certain why since it was fixed in 3.4 [2]

[1] https://github.com/openshift/openshift-ansible/blob/release-1.5/roles/openshift_logging/templates/elasticsearch.yml.j2#L65
[2] https://github.com/openshift/origin-aggregated-logging/blob/release-1.4/deployer/conf/elasticsearch.yml#L65

Please open an issue against 3.5 if we need to backport.  I believe this would be a regression from 3.4