1457642 – SearchGuard times out seeding ES pod's .seachguard.$HOSTNAME index

Bug 1457642 - SearchGuard times out seeding ES pod's .seachguard.$HOSTNAME index

Summary: SearchGuard times out seeding ES pod's .seachguard.$HOSTNAME index

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Jeff Cantrill
QA Contact:	Xia Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1464854 (view as bug list)
Depends On:
Blocks:	1470368
TreeView+	depends on / blocked

Reported:	2017-06-01 06:01 UTC by Jaspreet Kaur
Modified:	2021-09-09 12:20 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: ES was configured to write its ACL information to an ES index named .searchguard.$HOSTNAME Consequence: When ES pod started, it needs to recover all its indicies as well as re-seed its ACL index. This could render ES unavailable until the recovery completed and in some cases, the ACLs never were seeded because the seeding process timed out. Fix: ES is now configured to write its ACL to a .searchguard.$DC_NAME. Additionally, the seeding process for ACL continues to try until success or timesout after the arbitrary time of a week. Result: Since the ACL index remains the same, ES should be available and accessible once the cluster is YELLOW assuming the seeding process has succeeded at least once in the lifetime of the deploymentconfig. Additionally, the seeding process will continue to try even when the ES recovery process is slow due to latency issues.
Clone Of:
Clones:	1470368 (view as bug list)
Environment:
Last Closed:	2017-08-10 05:26:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3093741	0	None	None	None	2017-06-29 15:22:40 UTC
Red Hat Product Errata	RHEA-2017:1716	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.6 RPM Release Advisory	2017-08-10 09:02:50 UTC

Description Jaspreet Kaur 2017-06-01 06:01:01 UTC

Description of problem: When upgrading from 3.3.1.20 to 3.4.1.18 the elasticsearch fails to initialize. Upon investigation it was identified that there were nearly 5000 indices which were not quick enough to get search pattern initialise. It may be because the storage was also slower to fetch probably a storage latency.

Very often below messages are seen :

[2017-05-09 13:45:53,562][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2017-05-09 13:45:55,111][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2017-05-09 13:45:56,127][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[....]


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Upgrade went fine however the EFk fails to initialize.


Expected results: Upgrade should be successful and also the EFk stack should start successfully.


Additional info:

Comment 2 openshift-github-bot 2017-06-14 21:27:07 UTC

Commit pushed to master at https://github.com/openshift/origin-aggregated-logging

https://github.com/openshift/origin-aggregated-logging/commit/e125f746c81c5aeb6425e90246c62111b626c669
bug 1457642. Fix SG timeout

We repeatedly call the sgadmin script until it successfully returns,
sleeping 10 seconds between retries.

Partial fix for BZ #1457642

Comment 3 openshift-github-bot 2017-06-22 02:24:39 UTC

Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/2d9eeac5a20523e3574044bfbede1f0c0686c159
bug 1457642. Use same SG index to avoid seeding timeout

Comment 5 Jeff Cantrill 2017-06-27 14:00:12 UTC

*** Bug 1464854 has been marked as a duplicate of this bug. ***

Comment 6 Xia Zhao 2017-07-03 10:01:52 UTC

Tested with the latest v3.6 images on OCP 3.6.0, logging system worked fine and didn't meet this exception in es log. Set to verified.

Test env:
# openshift version
openshift v3.6.131
kubernetes v1.6.1+5115d708d7
etcd 3.2.1


ansible version:
openshift-ansible-playbooks-3.6.131-1.git.0.d87dfaa.el7.noarch

Images tested with:
openshift3/logging-elasticsearch    c601094a6111
openshift3/logging-kibana    c91b7ad68dc7
openshift3/logging-fluentd    82367a1102e0
openshift3/logging-curator    b609245a72f9
openshift3/logging-auth-proxy    39164e25543c

Comment 8 errata-xmlrpc 2017-08-10 05:26:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Comment 9 Miheer Salunke 2017-09-18 09:26:22 UTC

Do we have this fixed in OCP 3.5 ?

Comment 10 Jeff Cantrill 2017-09-19 18:20:02 UTC

Looks like we did not backport this fix to 3.5 [1] though I'm not certain why since it was fixed in 3.4 [2]

[1] https://github.com/openshift/openshift-ansible/blob/release-1.5/roles/openshift_logging/templates/elasticsearch.yml.j2#L65
[2] https://github.com/openshift/origin-aggregated-logging/blob/release-1.4/deployer/conf/elasticsearch.yml#L65

Please open an issue against 3.5 if we need to backport.  I believe this would be a regression from 3.4

Note You need to log in before you can comment on or make changes to this bug.