Bug 1457642
Summary: | SearchGuard times out seeding ES pod's .seachguard.$HOSTNAME index | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jaspreet Kaur <jkaur> | |
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> | |
Status: | CLOSED ERRATA | QA Contact: | Xia Zhao <xiazhao> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 3.4.1 | CC: | aos-bugs, clichybi, fcami, jcantril, jlee, juzhao, misalunk, pdwyer, pportant, smunilla, stwalter | |
Target Milestone: | --- | |||
Target Release: | 3.6.z | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: ES was configured to write its ACL information to an ES index named .searchguard.$HOSTNAME
Consequence: When ES pod started, it needs to recover all its indicies as well as re-seed its ACL index. This could render ES unavailable until the recovery completed and in some cases, the ACLs never were seeded because the seeding process timed out.
Fix: ES is now configured to write its ACL to a .searchguard.$DC_NAME. Additionally, the seeding process for ACL continues to try until success or timesout after the arbitrary time of a week.
Result: Since the ACL index remains the same, ES should be available and accessible once the cluster is YELLOW assuming the seeding process has succeeded at least once in the lifetime of the deploymentconfig. Additionally, the seeding process will continue to try even when the ES recovery process is slow due to latency issues.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1470368 (view as bug list) | Environment: | ||
Last Closed: | 2017-08-10 05:26:47 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1470368 |
Description
Jaspreet Kaur
2017-06-01 06:01:01 UTC
Commit pushed to master at https://github.com/openshift/origin-aggregated-logging https://github.com/openshift/origin-aggregated-logging/commit/e125f746c81c5aeb6425e90246c62111b626c669 bug 1457642. Fix SG timeout We repeatedly call the sgadmin script until it successfully returns, sleeping 10 seconds between retries. Partial fix for BZ #1457642 Commit pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/2d9eeac5a20523e3574044bfbede1f0c0686c159 bug 1457642. Use same SG index to avoid seeding timeout *** Bug 1464854 has been marked as a duplicate of this bug. *** Tested with the latest v3.6 images on OCP 3.6.0, logging system worked fine and didn't meet this exception in es log. Set to verified. Test env: # openshift version openshift v3.6.131 kubernetes v1.6.1+5115d708d7 etcd 3.2.1 ansible version: openshift-ansible-playbooks-3.6.131-1.git.0.d87dfaa.el7.noarch Images tested with: openshift3/logging-elasticsearch c601094a6111 openshift3/logging-kibana c91b7ad68dc7 openshift3/logging-fluentd 82367a1102e0 openshift3/logging-curator b609245a72f9 openshift3/logging-auth-proxy 39164e25543c Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716 Do we have this fixed in OCP 3.5 ? Looks like we did not backport this fix to 3.5 [1] though I'm not certain why since it was fixed in 3.4 [2] [1] https://github.com/openshift/openshift-ansible/blob/release-1.5/roles/openshift_logging/templates/elasticsearch.yml.j2#L65 [2] https://github.com/openshift/origin-aggregated-logging/blob/release-1.4/deployer/conf/elasticsearch.yml#L65 Please open an issue against 3.5 if we need to backport. I believe this would be a regression from 3.4 |