Bug 1459054

Summary: Timeout creating SearchGuard index
Product: OpenShift Container Platform Reporter: Ruben Romero Montes <rromerom>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED DUPLICATE QA Contact: Xia Zhao <xiazhao>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.4.1CC: aivaras.laimikis, aos-bugs, erich, jcantril, nnosenzo, pdwyer, pportant, tlarsson
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-27 16:34:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
docker logs
none
docker inspect
none
all_logging
none
nodes description none

Description Ruben Romero Montes 2017-06-06 08:02:28 UTC
Created attachment 1285280 [details]
docker logs

Description of problem:
SearchGuard is not able to initialize after a timeout.

[2017-06-05 08:15:36,606][INFO ][cluster.routing.allocation] [Crimson] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[project.aes-mbaas-infra.26cf63cd-2b47-11e7-a35d-0acaab79e3f7.2017.05.22][0], [.searchguard.logging-es-xgrcmvev-3-5nb30][0], [project.nagp-il-core-int-01.7c640146-08a0-11e7-8d5d-0610033e8e3f.2017.05.22][0], [.searchguard.logging-es-sm5vnjla-3-55uhm][0]] ...]).
Clustername: logging-es
Clusterstate: YELLOW
Number of nodes: 3
Number of data nodes: 3
.searchguard.logging-es-dyppkops-2-qapt3 index does not exists, attempt to create it ... [2017-06-05 08:15:46,856][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2017-06-05 08:15:54,250][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
...
ERR: An unexpected ProcessClusterEventTimeoutException occured: failed to process cluster event (create-index [.searchguard.logging-es-dyppkops-2-qapt3], cause [api]) within 30s
Trace:
ProcessClusterEventTimeoutException[failed to process cluster event (create-index [.searchguard.logging-es-dyppkops-2-qapt3], cause [api]) within 30s]
	at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:349)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

...
[2017-06-05 08:25:06,818][INFO ][cluster.routing.allocation] [Crimson] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.searchguard.logging-es-xgrcmvev-3-5nb30][0], [.searchguard.logging-es-xgrcmvev-3-5nb30][0]] ...]).

Version-Release number of selected component (if applicable):
openshift3-logging-elasticsearch-3.4.1-26

How reproducible:
Only in the reporting environment.

Steps to Reproduce:
1. Scale down to 0 all the 3 deploymentConfigs
2. Scale up to 1 all 3 deploymentconfigs

Actual results:
failed to process cluster event (create-index [.searchguard.logging-es-dyppkops-2-qapt3], cause [api]) within 30s

Expected results:
The SearchGuard index to be initialized

Additional info:
  Volume type is gp2 with 1500 / 3000 iops, with 500GiB of storage
    148G of data free
    /dev/mapper/vg01-data           500G  353G  148G  71% /data
  Deployment: AWS
  Memory: 16GB
  Ec2 instances are m4.xlarge for the masters, and r4.xlarge for the nodes

  Ensured they have auto_expand_replicas: 2 in the configmap.

Comment 1 Ruben Romero Montes 2017-06-06 08:05:31 UTC
Created attachment 1285281 [details]
docker inspect

Comment 2 Ruben Romero Montes 2017-06-06 08:05:52 UTC
Created attachment 1285282 [details]
all_logging

Comment 3 Ruben Romero Montes 2017-06-06 08:06:24 UTC
Created attachment 1285283 [details]
nodes description

Comment 7 Ruben Romero Montes 2017-06-07 14:08:25 UTC
The manual workaround can be to initialize SearchGuard from inside all three pods.

 $ oc rsh <logging-es-pod>
 # /usr/share/java/elasticsearch/plugins/search-guard-2/tools/sgadmin.sh \
        -cd ${HOME}/sgconfig \
        -i .searchguard.${HOSTNAME} \
        -ks /etc/elasticsearch/secret/searchguard.key \
        -kst JKS \
        -kspass kspass \
        -ts /etc/elasticsearch/secret/searchguard.truststore \
        -tst JKS \
        -tspass tspass \
        -nhnv \
        -icl

Or also try to close some old indices manually in order to speed up the initialization.

Comment 8 Jeff Cantrill 2017-06-27 16:34:11 UTC
Closing this as a dup since its all related to the initialization of the SG index for which we have a fix and needs to be ported to 3.4.1.  We will resolve against #1449378.  Ref upstream PR to be backported: https://github.com/openshift/origin-aggregated-logging/pull/469

*** This bug has been marked as a duplicate of bug 1449378 ***