Description of problem: There are several customer issues in relation to logging that can be mitigated by changing number of shards and replica defaults for ES. The current defaults are primary shard count of 5 and Replicas of 5. So every project will result in 10 shards. In cases where the number of Infra nodes for ES is small (say 1 or 2) the 10 shards per project causes the ES to go to status of yellow. The proposal is to start with a primary shard count of 1 and replica as zero and have users increase that count based on number of ES nodes, disks, and number of projects. This BZ is track that change to defaults for OCP. Version-Release number of selected component (if applicable): OCP 3.5 How reproducible: Always as long as the condition mentioned previously are met. Steps to Reproduce: 1. Set number of Infra nodes available for ES to 1 2. Create one or more projects 3. Deploy Logging solution. ES will be observe to be in yellow status Actual results: curl -XGET elasticsearch.example.com/_cat/health?v epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1489092539 20:48:59 elasticsearch.example yellow 1 1 499 499 0 0 0 0 - 100.0% Expected results: Above status should be green Additional info:
FYI, the current defaults are the Elasticsearch defaults, which are "number_of_shards" of 5, "number_of_replicas" of 1 (which means for each of the 5 primary shards, there is one replica).
To be fair, as of the 3.4 EFK stack (1.4 in Origin) we have updated the default to be 1 primary shard and 0 replicas (but it is using the auto_expand_replicas 0-3 which we are removing in 3.5). This applies to pre-3.4 EFK installations.
Fixed defaults in 1.5/3.5 here: https://github.com/openshift/openshift-ansible/pull/3754 Fixed defaults in master here: https://github.com/openshift/openshift-ansible/pull/3580
Moving to ON_QA since merged 3/23
Tested with openshift-ansible-playbooks-3.5.60-1.git.0.b6f77a6.el7.noarch The openshift master version is: # openshift version openshift v3.5.5.8 kubernetes v1.5.2+43a9be4 etcd 3.1.0 The default value works when inventory file didn't specify parameters openshift_logging_es_number_of_shards and openshift_logging_es_number_of_replicas: # oc get configmap logging-elasticsearch -o yaml ... index: number_of_shards: 1 number_of_replicas: 0 ... # # oc exec logging-es-x34sgp9r-1-k518k -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key -XGET https://logging-es:9200/_cat/health?v epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1493713208 08:20:08 logging-es green 1 1 8 8 0 0 0 0 - 100.0% And each index have 1 primary shard and 0 replicas: # oc exec logging-es-x34sgp9r-1-k518k -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key -XGET https://logging-es:9200/_cat/indices?v health status index pri rep docs.count docs.deleted store.size pri.store.size green open project.install-test.5d4f5926-2f01-11e7-b497-fa163e5e29ae.2017.05.02 1 0 692 0 260.9kb 260.9kb green open project.logging.0f4ab4c2-2f01-11e7-b497-fa163e5e29ae.2017.05.02 1 0 402 0 302.6kb 302.6kb green open .operations.2017.05.02 1 0 125073 0 50.6mb 50.6mb green open project.test.66d3cb31-2f0f-11e7-842e-fa163e5e29ae.2017.05.02 1 0 43 0 35.1kb 35.1kb green open .kibana 1 0 1 0 3.1kb 3.1kb green open .kibana.f7724d98466ed7391e970202dc54a6460046aadb 1 0 8 0 25kb 25kb green open .searchguard.logging-es-x34sgp9r-1-k518k 1 0 5 0 38.3kb 38.3kb green open project.java.6ff2a834-2f0f-11e7-842e-fa163e5e29ae.2017.05.02 1 0 799 0 461.9kb 461.9kb Set to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3049
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days