Bug 1430910 - Change shard and replica defaults for ES/Logging
Summary: Change shard and replica defaults for ES/Logging
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.5.0
Hardware: All
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.5.z
Assignee: Jeff Cantrill
QA Contact: Xia Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-09 20:51 UTC by Tushar Katarki
Modified: 2023-09-15 00:01 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-10-25 13:00:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:3049 0 normal SHIPPED_LIVE OpenShift Container Platform 3.6, 3.5, and 3.4 bug fix and enhancement update 2017-10-25 15:57:15 UTC

Description Tushar Katarki 2017-03-09 20:51:06 UTC
Description of problem:

There are several customer issues in relation to logging that can be mitigated by changing number of shards and replica defaults for ES. 

The current defaults are primary shard count of 5 and Replicas of 5. So every project will result in 10 shards. In cases where the number of Infra nodes for ES is small (say 1 or 2) the 10 shards per project causes the ES to go to status of yellow. 

The proposal is to start with a primary shard count of 1 and replica as zero and have users increase that count based on number of ES nodes, disks, and number of projects. 

This BZ is track that change to defaults for OCP. 


Version-Release number of selected component (if applicable):

OCP 3.5


How reproducible:

Always as long as the condition mentioned previously are met. 

Steps to Reproduce:
1. Set number of Infra nodes available for ES to 1 
2. Create one or more projects 
3. Deploy Logging solution. ES will be observe to be in yellow status

Actual results:

curl -XGET elasticsearch.example.com/_cat/health?v
epoch      timestamp cluster                 status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 
1489092539 20:48:59  elasticsearch.example yellow           1         1    499 499    0    0        0             0                  -                100.0%


Expected results:

Above status should be green

Additional info:

Comment 1 Peter Portante 2017-03-09 21:29:42 UTC
FYI, the current defaults are the Elasticsearch defaults, which are "number_of_shards" of 5, "number_of_replicas" of 1 (which means for each of the 5 primary shards, there is one replica).

Comment 2 ewolinet 2017-03-09 21:45:17 UTC
To be fair, as of the 3.4 EFK stack (1.4 in Origin) we have updated the default to be 1 primary shard and 0 replicas (but it is using the auto_expand_replicas 0-3 which we are removing in 3.5).

This applies to pre-3.4 EFK installations.

Comment 5 Jeff Cantrill 2017-04-04 13:05:32 UTC
Fixed defaults in 1.5/3.5 here: https://github.com/openshift/openshift-ansible/pull/3754
Fixed defaults in master here: https://github.com/openshift/openshift-ansible/pull/3580

Comment 6 Jeff Cantrill 2017-05-01 18:33:24 UTC
Moving to ON_QA since merged 3/23

Comment 7 Xia Zhao 2017-05-02 08:28:27 UTC
Tested with openshift-ansible-playbooks-3.5.60-1.git.0.b6f77a6.el7.noarch

The openshift master version is:
# openshift version
openshift v3.5.5.8
kubernetes v1.5.2+43a9be4
etcd 3.1.0

The default value works when inventory file didn't specify parameters openshift_logging_es_number_of_shards and openshift_logging_es_number_of_replicas:

# oc get configmap logging-elasticsearch -o yaml
...
    index:
      number_of_shards: 1
      number_of_replicas: 0
...

# #  oc exec logging-es-x34sgp9r-1-k518k -- curl -s -k --cert  /etc/elasticsearch/secret/admin-cert --key  /etc/elasticsearch/secret/admin-key -XGET https://logging-es:9200/_cat/health?v
epoch      timestamp cluster    status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 
1493713208 08:20:08  logging-es green           1         1      8   8    0    0        0             0                  -                100.0% 

And each index have 1 primary shard and 0 replicas:

# oc exec logging-es-x34sgp9r-1-k518k -- curl -s -k --cert  /etc/elasticsearch/secret/admin-cert --key  /etc/elasticsearch/secret/admin-key -XGET https://logging-es:9200/_cat/indices?v
health status index                                                                pri rep docs.count docs.deleted store.size pri.store.size 
green  open   project.install-test.5d4f5926-2f01-11e7-b497-fa163e5e29ae.2017.05.02   1   0        692            0    260.9kb        260.9kb 
green  open   project.logging.0f4ab4c2-2f01-11e7-b497-fa163e5e29ae.2017.05.02        1   0        402            0    302.6kb        302.6kb 
green  open   .operations.2017.05.02                                                 1   0     125073            0     50.6mb         50.6mb 
green  open   project.test.66d3cb31-2f0f-11e7-842e-fa163e5e29ae.2017.05.02           1   0         43            0     35.1kb         35.1kb 
green  open   .kibana                                                                1   0          1            0      3.1kb          3.1kb 
green  open   .kibana.f7724d98466ed7391e970202dc54a6460046aadb                       1   0          8            0       25kb           25kb 
green  open   .searchguard.logging-es-x34sgp9r-1-k518k                               1   0          5            0     38.3kb         38.3kb 
green  open   project.java.6ff2a834-2f0f-11e7-842e-fa163e5e29ae.2017.05.02           1   0        799            0    461.9kb        461.9kb 

Set to verified.

Comment 10 errata-xmlrpc 2017-10-25 13:00:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049

Comment 11 Red Hat Bugzilla 2023-09-15 00:01:35 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.