Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1430910

Summary:	Change shard and replica defaults for ES/Logging
Product:	OpenShift Container Platform	Reporter:	Tushar Katarki <tkatarki>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED ERRATA	QA Contact:	Xia Zhao <xiazhao>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.5.0	CC:	anli, aos-bugs, dmoessne, ewolinet, fcami, jcantril, jswensso, mcurry, myllynen, pportant, rmeggins, tkatarki
Target Milestone:	---
Target Release:	3.5.z
Hardware:	All
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-10-25 13:00:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tushar Katarki 2017-03-09 20:51:06 UTC

Description of problem:

There are several customer issues in relation to logging that can be mitigated by changing number of shards and replica defaults for ES. 

The current defaults are primary shard count of 5 and Replicas of 5. So every project will result in 10 shards. In cases where the number of Infra nodes for ES is small (say 1 or 2) the 10 shards per project causes the ES to go to status of yellow. 

The proposal is to start with a primary shard count of 1 and replica as zero and have users increase that count based on number of ES nodes, disks, and number of projects. 

This BZ is track that change to defaults for OCP. 


Version-Release number of selected component (if applicable):

OCP 3.5


How reproducible:

Always as long as the condition mentioned previously are met. 

Steps to Reproduce:
1. Set number of Infra nodes available for ES to 1 
2. Create one or more projects 
3. Deploy Logging solution. ES will be observe to be in yellow status

Actual results:

curl -XGET elasticsearch.example.com/_cat/health?v
epoch      timestamp cluster                 status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 
1489092539 20:48:59  elasticsearch.example yellow           1         1    499 499    0    0        0             0                  -                100.0%


Expected results:

Above status should be green

Additional info:

Comment 1 Peter Portante 2017-03-09 21:29:42 UTC

FYI, the current defaults are the Elasticsearch defaults, which are "number_of_shards" of 5, "number_of_replicas" of 1 (which means for each of the 5 primary shards, there is one replica).

Comment 2 ewolinet 2017-03-09 21:45:17 UTC

To be fair, as of the 3.4 EFK stack (1.4 in Origin) we have updated the default to be 1 primary shard and 0 replicas (but it is using the auto_expand_replicas 0-3 which we are removing in 3.5).

This applies to pre-3.4 EFK installations.

Comment 5 Jeff Cantrill 2017-04-04 13:05:32 UTC

Fixed defaults in 1.5/3.5 here: https://github.com/openshift/openshift-ansible/pull/3754
Fixed defaults in master here: https://github.com/openshift/openshift-ansible/pull/3580

Comment 6 Jeff Cantrill 2017-05-01 18:33:24 UTC

Moving to ON_QA since merged 3/23

Comment 7 Xia Zhao 2017-05-02 08:28:27 UTC

Tested with openshift-ansible-playbooks-3.5.60-1.git.0.b6f77a6.el7.noarch

The openshift master version is:
# openshift version
openshift v3.5.5.8
kubernetes v1.5.2+43a9be4
etcd 3.1.0

The default value works when inventory file didn't specify parameters openshift_logging_es_number_of_shards and openshift_logging_es_number_of_replicas:

# oc get configmap logging-elasticsearch -o yaml
...
    index:
      number_of_shards: 1
      number_of_replicas: 0
...

# #  oc exec logging-es-x34sgp9r-1-k518k -- curl -s -k --cert  /etc/elasticsearch/secret/admin-cert --key  /etc/elasticsearch/secret/admin-key -XGET https://logging-es:9200/_cat/health?v
epoch      timestamp cluster    status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 
1493713208 08:20:08  logging-es green           1         1      8   8    0    0        0             0                  -                100.0% 

And each index have 1 primary shard and 0 replicas:

# oc exec logging-es-x34sgp9r-1-k518k -- curl -s -k --cert  /etc/elasticsearch/secret/admin-cert --key  /etc/elasticsearch/secret/admin-key -XGET https://logging-es:9200/_cat/indices?v
health status index                                                                pri rep docs.count docs.deleted store.size pri.store.size 
green  open   project.install-test.5d4f5926-2f01-11e7-b497-fa163e5e29ae.2017.05.02   1   0        692            0    260.9kb        260.9kb 
green  open   project.logging.0f4ab4c2-2f01-11e7-b497-fa163e5e29ae.2017.05.02        1   0        402            0    302.6kb        302.6kb 
green  open   .operations.2017.05.02                                                 1   0     125073            0     50.6mb         50.6mb 
green  open   project.test.66d3cb31-2f0f-11e7-842e-fa163e5e29ae.2017.05.02           1   0         43            0     35.1kb         35.1kb 
green  open   .kibana                                                                1   0          1            0      3.1kb          3.1kb 
green  open   .kibana.f7724d98466ed7391e970202dc54a6460046aadb                       1   0          8            0       25kb           25kb 
green  open   .searchguard.logging-es-x34sgp9r-1-k518k                               1   0          5            0     38.3kb         38.3kb 
green  open   project.java.6ff2a834-2f0f-11e7-842e-fa163e5e29ae.2017.05.02           1   0        799            0    461.9kb        461.9kb 

Set to verified.

Comment 10 errata-xmlrpc 2017-10-25 13:00:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049

Comment 11 Red Hat Bugzilla 2023-09-15 00:01:35 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days