1883444 – Performance tweaking for Elasticsearch

Bug 1883444 - Performance tweaking for Elasticsearch

Summary: Performance tweaking for Elasticsearch

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	ewolinet
QA Contact:	Mike Fiedler
Docs Contact:	Rolfe Dlugy-Hegwer
URL:
Whiteboard:	logging-exploration
Depends On:
Blocks:	1885667
TreeView+	depends on / blocked

Reported:	2020-09-29 08:43 UTC by Rajnikant
Modified:	2024-06-13 23:08 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	[discrete] [id="ocp-4-7-eo-max-five-shards"] // https://bugzilla.redhat.com/show_bug.cgi?id=1883444 ==== Maximum five primary shards per index With this release, the Elasticsearch Operator (EO) sets the number of primary shards for an index between 1 and 5, depending on the number of data nodes defined for a cluster. Previously, EO set the number of shards for an index to the number of data nodes. When an index in Elasticsearch was configured with a number of replicas, it created that many replicas for each primary shard (not per index). Therefore, as the index sharded, a greater number of replica shards existed in the cluster. This created a lot of overhead for the cluster to replicate and keep in sync.
Clone Of:
Environment:
Last Closed:	2021-02-24 11:21:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 505	None	closed	Bug 1883444: capping primary shard count at 5 for performance reasons	2021-02-12 04:05:29 UTC
Github	openshift elasticsearch-operator pull 574	None	closed	Bug 1883444: Updating index template logic to cap primary shards	2021-02-12 04:05:29 UTC
Red Hat Product Errata	RHBA-2021:0652	None	None	None	2021-02-24 11:21:52 UTC

Comment 1 Jeff Cantrill 2020-09-29 13:37:51 UTC

Moving to 4.7 as this is not a 4.6 blocker

Comment 2 Jeff Cantrill 2020-09-29 18:27:22 UTC

The initial resolution to this issue will be to cap replicas to a value so that there are not N=nodes number of primary indices.  In the interim, I would suggest the following with caveat it is untested but we can arrange  a meeting to be available:

* List indices: oc exec -c elasticsearch $pod -- indices
* Remove indices you no longer need.  Typically we recommend retaining no more than 7 days with 14 days at the most.  The indices have a date suffix to understand the age
* To ensure the changes are honoured until a fix can be released (likely in 4.5), edit the clusterlogging instance to be "Unmanaged"
* Set replicas temporarily to 0: oc exec -c elasticsearch $pod -- es_util --query=_settings -XPUT -d '{"index":{ "number_of_replicas":0}}'
* Ensure new indices do not create more than 5 primaries: oc exec -c elasticsearch $pod --es_util --query=_template/replica_override  -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"index_patterns":["project.*",".operations*",".orphaned*"]}'

The previous will likely stabilize the cluster and ensure new indices do not return you to the previous state.  Assuming the curation process is functioning correctly, the highly sharded indices will eventually roll off.  Alternatively, you might consider shrinking your indices using the shrink API [1] with something like:

* Choose one of the remaining indices from the list which is likely in the form of 'project.<projectname>.<projectuid>.<yyyy>.<mm>.<dd>' or '.operations.<yyyy>.<mm>.<dd>'
* Choose a target name by adding a segment between the uuid and date: 'project.<projectname>.<projectuid>.ri.<yyyy>.<mm>.<dd>' or '.operations.ri.<yyyy>.<mm>.<dd>'
* Shrink the index:  oc exec -c elasticsearch $pod -- es_util --query=$sourcename/_shrink/$targetname -d '{"settings": {"index.number_of_replicas": 0,"index.number_of_shards": 5},"aliases": {".all": {}}}'
* finally you may have to delete the previous index once shrink is complete: oc exec -c elasticsearch $pod -- es_util --query=$sourcename -XDELETE

Finally, enable replication if needed: oc exec -c elasticsearch $pod -- es_util --query=_settings -XPUT -d '{"index":{ "number_of_replicas":1}}'

[1] https://www.elastic.co/guide/en/elasticsearch/reference/5.6/indices-shrink-index.html

Comment 3 Steven Walter 2020-09-29 20:10:10 UTC

I'm taking over the support case during NA hours; let us know if/when engineering would be available to work on this with the customer. If there would be a long delay due to unavailability, I've also offered the customer for me to work with them directly on it. I'm comfortable with the steps listed, but we may still need to rely on engineering assistance in the event of unexpected behavior.

Comment 4 Steven Walter 2020-10-02 13:55:02 UTC

Any chance if engineering will be available today for the session?

Comment 5 Jeff Cantrill 2020-10-02 15:24:00 UTC

Marking UpcomingSprint as will not be merged or addressed by EOD

Comment 7 Jeff Cantrill 2020-10-05 20:05:23 UTC

(In reply to rbolling from comment #6)

> Additionally, the question was asked would this 4.7 fix be backported to
> 4.4?  This would obviously be a question if we van get a viable workaround.

The work around was listed https://bugzilla.redhat.com/show_bug.cgi?id=1883444#c2  As regards to "backporting", the intention is to bring it back to 4.4 as long as that release is still within a maintenance cycle

Comment 8 rbolling 2020-10-05 20:47:33 UTC

Thanks Jeff.  Ford attempted the work around they were getting an error.  I believe the Steven Walters explained that they were we were hitting something else.  The customer is still requesting engineering on a call tomorrow.  This was number one topic for Ford today in our IBM Red Hat Ford call.

History.
Customer was following https://bugzilla.redhat.com/show_bug.cgi?id=1883444 & it failed on set 5. Can you please help us regarding this.
--------------------------------------
This particular command fails with status 400
oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"index_patterns":["project.*",".operations*",".orphaned*"]}'

"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: template is missing;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: template is missing;"},"status":400}

I'm unclear what the query line should be to correct.


Red Hat Response.
I think this might be caused by a discrepancy between ES 5 and ES 6. Someone in the ES community hit something similar: https://github.com/pelias/terraform-elasticsearch/issues/9

Seems ES 5 (which is running in OCP 4.3) doesn't use "index_patterns" and instead uses "template". Let's try this command instead:

oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":["project.*",".operations*",".orphaned*"]}'

Comment 9 Steven Walter 2020-10-06 20:37:20 UTC

Customer hit further issues with my modified command, but were able to get it working by running it separately for each template:

oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":["project.*"]}'
oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":[".operations*"]}'
oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":[".orphaned*"]}'

It seems to have worked. They mentioned after the changes the below:


~~~
This also appears to show up in the log on each cluster member:

Oct 06, 2020 6:51:11 PM okhttp3.internal.platform.Platform log
WARNING: A connection to https://10.0.0.1/ was leaked. Did you forget 
to close a response body? To see where this was allocated, set the 
OkHttpClient logger level to FINE: 
Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);

We also see fluctuating Readiness probe issues:


Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503]
Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000]

This will fluctuate but does persist for the life of the CI.
~~~


I've requested an updated must-gather to check the current status; for now we'll see if we can address the issue within support and update here if further assistance is needed

Comment 10 Jeff Cantrill 2020-10-07 14:31:47 UTC

(In reply to Steven Walter from comment #9)
> Customer hit further issues with my modified command, but were able to get
> it working by running it separately for each template:
> 
> oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override
> -XPUT -d '{"order": 100, "settings": {
> "index.number_of_shards":5},"template":["project.*"]}'
> oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override
> -XPUT -d '{"order": 100, "settings": {
> "index.number_of_shards":5},"template":[".operations*"]}'
> oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override
> -XPUT -d '{"order": 100, "settings": {
> "index.number_of_shards":5},"template":[".orphaned*"]}'
> 
> It seems to have worked. They mentioned after the changes the below:
> 

Worked as you have completed all of the steps I recommended or just the templates to override replicas?

> 
> ~~~
> This also appears to show up in the log on each cluster member:
> 
> Oct 06, 2020 6:51:11 PM okhttp3.internal.platform.Platform log
> WARNING: A connection to https://10.0.0.1/ was leaked. Did you forget 
> to close a response body? To see where this was allocated, set the 
> OkHttpClient logger level to FINE: 
> Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE);

I have a PR which could resolve this but given it is mostly a nuisance in the logs and the plugin goes away in 4.5+ the solution has been depriorited

> 
> We also see fluctuating Readiness probe issues:
> 
> 
> Readiness probe failed: Elasticsearch node is not ready to accept HTTP
> requests yet [response code: 503]
> Readiness probe failed: Elasticsearch node is not ready to accept HTTP
> requests yet [response code: 000]

I'm guessing here but wondering if this is related to trying to coordinate "readiness" across 27 nodes. Likely needs investigation but assuming you are otherwise functional I would not expect this to be an issue

> 
> This will fluctuate but does persist for the life of the CI.

Life of "Continuous Integration"? Please advise how this is relevant and factors into the issue?

Comment 11 Steven Walter 2020-10-07 15:22:28 UTC

So far they have only completed the template changes -- I had misunderstood yesterday and thought they completed all the steps. I am working with them on shrinking the indices. Since there are a few thousand, I wrote a short script to do it more automatically.

touch indexlist
for source in $(es_util --query=_cat/indices | awk '{print $3}' | grep project)
  do echo $source
  newindex=$(echo "project.$(echo $source | cut -d"." -f2-3).ri.$(echo $source | cut -d"." -f4-6)")
  echo $newindex
  es_util --query=$source/_shrink/$newindex -d '{"settings": {"index.number_of_replicas": 0,"index.number_of_shards": 5},"aliases": {".all": {}}}'
  echo $source >> indexlist
done

Once satisfied that the shrink worked we can iterate over indexlist to delete the indices:

$ for source in $(cat indexlist); do es_util --query=$source -XDELETE ; done

I'll update in the bz if any help is needed.

Comment 14 Steven Walter 2020-10-09 17:52:59 UTC

The template changes [1] dont seem to "stick". active_shards and active_primary_shards have a discrepancy, which grows with time.
  - My hypothesis was, replica shards are being re-enabled while we're in Managed state.
  - However, even setting to Unmanaged, the issue continues
  - We checked new indices being created and saw they still have 10 primary shards, even after overriding the shards count
I'll install a lab cluster to try to figure out what's not working here.

 - Pending tasks keeps going up forever after we do the template change.

[1]
oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":["project.*"]}'
oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":[".operations*"]}'
oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":[".orphaned*"]}'

Comment 15 Steven Walter 2020-10-09 20:08:25 UTC

Reproduced; I can't seem to use ES API to prevent sharding to node number.

$ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":1},"template":["project.*"]}'
{"acknowledged":true}
$ oc new-project jenkins
$ oc new-app jenkins-ephemeral
(waited till it's up)
$ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- es_util --query=_cat/shards 
. . .
project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 2 p STARTED   162    162b 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3
project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 2 r STARTED   162    162b 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2
project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 1 r STARTED   170 152.3kb 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3
project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 1 p STARTED   170    162b 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1
project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 0 p STARTED   136    162b 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2
project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 0 r STARTED   136 110.3kb 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1

$ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":1},"index_patterns":["project.*",".operations*",".orphaned*"]}'
{"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: template is missing;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: template is missing;"},"status":400}

$ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- es_util --query=_template/replica_override
{"replica_override":{"order":100,"template":"[project.*]","settings":{"index":{"number_of_shards":"1"}},"mappings":{},"aliases":{}}}

This seems to occur whether it's managed or unmanaged. Is this the wrong template? I guessed at another template:
$ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- es_util --query=_template/.project* -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":1},"template":["project.*"]}'

But not successfully... :)
project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 2 p STARTED   163   150kb 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3
project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 2 r STARTED   163 126.9kb 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2
project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 1 r STARTED   177 148.8kb 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3
project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 1 p STARTED   177 124.3kb 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1
project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 0 p STARTED   155 111.2kb 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2
project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 0 r STARTED   155 111.2kb 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1

Comment 17 Steven Walter 2020-10-15 21:37:53 UTC

Hi, any updates on the workaround for this? Customer's still stuck as we can't get the primary shard override to work.

Comment 19 Jeff Cantrill 2020-10-27 20:29:24 UTC

(In reply to Steven Walter from comment #15)
> Reproduced; I can't seem to use ES API to prevent sharding to node number.

I'm not certain we have made it clear regarding the templates being pushed in and the actual primary replica count.  Templates only effect NEW indices that are created.  To modify existing indices you need to reindex or shrink them

> 
> $ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm --
> es_util --query=_template/replica_override -XPUT -d '{"order": 100,
> "settings": { "index.number_of_shards":1},"template":["project.*"]}'
> {"acknowledged":true}

The key here is that you are adding a template called "replica_override" which gets applied after any other templates (and take precedence) which match the index based on the order of "100".  I believe the default settings have an order of something like 10 or less.  This template applies to any indices that match the pattern "project.*"

> $ oc new-project jenkins
> $ oc new-app jenkins-ephemeral
> (waited till it's up)
> $ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm --
> es_util --query=_cat/shards 
> . . .
> project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 2 p STARTED 
> 162    162b 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3
> project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 2 r STARTED 
> 162    162b 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2
> project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 1 r STARTED 
> 170 152.3kb 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3
> project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 1 p STARTED 
> 170    162b 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1
> project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 0 p STARTED 
> 136    162b 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2
> project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 0 r STARTED 
> 136 110.3kb 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1
> 
> $ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm --
> es_util --query=_template/replica_override -XPUT -d '{"order": 100,
> "settings": {
> "index.number_of_shards":1},"index_patterns":["project.*",".operations*",".
> orphaned*"]}'
> {"error":{"root_cause":[{"type":"action_request_validation_exception",
> "reason":"Validation Failed: 1: template is
> missing;"}],"type":"action_request_validation_exception","reason":
> "Validation Failed: 1: template is missing;"},"status":400}

I'm not certain what you are trying to accomplish here or why the failure but you are overwriting the template from earlier in this comment in lieu of adding a new one.  This is my fault as I didn't make it clear the name (i.e. replica_override) provided in #c14 should be unique

> 
> But not successfully... :)
> project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 2 p
> STARTED   163   150kb 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3
> project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 2 r
> STARTED   163 126.9kb 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2
> project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 1 r
> STARTED   177 148.8kb 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3
> project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 1 p
> STARTED   177 124.3kb 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1
> project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 0 p
> STARTED   155 111.2kb 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2
> project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 0 r
> STARTED   155 111.2kb 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1

The primary shard count of these indices can not be modified unless you re-index/delete and/or shrink them which I attempt to provide a script from #c12.  Note you could change the replica count which would additionally reduce shards

Comment 20 Steven Walter 2020-10-28 15:54:19 UTC

(In reply to Jeff Cantrill from comment #19)
> I'm not certain we have made it clear regarding the templates being pushed
> in and the actual primary replica count.  Templates only effect NEW indices
> that are created.  To modify existing indices you need to reindex or shrink
> them
We're clear on that -- we just havent been able to get any of the changes to work even for new indices.

> The primary shard count of these indices can not be modified unless you re-index/delete and/or shrink them which I attempt to provide a script from #c12.  Note you could change the replica count which would additionally reduce shards
Those were created *after* I had run all the other commands. Again, I'm creating a new project (and new indices) after each attempt.

>I'm not certain what you are trying to accomplish here
Grasping at straws. :)

I'm still not sure what the command should look like to force new indices to have a smaller primary shard. Would my command work if I set a lower order? What am I missing?

Comment 21 Steven Walter 2020-11-10 20:06:11 UTC

I experimented with different orders and unique template names. Both templates have a unique name (from what I can tell), and I chose different orders for both. You can see:

sh-4.2$ es_util --query=_template/project_override -XPUT -d '{"order": 1, "settings": { "index.number_of_shards":1},"template":["project.*"]}'
{"acknowledged":true}
sh-4.2$ es_util --query=_template/project_override_extra -XPUT -d '{"order": 1000, "settings": { "index.number_of_shards":2},"template":["project.*"]}'
{"acknowledged":true}

After running the above, I would expect a new project index would have either 1 shard or 2 shards (depending on what order the templates work). However, neither is the case; it still uses the default 3 primary shards:

$ oc new-project jenkins
$ oc new-app django-psql-example
$ oc exec elasticsearch-cdm-dh7awzbo-1-647c844cdc-qgsbj -- es_util --query=_cat/indices?v
Defaulting container name to elasticsearch.
Use 'oc describe pod/elasticsearch-cdm-dh7awzbo-1-647c844cdc-qgsbj -n openshift-logging' to see all of the containers in this pod.
health status index                                                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .operations.2020.11.10                                          jSYF0UtlRcKzh9ianLao8g   3   0     148437            0    127.3mb        127.3mb
green  open   .searchguard                                                    kqKaXrdmQIaNaoRx7WE7-Q   1   0          5            0     80.6kb         80.6kb
green  open   project.jenkins.769d3c0e-c4eb-4652-a023-c9230b1cbb33.2020.11.10 6XRKBiBEQLyXu59Wfg8ZRQ   3   0         12            0     82.4kb         82.4kb

Comment 22 ewolinet 2020-11-11 19:56:44 UTC

to recap our slack discussion here --

the issue with the template in comment 21 is the wrapping hard brackets.
using the following worked:

es_util --query=_template/project_override -XPUT -d '{"order": 1, "settings": { "index.number_of_shards":1},"template":"project.*"}'

the customer will likely need to have a different template for each of the indices they wish to overwrite with each having a different "template" value ("project.*", ".operations.*", ".orphaned*")

Comment 23 Mike Fiedler 2020-11-12 21:19:51 UTC

Created a 7 node ES cluster from release-4.7.    Some indices have 7 primaries and some have 5 primaries.   I assume all indices should have a max of 5 primaries, true?

# oc exec -c elasticsearch $pod -- indices
Thu Nov 12 21:16:37 UTC 2020
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   infra-000001 h2K65rITQc-Asq1YdLa4kw   7   1    1029220            0       1107            553
green  open   app-000001   R-ipaaVJQBe8vwZmSmJ7hw   7   1      70375            0        153             76
green  open   .kibana_1    TCtt4PrdRQKTfwVC8hx9Vg   1   1          0            0          0              0
green  open   audit-000001 chIurtt0T8CrWD0D058HnQ   7   1          0            0          0              0
green  open   infra-000002 SNGRjvDGSO2AwX3DWrpP9A   5   1     146824            0        226            113
green  open   .security    EC66T2YXQ-2UF2pn0JXrnA   1   1          5            0          0              0
green  open   infra-000003 MiqRKfUEQHCvXobFR_EzOw   5   1      23891            0         42             19

I tried deleting the app index to see if it was recreated with 5 primaries but it was recreated with 7.

Comment 24 Mike Fiedler 2020-11-13 00:29:24 UTC

After letting this cluster run:  

# oc exec -c elasticsearch $pod -- indices
Fri Nov 13 00:27:47 UTC 2020
health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   app-000003   bY1Cmx_6RjyrYsKFNKsnXQ   5   1          0            0          0              0
green  open   .kibana_1    TCtt4PrdRQKTfwVC8hx9Vg   1   1          0            0          0              0
green  open   infra-000009 uM_v258LR1S20ETJ1-9DKA   5   1     157910            0        241            120
green  open   infra-000010 ul79E8V3RwSRndT492fDMg   5   1     160409            0        243            122
green  open   infra-000008 3kvV0ak4QrS9Ug_dPqY_Dg   5   1     158610            0        242            121
green  open   app-000002   0o9dPiUrRRSNoO_RKZZRWw   5   1          0            0          0              0
green  open   app-000001   i4M1xGjeTzCe4pfvV6-1ag   7   1    5902725            0       9310           4666
green  open   audit-000002 aCylnMKNRKSkWUcV5EM9HQ   5   1          0            0          0              0
green  open   infra-000014 57cRK1YmRlGZHpJRqpvX7Q   5   1     160381            0        243            121
green  open   infra-000015 DiKM3tMiTfqfG_pGBifOVA   5   1     136384            0        209            106
green  open   infra-000012 LtrnFAM4RC-KgRrff5YFqQ   5   1     159045            0        243            120
green  open   infra-000007 gYWp8epFT8G2JUVygUK9WA   5   1     158236            0        240            120
green  open   infra-000011 4dMkTwMnR_GlihDra5W8XA   5   1     158333            0        238            119
green  open   audit-000001 chIurtt0T8CrWD0D058HnQ   7   1          0            0          0              0
green  open   infra-000013 Ow6EMJgoS0ezsYMaj2u5sg   5   1     159528            0        242            121
green  open   .security    EC66T2YXQ-2UF2pn0JXrnA   1   1          5            0          0              0

Looks like the first index is getting created with n=number of nodes and subsequent indices get n=5.

Expected?

Comment 26 ewolinet 2020-11-13 21:20:09 UTC

@mifiedle it looks like we missed the initial index creation... I didn't realize it was being handled in a separate place.

Will have a patch open shortly.

Comment 29 Mike Fiedler 2020-12-17 01:32:34 UTC

Verified on elasticsearch-operator.4.7.0-202012161545.p0

Initial and subsequent indices are limited to 5 shards on an ES cluster with 7 members.

Comment 37 errata-xmlrpc 2021-02-24 11:21:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652

Note You need to log in before you can comment on or make changes to this bug.