Moving to 4.7 as this is not a 4.6 blocker
The initial resolution to this issue will be to cap replicas to a value so that there are not N=nodes number of primary indices. In the interim, I would suggest the following with caveat it is untested but we can arrange a meeting to be available: * List indices: oc exec -c elasticsearch $pod -- indices * Remove indices you no longer need. Typically we recommend retaining no more than 7 days with 14 days at the most. The indices have a date suffix to understand the age * To ensure the changes are honoured until a fix can be released (likely in 4.5), edit the clusterlogging instance to be "Unmanaged" * Set replicas temporarily to 0: oc exec -c elasticsearch $pod -- es_util --query=_settings -XPUT -d '{"index":{ "number_of_replicas":0}}' * Ensure new indices do not create more than 5 primaries: oc exec -c elasticsearch $pod --es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"index_patterns":["project.*",".operations*",".orphaned*"]}' The previous will likely stabilize the cluster and ensure new indices do not return you to the previous state. Assuming the curation process is functioning correctly, the highly sharded indices will eventually roll off. Alternatively, you might consider shrinking your indices using the shrink API [1] with something like: * Choose one of the remaining indices from the list which is likely in the form of 'project.<projectname>.<projectuid>.<yyyy>.<mm>.<dd>' or '.operations.<yyyy>.<mm>.<dd>' * Choose a target name by adding a segment between the uuid and date: 'project.<projectname>.<projectuid>.ri.<yyyy>.<mm>.<dd>' or '.operations.ri.<yyyy>.<mm>.<dd>' * Shrink the index: oc exec -c elasticsearch $pod -- es_util --query=$sourcename/_shrink/$targetname -d '{"settings": {"index.number_of_replicas": 0,"index.number_of_shards": 5},"aliases": {".all": {}}}' * finally you may have to delete the previous index once shrink is complete: oc exec -c elasticsearch $pod -- es_util --query=$sourcename -XDELETE Finally, enable replication if needed: oc exec -c elasticsearch $pod -- es_util --query=_settings -XPUT -d '{"index":{ "number_of_replicas":1}}' [1] https://www.elastic.co/guide/en/elasticsearch/reference/5.6/indices-shrink-index.html
I'm taking over the support case during NA hours; let us know if/when engineering would be available to work on this with the customer. If there would be a long delay due to unavailability, I've also offered the customer for me to work with them directly on it. I'm comfortable with the steps listed, but we may still need to rely on engineering assistance in the event of unexpected behavior.
Any chance if engineering will be available today for the session?
Marking UpcomingSprint as will not be merged or addressed by EOD
(In reply to rbolling from comment #6) > Additionally, the question was asked would this 4.7 fix be backported to > 4.4? This would obviously be a question if we van get a viable workaround. The work around was listed https://bugzilla.redhat.com/show_bug.cgi?id=1883444#c2 As regards to "backporting", the intention is to bring it back to 4.4 as long as that release is still within a maintenance cycle
Thanks Jeff. Ford attempted the work around they were getting an error. I believe the Steven Walters explained that they were we were hitting something else. The customer is still requesting engineering on a call tomorrow. This was number one topic for Ford today in our IBM Red Hat Ford call. History. Customer was following https://bugzilla.redhat.com/show_bug.cgi?id=1883444 & it failed on set 5. Can you please help us regarding this. -------------------------------------- This particular command fails with status 400 oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"index_patterns":["project.*",".operations*",".orphaned*"]}' "error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: template is missing;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: template is missing;"},"status":400} I'm unclear what the query line should be to correct. Red Hat Response. I think this might be caused by a discrepancy between ES 5 and ES 6. Someone in the ES community hit something similar: https://github.com/pelias/terraform-elasticsearch/issues/9 Seems ES 5 (which is running in OCP 4.3) doesn't use "index_patterns" and instead uses "template". Let's try this command instead: oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":["project.*",".operations*",".orphaned*"]}'
Customer hit further issues with my modified command, but were able to get it working by running it separately for each template: oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":["project.*"]}' oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":[".operations*"]}' oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":[".orphaned*"]}' It seems to have worked. They mentioned after the changes the below: ~~~ This also appears to show up in the log on each cluster member: Oct 06, 2020 6:51:11 PM okhttp3.internal.platform.Platform log WARNING: A connection to https://10.0.0.1/ was leaked. Did you forget to close a response body? To see where this was allocated, set the OkHttpClient logger level to FINE: Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE); We also see fluctuating Readiness probe issues: Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503] Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000] This will fluctuate but does persist for the life of the CI. ~~~ I've requested an updated must-gather to check the current status; for now we'll see if we can address the issue within support and update here if further assistance is needed
(In reply to Steven Walter from comment #9) > Customer hit further issues with my modified command, but were able to get > it working by running it separately for each template: > > oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override > -XPUT -d '{"order": 100, "settings": { > "index.number_of_shards":5},"template":["project.*"]}' > oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override > -XPUT -d '{"order": 100, "settings": { > "index.number_of_shards":5},"template":[".operations*"]}' > oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override > -XPUT -d '{"order": 100, "settings": { > "index.number_of_shards":5},"template":[".orphaned*"]}' > > It seems to have worked. They mentioned after the changes the below: > Worked as you have completed all of the steps I recommended or just the templates to override replicas? > > ~~~ > This also appears to show up in the log on each cluster member: > > Oct 06, 2020 6:51:11 PM okhttp3.internal.platform.Platform log > WARNING: A connection to https://10.0.0.1/ was leaked. Did you forget > to close a response body? To see where this was allocated, set the > OkHttpClient logger level to FINE: > Logger.getLogger(OkHttpClient.class.getName()).setLevel(Level.FINE); I have a PR which could resolve this but given it is mostly a nuisance in the logs and the plugin goes away in 4.5+ the solution has been depriorited > > We also see fluctuating Readiness probe issues: > > > Readiness probe failed: Elasticsearch node is not ready to accept HTTP > requests yet [response code: 503] > Readiness probe failed: Elasticsearch node is not ready to accept HTTP > requests yet [response code: 000] I'm guessing here but wondering if this is related to trying to coordinate "readiness" across 27 nodes. Likely needs investigation but assuming you are otherwise functional I would not expect this to be an issue > > This will fluctuate but does persist for the life of the CI. Life of "Continuous Integration"? Please advise how this is relevant and factors into the issue?
So far they have only completed the template changes -- I had misunderstood yesterday and thought they completed all the steps. I am working with them on shrinking the indices. Since there are a few thousand, I wrote a short script to do it more automatically. touch indexlist for source in $(es_util --query=_cat/indices | awk '{print $3}' | grep project) do echo $source newindex=$(echo "project.$(echo $source | cut -d"." -f2-3).ri.$(echo $source | cut -d"." -f4-6)") echo $newindex es_util --query=$source/_shrink/$newindex -d '{"settings": {"index.number_of_replicas": 0,"index.number_of_shards": 5},"aliases": {".all": {}}}' echo $source >> indexlist done Once satisfied that the shrink worked we can iterate over indexlist to delete the indices: $ for source in $(cat indexlist); do es_util --query=$source -XDELETE ; done I'll update in the bz if any help is needed.
The template changes [1] dont seem to "stick". active_shards and active_primary_shards have a discrepancy, which grows with time. - My hypothesis was, replica shards are being re-enabled while we're in Managed state. - However, even setting to Unmanaged, the issue continues - We checked new indices being created and saw they still have 10 primary shards, even after overriding the shards count I'll install a lab cluster to try to figure out what's not working here. - Pending tasks keeps going up forever after we do the template change. [1] oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":["project.*"]}' oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":[".operations*"]}' oc exec -c elasticsearch $pod -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":5},"template":[".orphaned*"]}'
Reproduced; I can't seem to use ES API to prevent sharding to node number. $ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":1},"template":["project.*"]}' {"acknowledged":true} $ oc new-project jenkins $ oc new-app jenkins-ephemeral (waited till it's up) $ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- es_util --query=_cat/shards . . . project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 2 p STARTED 162 162b 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3 project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 2 r STARTED 162 162b 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2 project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 1 r STARTED 170 152.3kb 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3 project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 1 p STARTED 170 162b 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1 project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 0 p STARTED 136 162b 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2 project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 0 r STARTED 136 110.3kb 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1 $ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- es_util --query=_template/replica_override -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":1},"index_patterns":["project.*",".operations*",".orphaned*"]}' {"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: template is missing;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: template is missing;"},"status":400} $ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- es_util --query=_template/replica_override {"replica_override":{"order":100,"template":"[project.*]","settings":{"index":{"number_of_shards":"1"}},"mappings":{},"aliases":{}}} This seems to occur whether it's managed or unmanaged. Is this the wrong template? I guessed at another template: $ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- es_util --query=_template/.project* -XPUT -d '{"order": 100, "settings": { "index.number_of_shards":1},"template":["project.*"]}' But not successfully... :) project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 2 p STARTED 163 150kb 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3 project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 2 r STARTED 163 126.9kb 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2 project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 1 r STARTED 177 148.8kb 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3 project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 1 p STARTED 177 124.3kb 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1 project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 0 p STARTED 155 111.2kb 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2 project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 0 r STARTED 155 111.2kb 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1
Hi, any updates on the workaround for this? Customer's still stuck as we can't get the primary shard override to work.
(In reply to Steven Walter from comment #15) > Reproduced; I can't seem to use ES API to prevent sharding to node number. I'm not certain we have made it clear regarding the templates being pushed in and the actual primary replica count. Templates only effect NEW indices that are created. To modify existing indices you need to reindex or shrink them > > $ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- > es_util --query=_template/replica_override -XPUT -d '{"order": 100, > "settings": { "index.number_of_shards":1},"template":["project.*"]}' > {"acknowledged":true} The key here is that you are adding a template called "replica_override" which gets applied after any other templates (and take precedence) which match the index based on the order of "100". I believe the default settings have an order of something like 10 or less. This template applies to any indices that match the pattern "project.*" > $ oc new-project jenkins > $ oc new-app jenkins-ephemeral > (waited till it's up) > $ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- > es_util --query=_cat/shards > . . . > project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 2 p STARTED > 162 162b 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3 > project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 2 r STARTED > 162 162b 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2 > project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 1 r STARTED > 170 152.3kb 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3 > project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 1 p STARTED > 170 162b 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1 > project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 0 p STARTED > 136 162b 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2 > project.jenkins.fcbc212a-d906-4c23-8d57-9835852dceb4.2020.10.09 0 r STARTED > 136 110.3kb 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1 > > $ oc exec -c elasticsearch elasticsearch-cdm-0wr8jzq9-2-57c55d9fff-hp4lm -- > es_util --query=_template/replica_override -XPUT -d '{"order": 100, > "settings": { > "index.number_of_shards":1},"index_patterns":["project.*",".operations*",". > orphaned*"]}' > {"error":{"root_cause":[{"type":"action_request_validation_exception", > "reason":"Validation Failed: 1: template is > missing;"}],"type":"action_request_validation_exception","reason": > "Validation Failed: 1: template is missing;"},"status":400} I'm not certain what you are trying to accomplish here or why the failure but you are overwriting the template from earlier in this comment in lieu of adding a new one. This is my fault as I didn't make it clear the name (i.e. replica_override) provided in #c14 should be unique > > But not successfully... :) > project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 2 p > STARTED 163 150kb 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3 > project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 2 r > STARTED 163 126.9kb 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2 > project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 1 r > STARTED 177 148.8kb 10.131.0.80 elasticsearch-cdm-0wr8jzq9-3 > project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 1 p > STARTED 177 124.3kb 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1 > project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 0 p > STARTED 155 111.2kb 10.130.0.18 elasticsearch-cdm-0wr8jzq9-2 > project.other-template.bca8a9be-b932-4245-9c65-46b8f734a153.2020.10.09 0 r > STARTED 155 111.2kb 10.131.0.77 elasticsearch-cdm-0wr8jzq9-1 The primary shard count of these indices can not be modified unless you re-index/delete and/or shrink them which I attempt to provide a script from #c12. Note you could change the replica count which would additionally reduce shards
(In reply to Jeff Cantrill from comment #19) > I'm not certain we have made it clear regarding the templates being pushed > in and the actual primary replica count. Templates only effect NEW indices > that are created. To modify existing indices you need to reindex or shrink > them We're clear on that -- we just havent been able to get any of the changes to work even for new indices. > The primary shard count of these indices can not be modified unless you re-index/delete and/or shrink them which I attempt to provide a script from #c12. Note you could change the replica count which would additionally reduce shards Those were created *after* I had run all the other commands. Again, I'm creating a new project (and new indices) after each attempt. >I'm not certain what you are trying to accomplish here Grasping at straws. :) I'm still not sure what the command should look like to force new indices to have a smaller primary shard. Would my command work if I set a lower order? What am I missing?
I experimented with different orders and unique template names. Both templates have a unique name (from what I can tell), and I chose different orders for both. You can see: sh-4.2$ es_util --query=_template/project_override -XPUT -d '{"order": 1, "settings": { "index.number_of_shards":1},"template":["project.*"]}' {"acknowledged":true} sh-4.2$ es_util --query=_template/project_override_extra -XPUT -d '{"order": 1000, "settings": { "index.number_of_shards":2},"template":["project.*"]}' {"acknowledged":true} After running the above, I would expect a new project index would have either 1 shard or 2 shards (depending on what order the templates work). However, neither is the case; it still uses the default 3 primary shards: $ oc new-project jenkins $ oc new-app django-psql-example $ oc exec elasticsearch-cdm-dh7awzbo-1-647c844cdc-qgsbj -- es_util --query=_cat/indices?v Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-dh7awzbo-1-647c844cdc-qgsbj -n openshift-logging' to see all of the containers in this pod. health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open .operations.2020.11.10 jSYF0UtlRcKzh9ianLao8g 3 0 148437 0 127.3mb 127.3mb green open .searchguard kqKaXrdmQIaNaoRx7WE7-Q 1 0 5 0 80.6kb 80.6kb green open project.jenkins.769d3c0e-c4eb-4652-a023-c9230b1cbb33.2020.11.10 6XRKBiBEQLyXu59Wfg8ZRQ 3 0 12 0 82.4kb 82.4kb
to recap our slack discussion here -- the issue with the template in comment 21 is the wrapping hard brackets. using the following worked: es_util --query=_template/project_override -XPUT -d '{"order": 1, "settings": { "index.number_of_shards":1},"template":"project.*"}' the customer will likely need to have a different template for each of the indices they wish to overwrite with each having a different "template" value ("project.*", ".operations.*", ".orphaned*")
Created a 7 node ES cluster from release-4.7. Some indices have 7 primaries and some have 5 primaries. I assume all indices should have a max of 5 primaries, true? # oc exec -c elasticsearch $pod -- indices Thu Nov 12 21:16:37 UTC 2020 health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open infra-000001 h2K65rITQc-Asq1YdLa4kw 7 1 1029220 0 1107 553 green open app-000001 R-ipaaVJQBe8vwZmSmJ7hw 7 1 70375 0 153 76 green open .kibana_1 TCtt4PrdRQKTfwVC8hx9Vg 1 1 0 0 0 0 green open audit-000001 chIurtt0T8CrWD0D058HnQ 7 1 0 0 0 0 green open infra-000002 SNGRjvDGSO2AwX3DWrpP9A 5 1 146824 0 226 113 green open .security EC66T2YXQ-2UF2pn0JXrnA 1 1 5 0 0 0 green open infra-000003 MiqRKfUEQHCvXobFR_EzOw 5 1 23891 0 42 19 I tried deleting the app index to see if it was recreated with 5 primaries but it was recreated with 7.
After letting this cluster run: # oc exec -c elasticsearch $pod -- indices Fri Nov 13 00:27:47 UTC 2020 health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open app-000003 bY1Cmx_6RjyrYsKFNKsnXQ 5 1 0 0 0 0 green open .kibana_1 TCtt4PrdRQKTfwVC8hx9Vg 1 1 0 0 0 0 green open infra-000009 uM_v258LR1S20ETJ1-9DKA 5 1 157910 0 241 120 green open infra-000010 ul79E8V3RwSRndT492fDMg 5 1 160409 0 243 122 green open infra-000008 3kvV0ak4QrS9Ug_dPqY_Dg 5 1 158610 0 242 121 green open app-000002 0o9dPiUrRRSNoO_RKZZRWw 5 1 0 0 0 0 green open app-000001 i4M1xGjeTzCe4pfvV6-1ag 7 1 5902725 0 9310 4666 green open audit-000002 aCylnMKNRKSkWUcV5EM9HQ 5 1 0 0 0 0 green open infra-000014 57cRK1YmRlGZHpJRqpvX7Q 5 1 160381 0 243 121 green open infra-000015 DiKM3tMiTfqfG_pGBifOVA 5 1 136384 0 209 106 green open infra-000012 LtrnFAM4RC-KgRrff5YFqQ 5 1 159045 0 243 120 green open infra-000007 gYWp8epFT8G2JUVygUK9WA 5 1 158236 0 240 120 green open infra-000011 4dMkTwMnR_GlihDra5W8XA 5 1 158333 0 238 119 green open audit-000001 chIurtt0T8CrWD0D058HnQ 7 1 0 0 0 0 green open infra-000013 Ow6EMJgoS0ezsYMaj2u5sg 5 1 159528 0 242 121 green open .security EC66T2YXQ-2UF2pn0JXrnA 1 1 5 0 0 0 Looks like the first index is getting created with n=number of nodes and subsequent indices get n=5. Expected?
@mifiedle it looks like we missed the initial index creation... I didn't realize it was being handled in a separate place. Will have a patch open shortly.
Verified on elasticsearch-operator.4.7.0-202012161545.p0 Initial and subsequent indices are limited to 5 shards on an ES cluster with 7 members.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0652