Description of problem: Running OpenShift 4.5.8 with Cluster Logging clusterlogging.4.5.0-202009041228.p0 does not correctly decrease the number of Elasticsearch Nodes. > Spec: > Collection: > Logs: > Fluentd: > Type: fluentd > Curation: > Curator: > Schedule: 30 3 * * * > Type: curator > Log Store: > Elasticsearch: > Node Count: 5 > Redundancy Policy: FullRedundancy > Resources: > Limits: > Memory: 4Gi > Requests: > Cpu: 500m > Memory: 2Gi > Storage: > Size: 200G > Storage Class Name: gp2 > Retention Policy: > Application: > Max Age: 1d > Audit: > Max Age: 7d > Infra: > Max Age: 7d > Type: elasticsearch > Management State: Managed > Visualization: > Kibana: > Replicas: 1 > Type: kibana shows > $ oc get pod -l component=elasticsearch > NAME READY STATUS RESTARTS AGE > elasticsearch-cd-hh1vvavv-1-db447f8c4-797hz 2/2 Running 0 50m > elasticsearch-cd-hh1vvavv-2-8c6fb9f45-8zgsr 2/2 Running 0 50m > elasticsearch-cdm-gbgfqisu-1-75b49786b6-m72qt 2/2 Running 0 72m > elasticsearch-cdm-gbgfqisu-2-7f77c4947f-vmx7t 2/2 Running 0 72m > elasticsearch-cdm-gbgfqisu-3-6d5955bd8d-vnz9h 2/2 Running 0 72m When updating ClusterLogging resource "instance" and decreasing the node count to 3 we still see 5 Elasticsearch nodes running. > Spec: > Collection: > Logs: > Fluentd: > Type: fluentd > Curation: > Curator: > Schedule: 30 3 * * * > Type: curator > Log Store: > Elasticsearch: > Node Count: 3 > Redundancy Policy: FullRedundancy > Resources: > Limits: > Memory: 4Gi > Requests: > Cpu: 500m > Memory: 2Gi > Storage: > Size: 200G > Storage Class Name: gp2 > Retention Policy: > Application: > Max Age: 1d > Audit: > Max Age: 7d > Infra: > Max Age: 7d > Type: elasticsearch > Management State: Managed > Visualization: > Kibana: > Replicas: 1 > Type: kibana > $ oc get pod -l component=elasticsearch > NAME READY STATUS RESTARTS AGE > elasticsearch-cd-hh1vvavv-1-db447f8c4-797hz 2/2 Running 0 50m > elasticsearch-cd-hh1vvavv-2-8c6fb9f45-8zgsr 2/2 Running 0 50m > elasticsearch-cdm-gbgfqisu-1-75b49786b6-m72qt 2/2 Running 0 72m > elasticsearch-cdm-gbgfqisu-2-7f77c4947f-vmx7t 2/2 Running 0 72m > elasticsearch-cdm-gbgfqisu-3-6d5955bd8d-vnz9h 2/2 Running 0 72m Even when deleting Elasticsearch pod it will be re-created immediately. Also when adjusting "Redundancy Policy" from "FullRedundancy" to "SingleRedundancy" it does not take any effect. Version-Release number of selected component (if applicable): - clusterlogging.4.5.0-202009041228.p0 How reproducible: - Always Steps to Reproduce: 1. Install OpenShift Logging according https://docs.openshift.com/container-platform/4.5/logging/cluster-logging-deploying.html 2. Increase the number of Elasticsearch Nodes from 3 to 5 3. Decrease the number of Elasticsearch Nodes from 5 to 3 Actual results: All 5 Elasticsearch Nodes keep running and there is no attempt made reduce the number of Elasticsearch Nodes. Also changes to "Redundancy Policy" are not reflected (if done at the same time or not) Expected results: Number of Elasticsearch Nodes to be properly reflected at all time and the Operator to take action if spec.logStore.elasticsearch.nodeCount is modified. Additional info:
Moving to 4.7 as this is not a 4.6 blocker
@Simon Please collect a full must-gather for cluster-logging using to get a full picture of the stack: https://github.com/openshift/cluster-logging-operator/tree/master/must-gather
Marking UpcomingSprint as will not be merged or addressed by EOD
Verified with elasticsearch-operator.4.7.0-202011030448.p0
50% ES cluster went into Red Status in 10 scale down. Move back to assign to continue investigate.
When the replicas shards wasn't created, the ES may went into Red. ##Before scale down: + oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_util --query=_cat/shards .security 0 p STARTED 5 29.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1 .security 0 r STARTED 5 29.6kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2 audit-000001 1 p STARTED 10.131.0.27 elasticsearch-cdm-znu3x9e7-3 audit-000001 2 p STARTED 0 230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1 audit-000001 0 p STARTED 0 230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2 app-000001 1 p STARTED 0 230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1 app-000001 2 p STARTED 0 230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2 app-000001 0 p STARTED 10.131.0.27 elasticsearch-cdm-znu3x9e7-3 infra-000001 1 p STARTED 10.131.0.27 elasticsearch-cdm-znu3x9e7-3 infra-000001 2 p STARTED 7917 4.3mb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1 infra-000001 0 p STARTED 7191 4mb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2 .kibana_1 0 r STARTED 0 230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1 .kibana_1 0 p STARTED 10.131.0.27 elasticsearch-cdm-znu3x9e7-3 ##After scale down: + oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_cluster_health { "cluster_name" : "elasticsearch", "status" : "red", "timed_out" : false, "number_of_nodes" : 2, "number_of_data_nodes" : 2, "active_primary_shards" : 8, "active_shards" : 9, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 4, "delayed_unassigned_shards" : 4, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 69.23076923076923 } + oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_util --query=_cat/shards .security 0 p STARTED 5 29.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1 .security 0 r STARTED 5 29.6kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2 audit-000001 1 p UNASSIGNED audit-000001 2 p STARTED 0 230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1 audit-000001 0 p STARTED 0 230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2 app-000001 1 p STARTED 0 127.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1 app-000001 2 p STARTED 0 136kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2 app-000001 0 p UNASSIGNED infra-000001 1 p UNASSIGNED infra-000001 2 p STARTED 7917 4.3mb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1 infra-000001 0 p STARTED 7191 4mb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2 .kibana_1 0 p STARTED 0 230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
To scale down the ES cluster, I think the ES must match the some conditions. ZeroRedundancy: Don't all allow scale down. SingleRedundancy: Even if the replias shard has been created, the ES nodes only can be scaled down one by one. MultipleRedundancy: Even if all replias shards have been created, As we don't know where the replicas shards located. the ES nodes should be scaled down one by one. FullRedundancy: If all replicas have been created, scale down 1 to n-1 nodes. The EO should check the replicas shard status and block new indics generation.
@anli I think that would be a further feature. If a user is going to be scaling down their ES cluster, they should understand the risk for data loss if there is no replication.
Docs BZ for this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1896916
Thanks Michael
Move to verified
Created https://github.com/openshift/openshift-docs/pull/27404 to document the warnings about scaling down and the node minimums as listed in https://issues.redhat.com/browse/LOG-981.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0652