1879150 – Changes on spec.logStore.elasticsearch.nodeCount not reflected when decreasing the number of nodes

Bug 1879150 - Changes on spec.logStore.elasticsearch.nodeCount not reflected when decreasing the number of nodes

Summary: Changes on spec.logStore.elasticsearch.nodeCount not reflected when decreasin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	ewolinet
QA Contact:	Qiaoling Tang
Docs Contact:	Rolfe Dlugy-Hegwer
URL:
Whiteboard:	logging-exploration
Depends On:
Blocks:	1890801
TreeView+	depends on / blocked

Reported:	2020-09-15 14:28 UTC by Simon Reber
Modified:	2024-06-13 23:05 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Previously, when the Cluster Logging Operator (CLO) scaled down the number of Elasticsearch nodes in the clusterlogging custom resource (CR) to three nodes, it omitted previously-created nodes that had unique IDs. The Elasticsearch Operator (EO) rejected the update because it has safeguards that prevent nodes with unique IDs from being removed. The current release fixes this issue. Now, when the CLO scales down the number of nodes and updates the Elasticsearch CR, the CLO does not omit nodes that have unique IDs. Instead, the CLO marks those nodes as `count 0`. As a result, users can scale down their cluster to three nodes using the clusterlogging CR. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1879150[BZ#1879150])
Clone Of:
Clones:	1898310 (view as bug list)
Environment:
Last Closed:	2021-02-24 11:21:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-logging-operator pull 752	0	None	closed	Bug 1879150: Updating how we scale down ES nodes in CR to preserve the number of es node definitions	2021-02-19 14:36:11 UTC
Red Hat Product Errata	RHBA-2021:0652	0	None	None	None	2021-02-24 11:21:52 UTC

Description Simon Reber 2020-09-15 14:28:27 UTC

Description of problem:

Running OpenShift 4.5.8 with Cluster Logging clusterlogging.4.5.0-202009041228.p0 does not correctly decrease the number of Elasticsearch Nodes.

> Spec:
>   Collection:
>     Logs:
>       Fluentd:
>       Type:  fluentd
>   Curation:
>     Curator:
>       Schedule:  30 3 * * *
>     Type:        curator
>   Log Store:
>     Elasticsearch:
>       Node Count:         5
>       Redundancy Policy:  FullRedundancy
>       Resources:
>         Limits:
>           Memory:  4Gi
>         Requests:
>           Cpu:     500m
>           Memory:  2Gi
>       Storage:
>         Size:                200G
>         Storage Class Name:  gp2
>     Retention Policy:
>       Application:
>         Max Age:  1d
>       Audit:
>         Max Age:  7d
>       Infra:
>         Max Age:     7d
>     Type:            elasticsearch
>   Management State:  Managed
>   Visualization:
>     Kibana:
>       Replicas:  1
>     Type:        kibana

shows

> $ oc get pod -l component=elasticsearch
> NAME                                            READY   STATUS    RESTARTS   AGE
> elasticsearch-cd-hh1vvavv-1-db447f8c4-797hz     2/2     Running   0          50m
> elasticsearch-cd-hh1vvavv-2-8c6fb9f45-8zgsr     2/2     Running   0          50m
> elasticsearch-cdm-gbgfqisu-1-75b49786b6-m72qt   2/2     Running   0          72m
> elasticsearch-cdm-gbgfqisu-2-7f77c4947f-vmx7t   2/2     Running   0          72m
> elasticsearch-cdm-gbgfqisu-3-6d5955bd8d-vnz9h   2/2     Running   0          72m

When updating ClusterLogging resource "instance" and decreasing the node count to 3 we still see 5 Elasticsearch nodes running.

> Spec:
>   Collection:
>     Logs:
>       Fluentd:
>       Type:  fluentd
>   Curation:
>     Curator:
>       Schedule:  30 3 * * *
>     Type:        curator
>   Log Store:
>     Elasticsearch:
>       Node Count:         3
>       Redundancy Policy:  FullRedundancy
>       Resources:
>         Limits:
>           Memory:  4Gi
>         Requests:
>           Cpu:     500m
>           Memory:  2Gi
>       Storage:
>         Size:                200G
>         Storage Class Name:  gp2
>     Retention Policy:
>       Application:
>         Max Age:  1d
>       Audit:
>         Max Age:  7d
>       Infra:
>         Max Age:     7d
>     Type:            elasticsearch
>   Management State:  Managed
>   Visualization:
>     Kibana:
>       Replicas:  1
>     Type:        kibana

> $ oc get pod -l component=elasticsearch
> NAME                                            READY   STATUS    RESTARTS   AGE
> elasticsearch-cd-hh1vvavv-1-db447f8c4-797hz     2/2     Running   0          50m
> elasticsearch-cd-hh1vvavv-2-8c6fb9f45-8zgsr     2/2     Running   0          50m
> elasticsearch-cdm-gbgfqisu-1-75b49786b6-m72qt   2/2     Running   0          72m
> elasticsearch-cdm-gbgfqisu-2-7f77c4947f-vmx7t   2/2     Running   0          72m
> elasticsearch-cdm-gbgfqisu-3-6d5955bd8d-vnz9h   2/2     Running   0          72m

Even when deleting Elasticsearch pod it will be re-created immediately. Also when adjusting "Redundancy Policy" from "FullRedundancy" to "SingleRedundancy" it does not take any effect.

Version-Release number of selected component (if applicable):

 - clusterlogging.4.5.0-202009041228.p0

How reproducible:

 - Always

Steps to Reproduce:
1. Install OpenShift Logging according https://docs.openshift.com/container-platform/4.5/logging/cluster-logging-deploying.html
2. Increase the number of Elasticsearch Nodes from 3 to 5
3. Decrease the number of Elasticsearch Nodes from 5 to 3

Actual results:

All 5 Elasticsearch Nodes keep running and there is no attempt made reduce the number of Elasticsearch Nodes. Also changes to "Redundancy Policy" are not reflected (if done at the same time or not)

Expected results:

Number of Elasticsearch Nodes to be properly reflected at all time and the Operator to take action if spec.logStore.elasticsearch.nodeCount is modified.

Additional info:

Comment 2 Jeff Cantrill 2020-09-16 13:42:21 UTC

Moving to 4.7 as this is not a 4.6 blocker

Comment 3 Periklis Tsirakidis 2020-09-24 11:19:32 UTC

@Simon

Please collect a full must-gather for cluster-logging using to get a full picture of the stack:

https://github.com/openshift/cluster-logging-operator/tree/master/must-gather

Comment 11 Jeff Cantrill 2020-10-02 15:24:12 UTC

Marking UpcomingSprint as will not be merged or addressed by EOD

Comment 20 Qiaoling Tang 2020-11-04 06:18:54 UTC

Verified with elasticsearch-operator.4.7.0-202011030448.p0

Comment 21 Anping Li 2020-11-09 12:45:57 UTC

50% ES cluster went into Red Status in 10 scale down. Move back to assign to continue investigate.

Comment 22 Anping Li 2020-11-09 13:06:25 UTC

When the replicas shards wasn't created, the ES may went into Red.

##Before scale down:
+ oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_util --query=_cat/shards
.security    0 p STARTED    5 29.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
.security    0 r STARTED    5 29.6kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
audit-000001 1 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
audit-000001 2 p STARTED    0   230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
audit-000001 0 p STARTED    0   230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   1 p STARTED    0   230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
app-000001   2 p STARTED    0   230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   0 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
infra-000001 1 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
infra-000001 2 p STARTED 7917  4.3mb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
infra-000001 0 p STARTED 7191    4mb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
.kibana_1    0 r STARTED    0   230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
.kibana_1    0 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
##After scale down:
+ oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_cluster_health
{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 8,
  "active_shards" : 9,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 4,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 69.23076923076923
}

+ oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_util --query=_cat/shards
.security    0 p STARTED       5  29.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
.security    0 r STARTED       5  29.6kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
audit-000001 1 p UNASSIGNED                          
audit-000001 2 p STARTED       0    230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
audit-000001 0 p STARTED       0    230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   1 p STARTED       0 127.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
app-000001   2 p STARTED       0   136kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   0 p UNASSIGNED                          
infra-000001 1 p UNASSIGNED                          
infra-000001 2 p STARTED    7917   4.3mb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
infra-000001 0 p STARTED    7191     4mb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
.kibana_1    0 p STARTED       0    230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1

Comment 23 Anping Li 2020-11-09 13:06:57 UTC

When the replicas shards wasn't created, the ES may went into Red.

##Before scale down:
+ oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_util --query=_cat/shards
.security    0 p STARTED    5 29.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
.security    0 r STARTED    5 29.6kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
audit-000001 1 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
audit-000001 2 p STARTED    0   230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
audit-000001 0 p STARTED    0   230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   1 p STARTED    0   230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
app-000001   2 p STARTED    0   230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   0 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
infra-000001 1 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
infra-000001 2 p STARTED 7917  4.3mb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
infra-000001 0 p STARTED 7191    4mb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
.kibana_1    0 r STARTED    0   230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
.kibana_1    0 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
##After scale down:
+ oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_cluster_health
{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 8,
  "active_shards" : 9,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 4,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 69.23076923076923
}

+ oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_util --query=_cat/shards
.security    0 p STARTED       5  29.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
.security    0 r STARTED       5  29.6kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
audit-000001 1 p UNASSIGNED                          
audit-000001 2 p STARTED       0    230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
audit-000001 0 p STARTED       0    230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   1 p STARTED       0 127.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
app-000001   2 p STARTED       0   136kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   0 p UNASSIGNED                          
infra-000001 1 p UNASSIGNED                          
infra-000001 2 p STARTED    7917   4.3mb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
infra-000001 0 p STARTED    7191     4mb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
.kibana_1    0 p STARTED       0    230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1

Comment 24 Anping Li 2020-11-09 13:11:12 UTC

To scale down the ES cluster, I think the ES must match the some conditions.

ZeroRedundancy:       Don't all allow scale down.
SingleRedundancy:     Even if the replias shard has been created, the ES nodes only can be scaled down one by one.
MultipleRedundancy:   Even if all replias shards have been created,  As we don't know where the replicas shards located. the ES nodes should be scaled down one by one.
FullRedundancy:       If all replicas have been created, scale down 1 to n-1 nodes.

The EO should check the replicas shard status and block new indics generation.

Comment 25 ewolinet 2020-11-09 17:14:00 UTC

@anli I think that would be a further feature. 

If a user is going to be scaling down their ES cluster, they should understand the risk for data loss if there is no replication.

Comment 30 Michael Burke 2020-11-11 20:13:36 UTC

Docs BZ for this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1896916

Comment 31 ewolinet 2020-11-11 20:15:37 UTC

Thanks Michael

Comment 33 Anping Li 2020-11-13 08:38:17 UTC

Move to verified

Comment 34 Michael Burke 2020-11-16 21:10:28 UTC

Created https://github.com/openshift/openshift-docs/pull/27404 to document the warnings about scaling down and the node minimums as listed in https://issues.redhat.com/browse/LOG-981.

Comment 41 errata-xmlrpc 2021-02-24 11:21:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652

Note You need to log in before you can comment on or make changes to this bug.