Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1879150 - Changes on spec.logStore.elasticsearch.nodeCount not reflected when decreasing the number of nodes
Summary: Changes on spec.logStore.elasticsearch.nodeCount not reflected when decreasin...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: ewolinet
QA Contact: Qiaoling Tang
Rolfe Dlugy-Hegwer
URL:
Whiteboard: logging-exploration
Depends On:
Blocks: 1890801
TreeView+ depends on / blocked
 
Reported: 2020-09-15 14:28 UTC by Simon Reber
Modified: 2021-02-24 11:21 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, when the Cluster Logging Operator (CLO) scaled down the number of Elasticsearch nodes in the clusterlogging custom resource (CR) to three nodes, it omitted previously-created nodes that had unique IDs. The Elasticsearch Operator (EO) rejected the update because it has safeguards that prevent nodes with unique IDs from being removed. The current release fixes this issue. Now, when the CLO scales down the number of nodes and updates the Elasticsearch CR, the CLO does not omit nodes that have unique IDs. Instead, the CLO marks those nodes as `count 0`. As a result, users can scale down their cluster to three nodes using the clusterlogging CR. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1879150[*BZ#1879150*])
Clone Of:
: 1898310 (view as bug list)
Environment:
Last Closed: 2021-02-24 11:21:18 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 752 0 None closed Bug 1879150: Updating how we scale down ES nodes in CR to preserve the number of es node definitions 2021-02-19 14:36:11 UTC
Red Hat Product Errata RHBA-2021:0652 0 None None None 2021-02-24 11:21:52 UTC

Description Simon Reber 2020-09-15 14:28:27 UTC
Description of problem:

Running OpenShift 4.5.8 with Cluster Logging clusterlogging.4.5.0-202009041228.p0 does not correctly decrease the number of Elasticsearch Nodes.

> Spec:
>   Collection:
>     Logs:
>       Fluentd:
>       Type:  fluentd
>   Curation:
>     Curator:
>       Schedule:  30 3 * * *
>     Type:        curator
>   Log Store:
>     Elasticsearch:
>       Node Count:         5
>       Redundancy Policy:  FullRedundancy
>       Resources:
>         Limits:
>           Memory:  4Gi
>         Requests:
>           Cpu:     500m
>           Memory:  2Gi
>       Storage:
>         Size:                200G
>         Storage Class Name:  gp2
>     Retention Policy:
>       Application:
>         Max Age:  1d
>       Audit:
>         Max Age:  7d
>       Infra:
>         Max Age:     7d
>     Type:            elasticsearch
>   Management State:  Managed
>   Visualization:
>     Kibana:
>       Replicas:  1
>     Type:        kibana

shows

> $ oc get pod -l component=elasticsearch
> NAME                                            READY   STATUS    RESTARTS   AGE
> elasticsearch-cd-hh1vvavv-1-db447f8c4-797hz     2/2     Running   0          50m
> elasticsearch-cd-hh1vvavv-2-8c6fb9f45-8zgsr     2/2     Running   0          50m
> elasticsearch-cdm-gbgfqisu-1-75b49786b6-m72qt   2/2     Running   0          72m
> elasticsearch-cdm-gbgfqisu-2-7f77c4947f-vmx7t   2/2     Running   0          72m
> elasticsearch-cdm-gbgfqisu-3-6d5955bd8d-vnz9h   2/2     Running   0          72m

When updating ClusterLogging resource "instance" and decreasing the node count to 3 we still see 5 Elasticsearch nodes running.

> Spec:
>   Collection:
>     Logs:
>       Fluentd:
>       Type:  fluentd
>   Curation:
>     Curator:
>       Schedule:  30 3 * * *
>     Type:        curator
>   Log Store:
>     Elasticsearch:
>       Node Count:         3
>       Redundancy Policy:  FullRedundancy
>       Resources:
>         Limits:
>           Memory:  4Gi
>         Requests:
>           Cpu:     500m
>           Memory:  2Gi
>       Storage:
>         Size:                200G
>         Storage Class Name:  gp2
>     Retention Policy:
>       Application:
>         Max Age:  1d
>       Audit:
>         Max Age:  7d
>       Infra:
>         Max Age:     7d
>     Type:            elasticsearch
>   Management State:  Managed
>   Visualization:
>     Kibana:
>       Replicas:  1
>     Type:        kibana

> $ oc get pod -l component=elasticsearch
> NAME                                            READY   STATUS    RESTARTS   AGE
> elasticsearch-cd-hh1vvavv-1-db447f8c4-797hz     2/2     Running   0          50m
> elasticsearch-cd-hh1vvavv-2-8c6fb9f45-8zgsr     2/2     Running   0          50m
> elasticsearch-cdm-gbgfqisu-1-75b49786b6-m72qt   2/2     Running   0          72m
> elasticsearch-cdm-gbgfqisu-2-7f77c4947f-vmx7t   2/2     Running   0          72m
> elasticsearch-cdm-gbgfqisu-3-6d5955bd8d-vnz9h   2/2     Running   0          72m

Even when deleting Elasticsearch pod it will be re-created immediately. Also when adjusting "Redundancy Policy" from "FullRedundancy" to "SingleRedundancy" it does not take any effect.

Version-Release number of selected component (if applicable):

 - clusterlogging.4.5.0-202009041228.p0

How reproducible:

 - Always

Steps to Reproduce:
1. Install OpenShift Logging according https://docs.openshift.com/container-platform/4.5/logging/cluster-logging-deploying.html
2. Increase the number of Elasticsearch Nodes from 3 to 5
3. Decrease the number of Elasticsearch Nodes from 5 to 3

Actual results:

All 5 Elasticsearch Nodes keep running and there is no attempt made reduce the number of Elasticsearch Nodes. Also changes to "Redundancy Policy" are not reflected (if done at the same time or not)

Expected results:

Number of Elasticsearch Nodes to be properly reflected at all time and the Operator to take action if spec.logStore.elasticsearch.nodeCount is modified.

Additional info:

Comment 2 Jeff Cantrill 2020-09-16 13:42:21 UTC
Moving to 4.7 as this is not a 4.6 blocker

Comment 3 Periklis Tsirakidis 2020-09-24 11:19:32 UTC
@Simon

Please collect a full must-gather for cluster-logging using to get a full picture of the stack:

https://github.com/openshift/cluster-logging-operator/tree/master/must-gather

Comment 11 Jeff Cantrill 2020-10-02 15:24:12 UTC
Marking UpcomingSprint as will not be merged or addressed by EOD

Comment 20 Qiaoling Tang 2020-11-04 06:18:54 UTC
Verified with elasticsearch-operator.4.7.0-202011030448.p0

Comment 21 Anping Li 2020-11-09 12:45:57 UTC
50% ES cluster went into Red Status in 10 scale down. Move back to assign to continue investigate.

Comment 22 Anping Li 2020-11-09 13:06:25 UTC
When the replicas shards wasn't created, the ES may went into Red.

##Before scale down:
+ oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_util --query=_cat/shards
.security    0 p STARTED    5 29.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
.security    0 r STARTED    5 29.6kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
audit-000001 1 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
audit-000001 2 p STARTED    0   230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
audit-000001 0 p STARTED    0   230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   1 p STARTED    0   230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
app-000001   2 p STARTED    0   230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   0 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
infra-000001 1 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
infra-000001 2 p STARTED 7917  4.3mb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
infra-000001 0 p STARTED 7191    4mb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
.kibana_1    0 r STARTED    0   230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
.kibana_1    0 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
##After scale down:
+ oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_cluster_health
{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 8,
  "active_shards" : 9,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 4,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 69.23076923076923
}

+ oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_util --query=_cat/shards
.security    0 p STARTED       5  29.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
.security    0 r STARTED       5  29.6kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
audit-000001 1 p UNASSIGNED                          
audit-000001 2 p STARTED       0    230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
audit-000001 0 p STARTED       0    230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   1 p STARTED       0 127.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
app-000001   2 p STARTED       0   136kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   0 p UNASSIGNED                          
infra-000001 1 p UNASSIGNED                          
infra-000001 2 p STARTED    7917   4.3mb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
infra-000001 0 p STARTED    7191     4mb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
.kibana_1    0 p STARTED       0    230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1

Comment 23 Anping Li 2020-11-09 13:06:57 UTC
When the replicas shards wasn't created, the ES may went into Red.

##Before scale down:
+ oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_util --query=_cat/shards
.security    0 p STARTED    5 29.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
.security    0 r STARTED    5 29.6kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
audit-000001 1 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
audit-000001 2 p STARTED    0   230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
audit-000001 0 p STARTED    0   230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   1 p STARTED    0   230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
app-000001   2 p STARTED    0   230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   0 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
infra-000001 1 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
infra-000001 2 p STARTED 7917  4.3mb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
infra-000001 0 p STARTED 7191    4mb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
.kibana_1    0 r STARTED    0   230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
.kibana_1    0 p STARTED             10.131.0.27 elasticsearch-cdm-znu3x9e7-3
##After scale down:
+ oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_cluster_health
{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 8,
  "active_shards" : 9,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 4,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 69.23076923076923
}

+ oc exec -c elasticsearch elasticsearch-cdm-znu3x9e7-1-78b488bcf6-zq22z -- es_util --query=_cat/shards
.security    0 p STARTED       5  29.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
.security    0 r STARTED       5  29.6kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
audit-000001 1 p UNASSIGNED                          
audit-000001 2 p STARTED       0    230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
audit-000001 0 p STARTED       0    230b 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   1 p STARTED       0 127.6kb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
app-000001   2 p STARTED       0   136kb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
app-000001   0 p UNASSIGNED                          
infra-000001 1 p UNASSIGNED                          
infra-000001 2 p STARTED    7917   4.3mb 10.128.2.22 elasticsearch-cdm-znu3x9e7-1
infra-000001 0 p STARTED    7191     4mb 10.129.2.25 elasticsearch-cdm-znu3x9e7-2
.kibana_1    0 p STARTED       0    230b 10.128.2.22 elasticsearch-cdm-znu3x9e7-1

Comment 24 Anping Li 2020-11-09 13:11:12 UTC
To scale down the ES cluster, I think the ES must match the some conditions.

ZeroRedundancy:       Don't all allow scale down.
SingleRedundancy:     Even if the replias shard has been created, the ES nodes only can be scaled down one by one.
MultipleRedundancy:   Even if all replias shards have been created,  As we don't know where the replicas shards located. the ES nodes should be scaled down one by one.
FullRedundancy:       If all replicas have been created, scale down 1 to n-1 nodes.

The EO should check the replicas shard status and block new indics generation.

Comment 25 ewolinet 2020-11-09 17:14:00 UTC
@anli@redhat.com I think that would be a further feature. 

If a user is going to be scaling down their ES cluster, they should understand the risk for data loss if there is no replication.

Comment 30 Michael Burke 2020-11-11 20:13:36 UTC
Docs BZ for this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1896916

Comment 31 ewolinet 2020-11-11 20:15:37 UTC
Thanks Michael

Comment 33 Anping Li 2020-11-13 08:38:17 UTC
Move to verified

Comment 34 Michael Burke 2020-11-16 21:10:28 UTC
Created https://github.com/openshift/openshift-docs/pull/27404 to document the warnings about scaling down and the node minimums as listed in https://issues.redhat.com/browse/LOG-981.

Comment 41 errata-xmlrpc 2021-02-24 11:21:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652


Note You need to log in before you can comment on or make changes to this bug.