1712721 – Can't scalep up ES nodes from 3 to N (N>3) in clusterlogging CRD instance.

Bug 1712721 - Can't scalep up ES nodes from 3 to N (N>3) in clusterlogging CRD instance.

Summary: Can't scalep up ES nodes from 3 to N (N>3) in clusterlogging CRD instance.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	ewolinet
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1712955
TreeView+	depends on / blocked

Reported:	2019-05-22 06:53 UTC by Qiaoling Tang
Modified:	2019-10-16 06:29 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1712955 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:29:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-logging-operator pull 185	'None'	closed	Bug 1712721 - Fixing checking if nodes are different when scaling from 3 to 4	2021-02-08 13:47:36 UTC
Github	openshift cluster-logging-operator pull 205	'None'	closed	Bug 1712721 - Ensure we preserve GenUUID for nodes when doing diffs	2021-02-08 13:47:36 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:29:33 UTC

Description Qiaoling Tang 2019-05-22 06:53:35 UTC

Description of problem:
Deploy logging with 3 ES nodes, then wait until all pods running, change es node count to 4 in clusterlogging CRD instance, wait for about 10 minutes, the number of es node count is still 3 in the elasticsearch CRD instance. No logs in cluster-logging-operator pod.

Add ES nodes from 2 to 4 in the clusterlogging CRD instance, the ES node count can be changed to 4 in the elasticsearch CRD instance, and the ES pods could be scaled up. Find log `level=info msg="Elasticsearch node configuration change found, updating elasticsearch"` in cluster-logging-operator pod.

Version-Release number of selected component (if applicable):
image-registry.openshift-image-registry.svc:5000/openshift/ose-cluster-logging-operator:v4.1.0-201905191700


How reproducible:
Always

Steps to Reproduce:
1.Deploy logging via OLM, set es node count to 3 in the clusterlogging CRD instance
2.wait until all logging pods running, change es node count to 4 in clusterlogging CRD instance
3.check pods in `openshift-logging` namespace, and check the es node count in elasticsearch CRD instance and clusterlogging CRD instance

Actual results:


Expected results:


Additional info:

Comment 1 Qiaoling Tang 2019-05-22 08:23:29 UTC

Actual results:
the es nodeCount in elasticsearch CRD instance isn't changed after changing es nodeCount from 3 to n (n>3) in the clusterlogging CRD instance

Expected results:
the es node count should be the same as it in the clusterlogging CRD instance.

Additional info:
Scaling up es nodes from 1 or 2 to n(n>=3), no issue.
Scaling up es nodes from 4 or 5 to 6, no issue.

This issue only happens when scaling up from 3 nodes to n(n > 3) nodes

The workaround is: 
1. change es nodeCount in clusterlogging CRD instance, 
2. use `oc delete elasticsearch elasticsearch -n openshift-logging` to delete elasticsearch CRD instance, then the elasticsearch would be recreated, and the nodeCount is what it set in the clusterlogging CRD instance.

Comment 2 Ben Parees 2019-05-22 13:54:33 UTC

this should likely be cloned+backported to 4.1.z

Comment 4 Qiaoling Tang 2019-06-25 06:11:14 UTC

The issue isn't fixed.

Got error msg in EO pod after change es node count from 3 to 4 in clusterlogging CR instance.

{"level":"error","ts":1561442637.229709,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile Elasticsearch deployment spec: Unsupported change to UUIDs made: Previously used GenUUID \"jw91ctq6\" is no longer found in Spec.Nodes","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/_output/src/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/_output/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/_output/src/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/_output/src/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/_output/src/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/_output/src/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

$ oc get elasticsearch -oyaml
apiVersion: v1
items:
- apiVersion: logging.openshift.io/v1
  kind: Elasticsearch
  metadata:
    creationTimestamp: "2019-06-25T03:39:34Z"
    generation: 37
    name: elasticsearch
    namespace: openshift-logging
    ownerReferences:
    - apiVersion: logging.openshift.io/v1
      controller: true
      kind: ClusterLogging
      name: instance
      uid: d67cda8e-96fa-11e9-a275-06e6146aca30
    resourceVersion: "431049"
    selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch
    uid: d9521b52-96fa-11e9-a275-06e6146aca30
  spec:
    managementState: Managed
    nodeSpec:
      image: image-registry.openshift-image-registry.svc:5000/openshift/ose-logging-elasticsearch5:latest
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
        requests:
          cpu: 200m
          memory: 1Gi
    nodes:
    - nodeCount: 3
      resources: {}
      roles:
      - client
      - data
      - master
      storage:
        size: 10Gi
        storageClassName: gp2
    - nodeCount: 1
      resources: {}
      roles:
      - client
      - data
      storage:
        size: 10Gi
        storageClassName: gp2
    redundancyPolicy: FullRedundancy
  status:
    cluster:
      activePrimaryShards: 17
      activeShards: 23
      initializingShards: 0
      numDataNodes: 3
      numNodes: 3
      pendingTasks: 0
      relocatingShards: 0
      status: green
      unassignedShards: 0
    clusterHealth: ""
    conditions:
    - lastTransitionTime: "2019-06-25T06:03:57Z"
      message: Previously used GenUUID "jw91ctq6" is no longer found in Spec.Nodes
      reason: Invalid Spec
      status: "True"
      type: InvalidUUID
    nodes:
    - deploymentName: elasticsearch-cdm-jw91ctq6-1
      upgradeStatus: {}
    - deploymentName: elasticsearch-cdm-jw91ctq6-2
      upgradeStatus: {}
    - deploymentName: elasticsearch-cdm-jw91ctq6-3
      upgradeStatus: {}
    pods:
      client:
        failed: []
        notReady: []
        ready:
        - elasticsearch-cdm-jw91ctq6-1-fbbd7bfc-nglll
        - elasticsearch-cdm-jw91ctq6-2-564f89f647-bhtvm
        - elasticsearch-cdm-jw91ctq6-3-86dbf67c7-bhwvg
      data:
        failed: []
        notReady: []
        ready:
        - elasticsearch-cdm-jw91ctq6-1-fbbd7bfc-nglll
        - elasticsearch-cdm-jw91ctq6-2-564f89f647-bhtvm
        - elasticsearch-cdm-jw91ctq6-3-86dbf67c7-bhwvg
      master:
        failed: []
        notReady: []
        ready:
        - elasticsearch-cdm-jw91ctq6-1-fbbd7bfc-nglll
        - elasticsearch-cdm-jw91ctq6-2-564f89f647-bhtvm
        - elasticsearch-cdm-jw91ctq6-3-86dbf67c7-bhwvg
    shardAllocationEnabled: all
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 5 Qiaoling Tang 2019-06-25 06:12:06 UTC

ose-elasticsearch-operator-v4.2.0-201906241432

Comment 6 Qiaoling Tang 2019-06-25 06:13:21 UTC

ose-cluster-logging-operator-v4.2.0-201906241832

Comment 7 ewolinet 2019-06-25 14:07:05 UTC

Out of curiosity, how much time did you wait for after the initial creation of the clusterlogging object before updating the elasticsearch node count?

Comment 8 Qiaoling Tang 2019-06-26 02:13:46 UTC

In c4, it's about several hours. The ES cluster was in green status before I updating the elasticsearch node count.

And it alse could be reproduced by: creating clusterlogging instance, waiting for the ES cluster be in Green status, updating the es node count.

Comment 9 ewolinet 2019-06-26 16:56:23 UTC

I can recreate this, looking into why CLO is dropping the uuids

Comment 10 ewolinet 2019-06-26 19:39:45 UTC

https://github.com/openshift/cluster-logging-operator/pull/205

Comment 12 Qiaoling Tang 2019-07-24 02:24:58 UTC

Verified in ose-cluster-logging-operator-v4.2.0-201907222219

Comment 14 errata-xmlrpc 2019-10-16 06:29:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.