1832656 – The EO removes the all the node selector configurations in the ES cluster.

Bug 1832656 - The EO removes the all the node selector configurations in the ES cluster.

Summary: The EO removes the all the node selector configurations in the ES cluster.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Vimal Kumar
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-07 02:52 UTC by Qiaoling Tang
Modified:	2020-07-13 17:35 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:35:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:35:57 UTC

Description Qiaoling Tang 2020-05-07 02:52:18 UTC

Description of problem:
Deploy clusterlogging, set node selector for the EFK pods:

$ oc get clusterlogging -oyaml
apiVersion: v1
items:
- apiVersion: logging.openshift.io/v1
  kind: ClusterLogging
  metadata:
    creationTimestamp: "2020-05-07T02:20:20Z"
    generation: 1
    name: instance
    namespace: openshift-logging
    resourceVersion: "77871"
    selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/clusterloggings/instance
    uid: 2b90879f-6bbf-46f2-9d0c-d3135405af54
  spec:
    collection:
      logs:
        fluentd:
          nodeSelector:
            logging: test
        type: fluentd
    logStore:
      elasticsearch:
        nodeCount: 3
        nodeSelector:
          logging: test
        redundancyPolicy: SingleRedundancy
        resources:
          requests:
            memory: 2Gi
        storage:
          size: 20Gi
          storageClassName: standard
      retentionPolicy:
        application:
          maxAge: 1d
        audit:
          maxAge: 1w
        infra:
          maxAge: 7d
      type: elasticsearch
    managementState: Managed
    visualization:
      kibana:
        nodeSelector:
          logging: test
        replicas: 1
      type: kibana

No nodes in the cluster have the label `logging=test`, then all the ES pods are pending due to node selector mismatch. 

$ oc get elasticsearch -oyaml
apiVersion: v1
items:
- apiVersion: logging.openshift.io/v1
  kind: Elasticsearch
  metadata:
    annotations:
      elasticsearch.openshift.io/loglevel: trace
    creationTimestamp: "2020-05-07T02:20:27Z"
    generation: 3
   ......
    managementState: Managed
    nodeSpec:
      nodeSelector:
        logging: test
      resources:
        requests:
          memory: 2Gi
    nodes:
    - genUUID: wpykay58
      nodeCount: 3
      resources: {}
      roles:
      - client
      - data
      - master
      storage:
        size: 20Gi
        storageClassName: standard
    redundancyPolicy: SingleRedundancy

$ oc get pod
NAME                                            READY   STATUS    RESTARTS   AGE
cluster-logging-operator-75774d56b6-47x6c       1/1     Running   0          3m19s
elasticsearch-cdm-wpykay58-1-dfc8977c5-mhzwh    0/2     Pending   0          3m1s
elasticsearch-cdm-wpykay58-2-5f4c9fdb5d-n8hsk   0/2     Pending   0          2m
elasticsearch-cdm-wpykay58-3-56985bc445-m4dxg   0/2     Pending   0          59s
kibana-797d5b7f99-mwmtg                         2/2     Running   0          3m1s
$ oc get deploy
NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator       1/1     1            1           3m24s
elasticsearch-cdm-wpykay58-1   0/1     1            0           3m6s
elasticsearch-cdm-wpykay58-2   0/1     1            0           2m5s
elasticsearch-cdm-wpykay58-3   0/1     1            0           64s
kibana                         1/1     1            1           3m6s

$ oc get deploy -l cluster-name=elasticsearch -oyaml |grep -A 5 nodeSelector
              f:nodeSelector:
                .: {}
                f:kubernetes.io/os: {}
                f:logging: {}
              f:restartPolicy: {}
              f:schedulerName: {}
--
        nodeSelector:
          kubernetes.io/os: linux
          logging: test
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
--
              f:nodeSelector:
                .: {}
                f:kubernetes.io/os: {}
                f:logging: {}
              f:restartPolicy: {}
              f:schedulerName: {}
--
        nodeSelector:
          kubernetes.io/os: linux
          logging: test
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
--
              f:nodeSelector:
                .: {}
                f:kubernetes.io/os: {}
                f:logging: {}
              f:restartPolicy: {}
              f:schedulerName: {}
--
        nodeSelector:
          kubernetes.io/os: linux
          logging: test
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}

However, a few minutes later, the EO removes all the nodeSelector, and the ES pods are redeployed without node selector.

$ oc get pod
NAME                                            READY   STATUS    RESTARTS   AGE
cluster-logging-operator-75774d56b6-47x6c       1/1     Running   0          15m
elasticsearch-cdm-wpykay58-1-779dd794ff-d5qmg   2/2     Running   0          8m35s
elasticsearch-cdm-wpykay58-2-5f4f5884fd-xchxg   2/2     Running   0          8m34s
elasticsearch-cdm-wpykay58-3-5f586cd8d9-tpxvf   2/2     Running   0          8m33s
elasticsearch-delete-app-1588818600-vg7cq       0/1     Pending   0          5m35s
elasticsearch-delete-audit-1588818600-s4bqj     0/1     Pending   0          5m35s
elasticsearch-delete-infra-1588818600-7782l     0/1     Pending   0          5m35s
elasticsearch-rollover-app-1588818600-5dnxh     0/1     Pending   0          5m35s
elasticsearch-rollover-audit-1588818600-dkws4   0/1     Pending   0          5m35s
elasticsearch-rollover-infra-1588818600-5dx99   0/1     Pending   0          5m35s
kibana-797d5b7f99-mwmtg                         2/2     Running   0          15m

The EO removes the node selectors in the es deployment including the default nodeSelector `kubernetes.io/os: linux`, but the node selectors in the elasticsearch instance are not removed.

Logs in the EO:
$ oc logs -n openshift-operators-redhat elasticsearch-operator-f997486f5-z6wkp 
{"level":"info","ts":1588818013.5684557,"logger":"cmd","msg":"Go Version: go1.13.8"}
{"level":"info","ts":1588818013.568499,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1588818013.568506,"logger":"cmd","msg":"Version of operator-sdk: v0.8.2"}
{"level":"info","ts":1588818013.5690854,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1588818013.8786685,"logger":"leader","msg":"No pre-existing lock was found."}
{"level":"info","ts":1588818013.8938189,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1588818014.020002,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1588818014.0205996,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"kibana-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1588818014.020811,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"elasticsearch-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1588818014.021117,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"proxyconfig-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1588818014.0212927,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"kibanasecret-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1588818014.0215733,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trustedcabundle-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1588818014.1484535,"logger":"metrics","msg":"Metrics Service object created","Service.Name":"elasticsearch-operator","Service.Namespace":"openshift-operators-redhat"}
{"level":"info","ts":1588818014.1484966,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1588818015.2488885,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"kibana-controller"}
{"level":"info","ts":1588818015.2489219,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"proxyconfig-controller"}
{"level":"info","ts":1588818015.2488775,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"elasticsearch-controller"}
{"level":"info","ts":1588818015.248872,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"trustedcabundle-controller"}
{"level":"info","ts":1588818015.2489371,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"kibanasecret-controller"}
{"level":"info","ts":1588818015.3491592,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"proxyconfig-controller","worker count":1}
{"level":"info","ts":1588818015.349234,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"kibana-controller","worker count":1}
{"level":"info","ts":1588818015.3492274,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"kibanasecret-controller","worker count":1}
{"level":"info","ts":1588818015.3492675,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"trustedcabundle-controller","worker count":1}
{"level":"info","ts":1588818015.3491406,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"elasticsearch-controller","worker count":1}
time="2020-05-07T02:20:27Z" level=error msg="Operator unable to read local file to get contents: open /tmp/ocp-eo/ca.crt: no such file or directory"
time="2020-05-07T02:20:27Z" level=error msg="Operator unable to read local file to get contents: open /tmp/ocp-eo/ca.crt: no such file or directory"
{"level":"error","ts":1588818028.1739502,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"kibana-controller","request":"openshift-logging/instance","error":"Did not receive hashvalue for trusted CA value","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
time="2020-05-07T02:20:28Z" level=info msg="Updating status of Kibana"
time="2020-05-07T02:20:29Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:20:29Z" level=info msg="Updating status of Kibana"
time="2020-05-07T02:20:29Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:20:29Z" level=info msg="Updating status of Kibana"
time="2020-05-07T02:20:29Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:20:59Z" level=info msg="Updating status of Kibana"
time="2020-05-07T02:20:59Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:20:59Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:21:29Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.136.171:9200: i/o timeout\r\n"
time="2020-05-07T02:21:29Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:21:59Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:22:29Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:22:30Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.136.171:9200: i/o timeout\r\n"
time="2020-05-07T02:22:59Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:23:30Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:23:31Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.136.171:9200: i/o timeout\r\n"
time="2020-05-07T02:24:00Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:24:30Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:25:00Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:25:30Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:26:00Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:26:30Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:26:31Z" level=warning msg="Unable to list existing templates in order to reconcile stale ones: Get https://elasticsearch.openshift-logging.svc:9200/_template: dial tcp 172.30.136.171:9200: i/o timeout"
time="2020-05-07T02:27:00Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:27:01Z" level=error msg="Error creating index template for mapping app: Put https://elasticsearch.openshift-logging.svc:9200/_template/ocp-gen-app: dial tcp 172.30.136.171:9200: i/o timeout"
{"level":"error","ts":1588818421.5804315,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile IndexMangement for Elasticsearch cluster: Put https://elasticsearch.openshift-logging.svc:9200/_template/ocp-gen-app: dial tcp 172.30.136.171:9200: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
time="2020-05-07T02:27:03Z" level=info msg="Requested to update node 'elasticsearch-cdm-wpykay58-1', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-05-07T02:27:04Z" level=info msg="Requested to update node 'elasticsearch-cdm-wpykay58-2', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-05-07T02:27:05Z" level=info msg="Requested to update node 'elasticsearch-cdm-wpykay58-3', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-05-07T02:27:31Z" level=info msg="Kibana status successfully updated"
time="2020-05-07T02:27:36Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-wpykay58-1:  / green"
time="2020-05-07T02:27:36Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-wpykay58-1: Cluster not in green state before beginning upgrade: "
time="2020-05-07T02:27:43Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-wpykay58-2:  / green"
time="2020-05-07T02:27:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-wpykay58-2: Cluster not in green state before beginning upgrade: "
time="2020-05-07T02:27:43Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-wpykay58-3:  / green"
time="2020-05-07T02:27:43Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-wpykay58-3: Cluster not in green state before beginning upgrade: "
time="2020-05-07T02:27:44Z" level=warning msg="Unable to list existing templates in order to reconcile stale ones: There was an error retrieving list of templates. Error code: true, map[results:Open Distro not initialized]"
time="2020-05-07T02:27:44Z" level=error msg="Error creating index template for mapping app: There was an error creating index template ocp-gen-app. Error code: true, map[results:Open Distro not initialized]"
{"level":"error","ts":1588818464.3757575,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile IndexMangement for Elasticsearch cluster: There was an error creating index template ocp-gen-app. Error code: true, map[results:Open Distro not initialized]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
time="2020-05-07T02:27:47Z" level=warning msg="Unable to evaluate the number of replicas for index \"results\": Open Distro not initialized. cluster: elasticsearch, namespace: openshift-logging "
time="2020-05-07T02:27:47Z" level=error msg="Unable to evaluate number of replicas for index"
time="2020-05-07T02:27:47Z" level=warning msg="Unable to list existing templates in order to reconcile stale ones: There was an error retrieving list of templates. Error code: true, map[results:Open Distro not initialized]"
time="2020-05-07T02:27:47Z" level=error msg="Error creating index template for mapping app: There was an error creating index template ocp-gen-app. Error code: true, map[results:Open Distro not initialized]"
{"level":"error","ts":1588818467.5358996,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile IndexMangement for Elasticsearch cluster: There was an error creating index template ocp-gen-app. Error code: true, map[results:Open Distro not initialized]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
time="2020-05-07T02:27:50Z" level=warning msg="Unable to evaluate the number of replicas for index \"results\": Open Distro not initialized. cluster: elasticsearch, namespace: openshift-logging "
time="2020-05-07T02:27:50Z" level=error msg="Unable to evaluate number of replicas for index"
time="2020-05-07T02:27:50Z" level=warning msg="Unable to list existing templates in order to reconcile stale ones: There was an error retrieving list of templates. Error code: true, map[results:Open Distro not initialized]"
time="2020-05-07T02:27:50Z" level=error msg="Error creating index template for mapping app: There was an error creating index template ocp-gen-app. Error code: true, map[results:Open Distro not initialized]"
{"level":"error","ts":1588818470.975918,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile IndexMangement for Elasticsearch cluster: There was an error creating index template ocp-gen-app. Error code: true, map[results:Open Distro not initialized]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}


Version-Release number of selected component (if applicable):
Logging images are from 4.5.0-0.ci-2020-05-06-225918	
Manifests are copied from the master branch
Cluster version: 4.5.0-0.nightly-2020-05-06-003431

How reproducible:
Always

Steps to Reproduce:
1. deploy logging with:
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  name: "instance"
  namespace: "openshift-logging"
spec:
  managementState: "Managed"
  logStore:
    type: "elasticsearch"
    retentionPolicy: 
      application:
        maxAge: 1d
      infra:
        maxAge: 7d
      audit:
        maxAge: 1w
    elasticsearch:
      nodeCount: 3
      nodeSelector:
        logging: test
      redundancyPolicy: "SingleRedundancy"
      resources:
        requests:
          memory: "2Gi"
      storage:
        storageClassName: "standard"
        size: "20Gi"
  visualization:
    type: "kibana"
    kibana:
      nodeSelector:
        logging: test
      replicas: 1
  collection:
    logs:
      type: "fluentd"
      fluentd:
        nodeSelector:
          logging: test
note: no nodes have label `logging=test` in the cluster
2. check ES status
3. wait for a few minutes
4. check the ES pods

Actual results:
The EO removes the node selector configurations in the ES deployment

Expected results:
The EO should not remove the node selectors if there has node selector in the clusterlogging instance.

Additional info:

Comment 1 Qiaoling Tang 2020-05-19 09:06:31 UTC

Seems the EO always want to update the ES cluster when it can't connect to the ES cluster even when the ES cluster health is green. 

{"level":"info","ts":1589875076.724568,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"kibana-controller","worker count":1}
{"level":"info","ts":1589875076.7255478,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"proxyconfig-controller","worker count":1}
{"level":"info","ts":1589875076.7256155,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"kibanasecret-controller","worker count":1}
time="2020-05-19T07:57:57Z" level=info msg="Updating status of Kibana"
time="2020-05-19T07:57:57Z" level=info msg="Kibana status successfully updated"
time="2020-05-19T07:57:57Z" level=info msg="Updating status of Kibana"
time="2020-05-19T07:57:57Z" level=info msg="Updating status of Kibana"
time="2020-05-19T07:57:57Z" level=info msg="Kibana status successfully updated"
time="2020-05-19T07:57:57Z" level=info msg="Kibana status successfully updated"
time="2020-05-19T07:57:57Z" level=info msg="Kibana status successfully updated"
time="2020-05-19T07:58:01Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-vyhvbuyr-1: yellow / green"
time="2020-05-19T07:58:01Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-vyhvbuyr-1: Cluster not in green state before beginning upgrade: yellow"
time="2020-05-19T07:58:01Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-vyhvbuyr-2: yellow / green"
time="2020-05-19T07:58:01Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-vyhvbuyr-2: Cluster not in green state before beginning upgrade: yellow"
time="2020-05-19T07:58:01Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-vyhvbuyr-3: yellow / green"
time="2020-05-19T07:58:01Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-vyhvbuyr-3: Cluster not in green state before beginning upgrade: yellow"
time="2020-05-19T07:58:05Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 6 shards in preparation for cluster restart"
time="2020-05-19T07:58:27Z" level=info msg="Kibana status successfully updated"
time="2020-05-19T07:58:35Z" level=info msg="Timed out waiting for node elasticsearch-cdm-vyhvbuyr-1 to rollout"
time="2020-05-19T07:58:35Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-vyhvbuyr-1: timed out waiting for the condition"
time="2020-05-19T07:58:35Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 6 shards in preparation for cluster restart"
time="2020-05-19T07:58:57Z" level=info msg="Kibana status successfully updated"
time="2020-05-19T07:59:05Z" level=info msg="Timed out waiting for node elasticsearch-cdm-vyhvbuyr-2 to rollout"
time="2020-05-19T07:59:05Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-vyhvbuyr-2: timed out waiting for the condition"
time="2020-05-19T07:59:06Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 6 shards in preparation for cluster restart"
time="2020-05-19T07:59:27Z" level=info msg="Kibana status successfully updated"
time="2020-05-19T07:59:36Z" level=info msg="Timed out waiting for node elasticsearch-cdm-vyhvbuyr-3 to rollout"
time="2020-05-19T07:59:36Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-vyhvbuyr-3: timed out waiting for the condition"
time="2020-05-19T07:59:46Z" level=warning msg="Unable to perform synchronized flush: Failed to flush 6 shards in preparation for cluster restart"
time="2020-05-19T07:59:57Z" level=info msg="Kibana status successfully updated"

Comment 2 Vimal Kumar 2020-05-26 11:14:16 UTC

I have tried to reproduce this bug, but so far not able to.

1. create logging instance with CR [1]
2. wait 20 mins
3. check status of logging pods.
vimalkum bug-1832656 $ oc -n openshift-logging get pods
NAME                                                 READY   STATUS    RESTARTS   AGE
cluster-logging-operator-6f7f888684-292sq            1/1     Running   0          32m
cluster-logging-operator-registry-6b94c44598-pmsh8   1/1     Running   0          33m
elasticsearch-cdm-y4u7ur3g-1-8767dcb78-z5rsw         0/2     Pending   0          20m
kibana-6f74f6c49b-6hdsx                              0/2     Pending   0          20m

The elasticsearch is not deployed if the node selector doesnt match



[1] logging CR deployed
apiVersion: logging.openshift.io/v1
kind: Elasticsearch
metadata:
  creationTimestamp: "2020-05-26T10:49:03Z"
  generation: 4
  name: elasticsearch
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: logging.openshift.io/v1
    controller: true
    kind: ClusterLogging
    name: instance
    uid: 6e13c27f-70ef-4d20-8030-20e5d764171a
  resourceVersion: "326088"
  selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch
  uid: 764c5c34-96eb-4abf-911b-5c8c8e0eb5b4
spec:
  indexManagement:
    mappings:
    - aliases:
      - app
      - logs-app
      name: app
      policyRef: app-policy
    - aliases:
      - infra
      - logs-infra
      name: infra
      policyRef: infra-policy
    - aliases:
      - audit
      - logs-audit
      name: audit
      policyRef: audit-policy
    policies:
    - name: app-policy
      phases:
        delete:
          minAge: 1d
        hot:
          actions:
            rollover:
              maxAge: 1h
      pollInterval: 15m
    - name: infra-policy
      phases:
        delete:
          minAge: 7d
        hot:
          actions:
            rollover:
              maxAge: 8h
      pollInterval: 15m
    - name: audit-policy
      phases:
        delete:
          minAge: 1w
        hot:
          actions:
            rollover:
              maxAge: 1h
      pollInterval: 15m
  managementState: Managed
  nodeSpec:
    nodeSelector:
      logging: test
    resources:
      requests:
        memory: 2Gi
  nodes:
  - genUUID: y4u7ur3g
    nodeCount: 1
    resources: {}
    roles:
    - client
    - data
    - master
    storage:
      size: 20Gi
      storageClassName: standard
  redundancyPolicy: ZeroRedundancy
status:
  cluster:
    activePrimaryShards: 0
    activeShards: 0
    initializingShards: 0
    numDataNodes: 0
    numNodes: 0
    pendingTasks: 0
    relocatingShards: 0
    status: cluster health unknown
    unassignedShards: 0
  clusterHealth: ""
  conditions: []
  nodes:
  - conditions:
    - lastTransitionTime: "2020-05-26T10:49:04Z"
      message: '0/1 nodes are available: 1 node(s) didn''t match node selector.'
      reason: Unschedulable
      status: "True"
      type: Unschedulable
    deploymentName: elasticsearch-cdm-y4u7ur3g-1
    upgradeStatus: {}
  pods:
    client:
      failed: []
      notReady:
      - elasticsearch-cdm-y4u7ur3g-1-8767dcb78-z5rsw
      ready: []
    data:
      failed: []
      notReady:
      - elasticsearch-cdm-y4u7ur3g-1-8767dcb78-z5rsw
      ready: []
    master:
      failed: []
      notReady:
      - elasticsearch-cdm-y4u7ur3g-1-8767dcb78-z5rsw
      ready: []
  shardAllocationEnabled: shard allocation unknown


$ oc -n openshift-logging describe Elasticsearches/elasticsearch 

Name:         elasticsearch
Namespace:    openshift-logging
Labels:       <none>
Annotations:  <none>
API Version:  logging.openshift.io/v1
Kind:         Elasticsearch
Metadata:
  Creation Timestamp:  2020-05-26T10:49:03Z
  Generation:          4
  Owner References:
    API Version:     logging.openshift.io/v1
    Controller:      true
    Kind:            ClusterLogging
    Name:            instance
    UID:             6e13c27f-70ef-4d20-8030-20e5d764171a
  Resource Version:  326088
  Self Link:         /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch
  UID:               764c5c34-96eb-4abf-911b-5c8c8e0eb5b4
Spec:
  Index Management:
    Mappings:
      Aliases:
        app
        logs-app
      Name:        app
      Policy Ref:  app-policy
      Aliases:
        infra
        logs-infra
      Name:        infra
      Policy Ref:  infra-policy
      Aliases:
        audit
        logs-audit
      Name:        audit
      Policy Ref:  audit-policy
    Policies:
      Name:  app-policy
      Phases:
        Delete:
          Min Age:  1d
        Hot:
          Actions:
            Rollover:
              Max Age:  1h
      Poll Interval:    15m
      Name:             infra-policy
      Phases:
        Delete:
          Min Age:  7d
        Hot:
          Actions:
            Rollover:
              Max Age:  8h
      Poll Interval:    15m
      Name:             audit-policy
      Phases:
        Delete:
          Min Age:  1w
        Hot:
          Actions:
            Rollover:
              Max Age:  1h
      Poll Interval:    15m
  Management State:     Managed
  Node Spec:
    Node Selector:
      Logging:  test
    Resources:
      Requests:
        Memory:  2Gi
  Nodes:
    Gen UUID:    y4u7ur3g
    Node Count:  1
    Resources:
    Roles:
      client
      data
      master
    Storage:
      Size:                20Gi
      Storage Class Name:  standard
  Redundancy Policy:       ZeroRedundancy
Status:
  Cluster:
    Active Primary Shards:  0
    Active Shards:          0
    Initializing Shards:    0
    Num Data Nodes:         0
    Num Nodes:              0
    Pending Tasks:          0
    Relocating Shards:      0
    Status:                 cluster health unknown
    Unassigned Shards:      0
  Cluster Health:           
  Conditions:
  Nodes:
    Conditions:
      Last Transition Time:  2020-05-26T10:49:04Z
      Message:               0/1 nodes are available: 1 node(s) didn't match node selector.
      Reason:                Unschedulable
      Status:                True
      Type:                  Unschedulable
    Deployment Name:         elasticsearch-cdm-y4u7ur3g-1
    Upgrade Status:
  Pods:
    Client:
      Failed:
      Not Ready:
        elasticsearch-cdm-y4u7ur3g-1-8767dcb78-z5rsw
      Ready:
    Data:
      Failed:
      Not Ready:
        elasticsearch-cdm-y4u7ur3g-1-8767dcb78-z5rsw
      Ready:
    Master:
      Failed:
      Not Ready:
        elasticsearch-cdm-y4u7ur3g-1-8767dcb78-z5rsw
      Ready:
  Shard Allocation Enabled:  shard allocation unknown
Events:                      <none>

Comment 3 Vimal Kumar 2020-05-26 11:16:42 UTC

[1] logging CR deployed
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  name: "instance"
  namespace: "openshift-logging"
spec:
  managementState: "Managed"
  logStore:
    type: "elasticsearch"
    retentionPolicy: 
      application:
        maxAge: 1d
      infra:
        maxAge: 7d
      audit:
        maxAge: 1w
    elasticsearch:
      nodeCount: 1
      nodeSelector:
        logging: test
      redundancyPolicy: "ZeroRedundancy"
      resources:
        requests:
          memory: "2Gi"
      storage:
        storageClassName: "standard"
        size: "20Gi"
  visualization:
    type: "kibana"
    kibana:
      nodeSelector:
        logging: test
      replicas: 1
  collection:
    logs:
      type: "fluentd"
      fluentd:
        nodeSelector:
          logging: test



$ oc -n openshift-logging get Elasticsearches/elasticsearch -o yaml
apiVersion: logging.openshift.io/v1
kind: Elasticsearch
metadata:
  creationTimestamp: "2020-05-26T10:49:03Z"
  generation: 4
  name: elasticsearch
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: logging.openshift.io/v1
    controller: true
    kind: ClusterLogging
    name: instance
    uid: 6e13c27f-70ef-4d20-8030-20e5d764171a
  resourceVersion: "326088"
  selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch
  uid: 764c5c34-96eb-4abf-911b-5c8c8e0eb5b4
spec:
  indexManagement:
    mappings:
    - aliases:
      - app
      - logs-app
      name: app
      policyRef: app-policy
    - aliases:
      - infra
      - logs-infra
      name: infra
      policyRef: infra-policy
    - aliases:
      - audit
      - logs-audit
      name: audit
      policyRef: audit-policy
    policies:
    - name: app-policy
      phases:
        delete:
          minAge: 1d
        hot:
          actions:
            rollover:
              maxAge: 1h
      pollInterval: 15m
    - name: infra-policy
      phases:
        delete:
          minAge: 7d
        hot:
          actions:
            rollover:
              maxAge: 8h
      pollInterval: 15m
    - name: audit-policy
      phases:
        delete:
          minAge: 1w
        hot:
          actions:
            rollover:
              maxAge: 1h
      pollInterval: 15m
  managementState: Managed
  nodeSpec:
    nodeSelector:
      logging: test
    resources:
      requests:
        memory: 2Gi
  nodes:
  - genUUID: y4u7ur3g
    nodeCount: 1
    resources: {}
    roles:
    - client
    - data
    - master
    storage:
      size: 20Gi
      storageClassName: standard
  redundancyPolicy: ZeroRedundancy
status:
  cluster:
    activePrimaryShards: 0
    activeShards: 0
    initializingShards: 0
    numDataNodes: 0
    numNodes: 0
    pendingTasks: 0
    relocatingShards: 0
    status: cluster health unknown
    unassignedShards: 0
  clusterHealth: ""
  conditions: []
  nodes:
  - conditions:
    - lastTransitionTime: "2020-05-26T10:49:04Z"
      message: '0/1 nodes are available: 1 node(s) didn''t match node selector.'
      reason: Unschedulable
      status: "True"
      type: Unschedulable
    deploymentName: elasticsearch-cdm-y4u7ur3g-1
    upgradeStatus: {}
  pods:
    client:
      failed: []
      notReady:
      - elasticsearch-cdm-y4u7ur3g-1-8767dcb78-z5rsw
      ready: []
    data:
      failed: []
      notReady:
      - elasticsearch-cdm-y4u7ur3g-1-8767dcb78-z5rsw
      ready: []
    master:
      failed: []
      notReady:
      - elasticsearch-cdm-y4u7ur3g-1-8767dcb78-z5rsw
      ready: []
  shardAllocationEnabled: shard allocation unknown

Comment 4 Vimal Kumar 2020-05-26 11:35:28 UTC

As soon as the label logging=test is added, the logging components proceed to be deployed.

Comment 5 Qiaoling Tang 2020-05-27 01:48:17 UTC

I'm not able to reproduce this issue either. I'll close it.

Please feel free to reopen it if you can reproduce.

Comment 7 errata-xmlrpc 2020-07-13 17:35:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.