1698377 – Couldn't update elasticsearch CR successfully after deleting nodeSelector in clusterlogging CR.

Bug 1698377 - Couldn't update elasticsearch CR successfully after deleting nodeSelector in clusterlogging CR.

Summary: Couldn't update elasticsearch CR successfully after deleting nodeSelector in ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	ewolinet
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-10 09:24 UTC by Qiaoling Tang
Modified:	2019-10-16 06:28 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:28:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:28:22 UTC

Description Qiaoling Tang 2019-04-10 09:24:12 UTC

Description of problem:
The CLO can't update elasticsearch CR successfully after deleting nodeSelector setting in clusterlogging CR.

Version-Release number of selected component (if applicable):
quay.io/openshift/origin-cluster-logging-operator@sha256:949ee74661a3bac7d08084d01ce1375ff51a04f97d28ff59d7e35f49e5065a15
4.0.0-0.nightly-2019-04-05-165550

How reproducible:
Always

Steps to Reproduce:
1. deploy logging using https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/logging/clusterlogging/nodeSelector.yaml to create clusterlogging instance, set nodeSelector for CEFK, the EFK pod can start because no node match the nodeSelector

$ oc get cj -o yaml |grep nodeSelector -A 1
            nodeSelector:
              logging: curator
$ oc get pod elasticsearch-cdm-pu2hu2uo-1-69c777898b-dbcf9 -o yaml |grep nodeSelector -A 1
  nodeSelector:
    logging: es
$ oc get pod kibana-786db86ff8-nqcgt -o yaml |grep nodeSelector -A 1
  nodeSelector:
    logging: kibana
$ oc get ds
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR     AGE
fluentd   0         0         0       0            0           logging=fluentd   101s

2. delete nodeSelector settings in clusterlogging CR

3. check pods, the kibana and fluentd pod could be redeployed without nodeSelector, but the ES pod couldn't, logs in CLO pod shows the CLO is attempting to update elasticsearch CR, but no success:

$ oc get pod
NAME                                            READY   STATUS    RESTARTS   AGE
cluster-logging-operator-595948cb68-6klkt       1/1     Running   0          8m18s
curator-1554883200-8rk7r                        1/1     Running   0          104s
elasticsearch-cdm-pu2hu2uo-1-69c777898b-dbcf9   0/2     Pending   0          7m38s
fluentd-54rwb                                   1/1     Running   0          4m58s
fluentd-gjgvn                                   1/1     Running   0          4m58s
fluentd-kp576                                   1/1     Running   0          4m58s
fluentd-l7bks                                   1/1     Running   0          4m58s
fluentd-qn5pc                                   1/1     Running   0          4m58s
fluentd-wk87z                                   1/1     Running   0          4m58s
kibana-f5db689bd-vb5rb                          2/2     Running   0          5m4s

$ oc logs cluster-logging-operator-595948cb68-6klkt
time="2019-04-10T07:53:40Z" level=info msg="Go Version: go1.10.3"
time="2019-04-10T07:53:40Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-04-10T07:53:40Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-04-10T07:53:40Z" level=info msg="Watching logging.openshift.io/v1, ClusterLogging, openshift-logging, 5000000000"
time="2019-04-10T07:54:09Z" level=info msg="Updating status of Elasticsearch"
time="2019-04-10T07:54:10Z" level=error msg="Unable to read file to get contents: open /tmp/_working_dir/kibana-proxy-oauth.secret: no such file or directory"
time="2019-04-10T07:54:12Z" level=info msg="Updating status of Kibana for \"instance\""
time="2019-04-10T07:54:14Z" level=info msg="Updating status of Curator"
time="2019-04-10T07:54:17Z" level=info msg="Updating status of Fluentd"
time="2019-04-10T07:54:26Z" level=info msg="Updating status of Elasticsearch"
time="2019-04-10T07:54:46Z" level=info msg="Updating status of Elasticsearch"
time="2019-04-10T07:56:41Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'"
time="2019-04-10T07:56:45Z" level=info msg="Invalid Kibana nodeSelector change found, updating 'kibana'"
time="2019-04-10T07:56:46Z" level=info msg="Updating status of Kibana for \"instance\""
time="2019-04-10T07:56:48Z" level=info msg="Invalid Curator nodeSelector change found, updating 'curator'"
time="2019-04-10T07:56:52Z" level=info msg="Collector nodeSelector change found, updating 'fluentd'"
time="2019-04-10T07:56:53Z" level=info msg="Updating status of Fluentd"
time="2019-04-10T07:57:01Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'"
time="2019-04-10T07:57:07Z" level=info msg="Updating status of Kibana for \"instance\""
time="2019-04-10T07:57:12Z" level=info msg="Updating status of Fluentd"
time="2019-04-10T07:57:21Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'"
------
time="2019-04-10T08:02:31Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'"
------
time="2019-04-10T08:08:59Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'"

$ oc get elasticsearch -o yaml |grep nodeSelector -A 1
      nodeSelector:
        logging: es

Actual results:


Expected results:


Additional info:

Comment 1 ewolinet 2019-04-11 14:18:30 UTC

This is strange that the elasticsearch CR would still have the nodeselector.
can you provide the full output from `oc get elasticsearch elasticsearch -o yaml` ?

Comment 2 Anping Li 2019-04-11 14:45:00 UTC

The elasticsearch is updated.  we can see the generation have changed from 7 to 11

apiVersion: logging.openshift.io/v1
kind: Elasticsearch
metadata:
  creationTimestamp: 2019-04-11T14:33:35Z
  generation: 11
  name: elasticsearch
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: logging.openshift.io/v1
    controller: true
    kind: ClusterLogging
    name: instance
    uid: c9bdbd30-5c66-11e9-95a3-066c9806bef0
  resourceVersion: "479698"
  selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch
  uid: c9fc682f-5c66-11e9-8d1c-0e9e59f648a4
spec:
  managementState: Managed
  nodeSpec:
    image: quay.io/openshift/origin-logging-elasticsearch5:latest
    nodeSelector:
      logging: es
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: 200m
        memory: 2Gi
  nodes:
  - genUUID: xc3ifgr1
    nodeCount: 1
    resources: {}
    roles:
    - client
    - data
    - master
    storage:
      size: 20G
      storageClassName: gp2
  redundancyPolicy: ZeroRedundancy
status:
  clusterHealth: green
  conditions: []
  nodes:
  - deploymentName: elasticsearch-cdm-xc3ifgr1-1
    upgradeStatus: {}
  pods:
    client:
      failed: []
      notReady: []
      ready:
      - elasticsearch-cdm-xc3ifgr1-1-7d967d6855-qzjct
    data:
      failed: []
      notReady: []
      ready:
      - elasticsearch-cdm-xc3ifgr1-1-7d967d6855-qzjct
    master:
      failed: []
      notReady: []
      ready:
      - elasticsearch-cdm-xc3ifgr1-1-7d967d6855-qzjct
  shardAllocationEnabled: all

Comment 3 ewolinet 2019-04-11 22:07:07 UTC

The node selector is still on the elasticsearch CR...
This makes sense since we see in the CLO logs that it keeps evaluating that the CR nodeSelector isn't what it thinks it should be.

I'll try to recreate this as well.

spec:
  managementState: Managed
  nodeSpec:
    image: quay.io/openshift/origin-logging-elasticsearch5:latest
    nodeSelector:
      logging: es

Comment 4 ewolinet 2019-04-15 16:23:40 UTC

I noticed that if I scale down the EO before making a change to the ES CR we don't have an issue of the nodeSelector field persisting. Also, bumping the sdk (https://github.com/openshift/elasticsearch-operator/pull/122) resolves this issue. However, then we run into https://bugzilla.redhat.com/show_bug.cgi?id=1699015 which Joseph is currently working on a fix for.

Comment 5 Jeff Cantrill 2019-04-16 16:23:00 UTC

Moving to 4.2 as this can be worked around by:

* Scaling down EO
* Wait for CLO to update nodeSelector
* Scale up EO

Comment 8 Qiaoling Tang 2019-06-25 06:39:48 UTC

Verified in ose-elasticsearch-operator-v4.2.0-201906241432 and ose-cluster-logging-operator-v4.2.0-201906241832

Comment 9 errata-xmlrpc 2019-10-16 06:28:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.