Bug 1698377

Summary:	Couldn't update elasticsearch CR successfully after deleting nodeSelector in clusterlogging CR.
Product:	OpenShift Container Platform	Reporter:	Qiaoling Tang <qitang>
Component:	Logging	Assignee:	ewolinet
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	anli, aos-bugs, ewolinet, jcantril, rmeggins
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:28:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Qiaoling Tang 2019-04-10 09:24:12 UTC

Description of problem:
The CLO can't update elasticsearch CR successfully after deleting nodeSelector setting in clusterlogging CR.

Version-Release number of selected component (if applicable):
quay.io/openshift/origin-cluster-logging-operator@sha256:949ee74661a3bac7d08084d01ce1375ff51a04f97d28ff59d7e35f49e5065a15
4.0.0-0.nightly-2019-04-05-165550

How reproducible:
Always

Steps to Reproduce:
1. deploy logging using https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/logging/clusterlogging/nodeSelector.yaml to create clusterlogging instance, set nodeSelector for CEFK, the EFK pod can start because no node match the nodeSelector

$ oc get cj -o yaml |grep nodeSelector -A 1
            nodeSelector:
              logging: curator
$ oc get pod elasticsearch-cdm-pu2hu2uo-1-69c777898b-dbcf9 -o yaml |grep nodeSelector -A 1
  nodeSelector:
    logging: es
$ oc get pod kibana-786db86ff8-nqcgt -o yaml |grep nodeSelector -A 1
  nodeSelector:
    logging: kibana
$ oc get ds
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR     AGE
fluentd   0         0         0       0            0           logging=fluentd   101s

2. delete nodeSelector settings in clusterlogging CR

3. check pods, the kibana and fluentd pod could be redeployed without nodeSelector, but the ES pod couldn't, logs in CLO pod shows the CLO is attempting to update elasticsearch CR, but no success:

$ oc get pod
NAME                                            READY   STATUS    RESTARTS   AGE
cluster-logging-operator-595948cb68-6klkt       1/1     Running   0          8m18s
curator-1554883200-8rk7r                        1/1     Running   0          104s
elasticsearch-cdm-pu2hu2uo-1-69c777898b-dbcf9   0/2     Pending   0          7m38s
fluentd-54rwb                                   1/1     Running   0          4m58s
fluentd-gjgvn                                   1/1     Running   0          4m58s
fluentd-kp576                                   1/1     Running   0          4m58s
fluentd-l7bks                                   1/1     Running   0          4m58s
fluentd-qn5pc                                   1/1     Running   0          4m58s
fluentd-wk87z                                   1/1     Running   0          4m58s
kibana-f5db689bd-vb5rb                          2/2     Running   0          5m4s

$ oc logs cluster-logging-operator-595948cb68-6klkt
time="2019-04-10T07:53:40Z" level=info msg="Go Version: go1.10.3"
time="2019-04-10T07:53:40Z" level=info msg="Go OS/Arch: linux/amd64"
time="2019-04-10T07:53:40Z" level=info msg="operator-sdk Version: 0.0.7"
time="2019-04-10T07:53:40Z" level=info msg="Watching logging.openshift.io/v1, ClusterLogging, openshift-logging, 5000000000"
time="2019-04-10T07:54:09Z" level=info msg="Updating status of Elasticsearch"
time="2019-04-10T07:54:10Z" level=error msg="Unable to read file to get contents: open /tmp/_working_dir/kibana-proxy-oauth.secret: no such file or directory"
time="2019-04-10T07:54:12Z" level=info msg="Updating status of Kibana for \"instance\""
time="2019-04-10T07:54:14Z" level=info msg="Updating status of Curator"
time="2019-04-10T07:54:17Z" level=info msg="Updating status of Fluentd"
time="2019-04-10T07:54:26Z" level=info msg="Updating status of Elasticsearch"
time="2019-04-10T07:54:46Z" level=info msg="Updating status of Elasticsearch"
time="2019-04-10T07:56:41Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'"
time="2019-04-10T07:56:45Z" level=info msg="Invalid Kibana nodeSelector change found, updating 'kibana'"
time="2019-04-10T07:56:46Z" level=info msg="Updating status of Kibana for \"instance\""
time="2019-04-10T07:56:48Z" level=info msg="Invalid Curator nodeSelector change found, updating 'curator'"
time="2019-04-10T07:56:52Z" level=info msg="Collector nodeSelector change found, updating 'fluentd'"
time="2019-04-10T07:56:53Z" level=info msg="Updating status of Fluentd"
time="2019-04-10T07:57:01Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'"
time="2019-04-10T07:57:07Z" level=info msg="Updating status of Kibana for \"instance\""
time="2019-04-10T07:57:12Z" level=info msg="Updating status of Fluentd"
time="2019-04-10T07:57:21Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'"
------
time="2019-04-10T08:02:31Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'"
------
time="2019-04-10T08:08:59Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'"

$ oc get elasticsearch -o yaml |grep nodeSelector -A 1
      nodeSelector:
        logging: es

Actual results:


Expected results:


Additional info:

Comment 1 ewolinet 2019-04-11 14:18:30 UTC

This is strange that the elasticsearch CR would still have the nodeselector.
can you provide the full output from `oc get elasticsearch elasticsearch -o yaml` ?

Comment 2 Anping Li 2019-04-11 14:45:00 UTC

The elasticsearch is updated.  we can see the generation have changed from 7 to 11

apiVersion: logging.openshift.io/v1
kind: Elasticsearch
metadata:
  creationTimestamp: 2019-04-11T14:33:35Z
  generation: 11
  name: elasticsearch
  namespace: openshift-logging
  ownerReferences:
  - apiVersion: logging.openshift.io/v1
    controller: true
    kind: ClusterLogging
    name: instance
    uid: c9bdbd30-5c66-11e9-95a3-066c9806bef0
  resourceVersion: "479698"
  selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch
  uid: c9fc682f-5c66-11e9-8d1c-0e9e59f648a4
spec:
  managementState: Managed
  nodeSpec:
    image: quay.io/openshift/origin-logging-elasticsearch5:latest
    nodeSelector:
      logging: es
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: 200m
        memory: 2Gi
  nodes:
  - genUUID: xc3ifgr1
    nodeCount: 1
    resources: {}
    roles:
    - client
    - data
    - master
    storage:
      size: 20G
      storageClassName: gp2
  redundancyPolicy: ZeroRedundancy
status:
  clusterHealth: green
  conditions: []
  nodes:
  - deploymentName: elasticsearch-cdm-xc3ifgr1-1
    upgradeStatus: {}
  pods:
    client:
      failed: []
      notReady: []
      ready:
      - elasticsearch-cdm-xc3ifgr1-1-7d967d6855-qzjct
    data:
      failed: []
      notReady: []
      ready:
      - elasticsearch-cdm-xc3ifgr1-1-7d967d6855-qzjct
    master:
      failed: []
      notReady: []
      ready:
      - elasticsearch-cdm-xc3ifgr1-1-7d967d6855-qzjct
  shardAllocationEnabled: all

Comment 3 ewolinet 2019-04-11 22:07:07 UTC

The node selector is still on the elasticsearch CR...
This makes sense since we see in the CLO logs that it keeps evaluating that the CR nodeSelector isn't what it thinks it should be.

I'll try to recreate this as well.

spec:
  managementState: Managed
  nodeSpec:
    image: quay.io/openshift/origin-logging-elasticsearch5:latest
    nodeSelector:
      logging: es

Comment 4 ewolinet 2019-04-15 16:23:40 UTC

I noticed that if I scale down the EO before making a change to the ES CR we don't have an issue of the nodeSelector field persisting. Also, bumping the sdk (https://github.com/openshift/elasticsearch-operator/pull/122) resolves this issue. However, then we run into https://bugzilla.redhat.com/show_bug.cgi?id=1699015 which Joseph is currently working on a fix for.

Comment 5 Jeff Cantrill 2019-04-16 16:23:00 UTC

Moving to 4.2 as this can be worked around by:

* Scaling down EO
* Wait for CLO to update nodeSelector
* Scale up EO

Comment 8 Qiaoling Tang 2019-06-25 06:39:48 UTC

Verified in ose-elasticsearch-operator-v4.2.0-201906241432 and ose-cluster-logging-operator-v4.2.0-201906241832

Comment 9 errata-xmlrpc 2019-10-16 06:28:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922