Description of problem: The CLO can't update elasticsearch CR successfully after deleting nodeSelector setting in clusterlogging CR. Version-Release number of selected component (if applicable): quay.io/openshift/origin-cluster-logging-operator@sha256:949ee74661a3bac7d08084d01ce1375ff51a04f97d28ff59d7e35f49e5065a15 4.0.0-0.nightly-2019-04-05-165550 How reproducible: Always Steps to Reproduce: 1. deploy logging using https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/logging/clusterlogging/nodeSelector.yaml to create clusterlogging instance, set nodeSelector for CEFK, the EFK pod can start because no node match the nodeSelector $ oc get cj -o yaml |grep nodeSelector -A 1 nodeSelector: logging: curator $ oc get pod elasticsearch-cdm-pu2hu2uo-1-69c777898b-dbcf9 -o yaml |grep nodeSelector -A 1 nodeSelector: logging: es $ oc get pod kibana-786db86ff8-nqcgt -o yaml |grep nodeSelector -A 1 nodeSelector: logging: kibana $ oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE fluentd 0 0 0 0 0 logging=fluentd 101s 2. delete nodeSelector settings in clusterlogging CR 3. check pods, the kibana and fluentd pod could be redeployed without nodeSelector, but the ES pod couldn't, logs in CLO pod shows the CLO is attempting to update elasticsearch CR, but no success: $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-595948cb68-6klkt 1/1 Running 0 8m18s curator-1554883200-8rk7r 1/1 Running 0 104s elasticsearch-cdm-pu2hu2uo-1-69c777898b-dbcf9 0/2 Pending 0 7m38s fluentd-54rwb 1/1 Running 0 4m58s fluentd-gjgvn 1/1 Running 0 4m58s fluentd-kp576 1/1 Running 0 4m58s fluentd-l7bks 1/1 Running 0 4m58s fluentd-qn5pc 1/1 Running 0 4m58s fluentd-wk87z 1/1 Running 0 4m58s kibana-f5db689bd-vb5rb 2/2 Running 0 5m4s $ oc logs cluster-logging-operator-595948cb68-6klkt time="2019-04-10T07:53:40Z" level=info msg="Go Version: go1.10.3" time="2019-04-10T07:53:40Z" level=info msg="Go OS/Arch: linux/amd64" time="2019-04-10T07:53:40Z" level=info msg="operator-sdk Version: 0.0.7" time="2019-04-10T07:53:40Z" level=info msg="Watching logging.openshift.io/v1, ClusterLogging, openshift-logging, 5000000000" time="2019-04-10T07:54:09Z" level=info msg="Updating status of Elasticsearch" time="2019-04-10T07:54:10Z" level=error msg="Unable to read file to get contents: open /tmp/_working_dir/kibana-proxy-oauth.secret: no such file or directory" time="2019-04-10T07:54:12Z" level=info msg="Updating status of Kibana for \"instance\"" time="2019-04-10T07:54:14Z" level=info msg="Updating status of Curator" time="2019-04-10T07:54:17Z" level=info msg="Updating status of Fluentd" time="2019-04-10T07:54:26Z" level=info msg="Updating status of Elasticsearch" time="2019-04-10T07:54:46Z" level=info msg="Updating status of Elasticsearch" time="2019-04-10T07:56:41Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'" time="2019-04-10T07:56:45Z" level=info msg="Invalid Kibana nodeSelector change found, updating 'kibana'" time="2019-04-10T07:56:46Z" level=info msg="Updating status of Kibana for \"instance\"" time="2019-04-10T07:56:48Z" level=info msg="Invalid Curator nodeSelector change found, updating 'curator'" time="2019-04-10T07:56:52Z" level=info msg="Collector nodeSelector change found, updating 'fluentd'" time="2019-04-10T07:56:53Z" level=info msg="Updating status of Fluentd" time="2019-04-10T07:57:01Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'" time="2019-04-10T07:57:07Z" level=info msg="Updating status of Kibana for \"instance\"" time="2019-04-10T07:57:12Z" level=info msg="Updating status of Fluentd" time="2019-04-10T07:57:21Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'" ------ time="2019-04-10T08:02:31Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'" ------ time="2019-04-10T08:08:59Z" level=info msg="Elasticsearch nodeSelector change found, updating 'elasticsearch'" $ oc get elasticsearch -o yaml |grep nodeSelector -A 1 nodeSelector: logging: es Actual results: Expected results: Additional info:
This is strange that the elasticsearch CR would still have the nodeselector. can you provide the full output from `oc get elasticsearch elasticsearch -o yaml` ?
The elasticsearch is updated. we can see the generation have changed from 7 to 11 apiVersion: logging.openshift.io/v1 kind: Elasticsearch metadata: creationTimestamp: 2019-04-11T14:33:35Z generation: 11 name: elasticsearch namespace: openshift-logging ownerReferences: - apiVersion: logging.openshift.io/v1 controller: true kind: ClusterLogging name: instance uid: c9bdbd30-5c66-11e9-95a3-066c9806bef0 resourceVersion: "479698" selfLink: /apis/logging.openshift.io/v1/namespaces/openshift-logging/elasticsearches/elasticsearch uid: c9fc682f-5c66-11e9-8d1c-0e9e59f648a4 spec: managementState: Managed nodeSpec: image: quay.io/openshift/origin-logging-elasticsearch5:latest nodeSelector: logging: es resources: limits: memory: 2Gi requests: cpu: 200m memory: 2Gi nodes: - genUUID: xc3ifgr1 nodeCount: 1 resources: {} roles: - client - data - master storage: size: 20G storageClassName: gp2 redundancyPolicy: ZeroRedundancy status: clusterHealth: green conditions: [] nodes: - deploymentName: elasticsearch-cdm-xc3ifgr1-1 upgradeStatus: {} pods: client: failed: [] notReady: [] ready: - elasticsearch-cdm-xc3ifgr1-1-7d967d6855-qzjct data: failed: [] notReady: [] ready: - elasticsearch-cdm-xc3ifgr1-1-7d967d6855-qzjct master: failed: [] notReady: [] ready: - elasticsearch-cdm-xc3ifgr1-1-7d967d6855-qzjct shardAllocationEnabled: all
The node selector is still on the elasticsearch CR... This makes sense since we see in the CLO logs that it keeps evaluating that the CR nodeSelector isn't what it thinks it should be. I'll try to recreate this as well. spec: managementState: Managed nodeSpec: image: quay.io/openshift/origin-logging-elasticsearch5:latest nodeSelector: logging: es
I noticed that if I scale down the EO before making a change to the ES CR we don't have an issue of the nodeSelector field persisting. Also, bumping the sdk (https://github.com/openshift/elasticsearch-operator/pull/122) resolves this issue. However, then we run into https://bugzilla.redhat.com/show_bug.cgi?id=1699015 which Joseph is currently working on a fix for.
Moving to 4.2 as this can be worked around by: * Scaling down EO * Wait for CLO to update nodeSelector * Scale up EO
Verified in ose-elasticsearch-operator-v4.2.0-201906241432 and ose-cluster-logging-operator-v4.2.0-201906241832
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922