Bug 1942609

Summary:	Setting up Kibana and Elasticsearch replica to 0, Kibana pods are created and indexmanagement jobs
Product:	OpenShift Container Platform	Reporter:	Oscar Casal Sanchez <ocasalsa>
Component:	Logging	Assignee:	ewolinet
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.6	CC:	aos-bugs, ewolinet, gkarager, qitang
Target Milestone:	---
Target Release:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	logging-exploration
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: CLO would update the number of replicas based on some assumptions on the CL/instance object. Consequence: We would incorrectly set the number of kibana replicas as part of creating the kibana CR object. Fix: We correctly evaluate if the number of replicas aren't specified in the CL/instance object for Kibana. And will default to 0 if not provided. Result: We can set replicas as 0 for Kibana in the CL/instance object and that value will be passed on to the kibana CR object instead of overriding it to be 1.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-06-29 06:30:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Oscar Casal Sanchez 2021-03-24 15:55:15 UTC

[Description of problem]

When the CLO instance is defined with Kibana replica to 0 like this:

~~~
$ oc -n openshift-logging get clusterlogging instance -o yaml
...
spec:
  collection:
    logs:
      fluentd:
        resources: {}
      type: fluentd
  logStore:
    elasticsearch:
      nodeCount: 0
      redundancyPolicy: ZeroRedundancy
      resources:
        limits:
          memory: 2Gi
        requests:
          cpu: 200m
          memory: 2Gi
      storage: {}
    type: elasticsearch
  managementState: Managed
  visualization:
    kibana:
      replicas: 0
      resources:
        limits:
          memory: 512Mi
        requests:
          cpu: 500m
          memory: 512Mi
    type: kibana
~~~

After that, checking the pods created, it's visible that a Kibana pod is running and the indexmanagement jobs are running:

~~~
$ oc get pods
NAME                                        READY   STATUS    RESTARTS   AGE
cluster-logging-operator-565d4fd5bc-czg8h   1/1     Running   0          11m
elasticsearch-im-app-1616600700-zhncg       0/1     Error     0          3m39s
elasticsearch-im-audit-1616600700-44wzc     0/1     Error     0          3m37s
elasticsearch-im-infra-1616600700-jbk65     0/1     Error     0          3m30s
fluentd-4bj6g                               1/1     Running   0          10m
fluentd-82zd5                               1/1     Running   0          10m
fluentd-h2dbk                               1/1     Running   0          10m
fluentd-lwcbw                               1/1     Running   0          10m
fluentd-mtwx4                               1/1     Running   0          10m
fluentd-sz7gh                               1/1     Running   0          10m
kibana-84cd8747c5-s9mct                     2/2     Running   0          9m32s
~~~

[Version-Release number of selected component (if applicable):]
OCP 4.6

~~~
$ oc get csv
NAME                                           DISPLAY                            VERSION                 REPLACES   PHASE
clusterlogging.4.6.0-202103060451.p0           Cluster Logging                    4.6.0-202103060451.p0              Succeeded
elasticsearch-operator.4.6.0-202103060018.p0   OpenShift Elasticsearch Operator   4.6.0-202103060018.p0              Succeeded
~~~


[How reproducible:]
Always

[Steps to Reproduce]
1. Define the CLO as:

~~~
$ oc -n openshift-logging get clusterlogging instance -o yaml
...
spec:
  collection:
    logs:
      fluentd:
        resources: {}
      type: fluentd
  logStore:
    elasticsearch:
      nodeCount: 0
      redundancyPolicy: ZeroRedundancy
      resources:
        limits:
          memory: 2Gi
        requests:
          cpu: 200m
          memory: 2Gi
      storage: {}
    type: elasticsearch
  managementState: Managed
  visualization:
    kibana:
      replicas: 0
      resources:
        limits:
          memory: 512Mi
        requests:
          cpu: 500m
          memory: 512Mi
    type: kibana
~~~

2. Check that kibana pod is created and indexmanagement jobs

~~~
$ oc get pods
NAME                                        READY   STATUS    RESTARTS   AGE
cluster-logging-operator-565d4fd5bc-czg8h   1/1     Running   0          11m
elasticsearch-im-app-1616600700-zhncg       0/1     Error     0          3m39s
elasticsearch-im-audit-1616600700-44wzc     0/1     Error     0          3m37s
elasticsearch-im-infra-1616600700-jbk65     0/1     Error     0          3m30s
fluentd-4bj6g                               1/1     Running   0          10m
fluentd-82zd5                               1/1     Running   0          10m
fluentd-h2dbk                               1/1     Running   0          10m
fluentd-lwcbw                               1/1     Running   0          10m
fluentd-mtwx4                               1/1     Running   0          10m
fluentd-sz7gh                               1/1     Running   0          10m
kibana-84cd8747c5-s9mct                     2/2     Running   0          9m32s
~~~

[Actual results]
- Kibana pod is created even when replica is 0
- Indexmanagement pods are running even when the logStore replica is defined to 0

[Expected results:]
- Not kibana pod
- Not Indexmanagement jobs running

Comment 1 ewolinet 2021-03-26 16:15:19 UTC

The index management cronjobs are going to schedule pods, setting the number of es nodes to 0 should have no impact on that.
The fact that the number of kibana replicas don't match what is specified there is a bug though.

Comment 2 ewolinet 2021-03-26 16:42:05 UTC

It's possible CLO is overwriting this value when creating the Kibana CR.

@Oscar,

can you provide the yaml output of the kibana CR for the cluster?

Comment 3 Oscar Casal Sanchez 2021-03-29 15:36:37 UTC

@ewolinet, 

- About the issue with Kibana

  I was having some issues today with the labs for getting one clean and running. Let me until tomorrow for being able to reproduce it and providing the Kibana CR.

- About the issue with the indexmanagement jobs

  I agree that it shouldn't have an impact, but if ES is scale down to 0, I can understand that we should have the enough logic in the operator to don't run the indexmanagement jobs. It doesn't make sense to have a job running each 15 minutes when you know that it won't work, the same with the curator jobs. Then, the operator should have the logic implemented for having in consideration that ES doesn't exist more since it's defined like that and doesn't create the jobs for curator and indexmanagement.

  If you prefer, we could manage this in a different bug and I could split it and move this issue to one different, but from my perspective, it's clear that it's a bug: something that it's trying to execute against another thing that it's defined that it shouldn't exist.

Comment 4 ewolinet 2021-03-29 21:49:44 UTC

(In reply to Oscar Casal Sanchez from comment #3)
> @ewolinet, 
> 
> - About the issue with Kibana
> 
>   I was having some issues today with the labs for getting one clean and
> running. Let me until tomorrow for being able to reproduce it and providing
> the Kibana CR.

I was trying to recreate this locally and am unable to see this happen as well.
If I define both an ES and Kibana section for my cl/instance object I see the kibana and elasticsearch cr's get created but with 0 replicas.

> - About the issue with the indexmanagement jobs
> 
>   I agree that it shouldn't have an impact, but if ES is scale down to 0, I
> can understand that we should have the enough logic in the operator to don't
> run the indexmanagement jobs. It doesn't make sense to have a job running
> each 15 minutes when you know that it won't work, the same with the curator
> jobs. Then, the operator should have the logic implemented for having in
> consideration that ES doesn't exist more since it's defined like that and
> doesn't create the jobs for curator and indexmanagement.
> 
>   If you prefer, we could manage this in a different bug and I could split
> it and move this issue to one different, but from my perspective, it's clear
> that it's a bug: something that it's trying to execute against another thing
> that it's defined that it shouldn't exist.

While I agree it could be handled better. I would argue that this is not a normal use case. If you were going to specify 0 ES nodes, you shouldn't define a logstore section in your cl/instance object (this prevents the indexmanagement cronjobs from being created).

Please do not open a bug, this is working as expected. If you feel it should be different I would request that it instead is a RFE and it can be prioritized like any other feature.

Comment 5 Oscar Casal Sanchez 2021-03-30 11:41:05 UTC

Hello,

I was able to reproduce following the steps below (take in consideration that something similar is happening with ES as you can see below)


Environment:

~~~
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.22    True        False         15h     Cluster version is 4.6.22

$ oc get csv
NAME                                           DISPLAY                            VERSION                 REPLACES   PHASE
clusterlogging.4.6.0-202103130248.p0           Cluster Logging                    4.6.0-202103130248.p0              Succeeded
elasticsearch-operator.4.6.0-202103130248.p0   OpenShift Elasticsearch Operator   4.6.0-202103130248.p0              Succeeded
~~~


### 1. Create CLO instance like below without kibana definition in the spec section. The Kibana pod is not created as it's expected

~~~
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  name: "instance"
  namespace: "openshift-logging"
spec:
  managementState: "Managed"
  logStore:
    type: "elasticsearch"
    elasticsearch:
      nodeCount: 1
      resources:
        limits:
          memory: 2Gi
        requests:
          cpu: 200m
          memory: 2Gi
      storage: {}
      redundancyPolicy: "ZeroRedundancy"
  curation:
    type: "curator"
    curator:
      resources:
        limits:
          memory: 200Mi
        requests:
          cpu: 200m
          memory: 200Mi
      schedule: "*/5 * * * *"
  collection:
    logs:
      type: "fluentd"
      fluentd:
        resources: {}
~~~


###2. Added to the CLO instance the Kibana definition with "replicas: 1" and the Kibana pod was created as expected

~~~
  $ oc edit clusterlogging
...
  visualization:
    kibana:
      replicas: 0
      resources:
        limits:
          memory: 512Mi
        requests:
          cpu: 500m
          memory: 512Mi
    type: kibana
...

$ oc get clusterlogging instance -o jsonpath='{.spec.visualization}'
{"kibana":{"replicas":1,"resources":{"limits":{"memory":"512Mi"},"requests":{"cpu":"500m","memory":"512Mi"}}},"type":"kibana"}

$ oc get pods -l component=kibana
NAME                      READY   STATUS    RESTARTS   AGE
kibana-6cb7cbbb97-sgfz2   2/2     Running   0          87s

$ oc get kibana
NAME     MANAGEMENT STATE   REPLICAS
kibana   Managed            1


###3. Move kibana replicas to 0 and Kibana pod is not deleted as it should be

~~~
$ oc edit clusterlogging instance 
...
spec:
...
  visualization:
    kibana:
      replicas: 0
...

$ oc get pods -l component=kibana
NAME                     READY   STATUS    RESTARTS   AGE
kibana-b66b87b58-5s5m6   2/2     Running   0          32m

### Verify that in the CLO instance, the Kibana replicas is 0
$ oc get clusterlogging instance -o jsonpath='{.spec.visualization}'
{"kibana":{"replicas":0,"resources":{"limits":{"memory":"512Mi"},"requests":{"cpu":"500m","memory":"512Mi"}}},"type":"kibana"}

### The Kibana CR maintains the Replicas to 1 instead of 0
$ oc get kibana
NAME     MANAGEMENT STATE   REPLICAS
kibana   Managed            1
~~~

4. Tried to delete from CLO instance the kibana configuration and the Kibana pod remains running

~~~
$ oc get clusterlogging instance -o jsonpath='{.spec.visualization}'

$ oc get pods -l component=kibana
NAME                     READY   STATUS    RESTARTS   AGE
kibana-b66b87b58-5s5m6   2/2     Running   0          41m

$ oc get kibana kibana 
NAME     MANAGEMENT STATE   REPLICAS
kibana   Managed            1
~~~


The same happens for Elasticsearch if it's modified the CLO for setting in the LogStore stanza `nodeCount: 0`. The fluentd pods are restarted (then, they are detecting a change), but the ES pod continues running.

~~~
$  oc get clusterlogging instance -o jsonpath='{.spec.logStore}'
{"elasticsearch":{"nodeCount":0,"redundancyPolicy":"ZeroRedundancy","resources":{"limits":{"memory":"2Gi"},"requests":{"cpu":"200m","memory":"2Gi"}},"storage":{}},"type":"elasticsearch"}

$ oc get pods -l component=elasticsearch
NAME                                            READY   STATUS    RESTARTS   AGE
elasticsearch-cdm-pyyw8xl2-1-7687d9484c-r5pnn   2/2     Running   0          46m

$ oc get elasticsearch
NAME            MANAGEMENT STATE   HEALTH   NODES   DATA NODES   SHARD ALLOCATION   INDEX MANAGEMENT
elasticsearch   Managed            green    1       1            all               

$ oc get pods -l component=fluentd
NAME                                            READY   STATUS              RESTARTS   AGE
fluentd-qmhmd                                   1/1     Running             0          44m
fluentd-r72mj                                   1/1     Running             0          29s
fluentd-rgq87                                   1/1     Running             0          44m
fluentd-tspm2                                   1/1     Running             0          11s
fluentd-tvq6g                                   1/1     Running             0          41s
fluentd-w465b                                   0/1     ContainerCreating   0          2s
~~~


Now, we delete from the CLO the stanza for Elasticsearch and the pod is deleted (this is one difference with what's happening to Kibana that it's not deleted)

~~~
$ oc edit clusterlogging
$ oc get clusterlogging instance -o jsonpath='{.spec.logStore}'
$ oc get pods -l component=elasticsearch
No resources found in openshift-logging namespace.

Comment 6 ewolinet 2021-03-30 21:56:08 UTC

(In reply to Oscar Casal Sanchez from comment #5)

> ###3. Move kibana replicas to 0 and Kibana pod is not deleted as it should be

> ### The Kibana CR maintains the Replicas to 1 instead of 0
> $ oc get kibana
> NAME     MANAGEMENT STATE   REPLICAS
> kibana   Managed            1
> ~~~
> 
> 4. Tried to delete from CLO instance the kibana configuration and the Kibana
> pod remains running
> 
> ~~~
> $ oc get clusterlogging instance -o jsonpath='{.spec.visualization}'
> 
> $ oc get pods -l component=kibana
> NAME                     READY   STATUS    RESTARTS   AGE
> kibana-b66b87b58-5s5m6   2/2     Running   0          41m
> 
> $ oc get kibana kibana 
> NAME     MANAGEMENT STATE   REPLICAS
> kibana   Managed            1
> ~~~
> 

Let me try to recreate this using your above steps, it seems the issue is stemming from how the Cluster Logging Operator is gating/controlling the kibana CR.


> The same happens for Elasticsearch if it's modified the CLO for setting in
> the LogStore stanza `nodeCount: 0`. The fluentd pods are restarted (then,
> they are detecting a change), but the ES pod continues running.

This is because the Elasticsearch Operator is preventing a total scale down (it does this to protect the cluster, a minimum of one master and one data node is required for a cluster). This works as expected and there should be a status message in the elasticsearch CR that explains this.

Comment 7 ewolinet 2021-03-30 22:15:15 UTC

Thank you Oscar. I'm able to recreate this using your above steps and I can see where in the code it prevents this.

It looks like this is a continuation from https://bugzilla.redhat.com/show_bug.cgi?id=1901424

Comment 8 Oscar Casal Sanchez 2021-03-31 20:03:36 UTC

Hello ewolinet,

Thank you so much for your update. 

I'm glad to see that you are able to reproduce it now and that you are able to see where it's failing. My apologies for perhaps missing some steps in the case description that could lead you to not being able to reproduce it.

I'll create a solution related to this issue, although, at this moment, I don't feel that it has a high priority since it's a case corner use case, it would be good to fix it in the future.

Best regards,
Oscar

Comment 9 ewolinet 2021-06-14 21:33:21 UTC

The fix for this was merged back on May 6th. Not sure why our bot didn't link the fix here nor move this card along

https://github.com/openshift/cluster-logging-operator/pull/998

Comment 10 Giriyamma 2021-06-15 05:57:01 UTC

Verified this on issue on clusterlogging.4.6.0-202106021513, elasticsearch-operator.4.6.0-202106100456. 
Issue is fixed: When kibana replica is set to 0, kibana pods are not created.

Comment 13 errata-xmlrpc 2021-06-29 06:30:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Enterprise security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2500