Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1918333

Summary:

Elasticsearch and Cluster Logging operator show 50% targets down message during upgrade and are not clearing out after upgrade completion

Product:

OpenShift Container Platform

Reporter:

Sam Yangsao <syangsao>

Component:

Monitoring

Assignee:

Jan Fajerski <jfajersk>

Status:

CLOSED DUPLICATE

QA Contact:

Junqi Zhao <juzhao>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.7

CC:

aharchin, amuller, andreas.letsche, anisal, anpicker, aos-bugs, berrange, erooth, ghernandeza, hkang, hongyli, jfajersk, kai-uwe.rommel, lchiaret, periklis, rsandu, s.heijmans, shizu, spasquie, trees

Target Milestone:

---

Keywords:

Reopened

Target Release:

4.7.0

Flags:

syangsao: needinfo-

Hardware:

x86_64

OS:

Linux

Whiteboard:

47hack

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-11-10 11:22:28 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1943860

Bug Blocks:

Attachments:

Description	Flags
screenshot of overview page	none

Description Sam Yangsao 2021-01-20 13:28:31 UTC

Description of problem:

Elasticsearch and Cluster Logging operator show 50% targets down message during upgrade and are not clearing out after upgrade completion

Version-Release number of selected component (if applicable):

vSphere 6.7U3
OCP 4.7.0-fc3 (upgraded from 4.6.9)

How reproducible:

Unsure

Steps to Reproduce:

1.  Install OCP 4.6.9 vSphere 6.7U3 UPI with OpenShift logging stack
2.  Upgrade to OCP 4.7.0-fc3

Actual results:

50% of the elasticsearch-operator-metrics/elasticsearch-operator-metrics targets in openshift-operators-redhat namespace are down.

50% of the cluster-logging-operator-metrics/cluster-logging-operator-metrics targets in openshift-logging namespace are down.

Expected results:

Both the elasticsearch and cluster-logging operators should be able to validate the upgrade completed and remove the errors from the Overview page.

Additional info:

Openshift logging project looks OK after the upgrade - must-gather attached as well.

# oc get all
NAME                                                READY   STATUS      RESTARTS   AGE
pod/cluster-logging-operator-544ddc54bc-4jdl6       1/1     Running     0          21h
pod/curator-1611113400-9w7m2                        0/1     Completed   0          9h
pod/elasticsearch-cdm-xwn3zxjb-1-68696589bf-gd58l   2/2     Running     11         32h
pod/elasticsearch-cdm-xwn3zxjb-2-b64b8849b-qx96h    2/2     Running     0          32h
pod/elasticsearch-cdm-xwn3zxjb-3-67b5dfbbb7-z8x5s   2/2     Running     0          32h
pod/elasticsearch-delete-app-1611148500-8rvmk       0/1     Completed   0          12m
pod/elasticsearch-delete-audit-1611148500-hh2qk     0/1     Completed   0          12m
pod/elasticsearch-delete-infra-1611148500-fjv6r     0/1     Completed   0          12m
pod/elasticsearch-rollover-app-1611148500-j4lvx     0/1     Completed   0          12m
pod/elasticsearch-rollover-audit-1611148500-l4s2k   0/1     Completed   0          12m
pod/elasticsearch-rollover-infra-1611148500-pwfgz   0/1     Completed   0          12m
pod/fluentd-5wsc9                                   1/1     Running     0          38h
pod/fluentd-bwmds                                   1/1     Running     1          38h
pod/fluentd-cpvnv                                   1/1     Running     6          39h
pod/fluentd-dr47j                                   1/1     Running     0          38h
pod/fluentd-jwfsv                                   1/1     Running     0          38h
pod/fluentd-l749q                                   1/1     Running     0          38h
pod/fluentd-s25ld                                   1/1     Running     0          38h
pod/fluentd-zhl6k                                   1/1     Running     0          39h
pod/kibana-6895f9c7b4-r2l42                         2/2     Running     0          32h

NAME                                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/cluster-logging-operator-metrics   ClusterIP   172.30.35.128    <none>        8383/TCP,8686/TCP   39h
service/elasticsearch                      ClusterIP   172.30.93.223    <none>        9200/TCP            39h
service/elasticsearch-cluster              ClusterIP   172.30.225.172   <none>        9300/TCP            39h
service/elasticsearch-metrics              ClusterIP   172.30.141.188   <none>        60001/TCP           39h
service/fluentd                            ClusterIP   172.30.2.150     <none>        24231/TCP           39h
service/kibana                             ClusterIP   172.30.196.125   <none>        443/TCP             39h

NAME                     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/fluentd   8         8         8       8            8           kubernetes.io/os=linux   39h

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cluster-logging-operator       1/1     1            1           39h
deployment.apps/elasticsearch-cdm-xwn3zxjb-1   1/1     1            1           39h
deployment.apps/elasticsearch-cdm-xwn3zxjb-2   1/1     1            1           39h
deployment.apps/elasticsearch-cdm-xwn3zxjb-3   1/1     1            1           39h
deployment.apps/kibana                         1/1     1            1           39h

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/cluster-logging-operator-544ddc54bc       1         1         1       36h
replicaset.apps/cluster-logging-operator-78989b6754       0         0         0       39h
replicaset.apps/elasticsearch-cdm-xwn3zxjb-1-68696589bf   1         1         1       39h
replicaset.apps/elasticsearch-cdm-xwn3zxjb-2-b64b8849b    1         1         1       39h
replicaset.apps/elasticsearch-cdm-xwn3zxjb-3-67b5dfbbb7   1         1         1       39h
replicaset.apps/kibana-6895f9c7b4                         1         1         1       39h

NAME                                                COMPLETIONS   DURATION   AGE
job.batch/curator-1611113400                        1/1           7s         9h
job.batch/elasticsearch-delete-app-1611148500       1/1           6s         12m
job.batch/elasticsearch-delete-audit-1611148500     1/1           5s         12m
job.batch/elasticsearch-delete-infra-1611148500     1/1           6s         12m
job.batch/elasticsearch-rollover-app-1611148500     1/1           6s         12m
job.batch/elasticsearch-rollover-audit-1611148500   1/1           6s         12m
job.batch/elasticsearch-rollover-infra-1611148500   1/1           7s         12m

NAME                                         SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/curator                        30 3 * * *     False     0        9h              39h
cronjob.batch/elasticsearch-delete-app       */15 * * * *   False     0        12m             39h
cronjob.batch/elasticsearch-delete-audit     */15 * * * *   False     0        12m             39h
cronjob.batch/elasticsearch-delete-infra     */15 * * * *   False     0        12m             39h
cronjob.batch/elasticsearch-rollover-app     */15 * * * *   False     0        12m             39h
cronjob.batch/elasticsearch-rollover-audit   */15 * * * *   False     0        12m             39h
cronjob.batch/elasticsearch-rollover-infra   */15 * * * *   False     0        12m             39h

NAME                              HOST/PORT                                                  PATH   SERVICES   PORT    TERMINATION          WILDCARD
route.route.openshift.io/kibana   kibana-openshift-logging.apps.disocp4.lab.msp.redhat.com          kibana     <all>   reencrypt/Redirect   None

Comment 2 Sam Yangsao 2021-01-20 13:33:40 UTC

Created attachment 1749071 [details]
screenshot of overview page

Comment 5 Kai-Uwe Rommel 2021-08-15 13:02:56 UTC

Anyone working on this? I also frequently see "50% of the cluster-logging-operator-metrics/cluster-logging-operator-metrics targets in openshift-logging namespace are down." alerts. No idea why. This buglet is following me already through at least all 4.7 versionbs.

Comment 6 Luiz Gustavo Chiaretto 2021-08-18 16:17:49 UTC

I am facing the same issue here in my cluster. I have recently upgraded the cluster from 4.7.19 to 4.7.22 and this alert is now being shown.

Comment 7 Guillermo 2021-08-23 06:43:29 UTC

The same here after installing 4.7.22, the only way I get the alert to disappear is disabling "enableUserWorkload: true" in openshift-monitoring, but I don't understand what the user-defined monitoring has to do with openshift-logging.

$ oc get all
NAME                                                READY   STATUS      RESTARTS   AGE
pod/cluster-logging-operator-c5c746648-lkngz        1/1     Running     0          3d9h
pod/curator-1629689400-pvhv6                        0/1     Completed   0          3h9m
pod/elasticsearch-cdm-l3qjqqbv-1-84f8cd59d6-6fmp7   2/2     Running     0          3d9h
pod/elasticsearch-cdm-l3qjqqbv-2-59bd9cb655-fb428   2/2     Running     0          3d9h
pod/elasticsearch-cdm-l3qjqqbv-3-86bb6f94d8-d9hqb   2/2     Running     0          3d9h
pod/elasticsearch-im-app-1629700200-w9685           0/1     Completed   0          9m54s
pod/elasticsearch-im-audit-1629700200-n4qp4         0/1     Completed   0          9m54s
pod/elasticsearch-im-infra-1629700200-swbl9         0/1     Completed   0          9m54s
pod/fluentd-2lwb7                                   1/1     Running     0          3d11h
pod/fluentd-4lrn2                                   1/1     Running     0          3d11h
pod/fluentd-5pmkg                                   1/1     Running     0          3d11h
pod/fluentd-7kfpg                                   1/1     Running     0          3d11h
pod/fluentd-g57mx                                   1/1     Running     0          3d11h
pod/fluentd-mcl7d                                   1/1     Running     0          3d11h
pod/fluentd-mft9q                                   1/1     Running     0          3d11h
pod/fluentd-mqnss                                   1/1     Running     0          3d11h
pod/fluentd-q8z6w                                   1/1     Running     0          3d11h
pod/fluentd-s2bxw                                   1/1     Running     0          3d11h
pod/fluentd-wgc8g                                   1/1     Running     0          3d11h
pod/fluentd-wlhg4                                   1/1     Running     0          3d11h
pod/fluentd-wvgzd                                   1/1     Running     0          3d11h
pod/fluentd-xrsbf                                   1/1     Running     0          3d11h
pod/fluentd-zfjlt                                   1/1     Running     0          3d11h
pod/fluentd-zz5rc                                   1/1     Running     0          3d11h
pod/kibana-77b48f9dfc-xdxll                         2/2     Running     0          14h

NAME                                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
service/cluster-logging-operator-metrics   ClusterIP   172.30.118.80   <none>        8383/TCP,8686/TCP   83d
service/elasticsearch                      ClusterIP   172.30.211.94   <none>        9200/TCP            39d
service/elasticsearch-cluster              ClusterIP   172.30.64.251   <none>        9300/TCP            39d
service/elasticsearch-metrics              ClusterIP   172.30.102.81   <none>        60001/TCP           39d
service/fluentd                            ClusterIP   172.30.28.118   <none>        24231/TCP           39d
service/kibana                             ClusterIP   172.30.22.247   <none>        443/TCP             39d

NAME                     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/fluentd   16        16        16      16           16          kubernetes.io/os=linux   39d

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cluster-logging-operator       1/1     1            1           83d
deployment.apps/elasticsearch-cdm-l3qjqqbv-1   1/1     1            1           39d
deployment.apps/elasticsearch-cdm-l3qjqqbv-2   1/1     1            1           39d
deployment.apps/elasticsearch-cdm-l3qjqqbv-3   1/1     1            1           39d
deployment.apps/kibana                         1/1     1            1           39d

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/cluster-logging-operator-64dcfc9865       0         0         0       3d8h
replicaset.apps/cluster-logging-operator-c5c746648        1         1         1       3d11h
replicaset.apps/elasticsearch-cdm-l3qjqqbv-1-84f8cd59d6   1         1         1       39d
replicaset.apps/elasticsearch-cdm-l3qjqqbv-2-59bd9cb655   1         1         1       39d
replicaset.apps/elasticsearch-cdm-l3qjqqbv-3-86bb6f94d8   1         1         1       39d
replicaset.apps/kibana-77b48f9dfc                         1         1         1       39d
replicaset.apps/kibana-86767546ff                         0         0         0       3d8h

NAME                                          COMPLETIONS   DURATION   AGE
job.batch/curator-1629689400                  1/1           3s         3h9m
job.batch/elasticsearch-im-app-1629700200     1/1           3s         9m54s
job.batch/elasticsearch-im-audit-1629700200   1/1           4s         9m54s
job.batch/elasticsearch-im-infra-1629700200   1/1           4s         9m54s

NAME                                   SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/curator                  30 3 * * *     False     0        3h9m            39d
cronjob.batch/elasticsearch-im-app     */15 * * * *   False     0        9m56s           39d
cronjob.batch/elasticsearch-im-audit   */15 * * * *   False     0        9m56s           39d
cronjob.batch/elasticsearch-im-infra   */15 * * * *   False     0        9m56s           39d

Comment 8 Becky Mack 2021-08-23 09:20:58 UTC

Elasticsearch and Cluster Logging operator show 50% targets down message during upgrade and are not clearing out after upgrade completion.  This https://www.bestessaytips.com/masterpapers-com-review/ will walk you through how to debug this issue by checking the logs, re-running an update script, or restarting the node service. We hope this information helps get your cluster back up and running!

Comment 9 Andreas Letsche 2021-09-06 11:11:38 UTC

Recently, I updated from OpenShift 4.7.24 to 4.7.28.
Afterwards, I saw this error message as well.
However, I had a look to the logs of the "cluster-logging-operator" in "openshift-logging" namespace and saw errors.
I simple restart of the operator fixed the issue.

Comment 10 Kai-Uwe Rommel 2021-09-06 11:19:09 UTC

I have also fixed that with a simple restart so far. But the problem keeps coming back.

Comment 11 Jan Fajerski 2021-09-15 11:37:42 UTC

*** Bug 2004457 has been marked as a duplicate of this bug. ***

Comment 12 Jan Fajerski 2021-09-15 11:39:07 UTC

Reopening as there seem to be several reports of clusters showing this.

Comment 13 Jan Fajerski 2021-09-15 11:45:36 UTC

(In reply to Becky Mack from comment #8)
> Elasticsearch and Cluster Logging operator show 50% targets down message
> during upgrade and are not clearing out after upgrade completion.  This
> https://www.bestessaytips.com/masterpapers-com-review/ will walk you through
> how to debug this issue by checking the logs, re-running an update script,
> or restarting the node service. We hope this information helps get your
> cluster back up and running!

This link seems irrelevant at best. Is this a mistake?

Comment 15 Jan Fajerski 2021-09-20 10:15:03 UTC

I think I identified the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1943860.

Setting a depends-on relation accordingly.

Comment 18 Jan Fajerski 2021-10-15 08:16:48 UTC

Waiting on feedback by the apiserver team in https://bugzilla.redhat.com/show_bug.cgi?id=1943860

Comment 19 Jan Fajerski 2021-11-10 11:22:28 UTC

Closing this as a duplicate, since no new information is forthcoming. Restarting the prometheus pods should work around this situation. Please feel free to re-open this if needed.

*** This bug has been marked as a duplicate of bug 1943860 ***