Bug 1918333
| Summary: | Elasticsearch and Cluster Logging operator show 50% targets down message during upgrade and are not clearing out after upgrade completion | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Sam Yangsao <syangsao> | ||||
| Component: | Monitoring | Assignee: | Jan Fajerski <jfajersk> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Junqi Zhao <juzhao> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 4.7 | CC: | aharchin, amuller, andreas.letsche, anisal, anpicker, aos-bugs, berrange, erooth, ghernandeza, hkang, hongyli, jfajersk, kai-uwe.rommel, lchiaret, periklis, rsandu, s.heijmans, shizu, spasquie, trees | ||||
| Target Milestone: | --- | Keywords: | Reopened | ||||
| Target Release: | 4.7.0 | Flags: | syangsao:
needinfo-
|
||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | 47hack | ||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-11-10 11:22:28 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1943860 | ||||||
| Bug Blocks: | |||||||
| Attachments: |
|
||||||
|
Description
Sam Yangsao
2021-01-20 13:28:31 UTC
Created attachment 1749071 [details]
screenshot of overview page
Anyone working on this? I also frequently see "50% of the cluster-logging-operator-metrics/cluster-logging-operator-metrics targets in openshift-logging namespace are down." alerts. No idea why. This buglet is following me already through at least all 4.7 versionbs. I am facing the same issue here in my cluster. I have recently upgraded the cluster from 4.7.19 to 4.7.22 and this alert is now being shown. The same here after installing 4.7.22, the only way I get the alert to disappear is disabling "enableUserWorkload: true" in openshift-monitoring, but I don't understand what the user-defined monitoring has to do with openshift-logging. $ oc get all NAME READY STATUS RESTARTS AGE pod/cluster-logging-operator-c5c746648-lkngz 1/1 Running 0 3d9h pod/curator-1629689400-pvhv6 0/1 Completed 0 3h9m pod/elasticsearch-cdm-l3qjqqbv-1-84f8cd59d6-6fmp7 2/2 Running 0 3d9h pod/elasticsearch-cdm-l3qjqqbv-2-59bd9cb655-fb428 2/2 Running 0 3d9h pod/elasticsearch-cdm-l3qjqqbv-3-86bb6f94d8-d9hqb 2/2 Running 0 3d9h pod/elasticsearch-im-app-1629700200-w9685 0/1 Completed 0 9m54s pod/elasticsearch-im-audit-1629700200-n4qp4 0/1 Completed 0 9m54s pod/elasticsearch-im-infra-1629700200-swbl9 0/1 Completed 0 9m54s pod/fluentd-2lwb7 1/1 Running 0 3d11h pod/fluentd-4lrn2 1/1 Running 0 3d11h pod/fluentd-5pmkg 1/1 Running 0 3d11h pod/fluentd-7kfpg 1/1 Running 0 3d11h pod/fluentd-g57mx 1/1 Running 0 3d11h pod/fluentd-mcl7d 1/1 Running 0 3d11h pod/fluentd-mft9q 1/1 Running 0 3d11h pod/fluentd-mqnss 1/1 Running 0 3d11h pod/fluentd-q8z6w 1/1 Running 0 3d11h pod/fluentd-s2bxw 1/1 Running 0 3d11h pod/fluentd-wgc8g 1/1 Running 0 3d11h pod/fluentd-wlhg4 1/1 Running 0 3d11h pod/fluentd-wvgzd 1/1 Running 0 3d11h pod/fluentd-xrsbf 1/1 Running 0 3d11h pod/fluentd-zfjlt 1/1 Running 0 3d11h pod/fluentd-zz5rc 1/1 Running 0 3d11h pod/kibana-77b48f9dfc-xdxll 2/2 Running 0 14h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/cluster-logging-operator-metrics ClusterIP 172.30.118.80 <none> 8383/TCP,8686/TCP 83d service/elasticsearch ClusterIP 172.30.211.94 <none> 9200/TCP 39d service/elasticsearch-cluster ClusterIP 172.30.64.251 <none> 9300/TCP 39d service/elasticsearch-metrics ClusterIP 172.30.102.81 <none> 60001/TCP 39d service/fluentd ClusterIP 172.30.28.118 <none> 24231/TCP 39d service/kibana ClusterIP 172.30.22.247 <none> 443/TCP 39d NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/fluentd 16 16 16 16 16 kubernetes.io/os=linux 39d NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/cluster-logging-operator 1/1 1 1 83d deployment.apps/elasticsearch-cdm-l3qjqqbv-1 1/1 1 1 39d deployment.apps/elasticsearch-cdm-l3qjqqbv-2 1/1 1 1 39d deployment.apps/elasticsearch-cdm-l3qjqqbv-3 1/1 1 1 39d deployment.apps/kibana 1/1 1 1 39d NAME DESIRED CURRENT READY AGE replicaset.apps/cluster-logging-operator-64dcfc9865 0 0 0 3d8h replicaset.apps/cluster-logging-operator-c5c746648 1 1 1 3d11h replicaset.apps/elasticsearch-cdm-l3qjqqbv-1-84f8cd59d6 1 1 1 39d replicaset.apps/elasticsearch-cdm-l3qjqqbv-2-59bd9cb655 1 1 1 39d replicaset.apps/elasticsearch-cdm-l3qjqqbv-3-86bb6f94d8 1 1 1 39d replicaset.apps/kibana-77b48f9dfc 1 1 1 39d replicaset.apps/kibana-86767546ff 0 0 0 3d8h NAME COMPLETIONS DURATION AGE job.batch/curator-1629689400 1/1 3s 3h9m job.batch/elasticsearch-im-app-1629700200 1/1 3s 9m54s job.batch/elasticsearch-im-audit-1629700200 1/1 4s 9m54s job.batch/elasticsearch-im-infra-1629700200 1/1 4s 9m54s NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE cronjob.batch/curator 30 3 * * * False 0 3h9m 39d cronjob.batch/elasticsearch-im-app */15 * * * * False 0 9m56s 39d cronjob.batch/elasticsearch-im-audit */15 * * * * False 0 9m56s 39d cronjob.batch/elasticsearch-im-infra */15 * * * * False 0 9m56s 39d Elasticsearch and Cluster Logging operator show 50% targets down message during upgrade and are not clearing out after upgrade completion. This https://www.bestessaytips.com/masterpapers-com-review/ will walk you through how to debug this issue by checking the logs, re-running an update script, or restarting the node service. We hope this information helps get your cluster back up and running! Recently, I updated from OpenShift 4.7.24 to 4.7.28. Afterwards, I saw this error message as well. However, I had a look to the logs of the "cluster-logging-operator" in "openshift-logging" namespace and saw errors. I simple restart of the operator fixed the issue. I have also fixed that with a simple restart so far. But the problem keeps coming back. *** Bug 2004457 has been marked as a duplicate of this bug. *** Reopening as there seem to be several reports of clusters showing this. (In reply to Becky Mack from comment #8) > Elasticsearch and Cluster Logging operator show 50% targets down message > during upgrade and are not clearing out after upgrade completion. This > https://www.bestessaytips.com/masterpapers-com-review/ will walk you through > how to debug this issue by checking the logs, re-running an update script, > or restarting the node service. We hope this information helps get your > cluster back up and running! This link seems irrelevant at best. Is this a mistake? I think I identified the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1943860. Setting a depends-on relation accordingly. Waiting on feedback by the apiserver team in https://bugzilla.redhat.com/show_bug.cgi?id=1943860 Closing this as a duplicate, since no new information is forthcoming. Restarting the prometheus pods should work around this situation. Please feel free to re-open this if needed. *** This bug has been marked as a duplicate of bug 1943860 *** |