Bug 2028928

Summary: TargetDown alerts (false positives) on deleted prometheus-adapter pods
Product: OpenShift Container Platform Reporter: ncarmich
Component: MonitoringAssignee: Jan Fajerski <jfajersk>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: low    
Version: 4.6CC: amuller, anpicker, aos-bugs, erooth, spasquie, sthaha
Target Milestone: ---Flags: sthaha: needinfo-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-22 13:13:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description ncarmich 2021-12-03 18:26:55 UTC
Description of problem:

[SF case #: 03094729]

We see inconsistent information between Prometheus and object state checked using oc command. Prometheus has information about pods which were removed, as a result it monitors two additional targets. This triggers false positive "TargetDown" alerts. I attached screenshots from Prometheus UI.


```sh
sd-df2a-ef7d: ~  namicg39021p/openshift-monitoring $ oc get po -l name=prometheus-adapter -n openshift-monitoring 
NAME                                  READY   STATUS    RESTARTS   AGE
prometheus-adapter-6f74bb68fd-c9rvr   1/1     Running   0          5d14h
prometheus-adapter-6f74bb68fd-kmg8l   1/1     Running   0          5d14h


sd-df2a-ef7d: ~  namicg39021p/openshift-monitoring $ oc get po prometheus-adapter-6b765fc44b-kbhxm -n openshift-monitoring 
Error from server (NotFound): pods "prometheus-adapter-6b765fc44b-kbhxm" not found
sd-df2a-ef7d: ~  namicg39021p/openshift-monitoring $ oc get po prometheus-adapter-6b765fc44b-h7nx5 -n openshift-monitoring 
Error from server (NotFound): pods "prometheus-adapter-6b765fc44b-h7nx5" not found
```

Where are you experiencing the behavior? What environment?
prod

When does the behavior occur? Frequency? Repeatedly? At certain times?
Random

What is the business impact? Please also provide timeframe information.
false positive alerts are generated, it creates noise


Additional info:

must-gather is located here: [SF case #: 03094729 - comment#-3]

Comment 6 ncarmich 2021-12-10 16:51:44 UTC
Hi Sunil,

I have verified it again and the alert disappeared from Prometheus, but *not* from Alertmanager. I attached two screenshoots, one is the alert in Alermanager. The second one is the query in Prometheus used to trigger this alert. Please let me know if you need more info.

---

attached screenshots: Alertmanager1.PNG & Prometheus1.PNG

Comment 7 ncarmich 2021-12-10 16:53:04 UTC
Hi Sunil,

I have verified it again and the alert disappeared from Prometheus, but *not* from Alertmanager. I attached two screenshots, one is the alert in Alermanager. The second one is the query in Prometheus used to trigger this alert. Please let me know if you need more info.

---

attached screenshots: Alertmanager1.PNG & Prometheus1.PNG

Comment 12 Jan Fajerski 2021-12-22 13:13:37 UTC
Closing this as a duplicate, as mentioned above. Please feel free to reopen either the workaround in the original bug doesn't work or anyone disagrees on the duplicate status.

*** This bug has been marked as a duplicate of bug 1943860 ***

Comment 15 ncarmich 2022-01-04 20:23:25 UTC
Hi again Team - just to close the loop I wanted to post the latest update from the customer regarding the workaround: (https://access.redhat.com/solutions/6604421) -

"I confirm the workaround is working, I have tested it on another cluster where we had similar problem." so I will go ahead and mark the solution above as verified.

Thanks again for all your help!