Bug 2128677

Summary: Prometheus pods in the openshift-storage namespace in a production cluster breaking
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Yashvardhan Kukreja <ykukreja>
Component: odf-managed-serviceAssignee: Leela Venkaiah Gangavarapu <lgangava>
Status: CLOSED WONTFIX QA Contact: Neha Berry <nberry>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.10CC: aeyal, lgangava, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-10-03 10:36:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yashvardhan Kukreja 2022-09-21 11:45:19 UTC
Summary of the issue

`prometheus-managed-ocs-prometheus-0`  pods in the openshift-storage namespace are breaking because the `prometheus` container inside them is breaking.


After looking into the logs, it was found that there are multiple issues with that prometheus container.

It is non-interactable, meaning, upon port-forwarding to it, we tried to access its UI and it timed out
Same reason why even the local readinessProbes of that container are failing. Fyi: the probes try to hit http://localhost:9090/-/ready and expect a 200 response but they're getting a timeout, hence, confirming the unavailability of the Prom container
Looking at the logs of the Prometheus container, it is depicting multiple issues:
It can't List/Watch Kubernetes resources like Pod, Services, etc. which it requires to do.

```
ts=2022-09-21T10:50:39.142Z caller=log.go:168 level=error component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:449: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://172.30.0.1:443/api/v1/namespaces/openshift-monitoring/pods?resourceVersion=184343914\": dial tcp 172.30.0.1:443: i/o timeout"
```



It can't talk to the alert manager. Whenever it does, it is facing context deadline exceeded.

```ts=2022-09-21T11:13:49.953Z caller=notifier.go:526 level=error component=notifier alertmanager=http://10.128.2.157:9093/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://10.128.2.157:9093/api/v2/alerts\": context deadline exceeded"```


It can't evaluate DeadManSnitch metrics because "query time for query execution exhausted"

```
ts=2022-09-21T09:56:45.956Z caller=manager.go:609 level=warn component="rule manager" group=snitch-alert msg="Evaluating rule failed" rule="alert: DeadMansSnitch\nexpr: vector(1)\nlabels:\n  alertname: DeadMansSnitch\n  namespace: openshift-storage\n" err="query timed out in query execution"
```


Cluster Details


ID:            1rq91l2op46odsj6o2u0gutmtsiob83b
External ID:        4054490b-bcb2-4eb9-84e5-8a3922469ce3
Name:            r-eu-prd-01
State:            ready
API URL:        https://api.r-eu-prd-01.l4x7.p1.openshiftapps.com:6443
API Listening:        internal
Console URL:        https://console-openshift-console.apps.r-eu-prd-01.l4x7.p1.openshiftapps.com
Masters:        3
Infra:            3
Computes:        3-12 (Autoscaled)
Product:        rosa
Provider:        aws
Version:        4.10.14
Region:            eu-west-1
Multi-az:        true
CCS:            true
Subnet IDs:        [subnet-071770adceb574f6d subnet-07785985313972980 subnet-0282ac95107e1412f]
PrivateLink:        true
STS:            true
Existing VPC:        true
Channel Group:        stable
Cluster Admin:        true
Organization:        BP Corporation North America Inc
Creator:        rosa-eu-prd
Email:            rosa-eu-prd
AccountNumber:          1569407
Created:        2022-04-25T17:48:13Z
Expiration:        0001-01-01T00:00:00Z
Shard:            https://api.hivep01ue1.b6s7.p1.openshiftapps.com:6443

Comment 1 Yashvardhan Kukreja 2022-09-21 14:10:55 UTC
This issue was investigated and the root cause wasn't identified. 
To the best of your investigation, it seems it was a flake because of the following reasons:
- we got shell access inside the `prometheus` container of the `prometheus-managed-ocs-prometheus-0` pod and tried making/simulating the API calls which the `prometheus` container was doing against the API server. Those API calls successfully reached the API server and we got a response from it, unlike what the prometheus container was representing.
- we couldn't even hit the prometheus container from inside the container itself at `localhost:9090` depicting that the prometheus container itself wasn't being perceived as a process in the pod.

Ultimately, we proceeded to restart the pod by `oc rollout restart statefulset/prometheus-managed-ocs-prometheus -n openshift-storage` and everything worked perfectly fine.

Comment 2 Yashvardhan Kukreja 2022-09-21 14:34:58 UTC
https://redhat.pagerduty.com/incidents/Q1YCXVSJYBKX4W

Comment 3 Leela Venkaiah Gangavarapu 2022-09-26 10:48:28 UTC
@ykukreja in the bridge as weren't able to find the root cause and there were no repro steps and a restart of pod fixed it, can I close this?

Comment 4 Yashvardhan Kukreja 2022-09-28 13:01:50 UTC
Sure, we can reopen this ticket again if the issue seems to occur regularly again.