Bug 2128677 - Prometheus pods in the openshift-storage namespace in a production cluster breaking
Summary: Prometheus pods in the openshift-storage namespace in a production cluster br...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-managed-service
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Leela Venkaiah Gangavarapu
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-09-21 11:45 UTC by Yashvardhan Kukreja
Modified: 2023-08-09 17:00 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-03 10:36:09 UTC
Embargoed:


Attachments (Terms of Use)

Description Yashvardhan Kukreja 2022-09-21 11:45:19 UTC
Summary of the issue

`prometheus-managed-ocs-prometheus-0`  pods in the openshift-storage namespace are breaking because the `prometheus` container inside them is breaking.


After looking into the logs, it was found that there are multiple issues with that prometheus container.

It is non-interactable, meaning, upon port-forwarding to it, we tried to access its UI and it timed out
Same reason why even the local readinessProbes of that container are failing. Fyi: the probes try to hit http://localhost:9090/-/ready and expect a 200 response but they're getting a timeout, hence, confirming the unavailability of the Prom container
Looking at the logs of the Prometheus container, it is depicting multiple issues:
It can't List/Watch Kubernetes resources like Pod, Services, etc. which it requires to do.

```
ts=2022-09-21T10:50:39.142Z caller=log.go:168 level=error component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:449: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://172.30.0.1:443/api/v1/namespaces/openshift-monitoring/pods?resourceVersion=184343914\": dial tcp 172.30.0.1:443: i/o timeout"
```



It can't talk to the alert manager. Whenever it does, it is facing context deadline exceeded.

```ts=2022-09-21T11:13:49.953Z caller=notifier.go:526 level=error component=notifier alertmanager=http://10.128.2.157:9093/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://10.128.2.157:9093/api/v2/alerts\": context deadline exceeded"```


It can't evaluate DeadManSnitch metrics because "query time for query execution exhausted"

```
ts=2022-09-21T09:56:45.956Z caller=manager.go:609 level=warn component="rule manager" group=snitch-alert msg="Evaluating rule failed" rule="alert: DeadMansSnitch\nexpr: vector(1)\nlabels:\n  alertname: DeadMansSnitch\n  namespace: openshift-storage\n" err="query timed out in query execution"
```


Cluster Details


ID:            1rq91l2op46odsj6o2u0gutmtsiob83b
External ID:        4054490b-bcb2-4eb9-84e5-8a3922469ce3
Name:            r-eu-prd-01
State:            ready
API URL:        https://api.r-eu-prd-01.l4x7.p1.openshiftapps.com:6443
API Listening:        internal
Console URL:        https://console-openshift-console.apps.r-eu-prd-01.l4x7.p1.openshiftapps.com
Masters:        3
Infra:            3
Computes:        3-12 (Autoscaled)
Product:        rosa
Provider:        aws
Version:        4.10.14
Region:            eu-west-1
Multi-az:        true
CCS:            true
Subnet IDs:        [subnet-071770adceb574f6d subnet-07785985313972980 subnet-0282ac95107e1412f]
PrivateLink:        true
STS:            true
Existing VPC:        true
Channel Group:        stable
Cluster Admin:        true
Organization:        BP Corporation North America Inc
Creator:        rosa-eu-prd
Email:            rosa-eu-prd
AccountNumber:          1569407
Created:        2022-04-25T17:48:13Z
Expiration:        0001-01-01T00:00:00Z
Shard:            https://api.hivep01ue1.b6s7.p1.openshiftapps.com:6443

Comment 1 Yashvardhan Kukreja 2022-09-21 14:10:55 UTC
This issue was investigated and the root cause wasn't identified. 
To the best of your investigation, it seems it was a flake because of the following reasons:
- we got shell access inside the `prometheus` container of the `prometheus-managed-ocs-prometheus-0` pod and tried making/simulating the API calls which the `prometheus` container was doing against the API server. Those API calls successfully reached the API server and we got a response from it, unlike what the prometheus container was representing.
- we couldn't even hit the prometheus container from inside the container itself at `localhost:9090` depicting that the prometheus container itself wasn't being perceived as a process in the pod.

Ultimately, we proceeded to restart the pod by `oc rollout restart statefulset/prometheus-managed-ocs-prometheus -n openshift-storage` and everything worked perfectly fine.

Comment 2 Yashvardhan Kukreja 2022-09-21 14:34:58 UTC
https://redhat.pagerduty.com/incidents/Q1YCXVSJYBKX4W

Comment 3 Leela Venkaiah Gangavarapu 2022-09-26 10:48:28 UTC
@ykukreja in the bridge as weren't able to find the root cause and there were no repro steps and a restart of pod fixed it, can I close this?

Comment 4 Yashvardhan Kukreja 2022-09-28 13:01:50 UTC
Sure, we can reopen this ticket again if the issue seems to occur regularly again.


Note You need to log in before you can comment on or make changes to this bug.