Summary of the issue `prometheus-managed-ocs-prometheus-0` pods in the openshift-storage namespace are breaking because the `prometheus` container inside them is breaking. After looking into the logs, it was found that there are multiple issues with that prometheus container. It is non-interactable, meaning, upon port-forwarding to it, we tried to access its UI and it timed out Same reason why even the local readinessProbes of that container are failing. Fyi: the probes try to hit http://localhost:9090/-/ready and expect a 200 response but they're getting a timeout, hence, confirming the unavailability of the Prom container Looking at the logs of the Prometheus container, it is depicting multiple issues: It can't List/Watch Kubernetes resources like Pod, Services, etc. which it requires to do. ``` ts=2022-09-21T10:50:39.142Z caller=log.go:168 level=error component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:449: Failed to watch *v1.Pod: failed to list *v1.Pod: Get \"https://172.30.0.1:443/api/v1/namespaces/openshift-monitoring/pods?resourceVersion=184343914\": dial tcp 172.30.0.1:443: i/o timeout" ``` It can't talk to the alert manager. Whenever it does, it is facing context deadline exceeded. ```ts=2022-09-21T11:13:49.953Z caller=notifier.go:526 level=error component=notifier alertmanager=http://10.128.2.157:9093/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://10.128.2.157:9093/api/v2/alerts\": context deadline exceeded"``` It can't evaluate DeadManSnitch metrics because "query time for query execution exhausted" ``` ts=2022-09-21T09:56:45.956Z caller=manager.go:609 level=warn component="rule manager" group=snitch-alert msg="Evaluating rule failed" rule="alert: DeadMansSnitch\nexpr: vector(1)\nlabels:\n alertname: DeadMansSnitch\n namespace: openshift-storage\n" err="query timed out in query execution" ``` Cluster Details ID: 1rq91l2op46odsj6o2u0gutmtsiob83b External ID: 4054490b-bcb2-4eb9-84e5-8a3922469ce3 Name: r-eu-prd-01 State: ready API URL: https://api.r-eu-prd-01.l4x7.p1.openshiftapps.com:6443 API Listening: internal Console URL: https://console-openshift-console.apps.r-eu-prd-01.l4x7.p1.openshiftapps.com Masters: 3 Infra: 3 Computes: 3-12 (Autoscaled) Product: rosa Provider: aws Version: 4.10.14 Region: eu-west-1 Multi-az: true CCS: true Subnet IDs: [subnet-071770adceb574f6d subnet-07785985313972980 subnet-0282ac95107e1412f] PrivateLink: true STS: true Existing VPC: true Channel Group: stable Cluster Admin: true Organization: BP Corporation North America Inc Creator: rosa-eu-prd Email: rosa-eu-prd AccountNumber: 1569407 Created: 2022-04-25T17:48:13Z Expiration: 0001-01-01T00:00:00Z Shard: https://api.hivep01ue1.b6s7.p1.openshiftapps.com:6443
This issue was investigated and the root cause wasn't identified. To the best of your investigation, it seems it was a flake because of the following reasons: - we got shell access inside the `prometheus` container of the `prometheus-managed-ocs-prometheus-0` pod and tried making/simulating the API calls which the `prometheus` container was doing against the API server. Those API calls successfully reached the API server and we got a response from it, unlike what the prometheus container was representing. - we couldn't even hit the prometheus container from inside the container itself at `localhost:9090` depicting that the prometheus container itself wasn't being perceived as a process in the pod. Ultimately, we proceeded to restart the pod by `oc rollout restart statefulset/prometheus-managed-ocs-prometheus -n openshift-storage` and everything worked perfectly fine.
https://redhat.pagerduty.com/incidents/Q1YCXVSJYBKX4W
@ykukreja in the bridge as weren't able to find the root cause and there were no repro steps and a restart of pod fixed it, can I close this?
Sure, we can reopen this ticket again if the issue seems to occur regularly again.