Bug 1943860 - Prometheus never sees endpoint propagation of a deleted pod
Summary: Prometheus never sees endpoint propagation of a deleted pod
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.12.z
Assignee: Jan Fajerski
QA Contact: Junqi Zhao
URL:
Whiteboard:
: 1918333 2014286 2028928 2066533 (view as bug list)
Depends On:
Blocks: 1918333 2014286
TreeView+ depends on / blocked
 
Reported: 2021-03-27 22:34 UTC by Clayton Coleman
Modified: 2023-03-09 01:01 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-09 01:01:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift prometheus pull 96 0 None open Bug 1943860: Bump 2.30.0 2021-09-21 06:11:29 UTC
Red Hat Issue Tracker OHSS-14889 0 None None None 2022-09-01 16:49:58 UTC
Red Hat Issue Tracker OSD-13070 0 None None None 2022-09-12 11:24:21 UTC
Red Hat Knowledge Base (Solution) 6963954 0 None None None 2022-06-20 14:46:53 UTC

Description Clayton Coleman 2021-03-27 22:34:00 UTC
In https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1375890420118065152

during node upgrades the oauth-openshift-...-vrmzm pod is marked as deleted at 20:59:42 and is not fully deleted 50m later.  This causes TargetDown to fire because the pod is unready and is never completed.

Mar 27 20:59:42.790 W ns/openshift-authentication pod/oauth-openshift-b88f6b558-vrmzm node/**************-17f95-gdtt6-master-1 reason/GracefulDelete in 40s

Marking high because this is a bad state to get in and effectively wedged the whole upgrade process.  May be similar to other failures we are seeing in CI (such as the pod submit test still failing occasionally.

This is a release blocking bug and is potentially high impact to CI (although I have not quantified impact yet)

Comment 2 Clayton Coleman 2021-03-27 22:45:08 UTC
Interestingly kube-state-metrics doesn't see the pod after 21:00, which may mean that it is being deleted but the endpoint is not propagating to prometheus.  If we can show the deletion event in apiserver output in that job, we can move this bug potentially to prometheus (which may not be seeing the endpoint propagate).

Comment 3 Elana Hashman 2021-04-05 17:20:31 UTC
Marking blocker per Clayton's comment.

Comment 4 Elana Hashman 2021-04-05 20:00:58 UTC
Logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1375890420118065152/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/nodes/ci-op-pm7zl9i9-17f95-gdtt6-master-1/journal

First delete:

Mar 27 20:59:44.176181 ci-op-pm7zl9i9-17f95-gdtt6-master-1 hyperkube[1535]: I0327 20:59:44.161867    1535 kubelet.go:1936] SyncLoop (DELETE, "api"): "oauth-openshift-b88f6b558-vrmzm_openshift-authentication(85f1ee3a-acc9-4af0-8ec5-ac0f83cbf8e3)"

Second delete + remove:

Mar 27 21:00:21.786533 ci-op-pm7zl9i9-17f95-gdtt6-master-1 hyperkube[1535]: I0327 21:00:21.786484    1535 kubelet.go:1936] SyncLoop (DELETE, "api"): "oauth-openshift-b88f6b558-vrmzm_openshift-authentication(85f1ee3a-acc9-4af0-8ec5-ac0f83cbf8e3)"
Mar 27 21:00:21.792626 ci-op-pm7zl9i9-17f95-gdtt6-master-1 hyperkube[1535]: I0327 21:00:21.786659    1535 kubelet.go:1930] SyncLoop (REMOVE, "api"): "oauth-openshift-b88f6b558-vrmzm_openshift-authentication(85f1ee3a-acc9-4af0-8ec5-ac0f83cbf8e3)"
Mar 27 21:00:21.792626 ci-op-pm7zl9i9-17f95-gdtt6-master-1 hyperkube[1535]: I0327 21:00:21.786745    1535 kubelet.go:2120] Failed to delete pod "oauth-openshift-b88f6b558-vrmzm_openshift-authentication(85f1ee3a-acc9-4af0-8ec5-ac0f83cbf8e3)", err: pod not found

No further messages for the pod after that. It looks like it was successfully deleted within 30s of the deletion initiation and should have been removed from the API.

I'll look in the API Server logs next.

Comment 5 Elana Hashman 2021-04-05 21:03:41 UTC
Looking through API server audit logs, the last entry I see for this pod is

ci-op-pm7zl9i9-17f95-gdtt6-master-1-audit-2021-03-27T21-41-09.028.log.gz:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"fac31812-2e13-4d04-a69b-35f14a6ce994","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-authentication/pods/oauth-openshift-b88f6b558-vrmzm","verb":"get","user":{"username":"system:serviceaccount:openshift-machine-config-operator:machine-config-daemon","uid":"0490cbbb-3d91-48d7-b23b-1b6b3db9cefa","groups":["system:serviceaccounts","system:serviceaccounts:openshift-machine-config-operator","system:authenticated"]},"sourceIPs":["10.0.0.3"],"userAgent":"machine-config-daemon/v0.0.0 (linux/amd64) kubernetes/$Format/machine-config-daemon","objectRef":{"resource":"pods","namespace":"openshift-authentication","name":"oauth-openshift-b88f6b558-vrmzm","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2021-03-27T21:00:22.479711Z","stageTimestamp":"2021-03-27T21:00:22.483947Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:openshift-machine-config-operator:machine-config-daemon","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"machine-config-daemon\" of ClusterRole \"machine-config-daemon\" to ServiceAccount \"machine-config-daemon/openshift-machine-config-operator\""}}


which is at 2021-03-27T21:00:22.479711Z, corresponding to the removal timestamp above (within 1s).

Event log also has a Killing event at 20:59:44, and I don't see the pod anywhere in the must-gather.


Hence, I don't think this is a kubelet/API server problem.

Moving this bug to Monitoring Team/Prometheus.

Comment 7 Junqi Zhao 2021-04-06 01:55:19 UTC
FYI, expr for TargetDown
        expr: |
          100 * (count(up == 0 unless on (node) max by (node) (kube_node_spec_unschedulable == 1)) BY (job, namespace, service) /
            count(up unless on (node) max by (node) (kube_node_spec_unschedulable == 1)) BY (job, namespace, service)) > 10

Comment 9 Simon Pasquier 2021-04-06 12:33:57 UTC
Looking at the prometheus-k8s-0 logs I can see that Prometheus got throttled by the API:

level=error ts=2021-03-27T20:59:39.619Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)"
level=error ts=2021-03-27T20:59:39.643Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)"
level=error ts=2021-03-27T20:59:39.655Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)"
level=error ts=2021-03-27T20:59:39.657Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)"
level=error ts=2021-03-27T20:59:39.673Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
level=error ts=2021-03-27T20:59:39.699Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
level=error ts=2021-03-27T20:59:39.703Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
level=error ts=2021-03-27T20:59:39.794Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)"
level=error ts=2021-03-27T20:59:39.873Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
level=error ts=2021-03-27T20:59:39.880Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"


The same is true for prometheus-k8s-1:

level=error ts=2021-03-27T21:03:35.537Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
level=error ts=2021-03-27T21:03:35.537Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"
level=error ts=2021-03-27T21:03:35.537Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)"
level=error ts=2021-03-27T21:03:35.538Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)"
level=error ts=2021-03-27T21:03:35.538Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)"
level=error ts=2021-03-27T21:03:35.538Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)"
level=error ts=2021-03-27T21:03:35.538Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)"
level=error ts=2021-03-27T21:03:35.538Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)"

I suspect that the deletion event about oauth-openshift-b88f6b558-vrmzm got lost and never reached the kubernetes service discovery.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1375890420118065152/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_prometheus.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1375890420118065152/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-1_prometheus.log

Comment 10 Clayton Coleman 2021-04-15 19:47:38 UTC
Ugh, this is the new apiserver throttling.  How long did you get throttled for?

Note that deletion events should be resynced - please tell me prometheus service discovery is using an informer and resynces periodically?

Comment 11 Simon Pasquier 2021-04-16 09:09:42 UTC
> How long did you get throttled for?

prometheus-k8s-0 logs show throttling from 20:59:34.712 to 20:59:39.88. prometheus-k8s-1 logs show throttling from 20:53:31.281 to 20:53:31.292 and again from 21:03:35.537 to 21:03:37.081.

The pod deletion happens around 21:00:22 so it doesn't coincide with the throttling periods.

>  please tell me prometheus service discovery is using an informer and resynces periodically?

yes the kubernetes service discovery uses informers and has a default resync period of 10 minutes.

Comment 13 Jan Fajerski 2021-05-10 14:05:08 UTC
Reproducing this proves to be rather difficult. I have not seen this during many attempts at recreating it.

A similar issue has been fixed a while ago, https://github.com/prometheus/prometheus/issues/4124 has a discussion on it and links to the fix. That fix however is included here.

Also https://github.com/prometheus-operator/prometheus-operator/issues/2328#issuecomment-468743681 outlines other potential sources for this issue.

In any case, assuming this is a similar issue, reloading prometheus should work-around this. Either by sending SIGHUP or sending GET to the -/reload endpoint of prometheus.

Will look further into this, but this might be hard to catch. It seems that a heavy load has triggered similar issues in the past.

Comment 14 Jan Fajerski 2021-07-28 10:59:36 UTC
Closing this as duplicate. The other bug has much more debug data attached to it and is generally more active.

If that doesn't work for somebody, please re-open.

*** This bug has been marked as a duplicate of bug 1982238 ***

Comment 15 W. Trevor King 2021-08-04 04:54:12 UTC
Looks like bug 1982238 was closed INSUFFICIENT_DATA.  I'm going to re-open this one, because two (now closed) cases and two separate bugs seems like a lot of noise for something to actually be a non-issue.

Comment 19 Jan Fajerski 2021-09-20 10:11:42 UTC
after some digging and discussion it looks like we need https://github.com/kubernetes/kubernetes/pull/100878, i.e. update k8s.io/client-go to at least 0.21.0 in impacted projects. See https://coreos.slack.com/archives/CB48XQ4KZ/p1631808575106700 for the discussion.

So far I believe we want to update client-go to 0.21.0 in prometheus, prometheus-operator and cluster-monitoring-operator for 4.8 and 4.7. I'll propose the PRs and look for other components that need this update (anything that creates informers iiuc).

Comment 20 Jan Fajerski 2021-09-20 11:31:35 UTC
(In reply to Jan Fajerski from comment #19)
> after some digging and discussion it looks like we need
> https://github.com/kubernetes/kubernetes/pull/100878, i.e. update
> k8s.io/client-go to at least 0.21.0 in impacted projects. See
> https://coreos.slack.com/archives/CB48XQ4KZ/p1631808575106700 for the
> discussion.
> 
> So far I believe we want to update client-go to 0.21.0 in prometheus,
> prometheus-operator and cluster-monitoring-operator for 4.8 and 4.7. I'll
> propose the PRs and look for other components that need this update
> (anything that creates informers iiuc).

Correction: client-go 0.22.0 is needed instead of 0.21.0.

Comment 25 Junqi Zhao 2021-09-24 02:54:13 UTC
checked with 4.10.0-0.nightly-2021-09-23-031209, prometheus version now is 2.30.0, but the releated resources label is still 2.29.2, need one PR to fix it
# oc -n openshift-monitoring logs -c prometheus prometheus-k8s-0
level=info ts=2021-09-23T10:31:00.721Z caller=main.go:455 msg="Starting Prometheus" version="(version=2.30.0, branch=rhaos-4.10-rhel-8, revision=174f97d0ba0031bbf7732d3269374a97b27f7e5b)"

take prometheus k8s as an example
# oc -n openshift-monitoring get prometheus k8s -oyaml | grep labels -A6
  labels:
    app.kubernetes.io/component: prometheus
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 2.29.2
    prometheus: k8s
  name: k8s
--
    labels:
      app.kubernetes.io/component: prometheus
      app.kubernetes.io/name: prometheus
      app.kubernetes.io/part-of: openshift-monitoring
      app.kubernetes.io/version: 2.29.2
  podMonitorNamespaceSelector:
    matchLabels:

Comment 26 Jan Fajerski 2021-09-29 14:52:13 UTC
Solution is still unclear, needs further investigation.

Comment 27 Jan Fajerski 2021-09-30 09:49:28 UTC
According to https://search.ci.openshift.org/?search=Failed+to+watch+%5C*v1.Endpoints%3A+the+server+has+received+too+many+requests+and+has+asked+us+to+try+again+later&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job&wrap=on
this seems to no longer happen in CI, but I might not be searching correctly.

From what I can tell this issue showed up when a change in the cluster caused three things during the same time frame: 1) a heavy load on the apiserver so that is starts to throttle requests, 2) a reconciliation of the monitoring stack and thus prometheus restarting and 3) endpoints being deleted.

Prometheus getting restarted caused its informers on various resources being recreated, among them several informers on *v1.Endpoints of the various service monitors. So there is a short time when prometheus starts up where the endpoint deletion didn't make it to the informers cache yet. Once prometheus startup completed and the informers were recreated they would hit 1) causing the "Failed to watch..." messages. The new informers however would still serve the cached content including the now deleted Endpoint. This did not resolve within the resync period.

Why this never resolved is unclear to me. Either the apiserver never lifted the throttling, in that case this is expected behavior and we need to take apiserver throttling into account in our alert. Or something was buggy in the retry logic of informers when the initial list failed. I suspected the later and updated prometheus to 2.30.0, which uses client-go 0.22.0 (instead of 0.21.0). This includes https://github.com/kubernetes/kubernetes/pull/100878 which looked like a potential fix to me. I was however informed this is merely a performance optimization, despite the kind/bug and priority/important-soon labels, so backporting this fix is not an option.

This seems to no longer happen since the prometheus update. However this was reported against 4.8 and against 4.7 in https://bugzilla.redhat.com/show_bug.cgi?id=1918333.

Requesting further input from the apiserver team, as it seems to me prometheus is using SharedindexInformers correctly (https://github.com/prometheus/prometheus/blob/main/discovery/kubernetes/kubernetes.go#L362-L397).

Comment 29 Lili Cosic 2021-10-15 13:35:17 UTC
Randomly came across this bug. Is this really not related to https://github.com/kubernetes/kube-state-metrics/issues/694 or https://github.com/kubernetes/kube-state-metrics/issues/1569? Or is this the route cause of those two issues maybe?

Comment 30 Jan Fajerski 2021-10-28 11:08:50 UTC
Simon found another instance of this behavior, this time not involving upgrades or apiserver throttling. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade/1451281811295440896.

My first inclination would be to see the KSM issues as a symptom of this bug here. Prometheus sometimes keeps a k8s SD target after the pod was deleted (and the apiserver reflects that).
Due to the new CI run linked here, it seems less likely thats an issue with informers. Will take the bug back and investigate how Prometheus processes informer updates.

Comment 31 Jan Fajerski 2021-11-10 11:22:28 UTC
*** Bug 1918333 has been marked as a duplicate of this bug. ***

Comment 33 Jan Fajerski 2021-12-09 12:05:51 UTC
*** Bug 2014286 has been marked as a duplicate of this bug. ***

Comment 34 Jan Fajerski 2021-12-22 13:13:37 UTC
*** Bug 2028928 has been marked as a duplicate of this bug. ***

Comment 36 Filip Petkovski 2022-02-07 09:43:07 UTC
A similar issue was just reported in Prometheus: https://github.com/prometheus/prometheus/issues/10257

Comment 37 Jan Fajerski 2022-02-25 14:43:45 UTC
https://github.com/prometheus/prometheus/issues/10257#issuecomment-1046784527 points out that unready pods can cause that. I'm not sure how OpenShift handles this case, but given that the symptom often coincides with heavy load in the apiserver, this might be our root cause as well.

Comment 39 Jan Fajerski 2022-03-22 12:19:13 UTC
*** Bug 2066533 has been marked as a duplicate of this bug. ***

Comment 55 Jan Fajerski 2022-11-04 15:35:15 UTC
Ack, thx for the update.
Unfortunately we still don't have a reproducer for this.

Comment 56 jamo luhrsen 2022-12-03 01:11:43 UTC
FYI, I've been working on this bug which may (or may not) be related to this:
  https://issues.redhat.com/browse/OCPBUGS-3887

Comment 58 jamo luhrsen 2022-12-15 16:15:49 UTC
(In reply to jamo luhrsen from comment #56)
> FYI, I've been working on this bug which may (or may not) be related to this:
>   https://issues.redhat.com/browse/OCPBUGS-3887

Recently, I dug a lot deeper in to OCPBUGS-3887 and it really does feel like a dup
to this bug. As for a reproducer, we see this problem quite a bit in CI. The original
bz (https://bugzilla.redhat.com/show_bug.cgi?id=2022113) was closed so that it could
be tracked more easily in Jira. I'll mark OCPBUGS-3887 as a duplicate of this
bug's jira mirror, which is OCPBUGS-27051.

Comment 61 Shiftzilla 2023-03-09 01:01:52 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-8857


Note You need to log in before you can comment on or make changes to this bug.