In https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1375890420118065152 during node upgrades the oauth-openshift-...-vrmzm pod is marked as deleted at 20:59:42 and is not fully deleted 50m later. This causes TargetDown to fire because the pod is unready and is never completed. Mar 27 20:59:42.790 W ns/openshift-authentication pod/oauth-openshift-b88f6b558-vrmzm node/**************-17f95-gdtt6-master-1 reason/GracefulDelete in 40s Marking high because this is a bad state to get in and effectively wedged the whole upgrade process. May be similar to other failures we are seeing in CI (such as the pod submit test still failing occasionally. This is a release blocking bug and is potentially high impact to CI (although I have not quantified impact yet)
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1034/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1375424947886755840 has a similar symptom https://search.ci.openshift.org/?search=alert+TargetDown+fired+for&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Look for TargetDown > 180s
Interestingly kube-state-metrics doesn't see the pod after 21:00, which may mean that it is being deleted but the endpoint is not propagating to prometheus. If we can show the deletion event in apiserver output in that job, we can move this bug potentially to prometheus (which may not be seeing the endpoint propagate).
Marking blocker per Clayton's comment.
Logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1375890420118065152/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/nodes/ci-op-pm7zl9i9-17f95-gdtt6-master-1/journal First delete: Mar 27 20:59:44.176181 ci-op-pm7zl9i9-17f95-gdtt6-master-1 hyperkube[1535]: I0327 20:59:44.161867 1535 kubelet.go:1936] SyncLoop (DELETE, "api"): "oauth-openshift-b88f6b558-vrmzm_openshift-authentication(85f1ee3a-acc9-4af0-8ec5-ac0f83cbf8e3)" Second delete + remove: Mar 27 21:00:21.786533 ci-op-pm7zl9i9-17f95-gdtt6-master-1 hyperkube[1535]: I0327 21:00:21.786484 1535 kubelet.go:1936] SyncLoop (DELETE, "api"): "oauth-openshift-b88f6b558-vrmzm_openshift-authentication(85f1ee3a-acc9-4af0-8ec5-ac0f83cbf8e3)" Mar 27 21:00:21.792626 ci-op-pm7zl9i9-17f95-gdtt6-master-1 hyperkube[1535]: I0327 21:00:21.786659 1535 kubelet.go:1930] SyncLoop (REMOVE, "api"): "oauth-openshift-b88f6b558-vrmzm_openshift-authentication(85f1ee3a-acc9-4af0-8ec5-ac0f83cbf8e3)" Mar 27 21:00:21.792626 ci-op-pm7zl9i9-17f95-gdtt6-master-1 hyperkube[1535]: I0327 21:00:21.786745 1535 kubelet.go:2120] Failed to delete pod "oauth-openshift-b88f6b558-vrmzm_openshift-authentication(85f1ee3a-acc9-4af0-8ec5-ac0f83cbf8e3)", err: pod not found No further messages for the pod after that. It looks like it was successfully deleted within 30s of the deletion initiation and should have been removed from the API. I'll look in the API Server logs next.
Looking through API server audit logs, the last entry I see for this pod is ci-op-pm7zl9i9-17f95-gdtt6-master-1-audit-2021-03-27T21-41-09.028.log.gz:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"fac31812-2e13-4d04-a69b-35f14a6ce994","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/openshift-authentication/pods/oauth-openshift-b88f6b558-vrmzm","verb":"get","user":{"username":"system:serviceaccount:openshift-machine-config-operator:machine-config-daemon","uid":"0490cbbb-3d91-48d7-b23b-1b6b3db9cefa","groups":["system:serviceaccounts","system:serviceaccounts:openshift-machine-config-operator","system:authenticated"]},"sourceIPs":["10.0.0.3"],"userAgent":"machine-config-daemon/v0.0.0 (linux/amd64) kubernetes/$Format/machine-config-daemon","objectRef":{"resource":"pods","namespace":"openshift-authentication","name":"oauth-openshift-b88f6b558-vrmzm","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","reason":"NotFound","code":404},"requestReceivedTimestamp":"2021-03-27T21:00:22.479711Z","stageTimestamp":"2021-03-27T21:00:22.483947Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:openshift-machine-config-operator:machine-config-daemon","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"machine-config-daemon\" of ClusterRole \"machine-config-daemon\" to ServiceAccount \"machine-config-daemon/openshift-machine-config-operator\""}} which is at 2021-03-27T21:00:22.479711Z, corresponding to the removal timestamp above (within 1s). Event log also has a Killing event at 20:59:44, and I don't see the pod anywhere in the must-gather. Hence, I don't think this is a kubelet/API server problem. Moving this bug to Monitoring Team/Prometheus.
FYI, expr for TargetDown expr: | 100 * (count(up == 0 unless on (node) max by (node) (kube_node_spec_unschedulable == 1)) BY (job, namespace, service) / count(up unless on (node) max by (node) (kube_node_spec_unschedulable == 1)) BY (job, namespace, service)) > 10
Looking at the prometheus-k8s-0 logs I can see that Prometheus got throttled by the API: level=error ts=2021-03-27T20:59:39.619Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)" level=error ts=2021-03-27T20:59:39.643Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)" level=error ts=2021-03-27T20:59:39.655Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)" level=error ts=2021-03-27T20:59:39.657Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)" level=error ts=2021-03-27T20:59:39.673Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" level=error ts=2021-03-27T20:59:39.699Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" level=error ts=2021-03-27T20:59:39.703Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" level=error ts=2021-03-27T20:59:39.794Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)" level=error ts=2021-03-27T20:59:39.873Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" level=error ts=2021-03-27T20:59:39.880Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" The same is true for prometheus-k8s-1: level=error ts=2021-03-27T21:03:35.537Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" level=error ts=2021-03-27T21:03:35.537Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" level=error ts=2021-03-27T21:03:35.537Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: the server has received too many requests and has asked us to try again later (get services)" level=error ts=2021-03-27T21:03:35.538Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)" level=error ts=2021-03-27T21:03:35.538Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)" level=error ts=2021-03-27T21:03:35.538Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)" level=error ts=2021-03-27T21:03:35.538Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:429: Failed to watch *v1.Pod: the server has received too many requests and has asked us to try again later (get pods)" level=error ts=2021-03-27T21:03:35.538Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: the server has received too many requests and has asked us to try again later (get endpoints)" I suspect that the deletion event about oauth-openshift-b88f6b558-vrmzm got lost and never reached the kubernetes service discovery. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1375890420118065152/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_prometheus.log [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1375890420118065152/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-1_prometheus.log
Ugh, this is the new apiserver throttling. How long did you get throttled for? Note that deletion events should be resynced - please tell me prometheus service discovery is using an informer and resynces periodically?
> How long did you get throttled for? prometheus-k8s-0 logs show throttling from 20:59:34.712 to 20:59:39.88. prometheus-k8s-1 logs show throttling from 20:53:31.281 to 20:53:31.292 and again from 21:03:35.537 to 21:03:37.081. The pod deletion happens around 21:00:22 so it doesn't coincide with the throttling periods. > please tell me prometheus service discovery is using an informer and resynces periodically? yes the kubernetes service discovery uses informers and has a default resync period of 10 minutes.
Reproducing this proves to be rather difficult. I have not seen this during many attempts at recreating it. A similar issue has been fixed a while ago, https://github.com/prometheus/prometheus/issues/4124 has a discussion on it and links to the fix. That fix however is included here. Also https://github.com/prometheus-operator/prometheus-operator/issues/2328#issuecomment-468743681 outlines other potential sources for this issue. In any case, assuming this is a similar issue, reloading prometheus should work-around this. Either by sending SIGHUP or sending GET to the -/reload endpoint of prometheus. Will look further into this, but this might be hard to catch. It seems that a heavy load has triggered similar issues in the past.
Closing this as duplicate. The other bug has much more debug data attached to it and is generally more active. If that doesn't work for somebody, please re-open. *** This bug has been marked as a duplicate of bug 1982238 ***
Looks like bug 1982238 was closed INSUFFICIENT_DATA. I'm going to re-open this one, because two (now closed) cases and two separate bugs seems like a lot of noise for something to actually be a non-issue.
after some digging and discussion it looks like we need https://github.com/kubernetes/kubernetes/pull/100878, i.e. update k8s.io/client-go to at least 0.21.0 in impacted projects. See https://coreos.slack.com/archives/CB48XQ4KZ/p1631808575106700 for the discussion. So far I believe we want to update client-go to 0.21.0 in prometheus, prometheus-operator and cluster-monitoring-operator for 4.8 and 4.7. I'll propose the PRs and look for other components that need this update (anything that creates informers iiuc).
(In reply to Jan Fajerski from comment #19) > after some digging and discussion it looks like we need > https://github.com/kubernetes/kubernetes/pull/100878, i.e. update > k8s.io/client-go to at least 0.21.0 in impacted projects. See > https://coreos.slack.com/archives/CB48XQ4KZ/p1631808575106700 for the > discussion. > > So far I believe we want to update client-go to 0.21.0 in prometheus, > prometheus-operator and cluster-monitoring-operator for 4.8 and 4.7. I'll > propose the PRs and look for other components that need this update > (anything that creates informers iiuc). Correction: client-go 0.22.0 is needed instead of 0.21.0.
checked with 4.10.0-0.nightly-2021-09-23-031209, prometheus version now is 2.30.0, but the releated resources label is still 2.29.2, need one PR to fix it # oc -n openshift-monitoring logs -c prometheus prometheus-k8s-0 level=info ts=2021-09-23T10:31:00.721Z caller=main.go:455 msg="Starting Prometheus" version="(version=2.30.0, branch=rhaos-4.10-rhel-8, revision=174f97d0ba0031bbf7732d3269374a97b27f7e5b)" take prometheus k8s as an example # oc -n openshift-monitoring get prometheus k8s -oyaml | grep labels -A6 labels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 2.29.2 prometheus: k8s name: k8s -- labels: app.kubernetes.io/component: prometheus app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 2.29.2 podMonitorNamespaceSelector: matchLabels:
Solution is still unclear, needs further investigation.
According to https://search.ci.openshift.org/?search=Failed+to+watch+%5C*v1.Endpoints%3A+the+server+has+received+too+many+requests+and+has+asked+us+to+try+again+later&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job&wrap=on this seems to no longer happen in CI, but I might not be searching correctly. From what I can tell this issue showed up when a change in the cluster caused three things during the same time frame: 1) a heavy load on the apiserver so that is starts to throttle requests, 2) a reconciliation of the monitoring stack and thus prometheus restarting and 3) endpoints being deleted. Prometheus getting restarted caused its informers on various resources being recreated, among them several informers on *v1.Endpoints of the various service monitors. So there is a short time when prometheus starts up where the endpoint deletion didn't make it to the informers cache yet. Once prometheus startup completed and the informers were recreated they would hit 1) causing the "Failed to watch..." messages. The new informers however would still serve the cached content including the now deleted Endpoint. This did not resolve within the resync period. Why this never resolved is unclear to me. Either the apiserver never lifted the throttling, in that case this is expected behavior and we need to take apiserver throttling into account in our alert. Or something was buggy in the retry logic of informers when the initial list failed. I suspected the later and updated prometheus to 2.30.0, which uses client-go 0.22.0 (instead of 0.21.0). This includes https://github.com/kubernetes/kubernetes/pull/100878 which looked like a potential fix to me. I was however informed this is merely a performance optimization, despite the kind/bug and priority/important-soon labels, so backporting this fix is not an option. This seems to no longer happen since the prometheus update. However this was reported against 4.8 and against 4.7 in https://bugzilla.redhat.com/show_bug.cgi?id=1918333. Requesting further input from the apiserver team, as it seems to me prometheus is using SharedindexInformers correctly (https://github.com/prometheus/prometheus/blob/main/discovery/kubernetes/kubernetes.go#L362-L397).
Randomly came across this bug. Is this really not related to https://github.com/kubernetes/kube-state-metrics/issues/694 or https://github.com/kubernetes/kube-state-metrics/issues/1569? Or is this the route cause of those two issues maybe?
Simon found another instance of this behavior, this time not involving upgrades or apiserver throttling. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade/1451281811295440896. My first inclination would be to see the KSM issues as a symptom of this bug here. Prometheus sometimes keeps a k8s SD target after the pod was deleted (and the apiserver reflects that). Due to the new CI run linked here, it seems less likely thats an issue with informers. Will take the bug back and investigate how Prometheus processes informer updates.
*** Bug 1918333 has been marked as a duplicate of this bug. ***
*** Bug 2014286 has been marked as a duplicate of this bug. ***
*** Bug 2028928 has been marked as a duplicate of this bug. ***
A similar issue was just reported in Prometheus: https://github.com/prometheus/prometheus/issues/10257
https://github.com/prometheus/prometheus/issues/10257#issuecomment-1046784527 points out that unready pods can cause that. I'm not sure how OpenShift handles this case, but given that the symptom often coincides with heavy load in the apiserver, this might be our root cause as well.
*** Bug 2066533 has been marked as a duplicate of this bug. ***
Ack, thx for the update. Unfortunately we still don't have a reproducer for this.
FYI, I've been working on this bug which may (or may not) be related to this: https://issues.redhat.com/browse/OCPBUGS-3887
(In reply to jamo luhrsen from comment #56) > FYI, I've been working on this bug which may (or may not) be related to this: > https://issues.redhat.com/browse/OCPBUGS-3887 Recently, I dug a lot deeper in to OCPBUGS-3887 and it really does feel like a dup to this bug. As for a reproducer, we see this problem quite a bit in CI. The original bz (https://bugzilla.redhat.com/show_bug.cgi?id=2022113) was closed so that it could be tracked more easily in Jira. I'll mark OCPBUGS-3887 as a duplicate of this bug's jira mirror, which is OCPBUGS-27051.
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-8857