Bug 2079679
Summary: | [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | medium | Docs Contact: | Brian Burt <bburt> |
Priority: | medium | ||
Version: | 4.11 | CC: | anpicker, aos-bugs, bburt, sippy |
Target Milestone: | --- | ||
Target Release: | 4.12.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-01-17 19:48:18 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2108978 |
Description
W. Trevor King
2022-04-28 05:15:17 UTC
(In reply to W. Trevor King from comment #0) > both out of openshift-kube-scheduler-ip-10-0-198-214.us-east-2.compute.internal. And in that node's journal: > > $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-210-174.us-east-2.compute.internal/journal | zgrep '16:39.*SyncLoop [A-Z].*prometheus-k8s-1' Oops, 10.0.198.214 and 10.0.210.174 are two different nodes. Details on the right node: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/junit/e2e-intervals_everything_20220427-153753.json | jq -r '.items[] | select(tostring | contains("node/ip-10-0-210-174.us-east-2.compute.internal")) | .from + " " + .to + " " + .locator + " " + .message' | grep -v 'pending longer than a minute' 2022-04-27T16:34:30Z 2022-04-27T16:36:42Z node/ip-10-0-210-174.us-east-2.compute.internal reason/NodeUpdate phase/Update config/rendered-worker-4b8ae95eb3c069abd41c7a5ef1fdc3da roles/worker reached desired config roles/worker 2022-04-27T16:34:41Z 2022-04-27T16:35:32Z node/ip-10-0-210-174.us-east-2.compute.internal reason/NodeUpdate phase/Drain roles/worker drained node 2022-04-27T16:35:32Z 2022-04-27T16:36:34Z node/ip-10-0-210-174.us-east-2.compute.internal reason/NodeUpdate phase/Reboot roles/worker rebooted and kubelet started 2022-04-27T16:36:17Z 2022-04-27T16:36:35Z node/ip-10-0-210-174.us-east-2.compute.internal roles/worker node is not ready And looking at machine-config events: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.reason // "" | contains("Cordon"))] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .involvedObject.name + " " + .reason + ": " + .message' 2022-04-27T16:30:08Z ip-10-0-170-33.us-east-2.compute.internal Cordon: Cordoned node to apply update 2022-04-27T16:30:10Z ip-10-0-145-93.us-east-2.compute.internal Cordon: Cordoned node to apply update 2022-04-27T16:32:26Z ip-10-0-185-114.us-east-2.compute.internal Cordon: Cordoned node to apply update 2022-04-27T16:34:35Z ip-10-0-157-60.us-east-2.compute.internal Cordon: Cordoned node to apply update 2022-04-27T16:34:41Z ip-10-0-210-174.us-east-2.compute.internal Cordon: Cordoned node to apply update 2022-04-27T16:39:04Z ip-10-0-198-214.us-east-2.compute.internal Cordon: Cordoned node to apply update And backing up in the monitoring timeline a bit more: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-monitoring" and (.reason | . == "Scheduled" or . == "SuccessfulDelete"))] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .involvedObject.name + " " + .reason + ": " + .message' | grep T16:3 2022-04-27T16:30:10Z thanos-querier-78c4659fb4-dhmvr Scheduled: Successfully assigned openshift-monitoring/thanos-querier-78c4659fb4-dhmvr to ip-10-0-185-114.us-east-2.compute.internal 2022-04-27T16:30:13Z prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-185-114.us-east-2.compute.internal 2022-04-27T16:30:16Z prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:30:28Z prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:32:28Z kube-state-metrics-677b9d7b67-7csqf Scheduled: Successfully assigned openshift-monitoring/kube-state-metrics-677b9d7b67-7csqf to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:32:28Z prometheus-adapter-cf7d875b9-ps7gx Scheduled: Successfully assigned openshift-monitoring/prometheus-adapter-cf7d875b9-ps7gx to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:32:28Z telemeter-client-575d874cbd-n25qf Scheduled: Successfully assigned openshift-monitoring/telemeter-client-575d874cbd-n25qf to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:32:28Z thanos-querier-78c4659fb4-wmr8z Scheduled: Successfully assigned openshift-monitoring/thanos-querier-78c4659fb4-wmr8z to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:32:29Z openshift-state-metrics-75967c9448-k9z2b Scheduled: Successfully assigned openshift-monitoring/openshift-state-metrics-75967c9448-k9z2b to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:32:30Z prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:32:31Z alertmanager-main-0 Scheduled: Successfully assigned openshift-monitoring/alertmanager-main-0 to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:34:43Z cluster-monitoring-operator-688dd856fd-c64x7 Scheduled: Successfully assigned openshift-monitoring/cluster-monitoring-operator-688dd856fd-c64x7 to ip-10-0-145-93.us-east-2.compute.internal 2022-04-27T16:34:43Z prometheus-adapter-cf7d875b9-cbwhh Scheduled: Successfully assigned openshift-monitoring/prometheus-adapter-cf7d875b9-cbwhh to ip-10-0-185-114.us-east-2.compute.internal 2022-04-27T16:34:44Z thanos-querier-78c4659fb4-mjf6c Scheduled: Successfully assigned openshift-monitoring/thanos-querier-78c4659fb4-mjf6c to ip-10-0-185-114.us-east-2.compute.internal 2022-04-27T16:34:46Z alertmanager-main-1 Scheduled: Successfully assigned openshift-monitoring/alertmanager-main-1 to ip-10-0-185-114.us-east-2.compute.internal 2022-04-27T16:36:42Z prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:39:14Z prometheus-operator-78f7dcb657-hwd5g Scheduled: Successfully assigned openshift-monitoring/prometheus-operator-78f7dcb657-hwd5g to ip-10-0-157-60.us-east-2.compute.internal 2022-04-27T16:39:35Z prometheus-k8s SuccessfulDelete: delete Pod prometheus-k8s-1 in StatefulSet prometheus-k8s successful 2022-04-27T16:39:37Z prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:39:44Z prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-210-174.us-east-2.compute.internal Hmm, 16:30:13 and 16:30:16 seem awfully close for prometheus-k8s-0 and prometheus-k8s-1 schedules, if they're actually PDB-protected. And there's a stuttered prometheus-k8s-1 Scheduled at 16:30:16 and 16:30:28 too. Why does this pod keep getting rescheduled? Let's drop into audit logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h /prometheus-k8s-1 kube-apiserver/*.log.gz | jq -r 'select(.verb == "delete") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort 2022-04-27T16:08:30.903259Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:08:32.866573Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:08:33.819926Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:08:41.652943Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:09:29.969926Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:10:16.990860Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:10:17.781785Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:10:24.996856Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:30:14.828179Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:30:16.096519Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:30:17.551916Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:30:28.271706Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:34:44.923949Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:39:35.422800Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:39:37.267367Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:39:38.073965Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:39:44.201756Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal Hmm, so both 10.0.210.174 node creds and the StatefulSet controller are going after this pod. Looking at the StatefulSet controller with Loki: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496"} | unpack | namespace="openshift-kube-controller-manager" |~ "16:39.*terminating.*prometheus-k8s-1" which gives: I0427 16:39:35.408913 1 stateful_set_control.go:587] StatefulSet openshift-monitoring/prometheus-k8s terminating Pod prometheus-k8s-1 for update I0427 16:39:38.065529 1 stateful_set_control.go:587] StatefulSet openshift-monitoring/prometheus-k8s terminating Pod prometheus-k8s-1 for update Unclear to me what would be triggering StatefulSet pod updates mid-compute-pool rollout. $ zgrep -h statefulsets/prometheus-k8s kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " > + .user.username + " " + .requestURI' | grep 'T16:[34]' | sort 2022-04-27T16:30:14.765499Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:15.485620Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:16.722612Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:17.494930Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:18.156508Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:18.898501Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:19.443528Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:19.968805Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:30.451950Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:33.277503Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:47.129804Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:32:31.690305Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:33:01.452300Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:34:46.423534Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:34:46.949126Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:37:03.465312Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:34.779837Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:35.378564Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:35.999583Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:36.783810Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:37.240148Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:37.819953Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:38.093792Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:38.363319Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:38.722498Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:39.416947Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:40:00.759665Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s Whoah. Not sure why the operator is so touchy... $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-78f7dcb657-hwd5g_prometheus-operator.log | grep -1 'T16:39.*sync prometheus' | head -n5 level=info ts=2022-04-27T16:39:33.039027377Z caller=operator.go:1479 component=prometheusoperator key=openshift-monitoring/k8s msg="update prometheus status" level=info ts=2022-04-27T16:39:33.039576448Z caller=operator.go:1322 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-27T16:39:33.745387047Z caller=operator.go:1479 component=prometheusoperator key=openshift-monitoring/k8s msg="update prometheus status" level=info ts=2022-04-27T16:39:33.745872215Z caller=operator.go:1322 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-27T16:39:34.786589376Z caller=operator.go:1322 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" Operator doesn't want to explain why it's so touchy. Back to the journal logs to see where the node deletion came from: Hmm, this is surprising: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/statefulsets.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s").spec.podManagementPolicy' Parallel That's leaning fairly hard on the PDB? Because we don't really want parallel deletion, do we? Or maybe "Updates are not affected" [1] means it's ok? $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/statefulsets.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s").spec.minReadySeconds' null Might be worth setting minReadySeconds [2]. Conditions on the pod: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s-1").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status' | sort 2022-04-27T16:39:44Z PodScheduled=True 2022-04-27T16:39:53Z Initialized=True 2022-04-27T16:40:00Z ContainersReady=True 2022-04-27T16:40:00Z Ready=True Filling in the timeline a bit more: * 16:34:41, 10.0.210.174 cordoned, starts draining * 16:34:44, 10.0.210.174 node creds request prometheus-k8s-1 deletion * 16:35:32, 10.0.210.174 finishes draining, begins rebooting * 16:36:35, 10.0.210.174 becomes ready * 16:36:42, prometheus-k8s-1 bound to 10.0.210.174 * 16:39:15, Watchdog goes out. * 16:39:35, statefulset controller requests prometheus-k8s-1 deletion for update, Prometheus operator banging away on the StatefulSet, unclear what changed * 16:39:37.267367, 10.0.210.174 node creds request prometheus-k8s-1 deletion, unclear why * 16:39:37.322638, prometheus-k8s-1 bound to 10.0.210.174 * 16:39:38, statefulset controller requests prometheus-k8s-1 deletion for update, Prometheus operator banging away on the StatefulSet, unclear what changed * 16:39:44.201756Z, 10.0.210.174 node creds request prometheus-k8s-1 deletion, unclear why * 16:39:44.205653, prometheus-k8s-1 removed * 16:39:44.256165, prometheus-k8s-1 bound to 10.0.210.174 * 16:40:00, prometheus-k8s-1 goes Ready=True * 16:40:46, Watchdog comes back. [1]: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#parallel-pod-management [2]: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#rolling-updates Hmm, I kind of suspect [1]. Picking an older run [2,3], the Prometheus operator logs then looked like: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519018020754690048/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-6fc8d765fd-p26hl_prometheus-operator.log | grep 'sync prometheus' level=info ts=2022-04-26T19:23:31.275393173Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T19:23:32.144137378Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T19:38:29.897130609Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T19:38:47.198335294Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T19:53:29.914563739Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T19:53:47.493613417Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T20:08:29.802862158Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T20:08:47.165966489Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" Seems like it used to like to touch in pairs, but at least it wasn't doing 16 touches in a single minute: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-78f7dcb657-hwd5g_prometheus-operator.log | grep 'T16:39:.*sync prometheus' | wc -l 16 [1]: https://github.com/openshift/prometheus-operator/pull/165 [2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519018020754690048 searched with https://search.ci.openshift.org/?search=which+may+be+a+sign+of+a+Prometheus+outage+in+violation+of+the+prometheus+query+SLO+of+100%25+uptime&maxAge=48h&context=1&type=bug%2Bjunit&name=4.11.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job still see the issue 1. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1521933787129188352 : [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_less 0s { Watchdog alert had 24631 changes during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime May 04 20:02:19.246 - 3170s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"} May 04 20:55:40.246 - 487s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"}} 2. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1521933759518085120 : [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_less 0s { Watchdog alert had 26575 changes during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime May 04 19:57:49.299 - 4529s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"} May 04 21:13:49.299 - 514s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"}} Moving back to ASSIGNED because we need to backport a patch [1] from upstream prometheus operator that avoids simultaneous restarts of Prometheus. The PR is https://github.com/openshift/prometheus-operator/pull/189. [1] https://github.com/prometheus-operator/prometheus-operator/commit/ceac7c9c4b436f37fca2e718153b724ee92be9c9 https://github.com/openshift/origin/pull/27201 caused a regression on single node jobs and got reverted. searched with https://search.ci.openshift.org/?search=which+may+be+a+sign+of+a+Prometheus+outage+in+violation+of+the+prometheus+query+SLO+of+100%25+uptime&maxAge=48h&context=1&type=bug%2Bjunit&name=4.12.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job did not see the issue, close it Removing the requires_doc_text because the bug fix has been backported to 4.11.z already (see bug 2108978). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |