[bz-monitoring][invariant] alert/Watchdog must have no gaps or changes is failing frequently in CI, see [1,2]. Seems like it was doing great for days, but started to struggle around [3]: : [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes 0s { Watchdog alert had 24841 changes during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime Apr 27 15:37:53.407 - 3682s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"} Apr 27 16:40:46.407 - 313s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"}} Popping open the first intervals chart, I don't see "24841 changes", but I do see an outage as 10.0.198.214 rolled. $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/junit/e2e-intervals_everything_20220427-153753.json | jq -r '.items[] | select(tostring | contains("Watchdog") or contains("node/ip-10-0-198-214.us-east-2.compute.internal")) | .from + " " + .to + " " + .locator + " " + .message' | grep -v 'pending longer than a minute' 2022-04-27T15:37:53Z 2022-04-27T16:39:15Z alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"} 2022-04-27T15:37:53Z 2022-04-27T15:37:56Z e2e-test/"[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" e2e test finished As "Passed" 2022-04-27T16:38:53Z 2022-04-27T16:43:28Z node/ip-10-0-198-214.us-east-2.compute.internal reason/NodeUpdate phase/Update config/rendered-master-8a9c5795e938eb0543c300eac049a5e6 roles/master reached desired config roles/master 2022-04-27T16:39:04Z 2022-04-27T16:40:34Z node/ip-10-0-198-214.us-east-2.compute.internal reason/NodeUpdate phase/Drain roles/master drained node 2022-04-27T16:40:34Z 2022-04-27T16:43:16Z node/ip-10-0-198-214.us-east-2.compute.internal reason/NodeUpdate phase/Reboot roles/master rebooted and kubelet started 2022-04-27T16:40:46Z 2022-04-27T16:45:59Z alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"} 2022-04-27T16:41:33Z 2022-04-27T16:43:16Z node/ip-10-0-198-214.us-east-2.compute.internal roles/master node is not ready And looking for pod events around that time: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-monitoring" and (.reason | . == "Scheduled" or . == "SuccessfulDelete"))] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .involvedObject.name + " " + .reason + ": " + .message' | grep '2022-04-27T16:\(3[89]\|4[0-3]\)' 2022-04-27T16:39:14Z prometheus-operator-78f7dcb657-hwd5g Scheduled: Successfully assigned openshift-monitoring/prometheus-operator-78f7dcb657-hwd5g to ip-10-0-157-60.us-east-2.compute.internal 2022-04-27T16:39:35Z prometheus-k8s SuccessfulDelete: delete Pod prometheus-k8s-1 in StatefulSet prometheus-k8s successful 2022-04-27T16:39:37Z prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:39:44Z prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-210-174.us-east-2.compute.internal Hrm, two separate schedules? In Loki: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496"} | unpack | namespace="openshift-kube-scheduler" |~ "I0427 16:39.*Successfully bound pod to node.*prometheus-k8s-1" gives: I0427 16:39:37.322638 1 scheduler.go:623] "Successfully bound pod to node" pod="openshift-monitoring/prometheus-k8s-1" node="ip-10-0-210-174.us-east-2.compute.internal" evaluatedNodes=6 feasibleNodes=1 I0427 16:39:44.256165 1 scheduler.go:623] "Successfully bound pod to node" pod="openshift-monitoring/prometheus-k8s-1" node="ip-10-0-210-174.us-east-2.compute.internal" evaluatedNodes=6 feasibleNodes=1 both out of openshift-kube-scheduler-ip-10-0-198-214.us-east-2.compute.internal. And in that node's journal: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-210-174.us-east-2.compute.internal/journal | zgrep '16:39.*SyncLoop [A-Z].*prometheus-k8s-1' Apr 27 16:39:35.445222 ip-10-0-210-174 hyperkube[1517]: I0427 16:39:35.442202 1517 kubelet.go:2122] "SyncLoop DELETE" source="api" pods=[openshift-monitoring/prometheus-k8s-1] Apr 27 16:39:37.196745 ip-10-0-210-174 hyperkube[1517]: I0427 16:39:37.196725 1517 kubelet.go:2122] "SyncLoop DELETE" source="api" pods=[openshift-monitoring/prometheus-k8s-1] Apr 27 16:39:37.264076 ip-10-0-210-174 hyperkube[1517]: I0427 16:39:37.264050 1517 kubelet.go:2116] "SyncLoop REMOVE" source="api" pods=[openshift-monitoring/prometheus-k8s-1] Apr 27 16:39:37.327441 ip-10-0-210-174 hyperkube[1517]: I0427 16:39:37.327413 1517 kubelet.go:2106] "SyncLoop ADD" source="api" pods=[openshift-monitoring/prometheus-k8s-1] Apr 27 16:39:38.077034 ip-10-0-210-174 hyperkube[1517]: I0427 16:39:38.076946 1517 kubelet.go:2122] "SyncLoop DELETE" source="api" pods=[openshift-monitoring/prometheus-k8s-1] Apr 27 16:39:42.430683 ip-10-0-210-174 hyperkube[1517]: I0427 16:39:42.430656 1517 kubelet.go:2122] "SyncLoop DELETE" source="api" pods=[openshift-monitoring/prometheus-k8s-1] Apr 27 16:39:44.182249 ip-10-0-210-174 hyperkube[1517]: I0427 16:39:44.182224 1517 kubelet.go:2122] "SyncLoop DELETE" source="api" pods=[openshift-monitoring/prometheus-k8s-1] Apr 27 16:39:44.205653 ip-10-0-210-174 hyperkube[1517]: I0427 16:39:44.205253 1517 kubelet.go:2116] "SyncLoop REMOVE" source="api" pods=[openshift-monitoring/prometheus-k8s-1] Apr 27 16:39:44.258465 ip-10-0-210-174 hyperkube[1517]: I0427 16:39:44.258438 1517 kubelet.go:2106] "SyncLoop ADD" source="api" pods=[openshift-monitoring/prometheus-k8s-1] Apr 27 16:39:52.210916 ip-10-0-210-174 hyperkube[1517]: I0427 16:39:52.210887 1517 kubelet.go:2113] "SyncLoop UPDATE" source="api" pods=[openshift-monitoring/prometheus-k8s-1] So timeline: * 16:39:15, Watchdog goes out. * 16:39:04, node starts draining * 16:39:37, prometheus-k8s-1 bound to node * 16:39:44, prometheus-k8s-1 removed * 16:39:44, prometheus-k8s-1 bound to node * 16:40:34, node finishes draining I dunno. Maybe there's an issue with cordons vs. taints? Pretty common: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=Watchdog+must+have+no+gaps' | grep 'failures match' | grep -v rehearse | sort aggregated-aws-ovn-upgrade-4.11-minor-release-openshift-release-analysis-aggregator (all) - 12 runs, 33% failed, 25% of failures match = 8% impact aggregated-aws-sdn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator (all) - 6 runs, 33% failed, 100% of failures match = 33% impact aggregated-azure-sdn-upgrade-4.11-minor-release-openshift-release-analysis-aggregator (all) - 12 runs, 17% failed, 50% of failures match = 8% impact aggregated-gcp-ovn-upgrade-4.11-minor-release-openshift-release-analysis-aggregator (all) - 6 runs, 33% failed, 50% of failures match = 17% impact aggregated-gcp-sdn-upgrade-4.11-micro-release-openshift-release-analysis-aggregator (all) - 12 runs, 17% failed, 100% of failures match = 17% impact openshift-etcd-118-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 20 runs, 95% failed, 11% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-rt-upgrade (all) - 8 runs, 50% failed, 25% of failures match = 13% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade (all) - 9 runs, 56% failed, 60% of failures match = 33% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade (all) - 9 runs, 78% failed, 29% of failures match = 22% impact periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade (all) - 120 runs, 75% failed, 4% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade (all) - 58 runs, 55% failed, 9% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade (all) - 120 runs, 42% failed, 24% of failures match = 10% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 122 runs, 58% failed, 4% of failures match = 2% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade-rollback (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade (all) - 12 runs, 50% failed, 17% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 120 runs, 68% failed, 7% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-rt-upgrade (all) - 8 runs, 100% failed, 38% of failures match = 38% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-upgrade (all) - 60 runs, 68% failed, 22% of failures match = 15% impact periodic-ci-openshift-release-master-ci-4.12-e2e-gcp-upgrade (all) - 17 runs, 100% failed, 12% of failures match = 12% impact periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade (all) - 57 runs, 23% failed, 46% of failures match = 11% impact periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-serial-ovn-ipv6 (all) - 6 runs, 50% failed, 33% of failures match = 17% impact periodic-ci-openshift-release-master-okd-4.10-e2e-vsphere (all) - 14 runs, 100% failed, 29% of failures match = 29% impact periodic-ci-openshift-release-master-okd-4.11-e2e-vsphere (all) - 11 runs, 91% failed, 10% of failures match = 9% impact periodic-ci-openshift-release-master-okd-4.11-upgrade-from-okd-4.10-e2e-upgrade-aws (all) - 1 runs, 100% failed, 100% of failures match = 100% impact pull-ci-openshift-cluster-cloud-controller-manager-operator-master-e2e-azure-upgrade (all) - 7 runs, 29% failed, 50% of failures match = 14% impact pull-ci-openshift-cluster-control-plane-machine-set-operator-main-e2e-aws-serial (all) - 8 runs, 50% failed, 25% of failures match = 13% impact pull-ci-openshift-cluster-etcd-operator-master-e2e-agnostic-upgrade (all) - 10 runs, 50% failed, 20% of failures match = 10% impact pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade (all) - 7 runs, 71% failed, 20% of failures match = 14% impact pull-ci-openshift-cluster-network-operator-master-e2e-aws-upgrade (all) - 19 runs, 74% failed, 7% of failures match = 5% impact pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade (all) - 12 runs, 50% failed, 17% of failures match = 8% impact pull-ci-openshift-cluster-node-tuning-operator-master-e2e-upgrade (all) - 32 runs, 47% failed, 13% of failures match = 6% impact pull-ci-openshift-insights-operator-master-e2e-agnostic-upgrade (all) - 6 runs, 33% failed, 50% of failures match = 17% impact pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade (all) - 21 runs, 29% failed, 17% of failures match = 5% impact pull-ci-openshift-metal3-dev-scripts-master-e2e-metal-ipi-serial-ovn-ipv6 (all) - 4 runs, 50% failed, 50% of failures match = 25% impact pull-ci-openshift-oc-master-e2e-aws-upgrade (all) - 6 runs, 67% failed, 25% of failures match = 17% impact pull-ci-openshift-oc-master-e2e-metal-ipi-ovn-ipv6 (all) - 3 runs, 33% failed, 100% of failures match = 33% impact pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade (all) - 53 runs, 75% failed, 13% of failures match = 9% impact pull-ci-openshift-origin-master-e2e-gcp-upgrade (all) - 74 runs, 46% failed, 26% of failures match = 12% impact pull-ci-openshift-ovn-kubernetes-master-4.11-upgrade-from-stable-4.10-local-gateway-e2e-aws-ovn-upgrade (all) - 7 runs, 57% failed, 25% of failures match = 14% impact pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn (all) - 8 runs, 63% failed, 20% of failures match = 13% impact pull-ci-openshift-prometheus-operator-master-e2e-aws-upgrade (all) - 5 runs, 60% failed, 33% of failures match = 20% impact release-openshift-origin-installer-e2e-aws-upgrade-4.7-to-4.8-to-4.9-to-4.10-ci (all) - 2 runs, 100% failed, 50% of failures match = 50% impact release-openshift-origin-installer-launch-azure-modern (all) - 75 runs, 40% failed, 3% of failures match = 1% impact release-openshift-origin-installer-launch-gcp-modern (all) - 321 runs, 19% failed, 2% of failures match = 0% impact [1]: https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=%5Bbz-monitoring%5D%5Binvariant%5D%20alert%2FWatchdog%20must%20have%20no%20gaps%20or%20changes [2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/15193318918979624
(In reply to W. Trevor King from comment #0) > both out of openshift-kube-scheduler-ip-10-0-198-214.us-east-2.compute.internal. And in that node's journal: > > $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-210-174.us-east-2.compute.internal/journal | zgrep '16:39.*SyncLoop [A-Z].*prometheus-k8s-1' Oops, 10.0.198.214 and 10.0.210.174 are two different nodes. Details on the right node: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/junit/e2e-intervals_everything_20220427-153753.json | jq -r '.items[] | select(tostring | contains("node/ip-10-0-210-174.us-east-2.compute.internal")) | .from + " " + .to + " " + .locator + " " + .message' | grep -v 'pending longer than a minute' 2022-04-27T16:34:30Z 2022-04-27T16:36:42Z node/ip-10-0-210-174.us-east-2.compute.internal reason/NodeUpdate phase/Update config/rendered-worker-4b8ae95eb3c069abd41c7a5ef1fdc3da roles/worker reached desired config roles/worker 2022-04-27T16:34:41Z 2022-04-27T16:35:32Z node/ip-10-0-210-174.us-east-2.compute.internal reason/NodeUpdate phase/Drain roles/worker drained node 2022-04-27T16:35:32Z 2022-04-27T16:36:34Z node/ip-10-0-210-174.us-east-2.compute.internal reason/NodeUpdate phase/Reboot roles/worker rebooted and kubelet started 2022-04-27T16:36:17Z 2022-04-27T16:36:35Z node/ip-10-0-210-174.us-east-2.compute.internal roles/worker node is not ready And looking at machine-config events: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.reason // "" | contains("Cordon"))] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .involvedObject.name + " " + .reason + ": " + .message' 2022-04-27T16:30:08Z ip-10-0-170-33.us-east-2.compute.internal Cordon: Cordoned node to apply update 2022-04-27T16:30:10Z ip-10-0-145-93.us-east-2.compute.internal Cordon: Cordoned node to apply update 2022-04-27T16:32:26Z ip-10-0-185-114.us-east-2.compute.internal Cordon: Cordoned node to apply update 2022-04-27T16:34:35Z ip-10-0-157-60.us-east-2.compute.internal Cordon: Cordoned node to apply update 2022-04-27T16:34:41Z ip-10-0-210-174.us-east-2.compute.internal Cordon: Cordoned node to apply update 2022-04-27T16:39:04Z ip-10-0-198-214.us-east-2.compute.internal Cordon: Cordoned node to apply update And backing up in the monitoring timeline a bit more: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-monitoring" and (.reason | . == "Scheduled" or . == "SuccessfulDelete"))] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .involvedObject.name + " " + .reason + ": " + .message' | grep T16:3 2022-04-27T16:30:10Z thanos-querier-78c4659fb4-dhmvr Scheduled: Successfully assigned openshift-monitoring/thanos-querier-78c4659fb4-dhmvr to ip-10-0-185-114.us-east-2.compute.internal 2022-04-27T16:30:13Z prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-185-114.us-east-2.compute.internal 2022-04-27T16:30:16Z prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:30:28Z prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:32:28Z kube-state-metrics-677b9d7b67-7csqf Scheduled: Successfully assigned openshift-monitoring/kube-state-metrics-677b9d7b67-7csqf to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:32:28Z prometheus-adapter-cf7d875b9-ps7gx Scheduled: Successfully assigned openshift-monitoring/prometheus-adapter-cf7d875b9-ps7gx to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:32:28Z telemeter-client-575d874cbd-n25qf Scheduled: Successfully assigned openshift-monitoring/telemeter-client-575d874cbd-n25qf to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:32:28Z thanos-querier-78c4659fb4-wmr8z Scheduled: Successfully assigned openshift-monitoring/thanos-querier-78c4659fb4-wmr8z to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:32:29Z openshift-state-metrics-75967c9448-k9z2b Scheduled: Successfully assigned openshift-monitoring/openshift-state-metrics-75967c9448-k9z2b to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:32:30Z prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:32:31Z alertmanager-main-0 Scheduled: Successfully assigned openshift-monitoring/alertmanager-main-0 to ip-10-0-170-33.us-east-2.compute.internal 2022-04-27T16:34:43Z cluster-monitoring-operator-688dd856fd-c64x7 Scheduled: Successfully assigned openshift-monitoring/cluster-monitoring-operator-688dd856fd-c64x7 to ip-10-0-145-93.us-east-2.compute.internal 2022-04-27T16:34:43Z prometheus-adapter-cf7d875b9-cbwhh Scheduled: Successfully assigned openshift-monitoring/prometheus-adapter-cf7d875b9-cbwhh to ip-10-0-185-114.us-east-2.compute.internal 2022-04-27T16:34:44Z thanos-querier-78c4659fb4-mjf6c Scheduled: Successfully assigned openshift-monitoring/thanos-querier-78c4659fb4-mjf6c to ip-10-0-185-114.us-east-2.compute.internal 2022-04-27T16:34:46Z alertmanager-main-1 Scheduled: Successfully assigned openshift-monitoring/alertmanager-main-1 to ip-10-0-185-114.us-east-2.compute.internal 2022-04-27T16:36:42Z prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:39:14Z prometheus-operator-78f7dcb657-hwd5g Scheduled: Successfully assigned openshift-monitoring/prometheus-operator-78f7dcb657-hwd5g to ip-10-0-157-60.us-east-2.compute.internal 2022-04-27T16:39:35Z prometheus-k8s SuccessfulDelete: delete Pod prometheus-k8s-1 in StatefulSet prometheus-k8s successful 2022-04-27T16:39:37Z prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:39:44Z prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-210-174.us-east-2.compute.internal Hmm, 16:30:13 and 16:30:16 seem awfully close for prometheus-k8s-0 and prometheus-k8s-1 schedules, if they're actually PDB-protected. And there's a stuttered prometheus-k8s-1 Scheduled at 16:30:16 and 16:30:28 too. Why does this pod keep getting rescheduled? Let's drop into audit logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h /prometheus-k8s-1 kube-apiserver/*.log.gz | jq -r 'select(.verb == "delete") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " + .user.username' | sort 2022-04-27T16:08:30.903259Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:08:32.866573Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:08:33.819926Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:08:41.652943Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:09:29.969926Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:10:16.990860Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:10:17.781785Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:10:24.996856Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:30:14.828179Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:30:16.096519Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:30:17.551916Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:30:28.271706Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:34:44.923949Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:39:35.422800Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:39:37.267367Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal 2022-04-27T16:39:38.073965Z 200 system:serviceaccount:kube-system:statefulset-controller 2022-04-27T16:39:44.201756Z 200 system:node:ip-10-0-210-174.us-east-2.compute.internal Hmm, so both 10.0.210.174 node creds and the StatefulSet controller are going after this pod. Looking at the StatefulSet controller with Loki: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496"} | unpack | namespace="openshift-kube-controller-manager" |~ "16:39.*terminating.*prometheus-k8s-1" which gives: I0427 16:39:35.408913 1 stateful_set_control.go:587] StatefulSet openshift-monitoring/prometheus-k8s terminating Pod prometheus-k8s-1 for update I0427 16:39:38.065529 1 stateful_set_control.go:587] StatefulSet openshift-monitoring/prometheus-k8s terminating Pod prometheus-k8s-1 for update Unclear to me what would be triggering StatefulSet pod updates mid-compute-pool rollout. $ zgrep -h statefulsets/prometheus-k8s kube-apiserver/*.log.gz | jq -r 'select(.verb == "update" and .objectRef.subresource != "status") | .stageTimestamp + " " + (.responseStatus.code | tostring) + " " > + .user.username + " " + .requestURI' | grep 'T16:[34]' | sort 2022-04-27T16:30:14.765499Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:15.485620Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:16.722612Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:17.494930Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:18.156508Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:18.898501Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:19.443528Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:19.968805Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:30.451950Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:33.277503Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:30:47.129804Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:32:31.690305Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:33:01.452300Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:34:46.423534Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:34:46.949126Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:37:03.465312Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:34.779837Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:35.378564Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:35.999583Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:36.783810Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:37.240148Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:37.819953Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:38.093792Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:38.363319Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:38.722498Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:39:39.416947Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s 2022-04-27T16:40:00.759665Z 200 system:serviceaccount:openshift-monitoring:prometheus-operator /apis/apps/v1/namespaces/openshift-monitoring/statefulsets/prometheus-k8s Whoah. Not sure why the operator is so touchy... $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-78f7dcb657-hwd5g_prometheus-operator.log | grep -1 'T16:39.*sync prometheus' | head -n5 level=info ts=2022-04-27T16:39:33.039027377Z caller=operator.go:1479 component=prometheusoperator key=openshift-monitoring/k8s msg="update prometheus status" level=info ts=2022-04-27T16:39:33.039576448Z caller=operator.go:1322 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-27T16:39:33.745387047Z caller=operator.go:1479 component=prometheusoperator key=openshift-monitoring/k8s msg="update prometheus status" level=info ts=2022-04-27T16:39:33.745872215Z caller=operator.go:1322 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-27T16:39:34.786589376Z caller=operator.go:1322 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" Operator doesn't want to explain why it's so touchy. Back to the journal logs to see where the node deletion came from: Hmm, this is surprising: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/statefulsets.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s").spec.podManagementPolicy' Parallel That's leaning fairly hard on the PDB? Because we don't really want parallel deletion, do we? Or maybe "Updates are not affected" [1] means it's ok? $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/statefulsets.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s").spec.minReadySeconds' null Might be worth setting minReadySeconds [2]. Conditions on the pod: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods.json | jq -r '.items[] | select(.metadata.name == "prometheus-k8s-1").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status' | sort 2022-04-27T16:39:44Z PodScheduled=True 2022-04-27T16:39:53Z Initialized=True 2022-04-27T16:40:00Z ContainersReady=True 2022-04-27T16:40:00Z Ready=True Filling in the timeline a bit more: * 16:34:41, 10.0.210.174 cordoned, starts draining * 16:34:44, 10.0.210.174 node creds request prometheus-k8s-1 deletion * 16:35:32, 10.0.210.174 finishes draining, begins rebooting * 16:36:35, 10.0.210.174 becomes ready * 16:36:42, prometheus-k8s-1 bound to 10.0.210.174 * 16:39:15, Watchdog goes out. * 16:39:35, statefulset controller requests prometheus-k8s-1 deletion for update, Prometheus operator banging away on the StatefulSet, unclear what changed * 16:39:37.267367, 10.0.210.174 node creds request prometheus-k8s-1 deletion, unclear why * 16:39:37.322638, prometheus-k8s-1 bound to 10.0.210.174 * 16:39:38, statefulset controller requests prometheus-k8s-1 deletion for update, Prometheus operator banging away on the StatefulSet, unclear what changed * 16:39:44.201756Z, 10.0.210.174 node creds request prometheus-k8s-1 deletion, unclear why * 16:39:44.205653, prometheus-k8s-1 removed * 16:39:44.256165, prometheus-k8s-1 bound to 10.0.210.174 * 16:40:00, prometheus-k8s-1 goes Ready=True * 16:40:46, Watchdog comes back. [1]: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#parallel-pod-management [2]: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#rolling-updates
Hmm, I kind of suspect [1]. Picking an older run [2,3], the Prometheus operator logs then looked like: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519018020754690048/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-6fc8d765fd-p26hl_prometheus-operator.log | grep 'sync prometheus' level=info ts=2022-04-26T19:23:31.275393173Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T19:23:32.144137378Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T19:38:29.897130609Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T19:38:47.198335294Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T19:53:29.914563739Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T19:53:47.493613417Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T20:08:29.802862158Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2022-04-26T20:08:47.165966489Z caller=operator.go:1228 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" Seems like it used to like to touch in pairs, but at least it wasn't doing 16 touches in a single minute: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519331891897962496/artifacts/e2e-aws-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-operator-78f7dcb657-hwd5g_prometheus-operator.log | grep 'T16:39:.*sync prometheus' | wc -l 16 [1]: https://github.com/openshift/prometheus-operator/pull/165 [2]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.11-informing#periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1519018020754690048
searched with https://search.ci.openshift.org/?search=which+may+be+a+sign+of+a+Prometheus+outage+in+violation+of+the+prometheus+query+SLO+of+100%25+uptime&maxAge=48h&context=1&type=bug%2Bjunit&name=4.11.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job still see the issue 1. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1521933787129188352 : [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_less 0s { Watchdog alert had 24631 changes during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime May 04 20:02:19.246 - 3170s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"} May 04 20:55:40.246 - 487s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"}} 2. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1521933759518085120 : [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes expand_less 0s { Watchdog alert had 26575 changes during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime May 04 19:57:49.299 - 4529s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"} May 04 21:13:49.299 - 514s E alert/Watchdog ns/openshift-monitoring ALERTS{alertname="Watchdog", alertstate="firing", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="none"}}
Moving back to ASSIGNED because we need to backport a patch [1] from upstream prometheus operator that avoids simultaneous restarts of Prometheus. The PR is https://github.com/openshift/prometheus-operator/pull/189. [1] https://github.com/prometheus-operator/prometheus-operator/commit/ceac7c9c4b436f37fca2e718153b724ee92be9c9
https://github.com/openshift/origin/pull/27201 caused a regression on single node jobs and got reverted.
searched with https://search.ci.openshift.org/?search=which+may+be+a+sign+of+a+Prometheus+outage+in+violation+of+the+prometheus+query+SLO+of+100%25+uptime&maxAge=48h&context=1&type=bug%2Bjunit&name=4.12.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job did not see the issue, close it
Removing the requires_doc_text because the bug fix has been backported to 4.11.z already (see bug 2108978).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399