Since bug 1768756 fixed the Prometheus tests to ensure they run at the end of the suite a few days ago, we've seen a handful of MachineAPIOperatorDown firing [1]. Failures have happened on 4.4 release-promotion informers for Azure [2] and OVN AWS [3]. Also a few PR jobs on both AWS [4] and GCP [5,6]. Digging into the Azure release promotion informer [2], the machine-api ClusterOperator looks happy [7]: conditions: - lastTransitionTime: "2020-02-18T05:00:20Z" status: "False" type: Progressing - lastTransitionTime: "2020-02-18T04:59:51Z" message: 'Cluster Machine API Operator is available at operator: 4.4.0-0.nightly-2020-02-18-042756' status: "True" type: Available - lastTransitionTime: "2020-02-18T04:59:51Z" status: "False" type: Degraded - lastTransitionTime: "2020-02-18T04:59:51Z" status: "True" type: Upgradeable The operator pod itself has zero restarts and has been Ready=True since 2020-02-18T04:59:51Z [8]. Dropping [2] into [9], I can see: ALERTS{alertname="MachineAPIOperatorDown",alertstate="firing"} at 1 from 5:17:02 through 5:17:31 UTC. Not sure how that squares with the pod being ReadyTrue since well before the alert. Checking the operator logs [10], this is what was going on around when the alert fired: 2020-02-18T05:15:28.1100175Z I0218 05:15:28.109964 1 status.go:100] Syncing status: available 2020-02-18T05:17:30.9629175Z I0218 05:17:30.962876 1 status.go:67] Syncing status: re-syncing 2020-02-18T05:17:31.9973146Z I0218 05:17:31.997240 1 sync.go:39] Synced up all machine-api-controller components 2020-02-18T05:17:32.012677Z I0218 05:17:32.012642 1 status.go:100] Syncing status: available 2020-02-18T05:34:09.6840352Z I0218 05:34:09.683990 1 status.go:67] Syncing status: re-syncing I'm not sure what triggers those resyncs. ~2m from 5:15:28 to 5:17:30, but then ~6m from 5:17:32 to 5:34:09. [1]: https://search.svc.ci.openshift.org/?search=promQL+query%3A+count_over_time.*had+reported+incorrect+results.*MachineAPIOperatorDown.*firing&maxAge=336h&context=-1&type=all [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890 [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.4/792 [4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722 [5]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/158/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp/548 [6]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift-kni_cnf-features-deploy/71/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-origin/69 [7]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/cluster-scoped-resources/config.openshift.io/clusteroperators/machine-api.yaml [8]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/namespaces/openshift-machine-api/pods/machine-api-operator-d8dc95d6-62mtz/machine-api-operator-d8dc95d6-62mtz.yaml [9]: https://promecieus.svc.ci.openshift.org/ [10]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/namespaces/openshift-machine-api/pods/machine-api-operator-d8dc95d6-62mtz/machine-api-operator/machine-api-operator/logs/current.log
This seems to be mao racing with prometheus to be able to provide a valid bearer token. It seems to take up to 5 min based on the logs: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722/artifacts/e2e-aws/pods/openshift-machine-api_machine-api-operator-649f77c85d-k4nxj_kube-rbac-proxy.log Also matching the time frame 17:51 / 17:55 https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722/artifacts/e2e-aws/pods/openshift-monitoring_prometheus-k8s-0_prometheus-proxy.log https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722/artifacts/e2e-aws/pods/openshift-monitoring_prometheus-k8s-0_prometheus.log
Moving to verified as time is increased to 5m to relax the alert.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409