1804464 – Rare CI failures with MachineAPIOperatorDown firing

Bug 1804464 - Rare CI failures with MachineAPIOperatorDown firing

Summary: Rare CI failures with MachineAPIOperatorDown firing

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-18 21:43 UTC by W. Trevor King
Modified:	2020-08-04 18:01 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-04 18:01:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 574	0	None	closed	Bug 1804464: Relax timeout for metrics availability alerts	2020-11-11 11:19:01 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-08-04 18:01:03 UTC

Description W. Trevor King 2020-02-18 21:43:52 UTC

Since bug 1768756 fixed the Prometheus tests to ensure they run at the end of the suite a few days ago, we've seen a handful of MachineAPIOperatorDown firing [1].  Failures have happened on 4.4 release-promotion informers for Azure [2] and OVN AWS [3].  Also a few PR jobs on both AWS [4] and GCP [5,6].  Digging into the Azure release promotion informer [2], the machine-api ClusterOperator looks happy [7]:

  conditions:
  - lastTransitionTime: "2020-02-18T05:00:20Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-02-18T04:59:51Z"
    message: 'Cluster Machine API Operator is available at operator: 4.4.0-0.nightly-2020-02-18-042756'
    status: "True"
    type: Available
  - lastTransitionTime: "2020-02-18T04:59:51Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-02-18T04:59:51Z"
    status: "True"
    type: Upgradeable

The operator pod itself has zero restarts and has been Ready=True since 2020-02-18T04:59:51Z [8].  Dropping [2] into [9], I can see:

  ALERTS{alertname="MachineAPIOperatorDown",alertstate="firing"}

at 1 from 5:17:02 through 5:17:31 UTC.  Not sure how that squares with the pod being ReadyTrue since well before the alert.  Checking the operator logs [10], this is what was going on around when the alert fired:

2020-02-18T05:15:28.1100175Z I0218 05:15:28.109964       1 status.go:100] Syncing status: available
2020-02-18T05:17:30.9629175Z I0218 05:17:30.962876       1 status.go:67] Syncing status: re-syncing
2020-02-18T05:17:31.9973146Z I0218 05:17:31.997240       1 sync.go:39] Synced up all machine-api-controller components
2020-02-18T05:17:32.012677Z I0218 05:17:32.012642       1 status.go:100] Syncing status: available
2020-02-18T05:34:09.6840352Z I0218 05:34:09.683990       1 status.go:67] Syncing status: re-syncing

I'm not sure what triggers those resyncs.  ~2m from 5:15:28 to 5:17:30, but then ~6m from 5:17:32 to 5:34:09.

[1]: https://search.svc.ci.openshift.org/?search=promQL+query%3A+count_over_time.*had+reported+incorrect+results.*MachineAPIOperatorDown.*firing&maxAge=336h&context=-1&type=all
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.4/792
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722
[5]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/158/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp/548
[6]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift-kni_cnf-features-deploy/71/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-origin/69
[7]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/cluster-scoped-resources/config.openshift.io/clusteroperators/machine-api.yaml
[8]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/namespaces/openshift-machine-api/pods/machine-api-operator-d8dc95d6-62mtz/machine-api-operator-d8dc95d6-62mtz.yaml
[9]: https://promecieus.svc.ci.openshift.org/
[10]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/namespaces/openshift-machine-api/pods/machine-api-operator-d8dc95d6-62mtz/machine-api-operator/machine-api-operator/logs/current.log

Comment 1 Alberto 2020-04-29 11:46:42 UTC

This seems to be mao racing with prometheus to be able to provide a valid bearer token. It seems to take up to 5 min based on the logs:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722/artifacts/e2e-aws/pods/openshift-machine-api_machine-api-operator-649f77c85d-k4nxj_kube-rbac-proxy.log
Also matching the time frame 17:51 / 17:55

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722/artifacts/e2e-aws/pods/openshift-monitoring_prometheus-k8s-0_prometheus-proxy.log

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722/artifacts/e2e-aws/pods/openshift-monitoring_prometheus-k8s-0_prometheus.log

Comment 5 Jianwei Hou 2020-05-07 03:14:18 UTC

Moving to verified as time is increased to 5m to relax the alert.

Comment 7 errata-xmlrpc 2020-08-04 18:01:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.