Bug 1804464

Summary:	Rare CI failures with MachineAPIOperatorDown firing
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Cloud Compute sub component:	Other Providers	QA Contact:	Jianwei Hou <jhou>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	unspecified	CC:	agarcial
Version:	4.4
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-04 18:01:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-02-18 21:43:52 UTC

Since bug 1768756 fixed the Prometheus tests to ensure they run at the end of the suite a few days ago, we've seen a handful of MachineAPIOperatorDown firing [1].  Failures have happened on 4.4 release-promotion informers for Azure [2] and OVN AWS [3].  Also a few PR jobs on both AWS [4] and GCP [5,6].  Digging into the Azure release promotion informer [2], the machine-api ClusterOperator looks happy [7]:

  conditions:
  - lastTransitionTime: "2020-02-18T05:00:20Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-02-18T04:59:51Z"
    message: 'Cluster Machine API Operator is available at operator: 4.4.0-0.nightly-2020-02-18-042756'
    status: "True"
    type: Available
  - lastTransitionTime: "2020-02-18T04:59:51Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-02-18T04:59:51Z"
    status: "True"
    type: Upgradeable

The operator pod itself has zero restarts and has been Ready=True since 2020-02-18T04:59:51Z [8].  Dropping [2] into [9], I can see:

  ALERTS{alertname="MachineAPIOperatorDown",alertstate="firing"}

at 1 from 5:17:02 through 5:17:31 UTC.  Not sure how that squares with the pod being ReadyTrue since well before the alert.  Checking the operator logs [10], this is what was going on around when the alert fired:

2020-02-18T05:15:28.1100175Z I0218 05:15:28.109964       1 status.go:100] Syncing status: available
2020-02-18T05:17:30.9629175Z I0218 05:17:30.962876       1 status.go:67] Syncing status: re-syncing
2020-02-18T05:17:31.9973146Z I0218 05:17:31.997240       1 sync.go:39] Synced up all machine-api-controller components
2020-02-18T05:17:32.012677Z I0218 05:17:32.012642       1 status.go:100] Syncing status: available
2020-02-18T05:34:09.6840352Z I0218 05:34:09.683990       1 status.go:67] Syncing status: re-syncing

I'm not sure what triggers those resyncs.  ~2m from 5:15:28 to 5:17:30, but then ~6m from 5:17:32 to 5:34:09.

[1]: https://search.svc.ci.openshift.org/?search=promQL+query%3A+count_over_time.*had+reported+incorrect+results.*MachineAPIOperatorDown.*firing&maxAge=336h&context=-1&type=all
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.4/792
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722
[5]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/158/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp/548
[6]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift-kni_cnf-features-deploy/71/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-origin/69
[7]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/cluster-scoped-resources/config.openshift.io/clusteroperators/machine-api.yaml
[8]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/namespaces/openshift-machine-api/pods/machine-api-operator-d8dc95d6-62mtz/machine-api-operator-d8dc95d6-62mtz.yaml
[9]: https://promecieus.svc.ci.openshift.org/
[10]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/namespaces/openshift-machine-api/pods/machine-api-operator-d8dc95d6-62mtz/machine-api-operator/machine-api-operator/logs/current.log

Comment 1 Alberto 2020-04-29 11:46:42 UTC

This seems to be mao racing with prometheus to be able to provide a valid bearer token. It seems to take up to 5 min based on the logs:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722/artifacts/e2e-aws/pods/openshift-machine-api_machine-api-operator-649f77c85d-k4nxj_kube-rbac-proxy.log
Also matching the time frame 17:51 / 17:55

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722/artifacts/e2e-aws/pods/openshift-monitoring_prometheus-k8s-0_prometheus-proxy.log

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722/artifacts/e2e-aws/pods/openshift-monitoring_prometheus-k8s-0_prometheus.log

Comment 5 Jianwei Hou 2020-05-07 03:14:18 UTC

Moving to verified as time is increased to 5m to relax the alert.

Comment 7 errata-xmlrpc 2020-08-04 18:01:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409