Bug 1804464

Summary: Rare CI failures with MachineAPIOperatorDown firing
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Cloud Compute sub component: Other Providers QA Contact: Jianwei Hou <jhou>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: unspecified CC: agarcial
Version: 4.4   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-04 18:01:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-02-18 21:43:52 UTC
Since bug 1768756 fixed the Prometheus tests to ensure they run at the end of the suite a few days ago, we've seen a handful of MachineAPIOperatorDown firing [1].  Failures have happened on 4.4 release-promotion informers for Azure [2] and OVN AWS [3].  Also a few PR jobs on both AWS [4] and GCP [5,6].  Digging into the Azure release promotion informer [2], the machine-api ClusterOperator looks happy [7]:

  conditions:
  - lastTransitionTime: "2020-02-18T05:00:20Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-02-18T04:59:51Z"
    message: 'Cluster Machine API Operator is available at operator: 4.4.0-0.nightly-2020-02-18-042756'
    status: "True"
    type: Available
  - lastTransitionTime: "2020-02-18T04:59:51Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-02-18T04:59:51Z"
    status: "True"
    type: Upgradeable

The operator pod itself has zero restarts and has been Ready=True since 2020-02-18T04:59:51Z [8].  Dropping [2] into [9], I can see:

  ALERTS{alertname="MachineAPIOperatorDown",alertstate="firing"}

at 1 from 5:17:02 through 5:17:31 UTC.  Not sure how that squares with the pod being ReadyTrue since well before the alert.  Checking the operator logs [10], this is what was going on around when the alert fired:

2020-02-18T05:15:28.1100175Z I0218 05:15:28.109964       1 status.go:100] Syncing status: available
2020-02-18T05:17:30.9629175Z I0218 05:17:30.962876       1 status.go:67] Syncing status: re-syncing
2020-02-18T05:17:31.9973146Z I0218 05:17:31.997240       1 sync.go:39] Synced up all machine-api-controller components
2020-02-18T05:17:32.012677Z I0218 05:17:32.012642       1 status.go:100] Syncing status: available
2020-02-18T05:34:09.6840352Z I0218 05:34:09.683990       1 status.go:67] Syncing status: re-syncing

I'm not sure what triggers those resyncs.  ~2m from 5:15:28 to 5:17:30, but then ~6m from 5:17:32 to 5:34:09.

[1]: https://search.svc.ci.openshift.org/?search=promQL+query%3A+count_over_time.*had+reported+incorrect+results.*MachineAPIOperatorDown.*firing&maxAge=336h&context=-1&type=all
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.4/792
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722
[5]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/158/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp/548
[6]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift-kni_cnf-features-deploy/71/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-origin/69
[7]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/cluster-scoped-resources/config.openshift.io/clusteroperators/machine-api.yaml
[8]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/namespaces/openshift-machine-api/pods/machine-api-operator-d8dc95d6-62mtz/machine-api-operator-d8dc95d6-62mtz.yaml
[9]: https://promecieus.svc.ci.openshift.org/
[10]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/namespaces/openshift-machine-api/pods/machine-api-operator-d8dc95d6-62mtz/machine-api-operator/machine-api-operator/logs/current.log

Comment 5 Jianwei Hou 2020-05-07 03:14:18 UTC
Moving to verified as time is increased to 5m to relax the alert.

Comment 7 errata-xmlrpc 2020-08-04 18:01:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409