Bug 1804464 - Rare CI failures with MachineAPIOperatorDown firing
Summary: Rare CI failures with MachineAPIOperatorDown firing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.5.0
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-18 21:43 UTC by W. Trevor King
Modified: 2020-08-04 18:01 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-04 18:01:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 574 0 None closed Bug 1804464: Relax timeout for metrics availability alerts 2020-11-11 11:19:01 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-08-04 18:01:03 UTC

Description W. Trevor King 2020-02-18 21:43:52 UTC
Since bug 1768756 fixed the Prometheus tests to ensure they run at the end of the suite a few days ago, we've seen a handful of MachineAPIOperatorDown firing [1].  Failures have happened on 4.4 release-promotion informers for Azure [2] and OVN AWS [3].  Also a few PR jobs on both AWS [4] and GCP [5,6].  Digging into the Azure release promotion informer [2], the machine-api ClusterOperator looks happy [7]:

  conditions:
  - lastTransitionTime: "2020-02-18T05:00:20Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-02-18T04:59:51Z"
    message: 'Cluster Machine API Operator is available at operator: 4.4.0-0.nightly-2020-02-18-042756'
    status: "True"
    type: Available
  - lastTransitionTime: "2020-02-18T04:59:51Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2020-02-18T04:59:51Z"
    status: "True"
    type: Upgradeable

The operator pod itself has zero restarts and has been Ready=True since 2020-02-18T04:59:51Z [8].  Dropping [2] into [9], I can see:

  ALERTS{alertname="MachineAPIOperatorDown",alertstate="firing"}

at 1 from 5:17:02 through 5:17:31 UTC.  Not sure how that squares with the pod being ReadyTrue since well before the alert.  Checking the operator logs [10], this is what was going on around when the alert fired:

2020-02-18T05:15:28.1100175Z I0218 05:15:28.109964       1 status.go:100] Syncing status: available
2020-02-18T05:17:30.9629175Z I0218 05:17:30.962876       1 status.go:67] Syncing status: re-syncing
2020-02-18T05:17:31.9973146Z I0218 05:17:31.997240       1 sync.go:39] Synced up all machine-api-controller components
2020-02-18T05:17:32.012677Z I0218 05:17:32.012642       1 status.go:100] Syncing status: available
2020-02-18T05:34:09.6840352Z I0218 05:34:09.683990       1 status.go:67] Syncing status: re-syncing

I'm not sure what triggers those resyncs.  ~2m from 5:15:28 to 5:17:30, but then ~6m from 5:17:32 to 5:34:09.

[1]: https://search.svc.ci.openshift.org/?search=promQL+query%3A+count_over_time.*had+reported+incorrect+results.*MachineAPIOperatorDown.*firing&maxAge=336h&context=-1&type=all
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.4/792
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/162/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/722
[5]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/158/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp/548
[6]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift-kni_cnf-features-deploy/71/pull-ci-openshift-kni-cnf-features-deploy-master-e2e-gcp-origin/69
[7]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/cluster-scoped-resources/config.openshift.io/clusteroperators/machine-api.yaml
[8]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/namespaces/openshift-machine-api/pods/machine-api-operator-d8dc95d6-62mtz/machine-api-operator-d8dc95d6-62mtz.yaml
[9]: https://promecieus.svc.ci.openshift.org/
[10]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/890/artifacts/e2e-azure/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-4fcb5a5a8dd9e4633bdefcb26bc50cddc3f49de26d2de87b4dd3f6b12fa7ef00/namespaces/openshift-machine-api/pods/machine-api-operator-d8dc95d6-62mtz/machine-api-operator/machine-api-operator/logs/current.log

Comment 5 Jianwei Hou 2020-05-07 03:14:18 UTC
Moving to verified as time is increased to 5m to relax the alert.

Comment 7 errata-xmlrpc 2020-08-04 18:01:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.