test: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D%5C%5BLate%5C%5D+Alerts+shouldn%27t+report+any+alerts+in+firing+or+pending+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured+and+have+no+gaps+in+Watchdog+firing+%5C%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5C%5D test: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D%5C%5BLate%5C%5D+Alerts+shouldn%27t+report+any+alerts+in+firing+or+pending+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured+and+have+no+gaps+in+Watchdog+firing+%5C%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5C%5D This alert is firing in aws-serial jobs: alert MachineWithNoRunningPhase fired for 780 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.129.0.11:8443", job="machine-api-operator", name="ci-op-r1cxmwsh-170bf-jf4wp-worker-us-west-2c-ghfbq", namespace="openshift-machine-api", node="ip-10-0-206-27.us-west-2.compute.internal", phase="Running", pod="machine-api-operator-65fcc768ff-52cbz", service="machine-api-operator", severity="warning", spec_provider_id="aws:///us-west-2c/i-0106a2f034973b0e0"}
i have a feeling this is related to https://github.com/openshift/machine-api-operator/pull/878
Moving to a more specific subject, to match the fixing PR's scope. Searching for errors from the alert we fixed: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=MachineWithNoRunningPhase&maxAge=96h&type=junit' | grep 'failures match' | grep -v 'rehearse-\|pull-ci-' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-openstack-serial (all) - 12 runs, 25% failed, 33% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.9-e2e-aws-compact-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial (all) - 55 runs, 15% failed, 13% of failures match = 2% impact periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-rollback (all) - 4 runs, 50% failed, 50% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure (all) - 4 runs, 25% failed, 100% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-cilium (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade (all) - 4 runs, 50% failed, 50% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-compact-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-compact-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-serial (all) - 4 runs, 50% failed, 50% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-compact-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 4 runs, 25% failed, 100% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 16 runs, 100% failed, 6% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade (all) - 4 runs, 25% failed, 100% of failures match = 25% impact release-openshift-origin-installer-launch-aws (all) - 69 runs, 48% failed, 3% of failures match = 1% impact Hmm, digging into that 4.9-e2e-aws-serial failure: $ curl -s 'https://search.ci.openshift.org/search?search=MachineWithNoRunningPhase&maxAge=96h&type=junit&name=^periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial' | jq -r 'keys[]' https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial/1408457683907907584 That job has: : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] 22s fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jun 25 18:10:00.128: Unexpected alerts fired or pending after the test run: alert MachineWithNoRunningPhase fired for 1594 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.128.0.29:8443", job="machine-api-operator", name="ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-ccfdt", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-7957ccb848-dgl78", service="machine-api-operator", severity="warning", spec_provider_id="aws:///us-west-2b/i-0cd09248ca4f35f42"} alert MachineWithoutValidNode fired for 1635 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.128.0.29:8443", job="machine-api-operator", name="ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-ccfdt", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-7957ccb848-dgl78", service="machine-api-operator", severity="warning", spec_provider_id="aws:///us-west-2b/i-0cd09248ca4f35f42"} Checking the machine: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial/1408457683907907584/artifacts/e2e-aws-serial/gather-extra/artifacts/machines.json | jq -r '.items[] | select(.metadata.name == "ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-ccfdt").status | {phase, instanceId: .providerStatus.instanceId}' { "phase": "Provisioned", "instanceId": "i-0cd09248ca4f35f42" } And indeed, no associated node: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial/1408457683907907584/artifacts/e2e-aws-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | .name + " " + .annotations["machine.openshift.io/machine"]' ip-10-0-128-211.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-fcmqp ip-10-0-137-3.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-ccbsn ip-10-0-162-148.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-master-2 ip-10-0-184-187.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-master-0 ip-10-0-192-110.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-master-1 ip-10-0-213-124.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2a-sr69w ip-10-0-236-101.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2a-5q78h And... we tried to gather console logs for that instance, but failed [2]: 'ascii' codec can't encode character '\u2026' in position 11466: ordinal not in range(128) Anyhow, frequency seems to be down, and this true-positive seems like a different issue, so I see no problem moving the rescoped bug to VERIFIED. [1]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial/1408457683907907584/artifacts/e2e-aws-serial/gather-aws-console/build-log.txt
thanks for the comments Milind and Trevor. i think it is definitely possible that we would see this alert during a flaky test. Given the output that Trevor linked, i would assume those were flakes. glad to hear that the frequency is down, and i think that might be the strongest indication we have currently between flakes and successes. i'm ok to move this to VERIFIED as well.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days