1974758 – aws-serial jobs are failing with false-positive MachineWithNoRunningPhase firing or pending

Bug 1974758 - aws-serial jobs are failing with false-positive MachineWithNoRunningPhase firing or pending

Summary: aws-serial jobs are failing with false-positive MachineWithNoRunningPhase fir...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Michael McCune
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1999585
TreeView+	depends on / blocked

Reported:	2021-06-22 13:31 UTC by Stephen Benjamin
Modified:	2023-09-15 01:10 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1999585 (view as bug list)
Environment:	[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]
Last Closed:	2021-10-18 17:35:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 878	0	None	closed	Bug 1974758: install/0000_90_machine-api-operator_04_alertrules: Use '!~' for MachineWithNoRunningPhase	2021-06-22 14:25:50 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:36:22 UTC

Description Stephen Benjamin 2021-06-22 13:31:47 UTC

test:
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D%5C%5BLate%5C%5D+Alerts+shouldn%27t+report+any+alerts+in+firing+or+pending+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured+and+have+no+gaps+in+Watchdog+firing+%5C%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5C%5D
test:
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D%5C%5BLate%5C%5D+Alerts+shouldn%27t+report+any+alerts+in+firing+or+pending+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured+and+have+no+gaps+in+Watchdog+firing+%5C%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5C%5D


This alert is firing in aws-serial jobs:

alert MachineWithNoRunningPhase fired for 780 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.129.0.11:8443", job="machine-api-operator", name="ci-op-r1cxmwsh-170bf-jf4wp-worker-us-west-2c-ghfbq", namespace="openshift-machine-api", node="ip-10-0-206-27.us-west-2.compute.internal", phase="Running", pod="machine-api-operator-65fcc768ff-52cbz", service="machine-api-operator", severity="warning", spec_provider_id="aws:///us-west-2c/i-0106a2f034973b0e0"}

Comment 1 Michael McCune 2021-06-22 14:14:24 UTC

i have a feeling this is related to https://github.com/openshift/machine-api-operator/pull/878

Comment 4 W. Trevor King 2021-06-29 17:27:05 UTC

Moving to a more specific subject, to match the fixing PR's scope.  Searching for errors from the alert we fixed:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=MachineWithNoRunningPhase&maxAge=96h&type=junit' | grep 'failures match' | grep -v 'rehearse-\|pull-ci-' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-openstack-serial (all) - 12 runs, 25% failed, 33% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-compact-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial (all) - 55 runs, 15% failed, 13% of failures match = 2% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-rollback (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-cilium (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-compact-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-compact-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-serial (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-compact-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 16 runs, 100% failed, 6% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
release-openshift-origin-installer-launch-aws (all) - 69 runs, 48% failed, 3% of failures match = 1% impact

Hmm, digging into that 4.9-e2e-aws-serial failure:

$ curl -s 'https://search.ci.openshift.org/search?search=MachineWithNoRunningPhase&maxAge=96h&type=junit&name=^periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial/1408457683907907584

That job has:

  : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]	22s
    fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jun 25 18:10:00.128: Unexpected alerts fired or pending after the test run:

    alert MachineWithNoRunningPhase fired for 1594 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.128.0.29:8443", job="machine-api-operator", name="ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-ccfdt", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-7957ccb848-dgl78", service="machine-api-operator", severity="warning", spec_provider_id="aws:///us-west-2b/i-0cd09248ca4f35f42"}
    alert MachineWithoutValidNode fired for 1635 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.128.0.29:8443", job="machine-api-operator", name="ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-ccfdt", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-7957ccb848-dgl78", service="machine-api-operator", severity="warning", spec_provider_id="aws:///us-west-2b/i-0cd09248ca4f35f42"}

Checking the machine:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial/1408457683907907584/artifacts/e2e-aws-serial/gather-extra/artifacts/machines.json | jq -r '.items[] | select(.metadata.name == "ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-ccfdt").status | {phase, instanceId: .providerStatus.instanceId}'
  {
    "phase": "Provisioned",
    "instanceId": "i-0cd09248ca4f35f42"
  }

And indeed, no associated node:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial/1408457683907907584/artifacts/e2e-aws-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | .name + " " + .annotations["machine.openshift.io/machine"]'
  ip-10-0-128-211.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-fcmqp
  ip-10-0-137-3.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-ccbsn
  ip-10-0-162-148.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-master-2
  ip-10-0-184-187.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-master-0
  ip-10-0-192-110.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-master-1
  ip-10-0-213-124.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2a-sr69w
  ip-10-0-236-101.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2a-5q78h

And... we tried to gather console logs for that instance, but failed [2]:

  'ascii' codec can't encode character '\u2026' in position 11466: ordinal not in range(128)

Anyhow, frequency seems to be down, and this true-positive seems like a different issue, so I see no problem moving the rescoped bug to VERIFIED.

[1]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial/1408457683907907584/artifacts/e2e-aws-serial/gather-aws-console/build-log.txt

Comment 5 Michael McCune 2021-06-29 20:08:39 UTC

thanks for the comments Milind and Trevor. i think it is definitely possible that we would see this alert during a flaky test. Given the output that Trevor linked, i would assume those were flakes.

glad to hear that the frequency is down, and i think that might be the strongest indication we have currently between flakes and successes. i'm ok to move this to VERIFIED as well.

Comment 8 errata-xmlrpc 2021-10-18 17:35:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 9 Red Hat Bugzilla 2023-09-15 01:10:21 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.