Bug 1974758

Summary: aws-serial jobs are failing with false-positive MachineWithNoRunningPhase firing or pending
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: Cloud ComputeAssignee: Michael McCune <mimccune>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: mimccune, sippy, wking
Version: 4.9   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1999585 (view as bug list) Environment:
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]
Last Closed: 2021-10-18 17:35:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1999585    

Description Stephen Benjamin 2021-06-22 13:31:47 UTC
test:
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D%5C%5BLate%5C%5D+Alerts+shouldn%27t+report+any+alerts+in+firing+or+pending+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured+and+have+no+gaps+in+Watchdog+firing+%5C%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5C%5D
test:
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel] 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D%5C%5BLate%5C%5D+Alerts+shouldn%27t+report+any+alerts+in+firing+or+pending+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured+and+have+no+gaps+in+Watchdog+firing+%5C%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5C%5D


This alert is firing in aws-serial jobs:

alert MachineWithNoRunningPhase fired for 780 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.129.0.11:8443", job="machine-api-operator", name="ci-op-r1cxmwsh-170bf-jf4wp-worker-us-west-2c-ghfbq", namespace="openshift-machine-api", node="ip-10-0-206-27.us-west-2.compute.internal", phase="Running", pod="machine-api-operator-65fcc768ff-52cbz", service="machine-api-operator", severity="warning", spec_provider_id="aws:///us-west-2c/i-0106a2f034973b0e0"}

Comment 1 Michael McCune 2021-06-22 14:14:24 UTC
i have a feeling this is related to https://github.com/openshift/machine-api-operator/pull/878

Comment 4 W. Trevor King 2021-06-29 17:27:05 UTC
Moving to a more specific subject, to match the fixing PR's scope.  Searching for errors from the alert we fixed:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=MachineWithNoRunningPhase&maxAge=96h&type=junit' | grep 'failures match' | grep -v 'rehearse-\|pull-ci-' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-openstack-serial (all) - 12 runs, 25% failed, 33% of failures match = 8% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-compact-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial (all) - 55 runs, 15% failed, 13% of failures match = 2% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-rollback (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-cilium (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-compact-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-compact-serial (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-compact-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-serial (all) - 4 runs, 50% failed, 50% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-compact-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 16 runs, 100% failed, 6% of failures match = 6% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade (all) - 4 runs, 25% failed, 100% of failures match = 25% impact
release-openshift-origin-installer-launch-aws (all) - 69 runs, 48% failed, 3% of failures match = 1% impact

Hmm, digging into that 4.9-e2e-aws-serial failure:

$ curl -s 'https://search.ci.openshift.org/search?search=MachineWithNoRunningPhase&maxAge=96h&type=junit&name=^periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial/1408457683907907584

That job has:

  : [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing [Suite:openshift/conformance/parallel]	22s
    fail [github.com/onsi/ginkgo.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Jun 25 18:10:00.128: Unexpected alerts fired or pending after the test run:

    alert MachineWithNoRunningPhase fired for 1594 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.128.0.29:8443", job="machine-api-operator", name="ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-ccfdt", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-7957ccb848-dgl78", service="machine-api-operator", severity="warning", spec_provider_id="aws:///us-west-2b/i-0cd09248ca4f35f42"}
    alert MachineWithoutValidNode fired for 1635 seconds with labels: {api_version="machine.openshift.io/v1beta1", container="kube-rbac-proxy", endpoint="https", exported_namespace="openshift-machine-api", instance="10.128.0.29:8443", job="machine-api-operator", name="ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-ccfdt", namespace="openshift-machine-api", phase="Provisioned", pod="machine-api-operator-7957ccb848-dgl78", service="machine-api-operator", severity="warning", spec_provider_id="aws:///us-west-2b/i-0cd09248ca4f35f42"}

Checking the machine:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial/1408457683907907584/artifacts/e2e-aws-serial/gather-extra/artifacts/machines.json | jq -r '.items[] | select(.metadata.name == "ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-ccfdt").status | {phase, instanceId: .providerStatus.instanceId}'
  {
    "phase": "Provisioned",
    "instanceId": "i-0cd09248ca4f35f42"
  }

And indeed, no associated node:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial/1408457683907907584/artifacts/e2e-aws-serial/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata | .name + " " + .annotations["machine.openshift.io/machine"]'
  ip-10-0-128-211.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-fcmqp
  ip-10-0-137-3.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2b-ccbsn
  ip-10-0-162-148.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-master-2
  ip-10-0-184-187.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-master-0
  ip-10-0-192-110.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-master-1
  ip-10-0-213-124.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2a-sr69w
  ip-10-0-236-101.us-west-2.compute.internal openshift-machine-api/ci-op-ct77sg7c-170bf-wzhl4-worker-us-west-2a-5q78h

And... we tried to gather console logs for that instance, but failed [2]:

  'ascii' codec can't encode character '\u2026' in position 11466: ordinal not in range(128)

Anyhow, frequency seems to be down, and this true-positive seems like a different issue, so I see no problem moving the rescoped bug to VERIFIED.

[1]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-serial/1408457683907907584/artifacts/e2e-aws-serial/gather-aws-console/build-log.txt

Comment 5 Michael McCune 2021-06-29 20:08:39 UTC
thanks for the comments Milind and Trevor. i think it is definitely possible that we would see this alert during a flaky test. Given the output that Trevor linked, i would assume those were flakes.

glad to hear that the frequency is down, and i think that might be the strongest indication we have currently between flakes and successes. i'm ok to move this to VERIFIED as well.

Comment 8 errata-xmlrpc 2021-10-18 17:35:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 9 Red Hat Bugzilla 2023-09-15 01:10:21 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days