Bug 1771903

Summary: "MachineWithoutValidNode" and "MachineWithNoRunningPhase" alerts are firing
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Status: CLOSED DUPLICATE QA Contact: Jianwei Hou <jhou>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: agarcial
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-14 09:34:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Alberto 2019-11-13 08:57:59 UTC
The alert is triggering legitimately when a machine is missing an associated node for too long. We've seen this sporadically, for some unknown reason seems some aws instances are hanging in a pending state and possibly being terminated eventually hence the machine resource enters a failed phase.
See
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/543/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1569/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-lklc3wmp-stable-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc.yaml
And the lifecycle for ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc here https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/543/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1569/artifacts/e2e-aws/pods/openshift-machine-api_machine-api-controllers-584944fdd5-jjrmx_machine-controller.log
In a real cluster this can be remediated by covering your pool of machines with a machine health check.
Also in the near future we'll likely make machineSet to ignore "failed" machines to reconcile replicas so for a case like this it automatically recreate a new machine.

Comment 4 Red Hat Bugzilla 2023-09-14 05:46:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days