Bug 1771903

Summary:	"MachineWithoutValidNode" and "MachineWithNoRunningPhase" alerts are firing
Product:	OpenShift Container Platform	Reporter:	Simon Pasquier <spasquie>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Status:	CLOSED DUPLICATE	QA Contact:	Jianwei Hou <jhou>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.3.0	CC:	agarcial
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-14 09:34:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 1 Alberto 2019-11-13 08:57:59 UTC

The alert is triggering legitimately when a machine is missing an associated node for too long. We've seen this sporadically, for some unknown reason seems some aws instances are hanging in a pending state and possibly being terminated eventually hence the machine resource enters a failed phase.
See
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/543/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1569/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-lklc3wmp-stable-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc.yaml
And the lifecycle for ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc here https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/543/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1569/artifacts/e2e-aws/pods/openshift-machine-api_machine-api-controllers-584944fdd5-jjrmx_machine-controller.log
In a real cluster this can be remediated by covering your pool of machines with a machine health check.
Also in the near future we'll likely make machineSet to ignore "failed" machines to reconcile replicas so for a case like this it automatically recreate a new machine.

Comment 4 Red Hat Bugzilla 2023-09-14 05:46:04 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days