1771903 – "MachineWithoutValidNode" and "MachineWithNoRunningPhase" alerts are firing

Bug 1771903 - "MachineWithoutValidNode" and "MachineWithNoRunningPhase" alerts are firing

Summary: "MachineWithoutValidNode" and "MachineWithNoRunningPhase" alerts are firing

Keywords:
Status:	CLOSED DUPLICATE of bug 1772163
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-13 08:31 UTC by Simon Pasquier
Modified:	2023-09-14 05:46 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-11-14 09:34:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 1 Alberto 2019-11-13 08:57:59 UTC

The alert is triggering legitimately when a machine is missing an associated node for too long. We've seen this sporadically, for some unknown reason seems some aws instances are hanging in a pending state and possibly being terminated eventually hence the machine resource enters a failed phase.
See
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/543/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1569/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-lklc3wmp-stable-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc.yaml
And the lifecycle for ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc here https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/543/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1569/artifacts/e2e-aws/pods/openshift-machine-api_machine-api-controllers-584944fdd5-jjrmx_machine-controller.log
In a real cluster this can be remediated by covering your pool of machines with a machine health check.
Also in the near future we'll likely make machineSet to ignore "failed" machines to reconcile replicas so for a case like this it automatically recreate a new machine.

Comment 4 Red Hat Bugzilla 2023-09-14 05:46:04 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.