Bug 1771903 - "MachineWithoutValidNode" and "MachineWithNoRunningPhase" alerts are firing
Summary: "MachineWithoutValidNode" and "MachineWithNoRunningPhase" alerts are firing
Keywords:
Status: CLOSED DUPLICATE of bug 1772163
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-13 08:31 UTC by Simon Pasquier
Modified: 2023-09-14 05:46 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-14 09:34:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 1 Alberto 2019-11-13 08:57:59 UTC
The alert is triggering legitimately when a machine is missing an associated node for too long. We've seen this sporadically, for some unknown reason seems some aws instances are hanging in a pending state and possibly being terminated eventually hence the machine resource enters a failed phase.
See
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/543/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1569/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-lklc3wmp-stable-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc.yaml
And the lifecycle for ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc here https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/543/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1569/artifacts/e2e-aws/pods/openshift-machine-api_machine-api-controllers-584944fdd5-jjrmx_machine-controller.log
In a real cluster this can be remediated by covering your pool of machines with a machine health check.
Also in the near future we'll likely make machineSet to ignore "failed" machines to reconcile replicas so for a case like this it automatically recreate a new machine.

Comment 4 Red Hat Bugzilla 2023-09-14 05:46:04 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.