Bug 1771903 - "MachineWithoutValidNode" and "MachineWithNoRunningPhase" alerts are firing [NEEDINFO]
Summary: "MachineWithoutValidNode" and "MachineWithNoRunningPhase" alerts are firing
Keywords:
Status: CLOSED DUPLICATE of bug 1772163
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-13 08:31 UTC by Simon Pasquier
Modified: 2019-11-14 09:34 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-14 09:34:45 UTC
Target Upstream Version:
spasquie: needinfo? (agarcial)


Attachments (Terms of Use)

Comment 1 Alberto 2019-11-13 08:57:59 UTC
The alert is triggering legitimately when a machine is missing an associated node for too long. We've seen this sporadically, for some unknown reason seems some aws instances are hanging in a pending state and possibly being terminated eventually hence the machine resource enters a failed phase.
See
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/543/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1569/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ci-op-lklc3wmp-stable-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc.yaml
And the lifecycle for ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc here https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/543/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1569/artifacts/e2e-aws/pods/openshift-machine-api_machine-api-controllers-584944fdd5-jjrmx_machine-controller.log
In a real cluster this can be remediated by covering your pool of machines with a machine health check.
Also in the near future we'll likely make machineSet to ignore "failed" machines to reconcile replicas so for a case like this it automatically recreate a new machine.


Note You need to log in before you can comment on or make changes to this bug.