Bug 1810443 - MachineWithNoRunningPhase message does not consider the empty PHASE
Summary: MachineWithNoRunningPhase message does not consider the empty PHASE
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.6.0
Assignee: Michael Gugino
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-05 09:22 UTC by Junqi Zhao
Modified: 2020-10-27 15:56 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 15:56:23 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 549 0 None closed Bug 1810443: Rephrase MachineWithNoRunningPhase message 2021-02-15 22:23:54 UTC
Github openshift machine-api-operator pull 676 0 None closed Bug 1810443: Don't collect machine metrics if machine-controller not running 2021-02-15 22:23:54 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:56:25 UTC

Description Junqi Zhao 2020-03-05 09:22:09 UTC
Description of problem:
if the PHASE is empty for the machine, the MachineWithNoRunningPhase alert would indicate like below
"machine ocp-edge-cluster-master-0 is in phase", this makes people think the expression is not complete

$ oc -n openshift-machine-api get machine
NAME                              PHASE   TYPE   REGION   ZONE   AGE
ocp-edge-cluster-master-0                                        10h
ocp-edge-cluster-master-1                                        10h
ocp-edge-cluster-master-2                                        10h
ocp-edge-cluster-worker-0-b2k2j                                  70m
ocp-edge-cluster-worker-0-btsql                                  10m
ocp-edge-cluster-worker-0-n6wfc                                  10h

alert: MachineWithNoRunningPhase
expr: (mapi_machine_created_timestamp_seconds{phase!="Running"})
  > 0
for: 10m
labels:
  severity: critical
annotations:
  message: machine {{ $labels.name }} is in {{ $labels.phase }} phase



Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2020-03-04-165955

How reproducible:
only when the PHASE is empty

Steps to Reproduce:
1. See the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Alberto 2020-04-03 10:12:44 UTC
This seems a pretty edge case where the machine controller was never run. If anything we can try to rephrase the message to make that more obvious.
https://github.com/openshift/machine-api-operator/pull/549

Comment 4 Milind Yadav 2020-04-06 04:57:49 UTC
Validated on :
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-05-214011   True        False         16m     Cluster version is 4.5.0-0.nightly-2020-04-05-214011

One machine was set to have no phase : 
[miyadav@miyadav ManualRun]$ oc get machines
NAME                                         PHASE     TYPE        REGION      ZONE         AGE
miyadav-0604-hbfwf-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   89m
miyadav-0604-hbfwf-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   89m
miyadav-0604-hbfwf-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   89m
miyadav-0604-hbfwf-worker-us-east-2a-56p54   Running   m4.large    us-east-2   us-east-2a   76m
miyadav-0604-hbfwf-worker-us-east-2b-new               m4.large    us-east-2   us-east-2b   40m
miyadav-0604-hbfwf-worker-us-east-2c-5m7cz   Running   m4.large    us-east-2   us-east-2c   76m


Seeing the alert message :
machine miyadav-0604-hbfwf-worker-us-east-2b-new is in phase:

Which seems as per the change in pull request : https://github.com/openshift/machine-api-operator/pull/549

Will consult with reporter .

Comment 5 Junqi Zhao 2020-04-07 02:44:52 UTC
(In reply to Milind Yadav from comment #4)
> Validated on :
> NAME      VERSION                             AVAILABLE   PROGRESSING  
> SINCE   STATUS
> version   4.5.0-0.nightly-2020-04-05-214011   True        False         16m 
> Cluster version is 4.5.0-0.nightly-2020-04-05-214011
> 
> One machine was set to have no phase : 
> [miyadav@miyadav ManualRun]$ oc get machines
> NAME                                         PHASE     TYPE        REGION   
> ZONE         AGE
> miyadav-0604-hbfwf-master-0                  Running   m4.xlarge   us-east-2
> us-east-2a   89m
> miyadav-0604-hbfwf-master-1                  Running   m4.xlarge   us-east-2
> us-east-2b   89m
> miyadav-0604-hbfwf-master-2                  Running   m4.xlarge   us-east-2
> us-east-2c   89m
> miyadav-0604-hbfwf-worker-us-east-2a-56p54   Running   m4.large    us-east-2
> us-east-2a   76m
> miyadav-0604-hbfwf-worker-us-east-2b-new               m4.large    us-east-2
> us-east-2b   40m
> miyadav-0604-hbfwf-worker-us-east-2c-5m7cz   Running   m4.large    us-east-2
> us-east-2c   76m
> 
> 
> Seeing the alert message :
> machine miyadav-0604-hbfwf-worker-us-east-2b-new is in phase:
> 
> Which seems as per the change in pull request :
> https://github.com/openshift/machine-api-operator/pull/549
> 
> Will consult with reporter .

machine miyadav-0604-hbfwf-worker-us-east-2b-new is in phase:
I am afraid the user will confused by the status if the PHASE is empty, it is not user friendly

Comment 6 Saul Alanis 2020-05-15 19:05:14 UTC
I have the same issue on a fresh cluster running on VMware 6.7. Curious how to safely remove the checks since there's no machineset controller for VMware?

Client Version: 4.4.3
Server Version: 4.4.3
Kubernetes Version: v1.17.1

oc get machinesets
NAME                DESIRED   CURRENT   READY   AVAILABLE   AGE
ocp4-ctmtp-worker   0         0                             28h


oc get machines -n openshift-machine-api
NAME                  PHASE   TYPE   REGION   ZONE   AGE
ocp4-ctmtp-master-0                                  28h
ocp4-ctmtp-master-1                                  28h
ocp4-ctmtp-master-2                                  28h

Comment 7 Michael Orlov 2020-05-19 14:10:14 UTC
The same issue confirmed on the fresh installation on VMWare 6.5 as well
(In reply to Saul Alanis from comment #6)
> I have the same issue on a fresh cluster running on VMware 6.7. Curious how
> to safely remove the checks since there's no machineset controller for
> VMware?
> 
> Client Version: 4.4.3
> Server Version: 4.4.3
> Kubernetes Version: v1.17.1
>

Comment 10 Alberto 2020-05-29 10:43:58 UTC
The vSphere scenario is covered here https://bugzilla.redhat.com/show_bug.cgi?id=1834966

This ticket is to track a more suer friendly to communicate when the phase happens to be empty. Tagging with upcomingSprint.

Comment 12 Joel Speed 2020-08-20 12:05:34 UTC
All PRs merged, this should be on Modified now

Comment 15 Milind Yadav 2020-09-09 11:32:10 UTC
VERIFIED ON:

4.6.0-0.nightly-2020-09-08-123737

Steps :

Created a machine with empty phase (scaled machineset with machinecontroller kept down using cvo and machine-controller deployment)

[miyadav@miyadav vsphere]$ oc get machines -o wide --config vsp
NAME                                 PHASE         TYPE   REGION   ZONE   AGE     NODE                                 PROVIDERID                                       STATE
vs-miyadav-0909-rpkms-master-0       Running                              3h48m   vs-miyadav-0909-rpkms-master-0       vsphere://422b4df8-b303-505b-99e2-592c3ae20465   poweredOn
vs-miyadav-0909-rpkms-master-1       Running                              3h48m   vs-miyadav-0909-rpkms-master-1       vsphere://422b787d-c6a3-ff14-0f2d-c5ebb7f113db   poweredOn
vs-miyadav-0909-rpkms-master-2       Running                              3h48m   vs-miyadav-0909-rpkms-master-2       vsphere://422b4f4d-7d45-94be-3bc0-0e86d431fd01   poweredOn
vs-miyadav-0909-rpkms-worker-hmjgn                                        21s                                                                                           
vs-miyadav-0909-rpkms-worker-ptjkn   Provisioned                          3h36m                                        vsphere://422bde74-788b-7af8-9383-44408377bd62   poweredOn
vs-miyadav-0909-rpkms-worker-rrq7s   Running                              3h36m   vs-miyadav-0909-rpkms-worker-rrq7s   vsphere://422be000-30eb-31f9-7f41-65b1d3545ede   poweredOn

Expected & Actual : No Alert fired after 10m for MachineWithNoRunningPhase 

Additional Info:
Moved to VERIFIED

Comment 17 errata-xmlrpc 2020-10-27 15:56:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.