Bug 1810443

Summary: MachineWithNoRunningPhase message does not consider the empty PHASE
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: Cloud ComputeAssignee: Michael Gugino <mgugino>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: agarcial, michael.orlov, miyadav, salanis, skuznets, zhsun
Version: 4.3.z   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 15:56:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Junqi Zhao 2020-03-05 09:22:09 UTC
Description of problem:
if the PHASE is empty for the machine, the MachineWithNoRunningPhase alert would indicate like below
"machine ocp-edge-cluster-master-0 is in phase", this makes people think the expression is not complete

$ oc -n openshift-machine-api get machine
NAME                              PHASE   TYPE   REGION   ZONE   AGE
ocp-edge-cluster-master-0                                        10h
ocp-edge-cluster-master-1                                        10h
ocp-edge-cluster-master-2                                        10h
ocp-edge-cluster-worker-0-b2k2j                                  70m
ocp-edge-cluster-worker-0-btsql                                  10m
ocp-edge-cluster-worker-0-n6wfc                                  10h

alert: MachineWithNoRunningPhase
expr: (mapi_machine_created_timestamp_seconds{phase!="Running"})
  > 0
for: 10m
labels:
  severity: critical
annotations:
  message: machine {{ $labels.name }} is in {{ $labels.phase }} phase



Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2020-03-04-165955

How reproducible:
only when the PHASE is empty

Steps to Reproduce:
1. See the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Alberto 2020-04-03 10:12:44 UTC
This seems a pretty edge case where the machine controller was never run. If anything we can try to rephrase the message to make that more obvious.
https://github.com/openshift/machine-api-operator/pull/549

Comment 4 Milind Yadav 2020-04-06 04:57:49 UTC
Validated on :
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-05-214011   True        False         16m     Cluster version is 4.5.0-0.nightly-2020-04-05-214011

One machine was set to have no phase : 
[miyadav@miyadav ManualRun]$ oc get machines
NAME                                         PHASE     TYPE        REGION      ZONE         AGE
miyadav-0604-hbfwf-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   89m
miyadav-0604-hbfwf-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   89m
miyadav-0604-hbfwf-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   89m
miyadav-0604-hbfwf-worker-us-east-2a-56p54   Running   m4.large    us-east-2   us-east-2a   76m
miyadav-0604-hbfwf-worker-us-east-2b-new               m4.large    us-east-2   us-east-2b   40m
miyadav-0604-hbfwf-worker-us-east-2c-5m7cz   Running   m4.large    us-east-2   us-east-2c   76m


Seeing the alert message :
machine miyadav-0604-hbfwf-worker-us-east-2b-new is in phase:

Which seems as per the change in pull request : https://github.com/openshift/machine-api-operator/pull/549

Will consult with reporter .

Comment 5 Junqi Zhao 2020-04-07 02:44:52 UTC
(In reply to Milind Yadav from comment #4)
> Validated on :
> NAME      VERSION                             AVAILABLE   PROGRESSING  
> SINCE   STATUS
> version   4.5.0-0.nightly-2020-04-05-214011   True        False         16m 
> Cluster version is 4.5.0-0.nightly-2020-04-05-214011
> 
> One machine was set to have no phase : 
> [miyadav@miyadav ManualRun]$ oc get machines
> NAME                                         PHASE     TYPE        REGION   
> ZONE         AGE
> miyadav-0604-hbfwf-master-0                  Running   m4.xlarge   us-east-2
> us-east-2a   89m
> miyadav-0604-hbfwf-master-1                  Running   m4.xlarge   us-east-2
> us-east-2b   89m
> miyadav-0604-hbfwf-master-2                  Running   m4.xlarge   us-east-2
> us-east-2c   89m
> miyadav-0604-hbfwf-worker-us-east-2a-56p54   Running   m4.large    us-east-2
> us-east-2a   76m
> miyadav-0604-hbfwf-worker-us-east-2b-new               m4.large    us-east-2
> us-east-2b   40m
> miyadav-0604-hbfwf-worker-us-east-2c-5m7cz   Running   m4.large    us-east-2
> us-east-2c   76m
> 
> 
> Seeing the alert message :
> machine miyadav-0604-hbfwf-worker-us-east-2b-new is in phase:
> 
> Which seems as per the change in pull request :
> https://github.com/openshift/machine-api-operator/pull/549
> 
> Will consult with reporter .

machine miyadav-0604-hbfwf-worker-us-east-2b-new is in phase:
I am afraid the user will confused by the status if the PHASE is empty, it is not user friendly

Comment 6 Saul Alanis 2020-05-15 19:05:14 UTC
I have the same issue on a fresh cluster running on VMware 6.7. Curious how to safely remove the checks since there's no machineset controller for VMware?

Client Version: 4.4.3
Server Version: 4.4.3
Kubernetes Version: v1.17.1

oc get machinesets
NAME                DESIRED   CURRENT   READY   AVAILABLE   AGE
ocp4-ctmtp-worker   0         0                             28h


oc get machines -n openshift-machine-api
NAME                  PHASE   TYPE   REGION   ZONE   AGE
ocp4-ctmtp-master-0                                  28h
ocp4-ctmtp-master-1                                  28h
ocp4-ctmtp-master-2                                  28h

Comment 7 Michael Orlov 2020-05-19 14:10:14 UTC
The same issue confirmed on the fresh installation on VMWare 6.5 as well
(In reply to Saul Alanis from comment #6)
> I have the same issue on a fresh cluster running on VMware 6.7. Curious how
> to safely remove the checks since there's no machineset controller for
> VMware?
> 
> Client Version: 4.4.3
> Server Version: 4.4.3
> Kubernetes Version: v1.17.1
>

Comment 10 Alberto 2020-05-29 10:43:58 UTC
The vSphere scenario is covered here https://bugzilla.redhat.com/show_bug.cgi?id=1834966

This ticket is to track a more suer friendly to communicate when the phase happens to be empty. Tagging with upcomingSprint.

Comment 12 Joel Speed 2020-08-20 12:05:34 UTC
All PRs merged, this should be on Modified now

Comment 15 Milind Yadav 2020-09-09 11:32:10 UTC
VERIFIED ON:

4.6.0-0.nightly-2020-09-08-123737

Steps :

Created a machine with empty phase (scaled machineset with machinecontroller kept down using cvo and machine-controller deployment)

[miyadav@miyadav vsphere]$ oc get machines -o wide --config vsp
NAME                                 PHASE         TYPE   REGION   ZONE   AGE     NODE                                 PROVIDERID                                       STATE
vs-miyadav-0909-rpkms-master-0       Running                              3h48m   vs-miyadav-0909-rpkms-master-0       vsphere://422b4df8-b303-505b-99e2-592c3ae20465   poweredOn
vs-miyadav-0909-rpkms-master-1       Running                              3h48m   vs-miyadav-0909-rpkms-master-1       vsphere://422b787d-c6a3-ff14-0f2d-c5ebb7f113db   poweredOn
vs-miyadav-0909-rpkms-master-2       Running                              3h48m   vs-miyadav-0909-rpkms-master-2       vsphere://422b4f4d-7d45-94be-3bc0-0e86d431fd01   poweredOn
vs-miyadav-0909-rpkms-worker-hmjgn                                        21s                                                                                           
vs-miyadav-0909-rpkms-worker-ptjkn   Provisioned                          3h36m                                        vsphere://422bde74-788b-7af8-9383-44408377bd62   poweredOn
vs-miyadav-0909-rpkms-worker-rrq7s   Running                              3h36m   vs-miyadav-0909-rpkms-worker-rrq7s   vsphere://422be000-30eb-31f9-7f41-65b1d3545ede   poweredOn

Expected & Actual : No Alert fired after 10m for MachineWithNoRunningPhase 

Additional Info:
Moved to VERIFIED

Comment 17 errata-xmlrpc 2020-10-27 15:56:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196