1810443 – MachineWithNoRunningPhase message does not consider the empty PHASE

Bug 1810443 - MachineWithNoRunningPhase message does not consider the empty PHASE

Summary: MachineWithNoRunningPhase message does not consider the empty PHASE

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Michael Gugino
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-05 09:22 UTC by Junqi Zhao
Modified:	2024-03-25 15:43 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 15:56:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 549	None	closed	Bug 1810443: Rephrase MachineWithNoRunningPhase message	2021-02-15 22:23:54 UTC
Github	openshift machine-api-operator pull 676	None	closed	Bug 1810443: Don't collect machine metrics if machine-controller not running	2021-02-15 22:23:54 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 15:56:25 UTC

Description Junqi Zhao 2020-03-05 09:22:09 UTC

Description of problem:
if the PHASE is empty for the machine, the MachineWithNoRunningPhase alert would indicate like below
"machine ocp-edge-cluster-master-0 is in phase", this makes people think the expression is not complete

$ oc -n openshift-machine-api get machine
NAME                              PHASE   TYPE   REGION   ZONE   AGE
ocp-edge-cluster-master-0                                        10h
ocp-edge-cluster-master-1                                        10h
ocp-edge-cluster-master-2                                        10h
ocp-edge-cluster-worker-0-b2k2j                                  70m
ocp-edge-cluster-worker-0-btsql                                  10m
ocp-edge-cluster-worker-0-n6wfc                                  10h

alert: MachineWithNoRunningPhase
expr: (mapi_machine_created_timestamp_seconds{phase!="Running"})
  > 0
for: 10m
labels:
  severity: critical
annotations:
  message: machine {{ $labels.name }} is in {{ $labels.phase }} phase



Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2020-03-04-165955

How reproducible:
only when the PHASE is empty

Steps to Reproduce:
1. See the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Alberto 2020-04-03 10:12:44 UTC

This seems a pretty edge case where the machine controller was never run. If anything we can try to rephrase the message to make that more obvious.
https://github.com/openshift/machine-api-operator/pull/549

Comment 4 Milind Yadav 2020-04-06 04:57:49 UTC

Validated on :
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-05-214011   True        False         16m     Cluster version is 4.5.0-0.nightly-2020-04-05-214011

One machine was set to have no phase : 
[miyadav@miyadav ManualRun]$ oc get machines
NAME                                         PHASE     TYPE        REGION      ZONE         AGE
miyadav-0604-hbfwf-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   89m
miyadav-0604-hbfwf-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   89m
miyadav-0604-hbfwf-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   89m
miyadav-0604-hbfwf-worker-us-east-2a-56p54   Running   m4.large    us-east-2   us-east-2a   76m
miyadav-0604-hbfwf-worker-us-east-2b-new               m4.large    us-east-2   us-east-2b   40m
miyadav-0604-hbfwf-worker-us-east-2c-5m7cz   Running   m4.large    us-east-2   us-east-2c   76m


Seeing the alert message :
machine miyadav-0604-hbfwf-worker-us-east-2b-new is in phase:

Which seems as per the change in pull request : https://github.com/openshift/machine-api-operator/pull/549

Will consult with reporter .

Comment 5 Junqi Zhao 2020-04-07 02:44:52 UTC

(In reply to Milind Yadav from comment #4)
> Validated on :
> NAME      VERSION                             AVAILABLE   PROGRESSING  
> SINCE   STATUS
> version   4.5.0-0.nightly-2020-04-05-214011   True        False         16m 
> Cluster version is 4.5.0-0.nightly-2020-04-05-214011
> 
> One machine was set to have no phase : 
> [miyadav@miyadav ManualRun]$ oc get machines
> NAME                                         PHASE     TYPE        REGION   
> ZONE         AGE
> miyadav-0604-hbfwf-master-0                  Running   m4.xlarge   us-east-2
> us-east-2a   89m
> miyadav-0604-hbfwf-master-1                  Running   m4.xlarge   us-east-2
> us-east-2b   89m
> miyadav-0604-hbfwf-master-2                  Running   m4.xlarge   us-east-2
> us-east-2c   89m
> miyadav-0604-hbfwf-worker-us-east-2a-56p54   Running   m4.large    us-east-2
> us-east-2a   76m
> miyadav-0604-hbfwf-worker-us-east-2b-new               m4.large    us-east-2
> us-east-2b   40m
> miyadav-0604-hbfwf-worker-us-east-2c-5m7cz   Running   m4.large    us-east-2
> us-east-2c   76m
> 
> 
> Seeing the alert message :
> machine miyadav-0604-hbfwf-worker-us-east-2b-new is in phase:
> 
> Which seems as per the change in pull request :
> https://github.com/openshift/machine-api-operator/pull/549
> 
> Will consult with reporter .

machine miyadav-0604-hbfwf-worker-us-east-2b-new is in phase:
I am afraid the user will confused by the status if the PHASE is empty, it is not user friendly

Comment 6 Saul Alanis 2020-05-15 19:05:14 UTC

I have the same issue on a fresh cluster running on VMware 6.7. Curious how to safely remove the checks since there's no machineset controller for VMware?

Client Version: 4.4.3
Server Version: 4.4.3
Kubernetes Version: v1.17.1

oc get machinesets
NAME                DESIRED   CURRENT   READY   AVAILABLE   AGE
ocp4-ctmtp-worker   0         0                             28h


oc get machines -n openshift-machine-api
NAME                  PHASE   TYPE   REGION   ZONE   AGE
ocp4-ctmtp-master-0                                  28h
ocp4-ctmtp-master-1                                  28h
ocp4-ctmtp-master-2                                  28h

Comment 7 Michael Orlov 2020-05-19 14:10:14 UTC

The same issue confirmed on the fresh installation on VMWare 6.5 as well
(In reply to Saul Alanis from comment #6)
> I have the same issue on a fresh cluster running on VMware 6.7. Curious how
> to safely remove the checks since there's no machineset controller for
> VMware?
> 
> Client Version: 4.4.3
> Server Version: 4.4.3
> Kubernetes Version: v1.17.1
>

Comment 10 Alberto 2020-05-29 10:43:58 UTC

The vSphere scenario is covered here https://bugzilla.redhat.com/show_bug.cgi?id=1834966

This ticket is to track a more suer friendly to communicate when the phase happens to be empty. Tagging with upcomingSprint.

Comment 12 Joel Speed 2020-08-20 12:05:34 UTC

All PRs merged, this should be on Modified now

Comment 15 Milind Yadav 2020-09-09 11:32:10 UTC

VERIFIED ON:

4.6.0-0.nightly-2020-09-08-123737

Steps :

Created a machine with empty phase (scaled machineset with machinecontroller kept down using cvo and machine-controller deployment)

[miyadav@miyadav vsphere]$ oc get machines -o wide --config vsp
NAME                                 PHASE         TYPE   REGION   ZONE   AGE     NODE                                 PROVIDERID                                       STATE
vs-miyadav-0909-rpkms-master-0       Running                              3h48m   vs-miyadav-0909-rpkms-master-0       vsphere://422b4df8-b303-505b-99e2-592c3ae20465   poweredOn
vs-miyadav-0909-rpkms-master-1       Running                              3h48m   vs-miyadav-0909-rpkms-master-1       vsphere://422b787d-c6a3-ff14-0f2d-c5ebb7f113db   poweredOn
vs-miyadav-0909-rpkms-master-2       Running                              3h48m   vs-miyadav-0909-rpkms-master-2       vsphere://422b4f4d-7d45-94be-3bc0-0e86d431fd01   poweredOn
vs-miyadav-0909-rpkms-worker-hmjgn                                        21s                                                                                           
vs-miyadav-0909-rpkms-worker-ptjkn   Provisioned                          3h36m                                        vsphere://422bde74-788b-7af8-9383-44408377bd62   poweredOn
vs-miyadav-0909-rpkms-worker-rrq7s   Running                              3h36m   vs-miyadav-0909-rpkms-worker-rrq7s   vsphere://422be000-30eb-31f9-7f41-65b1d3545ede   poweredOn

Expected & Actual : No Alert fired after 10m for MachineWithNoRunningPhase 

Additional Info:
Moved to VERIFIED

Comment 17 errata-xmlrpc 2020-10-27 15:56:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.