1986237 – "MachineNotYetDeleted" in Pending state , alert not fired

Bug 1986237 - "MachineNotYetDeleted" in Pending state , alert not fired

Summary: "MachineNotYetDeleted" in Pending state , alert not fired

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Michael McCune
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:	1976744
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-27 03:16 UTC by Milind Yadav
Modified:	2022-03-10 16:05 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Machine API Operator deployment moving, or the Pods being redeployed so that the pod names change. Consequence: The MachineNotYetDeleted metric will become reset for each Machine that was being monitored. Fix: Change the metric query to ignore the source pod name of the metric originator. Result: The MachineNotYetDeleted alert will now fire properly in scenarios where the Machine API Operator pod has been renamed.
Clone Of:	1976744
Environment:
Last Closed:	2022-03-10 16:04:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 913	None	None	None	2021-09-07 17:23:39 UTC
Github	openshift machine-api-operator pull 941	None	open	Bug 1986237: add avg_over_time to the MachineNotYetDeleted alert	2021-10-25 20:44:33 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:05:06 UTC

Comment 1 Michael McCune 2021-07-29 20:56:19 UTC

hi @miyadav , i just want to confirm that this is not related to the issue we saw before the PDB alert, or any other alert, firing at the same time?

Comment 2 Milind Yadav 2021-07-30 06:33:17 UTC

Hi @Michael , no nothing like that , this time

Comment 3 Michael McCune 2021-07-30 20:11:43 UTC

thanks @Milind, i need to read up a little more on what the "pending" state means for an alert. i am wondering if it is possible that the condition for the alert might have changed during the 6 hour window. i'm not sure if that is possible, but it's the only thing that occurs to me immediately.

Comment 4 Michael McCune 2021-08-17 16:48:35 UTC

i've been reviewing the code around this metric and i am not seeing anything that would indicate an error in logic. i am going to reach out to the monitoring team to see if they might be able to help.

Comment 5 Michael McCune 2021-08-19 16:39:41 UTC

quick update, i got some advice about how these time series are processed by prometheus and i'm looking into the possibility that one of the labels is changing over time. this would cause the entire series to be invalid against the 6h time.

Comment 6 Michael McCune 2021-08-24 19:12:47 UTC

i reviewed the other labels on the metric that sources the alert, sadly it doesn't look like any of these label would be fluctuating during the observed period. i'm not sure what the next step should be, but i will probably reach out to the monitoring team again to see about the idea of capturing the data or writing a test to find what is causing this alert not to fire.

Comment 7 Milind Yadav 2021-08-25 07:34:39 UTC

Hi @Michael , Probably the alerts changes from firing to pending after 6hrs time , because I tried it again and it was fired and then changed back again to Pending(It was about 6 hrs or so)  , were you able to find any thing that could cause this ?

Comment 9 sunzhaohua 2021-08-25 08:01:32 UTC

Seems during a period of time, the query (mapi_machine_created_timestamp_seconds{phase=“Deleting”}) value is None, see the broken part on the picture (value is None). So it was Firing at the beginning, and later the result was None, it became pending, lasted for 6 hours, then changed to firing again. Not sure why the value becomes None.

Comment 10 Joel Speed 2021-08-25 09:07:47 UTC

Unfortunately the screenshot just cuts off some of the information in the table that we need here, but what I can see from that image is that the line changes colour. The line changing colour indicates that this is a new series, meaning that one of the labels changed. I suspect what has happened here is that the pod got rescheduled, and the pod name label changed. If we were to ignore the pod name label on the series, it should appear again as unbroken.

If you could grab the full table that was cut off in the bottom of the screenshot, this may be able to confirm my suspicion here

Comment 11 Michael McCune 2021-08-25 19:27:54 UTC

i think that seems like a strong possibility Joel, also though the value of the metric goes to "None" on the line shown. makes me wonder if there is something changing the creation timestamp.

that said, why would the pod for the machine-api-operator get rescheduled? (i didn't think the test was affecting the control plane nodes)

Comment 13 Michael McCune 2021-09-07 17:21:42 UTC

i talked with the monitoring team and they gave me a nice suggestion on how to ignore the `pod` label with this metric. i have created a PR which will allow this metric to persist in cases where the machine-api-operator has moved or had its name changed. this should alleviate the issues we are seeing.

Comment 15 Michael McCune 2021-09-09 19:04:53 UTC

hey Milind, i updated the PR yesterday to limit the metric labels that the alert is watching. i've lost track of the timing here though, did your last test include the latest change i made?

Comment 18 Milind Yadav 2021-10-18 10:09:31 UTC

Validated at - 4.10.0-0.nightly-2021-10-16-173656


Steps :

NAME                                         PHASE      TYPE        REGION      ZONE         AGE
miyadav-1810-mksmp-master-0                  Running    m5.xlarge   us-east-2   us-east-2a   6h1m
miyadav-1810-mksmp-master-1                  Running    m5.xlarge   us-east-2   us-east-2b   6h1m
miyadav-1810-mksmp-master-2                  Running    m5.xlarge   us-east-2   us-east-2c   6h1m
miyadav-1810-mksmp-worker-us-east-2a-4gf2j   Running    m5.large    us-east-2   us-east-2a   5h58m
miyadav-1810-mksmp-worker-us-east-2b-992sl   Running    m5.large    us-east-2   us-east-2b   5h58m
miyadav-1810-mksmp-worker-us-east-2c-5db72   Deleting   m5.large    us-east-2   us-east-2c   5h58m
miyadav-1810-mksmp-worker-us-east-2c-n6q6q   Running    m5.large    us-east-2   us-east-2c   5h34m


[miyadav@miyadav Downloads]$ oc get machines
NAME                                         PHASE      TYPE        REGION      ZONE         AGE
miyadav-1810-mksmp-master-0                  Running    m5.xlarge   us-east-2   us-east-2a   6h4m
miyadav-1810-mksmp-master-1                  Running    m5.xlarge   us-east-2   us-east-2b   6h4m
miyadav-1810-mksmp-master-2                  Running    m5.xlarge   us-east-2   us-east-2c   6h4m
miyadav-1810-mksmp-worker-us-east-2a-4gf2j   Running    m5.large    us-east-2   us-east-2a   6h1m
miyadav-1810-mksmp-worker-us-east-2b-992sl   Running    m5.large    us-east-2   us-east-2b   6h1m
miyadav-1810-mksmp-worker-us-east-2c-5db72   Deleting   m5.large    us-east-2   us-east-2c   6h1m
miyadav-1810-mksmp-worker-us-east-2c-n6q6q   Running    m5.large    us-east-2   us-east-2c   5h37m


[miyadav@miyadav Downloads]$ oc get pods
NAME                                           READY   STATUS    RESTARTS        AGE
cluster-autoscaler-operator-8656dd4ff7-s8svk   2/2     Running   0               6h28m
cluster-baremetal-operator-7569985c57-r9gm2    2/2     Running   1 (6h25m ago)   6h28m
dep1-64495756b4-6nkg7                          1/1     Running   0               6h2m
dep1-64495756b4-bmhws                          1/1     Running   0               6h2m
dep1-64495756b4-fbcgc                          1/1     Running   0               6h2m
dep1-64495756b4-fqkqj                          1/1     Running   0               6h2m
dep1-64495756b4-jdr2l                          1/1     Running   0               6h2m
dep1-64495756b4-m6fmv                          1/1     Running   0               6h2m
dep1-64495756b4-vq7t9                          1/1     Running   0               6h2m
machine-api-controllers-685c9dcd46-s5bdm       7/7     Running   0               6h26m
machine-api-operator-78884564ff-kv4n7          2/2     Running   0               6h28m


[miyadav@miyadav Downloads]$ oc get deployment
NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
cluster-autoscaler-operator   1/1     1            1           6h28m
cluster-baremetal-operator    1/1     1            1           6h28m
dep1                          7/7     7            7           6h2m
machine-api-controllers       1/1     1            1           6h26m
machine-api-operator          1/1     1            1           6h28m


[miyadav@miyadav Downloads]$ oc scale deployment machine-api-operator --replicas 0


deployment.apps/machine-api-operator scaled
[miyadav@miyadav Downloads]$ oc scale deployment machine-api-operator --replicas 1
deployment.apps/machine-api-operator scaled

Expected and Actual - Alert fired and remained in consistent state ( Attached snap )

Additional Info :
Will monitor for another 4-5 hrs to see if it flips , else will move to VERIFIED

Comment 22 Milind Yadav 2021-10-21 12:03:54 UTC

Moved to assigned , please review

Comment 23 Michael McCune 2021-10-21 14:27:29 UTC

hey @miyadav , i'm a little confused at what happened to create the failure.

> Expected and Actual - Alert fired and remained in consistent state ( Attached snap )

does this mean that the alert properly fired?

> Will monitor for another 4-5 hrs to see if it flips , else will move to VERIFIED

did it flip during that 4-5 hr monitoring?

just trying to understand what i might have missed.

Comment 24 Michael McCune 2021-10-21 14:29:12 UTC

also, did the machine get deleted or was it still in "Deleting" phase?

Comment 25 Michael McCune 2021-10-21 14:32:30 UTC

i reviewed the attachments a little closer, now i'm even more confused. it looks like the alert fires, but then stops and restarts later. i'm really curious to know what was happening to the machine during all this time, was it in "Deleting" for the entire period. and also, did something happen to disrupt the MAO or the machine-controller. very odd

Comment 26 Michael McCune 2021-10-21 17:23:48 UTC

hey Milind, one more comment for today.

i am clearing the NEEDINFO as i have talked with the team and we believe that we should be using a different method to gather this metric in order to account for possible dropouts in the data. i am going to investigate creating a patch that will aggregate the metric and then use something like the `min_over_time` prometheus query to smooth out the data.

sorry for all the noise here XD

Comment 27 Michael McCune 2021-10-25 20:46:25 UTC

i've created a new pull request which adds a 15 minute average over time for the source metric. this should help to smooth over any dropouts that occur because of the MAO being moved, or similar disruption.

Comment 30 Milind Yadav 2021-10-28 16:01:25 UTC

Validated on - [miyadav@miyadav ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-10-27-230233   True        False         6h16m   Cluster version is 4.10.0-0.nightly-2021-10-27-230233
[miyadav@miyadav ~]$ 



Same steps as earlier , working fine , moving to VERIFIED 


Additional info:

Even after playing with machine-api deployment 0->1 and other the alert remained at Firing status .

Comment 33 errata-xmlrpc 2022-03-10 16:04:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.