hi @miyadav , i just want to confirm that this is not related to the issue we saw before the PDB alert, or any other alert, firing at the same time?
Hi @Michael , no nothing like that , this time
thanks @Milind, i need to read up a little more on what the "pending" state means for an alert. i am wondering if it is possible that the condition for the alert might have changed during the 6 hour window. i'm not sure if that is possible, but it's the only thing that occurs to me immediately.
i've been reviewing the code around this metric and i am not seeing anything that would indicate an error in logic. i am going to reach out to the monitoring team to see if they might be able to help.
quick update, i got some advice about how these time series are processed by prometheus and i'm looking into the possibility that one of the labels is changing over time. this would cause the entire series to be invalid against the 6h time.
i reviewed the other labels on the metric that sources the alert, sadly it doesn't look like any of these label would be fluctuating during the observed period. i'm not sure what the next step should be, but i will probably reach out to the monitoring team again to see about the idea of capturing the data or writing a test to find what is causing this alert not to fire.
Hi @Michael , Probably the alerts changes from firing to pending after 6hrs time , because I tried it again and it was fired and then changed back again to Pending(It was about 6 hrs or so) , were you able to find any thing that could cause this ?
Seems during a period of time, the query (mapi_machine_created_timestamp_seconds{phase=“Deleting”}) value is None, see the broken part on the picture (value is None). So it was Firing at the beginning, and later the result was None, it became pending, lasted for 6 hours, then changed to firing again. Not sure why the value becomes None.
Unfortunately the screenshot just cuts off some of the information in the table that we need here, but what I can see from that image is that the line changes colour. The line changing colour indicates that this is a new series, meaning that one of the labels changed. I suspect what has happened here is that the pod got rescheduled, and the pod name label changed. If we were to ignore the pod name label on the series, it should appear again as unbroken. If you could grab the full table that was cut off in the bottom of the screenshot, this may be able to confirm my suspicion here
i think that seems like a strong possibility Joel, also though the value of the metric goes to "None" on the line shown. makes me wonder if there is something changing the creation timestamp. that said, why would the pod for the machine-api-operator get rescheduled? (i didn't think the test was affecting the control plane nodes)
i talked with the monitoring team and they gave me a nice suggestion on how to ignore the `pod` label with this metric. i have created a PR which will allow this metric to persist in cases where the machine-api-operator has moved or had its name changed. this should alleviate the issues we are seeing.
hey Milind, i updated the PR yesterday to limit the metric labels that the alert is watching. i've lost track of the timing here though, did your last test include the latest change i made?
Validated at - 4.10.0-0.nightly-2021-10-16-173656 Steps : NAME PHASE TYPE REGION ZONE AGE miyadav-1810-mksmp-master-0 Running m5.xlarge us-east-2 us-east-2a 6h1m miyadav-1810-mksmp-master-1 Running m5.xlarge us-east-2 us-east-2b 6h1m miyadav-1810-mksmp-master-2 Running m5.xlarge us-east-2 us-east-2c 6h1m miyadav-1810-mksmp-worker-us-east-2a-4gf2j Running m5.large us-east-2 us-east-2a 5h58m miyadav-1810-mksmp-worker-us-east-2b-992sl Running m5.large us-east-2 us-east-2b 5h58m miyadav-1810-mksmp-worker-us-east-2c-5db72 Deleting m5.large us-east-2 us-east-2c 5h58m miyadav-1810-mksmp-worker-us-east-2c-n6q6q Running m5.large us-east-2 us-east-2c 5h34m [miyadav@miyadav Downloads]$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-1810-mksmp-master-0 Running m5.xlarge us-east-2 us-east-2a 6h4m miyadav-1810-mksmp-master-1 Running m5.xlarge us-east-2 us-east-2b 6h4m miyadav-1810-mksmp-master-2 Running m5.xlarge us-east-2 us-east-2c 6h4m miyadav-1810-mksmp-worker-us-east-2a-4gf2j Running m5.large us-east-2 us-east-2a 6h1m miyadav-1810-mksmp-worker-us-east-2b-992sl Running m5.large us-east-2 us-east-2b 6h1m miyadav-1810-mksmp-worker-us-east-2c-5db72 Deleting m5.large us-east-2 us-east-2c 6h1m miyadav-1810-mksmp-worker-us-east-2c-n6q6q Running m5.large us-east-2 us-east-2c 5h37m [miyadav@miyadav Downloads]$ oc get pods NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-8656dd4ff7-s8svk 2/2 Running 0 6h28m cluster-baremetal-operator-7569985c57-r9gm2 2/2 Running 1 (6h25m ago) 6h28m dep1-64495756b4-6nkg7 1/1 Running 0 6h2m dep1-64495756b4-bmhws 1/1 Running 0 6h2m dep1-64495756b4-fbcgc 1/1 Running 0 6h2m dep1-64495756b4-fqkqj 1/1 Running 0 6h2m dep1-64495756b4-jdr2l 1/1 Running 0 6h2m dep1-64495756b4-m6fmv 1/1 Running 0 6h2m dep1-64495756b4-vq7t9 1/1 Running 0 6h2m machine-api-controllers-685c9dcd46-s5bdm 7/7 Running 0 6h26m machine-api-operator-78884564ff-kv4n7 2/2 Running 0 6h28m [miyadav@miyadav Downloads]$ oc get deployment NAME READY UP-TO-DATE AVAILABLE AGE cluster-autoscaler-operator 1/1 1 1 6h28m cluster-baremetal-operator 1/1 1 1 6h28m dep1 7/7 7 7 6h2m machine-api-controllers 1/1 1 1 6h26m machine-api-operator 1/1 1 1 6h28m [miyadav@miyadav Downloads]$ oc scale deployment machine-api-operator --replicas 0 deployment.apps/machine-api-operator scaled [miyadav@miyadav Downloads]$ oc scale deployment machine-api-operator --replicas 1 deployment.apps/machine-api-operator scaled Expected and Actual - Alert fired and remained in consistent state ( Attached snap ) Additional Info : Will monitor for another 4-5 hrs to see if it flips , else will move to VERIFIED
Moved to assigned , please review
hey @miyadav , i'm a little confused at what happened to create the failure. > Expected and Actual - Alert fired and remained in consistent state ( Attached snap ) does this mean that the alert properly fired? > Will monitor for another 4-5 hrs to see if it flips , else will move to VERIFIED did it flip during that 4-5 hr monitoring? just trying to understand what i might have missed.
also, did the machine get deleted or was it still in "Deleting" phase?
i reviewed the attachments a little closer, now i'm even more confused. it looks like the alert fires, but then stops and restarts later. i'm really curious to know what was happening to the machine during all this time, was it in "Deleting" for the entire period. and also, did something happen to disrupt the MAO or the machine-controller. very odd
hey Milind, one more comment for today. i am clearing the NEEDINFO as i have talked with the team and we believe that we should be using a different method to gather this metric in order to account for possible dropouts in the data. i am going to investigate creating a patch that will aggregate the metric and then use something like the `min_over_time` prometheus query to smooth out the data. sorry for all the noise here XD
i've created a new pull request which adds a 15 minute average over time for the source metric. this should help to smooth over any dropouts that occur because of the MAO being moved, or similar disruption.
Validated on - [miyadav@miyadav ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-10-27-230233 True False 6h16m Cluster version is 4.10.0-0.nightly-2021-10-27-230233 [miyadav@miyadav ~]$ Same steps as earlier , working fine , moving to VERIFIED Additional info: Even after playing with machine-api deployment 0->1 and other the alert remained at Firing status .
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056