Bug 1844986 - Overlapping Machine health checks events results in spam when a machine needs remediation
Summary: Overlapping Machine health checks events results in spam when a machine needs...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.6.0
Assignee: Joel Speed
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-08 08:24 UTC by Alberto
Modified: 2020-10-27 16:05 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The MachineHealthCheck controller would not check if a Machine had been deleted before attempting to delete it Consequence: Multiple delete requests would be sent for the same Machine, each delete request resulted in a log and an event Fix: Check if a Machine has been deleted before attempting to delete it Result: Duplicate logs and events have been reduced
Clone Of:
Environment:
Last Closed: 2020-10-27 16:05:31 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 623 0 None closed Bug 1844986: Do not send Machine deletion event if the Machine has already been deleted 2021-01-12 13:45:00 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:05:53 UTC

Description Alberto 2020-06-08 08:24:03 UTC
Description of problem:

When a machine watched by the MHC is considered to need remediation, deletion requests and events are recorded unconditionally in loop until the machine is gone resulting in event spam. This should happen only once.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Events spam.

E.g.
Jun 04 22:49:48.053 I ns/openshift-machine-api machine/ci-op-dgz95p2d-067ff-pxnzv-worker-us-east-1b-7rvmd reason/MachineDeleted Machine openshift-machine-api/e2e-health-check-worker/ci-op-dgz95p2d-067ff-pxnzv-worker-us-east-1b-7rvmd/ip-10-0-171-88.ec2.internal has been remediated by requesting to delete Machine object
Jun 04 22:49:48.087 I ns/openshift-machine-api machine/ci-op-dgz95p2d-067ff-pxnzv-worker-us-east-1b-7rvmd reason/MachineDeleted Machine openshift-machine-api/e2e-health-check-worker/ci-op-dgz95p2d-067ff-pxnzv-worker-us-east-1b-7rvmd/ip-10-0-171-88.ec2.internal has been remediated by requesting to delete Machine object (2 times)
Jun 04 22:49:48.123 I ns/openshift-machine-api machine/ci-op-dgz95p2d-067ff-pxnzv-worker-us-east-1b-7rvmd reason/MachineDeleted Machine openshift-machine-api/e2e-health-check-worker/ci-op-dgz95p2d-067ff-pxnzv-worker-us-east-1b-7rvmd/ip-10-0-171-88.ec2.internal has been remediated by requesting to delete Machine object (3 times)
Jun 04 22:49:48.124 I ns/openshift-machine-api machine/ci-op-dgz95p2d-067ff-pxnzv-worker-us-east-1b-qrmgl reason/DetectedUnhealthy Machine openshift-machine-api/e2e-health-


Expected results:
Record event for deletion only once.

Additional info:

Comment 2 Alberto 2020-06-19 08:04:40 UTC
Planning to address this during upcomingSprint.

Comment 3 Joel Speed 2020-06-22 15:46:07 UTC
I've raised a PR that will check if the Machine has been deleted before attempting to delete it and skips sending the delete call and the delete event if the Machine has already been deleted

Comment 6 Milind Yadav 2020-07-02 05:35:46 UTC
[miyadav@miyadav ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-06-30-020342   True        False         106m    Cluster version is 4.6.0-0.nightly-2020-06-30-020342


Steps:
1.Create mhc using below :

apiVersion: "machine.openshift.io/v1beta1"
kind: "MachineHealthCheck"
metadata:
  name: mhc1
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: miyadav-0702-jwtwh
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: miyadav-0702-jwtwh-worker-us-east-2a
  unhealthyConditions:
  - type: Ready
    status: "False"
    timeout: 300s
  - type: Ready
    status: Unknown
    timeout: 300s
  maxUnhealthy: 3

Actual & Expected , mhc created successfully 

2.Delete machine of the machineset getting monitored

Actual and expected : machine deleted successfully

3. Monitor logs :
oc logs  machine-api-controllers-6f467bfc-ndzqj -c machine-controller  | grep delete

.
.
I0702 03:53:55.422842       1 controller.go:217] miyadav-0702-jwtwh-worker-us-east-2a-7zq5s: reconciling machine triggers delete
I0702 03:53:55.999127       1 controller.go:217] miyadav-0702-jwtwh-worker-us-east-2a-7zq5s: reconciling machine triggers delete
I0702 03:54:25.987396       1 controller.go:217] miyadav-0702-jwtwh-worker-us-east-2a-7zq5s: reconciling machine triggers delete
I0702 03:54:30.823811       1 controller.go:217] miyadav-0702-jwtwh-worker-us-east-2a-7zq5s: reconciling machine triggers delete
I0702 03:54:56.371037       1 controller.go:217] miyadav-0702-jwtwh-worker-us-east-2a-7zq5s: reconciling machine triggers delete
I0702 03:54:56.377749       1 controller.go:346] Could not find node from noderef, it may have already been deleted: ip-10-0-148-159.us-east-2.compute.internal
W0702 03:54:56.543267       1 reconciler.go:108] miyadav-0702-jwtwh-worker-us-east-2a-7zq5s: no instances found to delete for machine                            

As Expected

4. oc get events :
[miyadav@miyadav ~]$ oc get events | grep requesting
111m        Normal    MachineDeleted      machine/miyadav-0702-jwtwh-worker-us-east-2a-nxbd9   Machine openshift-machine-api/mhc1/miyadav-0702-jwtwh-worker-us-east-2a-nxbd9/ip-10-0-131-176.us-east-2.compute.internal has been remediated by requesting to delete Machine object

As expected

Additional info

logs of earlier version 4.4

.
.

I0702 04:47:22.324720       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:47:42.583364       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:48:04.348089       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:48:16.090063       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:48:22.684951       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:48:51.147782       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
W0702 04:48:51.477298       1 actuator.go:363] miyadav-07-ldn98-worker-us-east-2a-gffw5: no instances found to delete for machine
I0702 04:48:52.596945       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
W0702 04:48:52.786056       1 actuator.go:363] miyadav-07-ldn98-worker-us-east-2a-gffw5: no instances found to delete for machine
I0702 04:48:53.882622       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
W0702 04:48:54.078475       1 actuator.go:363] miyadav-07-ldn98-worker-us-east-2a-gffw5: no instances found to delete for machine
I0702 04:48:55.173135       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
W0702 04:48:55.348079       1 actuator.go:363] miyadav-07-ldn98-worker-us-east-2a-gffw5: no instances found to delete for machine
I0702 04:48:56.672863       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:48:56.678646       1 controller.go:332] Could not find node from noderef, it may have already been deleted: ip-10-0-151-143.us-east-2.compute.internal
W0702 04:48:56.776967       1 actuator.go:363] miyadav-07-ldn98-worker-us-east-2a-gffw5: no instances found to delete for machine
I0702 04:48:57.877306       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:48:57.882742       1 controller.go:332] Could not find node from noderef, it may have already been deleted: ip-10-0-151-143.us-east-2.compute.internal
W0702 04:48:57.998559       1 actuator.go:363] miyadav-07-ldn98-worker-us-east-2a-gffw5: no instances found to delete for machine
I0702 04:48:59.094517       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:48:59.102316       1 controller.go:332] Could not find node from noderef, it may have already been deleted: ip-10-0-151-143.us-east-2.compute.internal
W0702 04:48:59.189758       1 actuator.go:363] miyadav-07-ldn98-worker-us-east-2a-gffw5: no instances found to delete for machine
I0702 04:49:00.290766       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:49:00.299654       1 controller.go:332] Could not find node from noderef, it may have already been deleted: ip-10-0-151-143.us-east-2.compute.internal
W0702 04:49:00.389612       1 actuator.go:363] miyadav-07-ldn98-worker-us-east-2a-gffw5: no instances found to delete for machine
I0702 04:49:01.495477       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:49:01.501752       1 controller.go:332] Could not find node from noderef, it may have already been deleted: ip-10-0-151-143.us-east-2.compute.internal
W0702 04:49:01.610120       1 actuator.go:363] miyadav-07-ldn98-worker-us-east-2a-gffw5: no instances found to delete for machine
I0702 04:49:02.984863       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:49:02.990204       1 controller.go:332] Could not find node from noderef, it may have already been deleted: ip-10-0-151-143.us-east-2.compute.internal
W0702 04:49:03.104123       1 actuator.go:363] miyadav-07-ldn98-worker-us-east-2a-gffw5: no instances found to delete for machine
I0702 04:49:05.820811       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:49:05.827009       1 controller.go:332] Could not find node from noderef, it may have already been deleted: ip-10-0-151-143.us-east-2.compute.internal
W0702 04:49:05.931050       1 actuator.go:363] miyadav-07-ldn98-worker-us-east-2a-gffw5: no instances found to delete for machine
I0702 04:49:11.164889       1 controller.go:203] miyadav-07-ldn98-worker-us-east-2a-gffw5: reconciling machine triggers delete
I0702 04:49:11.173109       1 controller.go:332] Could not find node from noderef, it may have already been deleted: ip-10-0-151-143.us-east-2.compute.internal
W0702 04:49:11.278721       1 actuator.go:363] miyadav-07-ldn98-worker-us-east-2a-gffw5: no instances found to delete for machine.
.
.

Moved to VERIFIED

Comment 8 errata-xmlrpc 2020-10-27 16:05:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.