Bug 2008601 - WMCO ignores delete events for machines with invalid IP addresses
Summary: WMCO ignores delete events for machines with invalid IP addresses
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Windows Containers
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: jvaldes
QA Contact: Ronnie Rasouli
URL:
Whiteboard:
Depends On:
Blocks: 2008992
TreeView+ depends on / blocked
 
Reported: 2021-09-28 16:16 UTC by jvaldes
Modified: 2022-03-28 09:36 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: windows-exporter metrics endpoint object contains a reference to deleted machine Consequence: WMCO ignores delete events for machines with invalid IP addresses Fix: Remove the validation of the machine object from the event filtering Result: windows-exporter metrics endpoint object is correctly updated even when the machine is still in Deleting phase.
Clone Of:
: 2008992 (view as bug list)
Environment:
Last Closed: 2022-03-28 09:36:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
WMCO log (152.96 KB, text/plain)
2021-10-06 11:51 UTC, Ronnie Rasouli
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift windows-machine-config-operator pull 706 0 None Draft Bug 2008601: [wm] Fix delete event subscription 2021-09-28 16:52:25 UTC
Red Hat Product Errata RHSA-2022:0577 0 None None None 2022-03-28 09:36:45 UTC

Description jvaldes 2021-09-28 16:16:53 UTC
This bug was initially created as a light copy of Bug #1991739

I am copying this bug because: 
The resulting symptom is still showing up, with less frequency since the root cause is slight different, but still able to replicate on vSphere with a 4.7 cluster running WMCO 2.0.3.

Description of problem:
WMCO ignores the `Deleting` phase notification event for Windows machines without or invalid IPv4 address.

Version-Release number of selected component (if applicable):
WMCO 2.0.3 running on cluster with version 4.7.24 

How reproducible:
Sometimes, depends on platform performance while removing a virtual machine

Steps to Reproduce:
1. WMCO configured and running
2. Create a valid machineSet with 1 replicas
3. Observe the node information in the `windows-exporter` metrics endpoint object.
    Note the IP Addresses, for example: 172.31.251.250
4. Delete the machineSet
5. Wait for the Windows machine to disappear
6. Check one more time the `windows-exporter` metrics endpoint object, if there is still an entry in `Subsets` mapped to an IP address of a deleted machine, you have reproduced the bug. Metrics are no longer available for a deleted machine

Actual results:
WMCO with DEBUG logging enabled shows:
```
DEBUG   controller.windowsmachine   invalid Machine {
	"name": "winworker-rh5cr",
	 "error": "no internal IP address associated",
	 "errorVerbose": "no internal IP address associated, ...”
	...
}

```

The `windows-exporter` metrics endpoint object contains Subsets with an IP address of a deleted machine
```
$ oc describe endpoints -n openshift-windows-machine-config-operator
    Name:         windows-exporter
    Namespace:    openshift-windows-machine-config-operator
    Labels:       name=windows-exporter
    Annotations:  <none>
    Subsets:
      Addresses:          172.31.251.250
      NotReadyAddresses:  <none>
      Ports:
        Name     Port  Protocol
        ----     ----  --------
        metrics  9182  TCP

    Events:  <none>
```


Expected results:

WMCO with DEBUG logging enabled shows:
```
DEBUG controller.windowsmachine   machine not provisioned {
 	"windowsmachine": "openshift-machine-api/winworker-vdmnd",
	"phase": "Deleting"
}

INFO	metrics	Prometheus configured	{
 	 "endpoints": "windows-exporter",
	 "port": 9182,
	 "name": "metrics"
}

```

The IP Address of the deleted machine does not appears in the `windows-exporter` metrics endpoint object.
With replicas set to 1, the Subsets must have no entries, empty.
```
$ oc describe endpoints -n openshift-windows-machine-config-operator
    Name:         windows-exporter
    Namespace:    openshift-windows-machine-config-operator
    Labels:       name=windows-exporter
    Annotations:  <none>
    Subsets:
    Events:  <none>

Comment 2 jvaldes 2021-09-29 14:09:34 UTC
Marking as VERIFIED to allow the release-4.8/4.9 PRs to merge. Will update it to ON_QA once that PR merges.

Comment 3 Ronnie Rasouli 2021-10-06 11:50:02 UTC
The IP of the previous machine hasn't been deleted since config-map still retain the machine endpoint. 
oc describe configmaps windows-instances
Name:         windows-instances
Namespace:    openshift-windows-machine-config-operator
Labels:       <none>
Annotations:  <none>

Data
====
10.0.136.148:
----
username=Administrator
Events:
  Type     Reason                Age                  From       Message
  ----     ------                ----                 ----       -------
  Warning  InstanceSetupFailure  11m (x10 over 160m)  configmap  error configuring host with address 10.0.136.148: failed to create new nodeconfig: error instantiating Windows instance from VM: unable to setup VM 10.0.136.148 sshConnectivity: error instantiating SSH client: unable to connect to Windows VM 10.0.136.148: timed out waiting for the condition

c describe endpoints -n openshift-windows-machine-config-operator
Name:         windows-exporter
Namespace:    openshift-windows-machine-config-operator
Labels:       name=windows-exporter
Annotations:  <none>
Subsets:
  Addresses:          10.0.131.216,10.0.136.148,10.0.158.108
  NotReadyAddresses:  <none>
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    metrics  9182  TCP

Comment 4 Ronnie Rasouli 2021-10-06 11:51:20 UTC
Created attachment 1829808 [details]
WMCO log

Comment 5 Ronnie Rasouli 2021-10-07 06:07:51 UTC
Testing deletion of a non BYOH node is successful.
Verified on {"version": "4.0.0+7cdce8b"}

Comment 8 errata-xmlrpc 2022-03-28 09:36:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Windows Container Support for Red Hat OpenShift 5.0.0 [security update]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0577


Note You need to log in before you can comment on or make changes to this bug.