Bug 2008601

Summary: WMCO ignores delete events for machines with invalid IP addresses
Product: OpenShift Container Platform Reporter: jvaldes
Component: Windows ContainersAssignee: jvaldes
Status: CLOSED ERRATA QA Contact: Ronnie Rasouli <rrasouli>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.10CC: aos-bugs, rrasouli, team-winc
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: windows-exporter metrics endpoint object contains a reference to deleted machine Consequence: WMCO ignores delete events for machines with invalid IP addresses Fix: Remove the validation of the machine object from the event filtering Result: windows-exporter metrics endpoint object is correctly updated even when the machine is still in Deleting phase.
Story Points: ---
Clone Of:
: 2008992 (view as bug list) Environment:
Last Closed: 2022-03-28 09:36:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2008992    
Attachments:
Description Flags
WMCO log none

Description jvaldes 2021-09-28 16:16:53 UTC
This bug was initially created as a light copy of Bug #1991739

I am copying this bug because: 
The resulting symptom is still showing up, with less frequency since the root cause is slight different, but still able to replicate on vSphere with a 4.7 cluster running WMCO 2.0.3.

Description of problem:
WMCO ignores the `Deleting` phase notification event for Windows machines without or invalid IPv4 address.

Version-Release number of selected component (if applicable):
WMCO 2.0.3 running on cluster with version 4.7.24 

How reproducible:
Sometimes, depends on platform performance while removing a virtual machine

Steps to Reproduce:
1. WMCO configured and running
2. Create a valid machineSet with 1 replicas
3. Observe the node information in the `windows-exporter` metrics endpoint object.
    Note the IP Addresses, for example: 172.31.251.250
4. Delete the machineSet
5. Wait for the Windows machine to disappear
6. Check one more time the `windows-exporter` metrics endpoint object, if there is still an entry in `Subsets` mapped to an IP address of a deleted machine, you have reproduced the bug. Metrics are no longer available for a deleted machine

Actual results:
WMCO with DEBUG logging enabled shows:
```
DEBUG   controller.windowsmachine   invalid Machine {
	"name": "winworker-rh5cr",
	 "error": "no internal IP address associated",
	 "errorVerbose": "no internal IP address associated, ...”
	...
}

```

The `windows-exporter` metrics endpoint object contains Subsets with an IP address of a deleted machine
```
$ oc describe endpoints -n openshift-windows-machine-config-operator
    Name:         windows-exporter
    Namespace:    openshift-windows-machine-config-operator
    Labels:       name=windows-exporter
    Annotations:  <none>
    Subsets:
      Addresses:          172.31.251.250
      NotReadyAddresses:  <none>
      Ports:
        Name     Port  Protocol
        ----     ----  --------
        metrics  9182  TCP

    Events:  <none>
```


Expected results:

WMCO with DEBUG logging enabled shows:
```
DEBUG controller.windowsmachine   machine not provisioned {
 	"windowsmachine": "openshift-machine-api/winworker-vdmnd",
	"phase": "Deleting"
}

INFO	metrics	Prometheus configured	{
 	 "endpoints": "windows-exporter",
	 "port": 9182,
	 "name": "metrics"
}

```

The IP Address of the deleted machine does not appears in the `windows-exporter` metrics endpoint object.
With replicas set to 1, the Subsets must have no entries, empty.
```
$ oc describe endpoints -n openshift-windows-machine-config-operator
    Name:         windows-exporter
    Namespace:    openshift-windows-machine-config-operator
    Labels:       name=windows-exporter
    Annotations:  <none>
    Subsets:
    Events:  <none>

Comment 2 jvaldes 2021-09-29 14:09:34 UTC
Marking as VERIFIED to allow the release-4.8/4.9 PRs to merge. Will update it to ON_QA once that PR merges.

Comment 3 Ronnie Rasouli 2021-10-06 11:50:02 UTC
The IP of the previous machine hasn't been deleted since config-map still retain the machine endpoint. 
oc describe configmaps windows-instances
Name:         windows-instances
Namespace:    openshift-windows-machine-config-operator
Labels:       <none>
Annotations:  <none>

Data
====
10.0.136.148:
----
username=Administrator
Events:
  Type     Reason                Age                  From       Message
  ----     ------                ----                 ----       -------
  Warning  InstanceSetupFailure  11m (x10 over 160m)  configmap  error configuring host with address 10.0.136.148: failed to create new nodeconfig: error instantiating Windows instance from VM: unable to setup VM 10.0.136.148 sshConnectivity: error instantiating SSH client: unable to connect to Windows VM 10.0.136.148: timed out waiting for the condition

c describe endpoints -n openshift-windows-machine-config-operator
Name:         windows-exporter
Namespace:    openshift-windows-machine-config-operator
Labels:       name=windows-exporter
Annotations:  <none>
Subsets:
  Addresses:          10.0.131.216,10.0.136.148,10.0.158.108
  NotReadyAddresses:  <none>
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    metrics  9182  TCP

Comment 4 Ronnie Rasouli 2021-10-06 11:51:20 UTC
Created attachment 1829808 [details]
WMCO log

Comment 5 Ronnie Rasouli 2021-10-07 06:07:51 UTC
Testing deletion of a non BYOH node is successful.
Verified on {"version": "4.0.0+7cdce8b"}

Comment 8 errata-xmlrpc 2022-03-28 09:36:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Windows Container Support for Red Hat OpenShift 5.0.0 [security update]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0577