Bug 1991739 - WMCO ignores the `Deleting` phase notification event
Summary: WMCO ignores the `Deleting` phase notification event
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Windows Containers
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: jvaldes
QA Contact: gaoshang
URL:
Whiteboard:
Depends On:
Blocks: 1995341
TreeView+ depends on / blocked
 
Reported: 2021-08-09 20:48 UTC by jvaldes
Modified: 2021-10-28 17:41 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-28 17:41:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift windows-machine-config-operator pull 571 0 None Merged Bug 1991739: [wmco] Fix metrics endpoint on machine not found 2021-09-30 15:03:26 UTC
Red Hat Product Errata RHBA-2021:3702 0 None None None 2021-10-28 17:41:40 UTC

Description jvaldes 2021-08-09 20:48:42 UTC
Description of problem:

In vSphere, WMCO miss the `Deleting` phase notification event, leaving incorrect
node's information in the `windows-exporter` metrics endpoint, where `Subsets` contains
the IP address of a machine that is no longer available, resulting in an invalid mapping
for Prometheus metrics endpoint. 

It’s important to note, that this bug applies only for vSphere platform, given how quickly
virtual machines are destroyed in the vCenter. In AWS, for example, the EC2 instances spend more
time in `Deleting` phase allowing WMCO to fully process all the events.


Version-Release number of selected component (if applicable):

WMCO 2.0.3 running on cluster with version 4.7.23 
WMCO 3.0.0 running on cluster with version 4.8.4


How reproducible:
Sometimes, depends on platform performance while removing a virtual machine


Steps to Reproduce:
1. WMCO configured and running
2. Create a valid machineSet with 1 replicas
3. Observe the node information in the `windows-exporter` metrics endpoint object.
    Note the IP Addresses, for example: 172.31.251.250
4. Delete the machineSet
5. Wait for the Windows machine to disappear
6. Check one more time the `windows-exporter` metrics endpoint object, if there is
    still an entry in `Subsets` mapping an IP Address of a deleted machine, you
    have reproduced the bug. Metrics are no longer available for a deleted machine

Bug frequency is about 0.75 (6 of 8 runs), with replicas set to 1.

In step #2, you can set replicas to 5 to increase the bug frequency rate.


Actual results:
The `windows-exporter` metrics endpoint object contains Subsets with an IP Address of a deleted machine
```
$ oc describe endpoints -n openshift-windows-machine-config-operator
    Name:         windows-exporter
    Namespace:    openshift-windows-machine-config-operator
    Labels:       name=windows-exporter
    Annotations:  <none>
    Subsets:
      Addresses:          172.31.251.250
      NotReadyAddresses:  <none>
      Ports:
        Name     Port  Protocol
        ----     ----  --------
        metrics  9182  TCP

    Events:  <none>
```

Expected results:
The IP Address of the deleted machine does not appears in the `windows-exporter` metrics endpoint object.
With replicas set to 1, the Subsets must have no entries, empty.
```
$ oc describe endpoints -n openshift-windows-machine-config-operator
    Name:         windows-exporter
    Namespace:    openshift-windows-machine-config-operator
    Labels:       name=windows-exporter
    Annotations:  <none>
    Subsets:
    Events:  <none>
```

Comment 1 Aravindh Puthiyaparambil 2021-08-10 19:39:22 UTC
@jvaldes, The steps to reproduce seems to indicate that this happens consistently. However you say this behavior is only seen occasionally? How can this be reproduced consistently?

Comment 2 jvaldes 2021-08-10 22:11:06 UTC
Indeed, this behavior happens occasionally. Ran several tests using from 1 to 3 replicas in the machineSet. Sometimes, the `windows-exporter` endpoint is not getting updated by WMCO, removing the information of the deleted machines.

> The steps to reproduce seems to indicate that this happens consistently
Open to suggestions to make that clear.

Comment 3 jvaldes 2021-08-19 19:29:35 UTC
Marking as VERIFIED to allow the release-4.7/4.8 PRs to merge. Will move this back to ON_QA once that PR merges.

Comment 4 jvaldes 2021-08-20 16:10:46 UTC
Setting status back to ON_QA.

Comment 5 Ronnie Rasouli 2021-08-25 06:10:06 UTC
oc describoc get node -l kubernetes.io/os=windows -owide
NAME              STATUS   ROLES    AGE    VERSION                            INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                  KERNEL-VERSION   CONTAINER-RUNTIME
winworker-g8fgw   Ready    worker   113m   v1.22.0-rc.0.1611+9b1230e88478e6   172.31.249.177   172.31.249.177   Windows Server Standard   10.0.19041.508   docker://20.10.5
winworker-pm4kf   Ready    worker   107m   v1.22.0-rc.0.1611+9b1230e88478e6   172.31.249.147   172.31.249.147   Windows Server Standard   10.0.19041.508   docker://20.10.5
[cloud-user@PSI-VM ~/windows-machine-config-operator]> oc describe endpoints -n openshift-windows-machine-config-operator
Name:         windows-exporter
Namespace:    openshift-windows-machine-config-operator
Labels:       name=windows-exporter
Annotations:  <none>
Subsets:
  Addresses:          172.31.249.177,172.31.249.147
  NotReadyAddresses:  <none>
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    metrics  9182  TCP

Events:  <none>

"version": "3.1.0+0a3a937"

Comment 7 Ronnie Rasouli 2021-10-05 13:18:40 UTC
Bug tested mistakenly on 4.9 + 3.1.0 
Testing on OCP 4.9 WMCO 4.0.0+2f0b49a2
he endpoint 10.0.128.191 has been deleted, yet still exist
oc describe endpoints -n openshift-windows-machine-config-operator
Name:         windows-exporter
Namespace:    openshift-windows-machine-config-operator
Labels:       name=windows-exporter
Annotations:  <none>
Subsets:
  Addresses:          10.0.128.191,10.0.140.91,10.0.143.88
  NotReadyAddresses:  <none>

Comment 9 jvaldes 2021-10-05 18:22:38 UTC
@rrasouli  Can you share the WMCO logs associated with the failed test?

Comment 15 errata-xmlrpc 2021-10-28 17:41:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Windows Container Support for Red Hat OpenShift 4.0.0 product release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3702


Note You need to log in before you can comment on or make changes to this bug.