+++ This bug was initially created as a clone of Bug #1991739 +++ Description of problem: In vSphere, WMCO miss the `Deleting` phase notification event, leaving incorrect node's information in the `windows-exporter` metrics endpoint, where `Subsets` contains the IP address of a machine that is no longer available, resulting in an invalid mapping for Prometheus metrics endpoint. Itβs important to note, that this bug applies only for vSphere platform, given how quickly virtual machines are destroyed in the vCenter. In AWS, for example, the EC2 instances spend more time in `Deleting` phase allowing WMCO to fully process all the events. Version-Release number of selected component (if applicable): WMCO 2.0.3 running on cluster with version 4.7.23 WMCO 3.0.0 running on cluster with version 4.8.4 How reproducible: Sometimes, depends on platform performance while removing a virtual machine Steps to Reproduce: 1. WMCO configured and running 2. Create a valid machineSet with 1 replicas 3. Observe the node information in the `windows-exporter` metrics endpoint object. Note the IP Addresses, for example: 172.31.251.250 4. Delete the machineSet 5. Wait for the Windows machine to disappear 6. Check one more time the `windows-exporter` metrics endpoint object, if there is still an entry in `Subsets` mapping an IP Address of a deleted machine, you have reproduced the bug. Metrics are no longer available for a deleted machine Bug frequency is about 0.75 (6 of 8 runs), with replicas set to 1. In step #2, you can set replicas to 5 to increase the bug frequency rate. Actual results: The `windows-exporter` metrics endpoint object contains Subsets with an IP Address of a deleted machine ``` $ oc describe endpoints -n openshift-windows-machine-config-operator Name: windows-exporter Namespace: openshift-windows-machine-config-operator Labels: name=windows-exporter Annotations: <none> Subsets: Addresses: 172.31.251.250 NotReadyAddresses: <none> Ports: Name Port Protocol ---- ---- -------- metrics 9182 TCP Events: <none> ``` Expected results: The IP Address of the deleted machine does not appears in the `windows-exporter` metrics endpoint object. With replicas set to 1, the Subsets must have no entries, empty. ``` $ oc describe endpoints -n openshift-windows-machine-config-operator Name: windows-exporter Namespace: openshift-windows-machine-config-operator Labels: name=windows-exporter Annotations: <none> Subsets: Events: <none> ``` --- Additional comment from aravindh on 2021-08-10 19:39:22 UTC --- @jvaldes, The steps to reproduce seems to indicate that this happens consistently. However you say this behavior is only seen occasionally? How can this be reproduced consistently? --- Additional comment from jvaldes on 2021-08-10 22:11:06 UTC --- Indeed, this behavior happens occasionally. Ran several tests using from 1 to 3 replicas in the machineSet. Sometimes, the `windows-exporter` endpoint is not getting updated by WMCO, removing the information of the deleted machines. > The steps to reproduce seems to indicate that this happens consistently Open to suggestions to make that clear.
Updating status to ON_QA
oc describe endpoints -n openshift-windows-machine-config-operator Name: windows-exporter Namespace: openshift-windows-machine-config-operator Labels: name=windows-exporter Annotations: <none> Subsets: Addresses: 172.31.249.162 NotReadyAddresses: <none> Ports: Name Port Protocol ---- ---- -------- metrics 9182 TCP Events: <none> [cloud-user@PSI-VM ~/windows-machine-config-operator]>oc delete machineset -n openshift-machine-api vsworker machineset.machine.openshift.io "vsworker" deleted [cloud-user@PSI-VM ~/windows-machine-config-operator]> [cloud-user@PSI-VM ~/windows-machine-config-operator]> [cloud-user@PSI-VM ~/windows-machine-config-operator]>oc describe endpoints -n openshift-windows-machine-config-operator Name: windows-exporter Namespace: openshift-windows-machine-config-operator Labels: name=windows-exporter Annotations: <none> Subsets: Events: <none> Verified WMCO "3.1.0+05d607c"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Windows Container Support for Red Hat OpenShift 3.1.0 product release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3215