Bug 2107261 - [WMCO] WMCO endpoints missing after WMCO restart in vSphere
Summary: [WMCO] WMCO endpoints missing after WMCO restart in vSphere
Description Jose Luis Franco 2022-07-14 15:30:41 UTC
Must gather logs:

1. Issue: oc get endpoints -n openshift-windows-machine-config-operator command prints ENDPOINTS <none> in WMCO 6.0.0 for vSphere cloud provider:

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get endpoints -n openshift-windows-machine-config-operator
NAME               ENDPOINTS   AGE
windows-exporter   <none>      22m

When checking the windows-exporter service we can confirm that the endpoints are really missing:

[cloud-user@preserve-jfrancoa 119919]$ oc describe service/windows-exporter -n openshift-windows-machine-config-operator 
Name:              windows-exporter
Namespace:         openshift-windows-machine-config-operator
Labels:            name=windows-exporter
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
Port:              metrics  9182/TCP
TargetPort:        9182/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

2. WMCO & OpenShift Version 
[cloud-user@preserve-jfrancoa 119919]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-07-11-080250   True        False         8h      Cluster version is 4.11.0-0.nightly-2022-07-11-080250

[cloud-user@preserve-jfrancoa 119919]$ oc get csv -n openshift-windows-machine-config-operator
NAME                                     DISPLAY                            VERSION   REPLACES   PHASE
elasticsearch-operator.v5.5.0            OpenShift Elasticsearch Operator   5.5.0                Succeeded
windows-machine-config-operator.v6.0.0   Windows Machine Config Operator    6.0.0                Succeeded

[cloud-user@preserve-jfrancoa 119919]$ oc get cm -n openshift-windows-machine-config-operator 
NAME                                   DATA   AGE
kube-root-ca.crt                       1      7h54m
openshift-service-ca.crt               1      7h54m
windows-machine-config-operator-lock   0      44m
windows-services-6.0.0-9a1eca1         2      7h53m

3. Platform - VSphere
4. If the platform is vSphere, what is the VMware tools version? 
5. Is it a new test case or an old test case? 
    Old test case
   if it is the old test case, is it regression or first-time tested?
    It is a regression
   Is it platform-specific or consistent across all platforms?
    So far it has occurred only in vSphere, for Azure the endpoints are both present
6. Steps to Reproduce
   1. Deploy a 4.11 OCP cluster in vSphere
   2. Install WMCO 6.0.0 and create some machinesets
   3. Restart the WMCO container by deleting the wmco pod: oc pod delete <wmco-pod-id> -n openshift-windows-machine-config-operator
   3. Run: oc get endpoints -n openshift-windows-machine-config-operator
   4. Check the endpoints field 
7. Actual Result and Expected Result
  Actual: windows-exporter endpoints shows <none>
  Expected: windows-exporter endpoints displays the two IPs corresponding to the endpoints
8. A possible workaround has been tried? Is there a way to recover from the issue being tried out?
   Scaling down and scaling up the machineset made the windows-exporter endpoints appearing. Even though the Windows workers were up and running and the workloads could successfuly run, WMCO was not able to update the windows-exporter endpoints. Once the scale down and happens, it was observed the following log in wmco logs:

1.6576336929887342e+09  INFO    metrics Prometheus configured   {"endpoints": "windows-exporter", "port": 9182, "name": "metrics"}

9. Logs
           oc get network.operator cluster -o yaml
           oc logs -f deployment/windows-machine-config-operator -n openshift-windows-machine-config-operator
       Windows MachineSet yaml or windows-instances ConfigMap
           oc get machineset <windows_machineSet_name> -n openshift-machine-api -o yaml
           oc get configmaps <windows_configmap_name> -n <namespace_name> -o yaml

 Optional logs:
    Anything that can be useful to debug the issue.

Comment 4 Jose Luis Franco 2022-07-20 14:32:28 UTC
[jfrancoa@localhost wmco]$ oc get cm -n openshift-windows-machine-config-operator 
NAME                                   DATA   AGE
kube-root-ca.crt                       1      21m
openshift-service-ca.crt               1      21m
windows-machine-config-operator-lock   0      20m
windows-services-6.0.0-07ebdd7         2      20m

[jfrancoa@localhost wmco]$ oc get csv -n openshift-windows-machine-config-operator
NAME                                     DISPLAY                            VERSION   REPLACES   PHASE
elasticsearch-operator.v5.5.0            OpenShift Elasticsearch Operator   5.5.0                Succeeded
windows-machine-config-operator.v6.0.0   Windows Machine Config Operator    6.0.0                Succeeded

[jfrancoa@localhost wmco]$ oc get endpoints -n openshift-windows-machine-config-operator
NAME               ENDPOINTS                                AGE
windows-exporter,   21m

[jfrancoa@localhost wmco]$ oc get pods -n openshift-windows-machine-config-operator 
NAME                                               READY   STATUS    RESTARTS   AGE
windows-machine-config-operator-554d8d85f4-4pqtj   1/1     Running   0          21m

[jfrancoa@localhost wmco]$ oc delete pods windows-machine-config-operator-554d8d85f4-4pqtj -n openshift-windows-machine-config-operator 
pod "windows-machine-config-operator-554d8d85f4-4pqtj" deleted

[jfrancoa@localhost wmco]$ oc get endpoints -n openshift-windows-machine-config-operator
NAME               ENDPOINTS                                AGE
windows-exporter,   4s

Comment 8 errata-xmlrpc 2023-01-30 05:48:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift support for Windows Containers 7.0.0 [security update]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


