Bug 2107261

Summary: [WMCO] WMCO endpoints missing after WMCO restart in vSphere
Product: OpenShift Container Platform Reporter: Jose Luis Franco <jfrancoa>
Component: Windows ContainersAssignee: Sebastian Soto <ssoto>
Status: CLOSED ERRATA QA Contact: Ronnie Rasouli <rrasouli>
Severity: medium Docs Contact:
Priority: high    
Version: 4.11CC: jvaldes, mburke, ssoto, stevsmit
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
* Previously, restarting the Windows Machine Config Operator (WMCO) in a cluster with running Windows nodes caused the Windows exporter endpoint to be removed. Because of this, each Windows node could not report any metrics data. With this update, the endpoint is retained when the WMCO is started. As a result, metrics data is reported properly after restarting WMCO. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2107261[*BZ#2107261*])
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-30 05:48:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2108805    

Description Jose Luis Franco 2022-07-14 15:30:41 UTC
Must gather logs:

1. Issue: oc get endpoints -n openshift-windows-machine-config-operator command prints ENDPOINTS <none> in WMCO 6.0.0 for vSphere cloud provider:

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get endpoints -n openshift-windows-machine-config-operator
NAME               ENDPOINTS   AGE
windows-exporter   <none>      22m

When checking the windows-exporter service we can confirm that the endpoints are really missing:

[cloud-user@preserve-jfrancoa 119919]$ oc describe service/windows-exporter -n openshift-windows-machine-config-operator 
Name:              windows-exporter
Namespace:         openshift-windows-machine-config-operator
Labels:            name=windows-exporter
                   operators.coreos.com/windows-machine-config-operator.openshift-windows-machine-confi=
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                172.30.72.109
IPs:               172.30.72.109
Port:              metrics  9182/TCP
TargetPort:        9182/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

2. WMCO & OpenShift Version 
[cloud-user@preserve-jfrancoa 119919]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-07-11-080250   True        False         8h      Cluster version is 4.11.0-0.nightly-2022-07-11-080250

[cloud-user@preserve-jfrancoa 119919]$ oc get csv -n openshift-windows-machine-config-operator
NAME                                     DISPLAY                            VERSION   REPLACES   PHASE
elasticsearch-operator.v5.5.0            OpenShift Elasticsearch Operator   5.5.0                Succeeded
windows-machine-config-operator.v6.0.0   Windows Machine Config Operator    6.0.0                Succeeded

[cloud-user@preserve-jfrancoa 119919]$ oc get cm -n openshift-windows-machine-config-operator 
NAME                                   DATA   AGE
kube-root-ca.crt                       1      7h54m
openshift-service-ca.crt               1      7h54m
windows-machine-config-operator-lock   0      44m
windows-services-6.0.0-9a1eca1         2      7h53m

3. Platform - VSphere
4. If the platform is vSphere, what is the VMware tools version? 
5. Is it a new test case or an old test case? 
    Old test case
   if it is the old test case, is it regression or first-time tested?
    It is a regression
   Is it platform-specific or consistent across all platforms?
    So far it has occurred only in vSphere, for Azure the endpoints are both present
6. Steps to Reproduce
   1. Deploy a 4.11 OCP cluster in vSphere
   2. Install WMCO 6.0.0 and create some machinesets
   3. Restart the WMCO container by deleting the wmco pod: oc pod delete <wmco-pod-id> -n openshift-windows-machine-config-operator
   3. Run: oc get endpoints -n openshift-windows-machine-config-operator
   4. Check the endpoints field 
7. Actual Result and Expected Result
  Actual: windows-exporter endpoints shows <none>
  Expected: windows-exporter endpoints displays the two IPs corresponding to the endpoints
8. A possible workaround has been tried? Is there a way to recover from the issue being tried out?
   Scaling down and scaling up the machineset made the windows-exporter endpoints appearing. Even though the Windows workers were up and running and the workloads could successfuly run, WMCO was not able to update the windows-exporter endpoints. Once the scale down and happens, it was observed the following log in wmco logs:

1.6576336929887342e+09  INFO    metrics Prometheus configured   {"endpoints": "windows-exporter", "port": 9182, "name": "metrics"}

9. Logs
       Must-gather-windows-node-logs(https://github.com/openshift/must-gather/blob/master/collection-scripts/gather_windows_node_logs#L24)
           oc get network.operator cluster -o yaml
           oc logs -f deployment/windows-machine-config-operator -n openshift-windows-machine-config-operator
       Windows MachineSet yaml or windows-instances ConfigMap
           oc get machineset <windows_machineSet_name> -n openshift-machine-api -o yaml
           oc get configmaps <windows_configmap_name> -n <namespace_name> -o yaml


 Optional logs:
    Anything that can be useful to debug the issue.

Comment 4 Jose Luis Franco 2022-07-20 14:32:28 UTC
WMCO VERSION
==============
[jfrancoa@localhost wmco]$ oc get cm -n openshift-windows-machine-config-operator 
NAME                                   DATA   AGE
kube-root-ca.crt                       1      21m
openshift-service-ca.crt               1      21m
windows-machine-config-operator-lock   0      20m
windows-services-6.0.0-07ebdd7         2      20m

[jfrancoa@localhost wmco]$ oc get csv -n openshift-windows-machine-config-operator
NAME                                     DISPLAY                            VERSION   REPLACES   PHASE
elasticsearch-operator.v5.5.0            OpenShift Elasticsearch Operator   5.5.0                Succeeded
windows-machine-config-operator.v6.0.0   Windows Machine Config Operator    6.0.0                Succeeded

VALIDATION
============
[jfrancoa@localhost wmco]$ oc get endpoints -n openshift-windows-machine-config-operator
NAME               ENDPOINTS                                AGE
windows-exporter   172.31.249.42:9182,172.31.249.201:9182   21m

[jfrancoa@localhost wmco]$ oc get pods -n openshift-windows-machine-config-operator 
NAME                                               READY   STATUS    RESTARTS   AGE
windows-machine-config-operator-554d8d85f4-4pqtj   1/1     Running   0          21m

[jfrancoa@localhost wmco]$ oc delete pods windows-machine-config-operator-554d8d85f4-4pqtj -n openshift-windows-machine-config-operator 
pod "windows-machine-config-operator-554d8d85f4-4pqtj" deleted

[jfrancoa@localhost wmco]$ oc get endpoints -n openshift-windows-machine-config-operator
NAME               ENDPOINTS                                AGE
windows-exporter   172.31.249.42:9182,172.31.249.201:9182   4s

Comment 8 errata-xmlrpc 2023-01-30 05:48:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift support for Windows Containers 7.0.0 [security update]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:9096