2107261 – [WMCO] WMCO endpoints missing after WMCO restart in vSphere

Bug 2107261 - [WMCO] WMCO endpoints missing after WMCO restart in vSphere

Summary: [WMCO] WMCO endpoints missing after WMCO restart in vSphere

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Windows Containers
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Sebastian Soto
QA Contact:	Ronnie Rasouli
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2108805
TreeView+	depends on / blocked

Reported:	2022-07-14 15:30 UTC by Jose Luis Franco
Modified:	2023-01-30 05:49 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Previously, restarting the Windows Machine Config Operator (WMCO) in a cluster with running Windows nodes caused the Windows exporter endpoint to be removed. Because of this, each Windows node could not report any metrics data. With this update, the endpoint is retained when the WMCO is started. As a result, metrics data is reported properly after restarting WMCO. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2107261[BZ#2107261])
Clone Of:
Environment:
Last Closed:	2023-01-30 05:48:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift windows-machine-config-operator pull 1138	0	None	Merged	Bug 2107261: Maintain endpoint subsets on restart	2022-11-08 21:20:33 UTC
Red Hat Product Errata	RHSA-2022:9096	0	None	None	None	2023-01-30 05:49:04 UTC

Description Jose Luis Franco 2022-07-14 15:30:41 UTC

Must gather logs:

1. Issue: oc get endpoints -n openshift-windows-machine-config-operator command prints ENDPOINTS <none> in WMCO 6.0.0 for vSphere cloud provider:

[cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get endpoints -n openshift-windows-machine-config-operator
NAME               ENDPOINTS   AGE
windows-exporter   <none>      22m

When checking the windows-exporter service we can confirm that the endpoints are really missing:

[cloud-user@preserve-jfrancoa 119919]$ oc describe service/windows-exporter -n openshift-windows-machine-config-operator 
Name:              windows-exporter
Namespace:         openshift-windows-machine-config-operator
Labels:            name=windows-exporter
                   operators.coreos.com/windows-machine-config-operator.openshift-windows-machine-confi=
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                172.30.72.109
IPs:               172.30.72.109
Port:              metrics  9182/TCP
TargetPort:        9182/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

2. WMCO & OpenShift Version 
[cloud-user@preserve-jfrancoa 119919]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-07-11-080250   True        False         8h      Cluster version is 4.11.0-0.nightly-2022-07-11-080250

[cloud-user@preserve-jfrancoa 119919]$ oc get csv -n openshift-windows-machine-config-operator
NAME                                     DISPLAY                            VERSION   REPLACES   PHASE
elasticsearch-operator.v5.5.0            OpenShift Elasticsearch Operator   5.5.0                Succeeded
windows-machine-config-operator.v6.0.0   Windows Machine Config Operator    6.0.0                Succeeded

[cloud-user@preserve-jfrancoa 119919]$ oc get cm -n openshift-windows-machine-config-operator 
NAME                                   DATA   AGE
kube-root-ca.crt                       1      7h54m
openshift-service-ca.crt               1      7h54m
windows-machine-config-operator-lock   0      44m
windows-services-6.0.0-9a1eca1         2      7h53m

3. Platform - VSphere
4. If the platform is vSphere, what is the VMware tools version? 
5. Is it a new test case or an old test case? 
    Old test case
   if it is the old test case, is it regression or first-time tested?
    It is a regression
   Is it platform-specific or consistent across all platforms?
    So far it has occurred only in vSphere, for Azure the endpoints are both present
6. Steps to Reproduce
   1. Deploy a 4.11 OCP cluster in vSphere
   2. Install WMCO 6.0.0 and create some machinesets
   3. Restart the WMCO container by deleting the wmco pod: oc pod delete <wmco-pod-id> -n openshift-windows-machine-config-operator
   3. Run: oc get endpoints -n openshift-windows-machine-config-operator
   4. Check the endpoints field 
7. Actual Result and Expected Result
  Actual: windows-exporter endpoints shows <none>
  Expected: windows-exporter endpoints displays the two IPs corresponding to the endpoints
8. A possible workaround has been tried? Is there a way to recover from the issue being tried out?
   Scaling down and scaling up the machineset made the windows-exporter endpoints appearing. Even though the Windows workers were up and running and the workloads could successfuly run, WMCO was not able to update the windows-exporter endpoints. Once the scale down and happens, it was observed the following log in wmco logs:

1.6576336929887342e+09  INFO    metrics Prometheus configured   {"endpoints": "windows-exporter", "port": 9182, "name": "metrics"}

9. Logs
       Must-gather-windows-node-logs(https://github.com/openshift/must-gather/blob/master/collection-scripts/gather_windows_node_logs#L24)
           oc get network.operator cluster -o yaml
           oc logs -f deployment/windows-machine-config-operator -n openshift-windows-machine-config-operator
       Windows MachineSet yaml or windows-instances ConfigMap
           oc get machineset <windows_machineSet_name> -n openshift-machine-api -o yaml
           oc get configmaps <windows_configmap_name> -n <namespace_name> -o yaml


 Optional logs:
    Anything that can be useful to debug the issue.

Comment 4 Jose Luis Franco 2022-07-20 14:32:28 UTC

WMCO VERSION
==============
[jfrancoa@localhost wmco]$ oc get cm -n openshift-windows-machine-config-operator 
NAME                                   DATA   AGE
kube-root-ca.crt                       1      21m
openshift-service-ca.crt               1      21m
windows-machine-config-operator-lock   0      20m
windows-services-6.0.0-07ebdd7         2      20m

[jfrancoa@localhost wmco]$ oc get csv -n openshift-windows-machine-config-operator
NAME                                     DISPLAY                            VERSION   REPLACES   PHASE
elasticsearch-operator.v5.5.0            OpenShift Elasticsearch Operator   5.5.0                Succeeded
windows-machine-config-operator.v6.0.0   Windows Machine Config Operator    6.0.0                Succeeded

VALIDATION
============
[jfrancoa@localhost wmco]$ oc get endpoints -n openshift-windows-machine-config-operator
NAME               ENDPOINTS                                AGE
windows-exporter   172.31.249.42:9182,172.31.249.201:9182   21m

[jfrancoa@localhost wmco]$ oc get pods -n openshift-windows-machine-config-operator 
NAME                                               READY   STATUS    RESTARTS   AGE
windows-machine-config-operator-554d8d85f4-4pqtj   1/1     Running   0          21m

[jfrancoa@localhost wmco]$ oc delete pods windows-machine-config-operator-554d8d85f4-4pqtj -n openshift-windows-machine-config-operator 
pod "windows-machine-config-operator-554d8d85f4-4pqtj" deleted

[jfrancoa@localhost wmco]$ oc get endpoints -n openshift-windows-machine-config-operator
NAME               ENDPOINTS                                AGE
windows-exporter   172.31.249.42:9182,172.31.249.201:9182   4s

Comment 8 errata-xmlrpc 2023-01-30 05:48:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift support for Windows Containers 7.0.0 [security update]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:9096

Note You need to log in before you can comment on or make changes to this bug.