Bug 2107261
| Summary: | [WMCO] WMCO endpoints missing after WMCO restart in vSphere | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jose Luis Franco <jfrancoa> |
| Component: | Windows Containers | Assignee: | Sebastian Soto <ssoto> |
| Status: | CLOSED ERRATA | QA Contact: | Ronnie Rasouli <rrasouli> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.11 | CC: | jvaldes, mburke, ssoto, stevsmit |
| Target Milestone: | --- | ||
| Target Release: | 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
* Previously, restarting the Windows Machine Config Operator (WMCO) in a cluster with running Windows nodes caused the Windows exporter endpoint to be removed. Because of this, each Windows node could not report any metrics data. With this update, the endpoint is retained when the WMCO is started. As a result, metrics data is reported properly after restarting WMCO. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2107261[*BZ#2107261*])
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-30 05:48:31 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2108805 | ||
WMCO VERSION ============== [jfrancoa@localhost wmco]$ oc get cm -n openshift-windows-machine-config-operator NAME DATA AGE kube-root-ca.crt 1 21m openshift-service-ca.crt 1 21m windows-machine-config-operator-lock 0 20m windows-services-6.0.0-07ebdd7 2 20m [jfrancoa@localhost wmco]$ oc get csv -n openshift-windows-machine-config-operator NAME DISPLAY VERSION REPLACES PHASE elasticsearch-operator.v5.5.0 OpenShift Elasticsearch Operator 5.5.0 Succeeded windows-machine-config-operator.v6.0.0 Windows Machine Config Operator 6.0.0 Succeeded VALIDATION ============ [jfrancoa@localhost wmco]$ oc get endpoints -n openshift-windows-machine-config-operator NAME ENDPOINTS AGE windows-exporter 172.31.249.42:9182,172.31.249.201:9182 21m [jfrancoa@localhost wmco]$ oc get pods -n openshift-windows-machine-config-operator NAME READY STATUS RESTARTS AGE windows-machine-config-operator-554d8d85f4-4pqtj 1/1 Running 0 21m [jfrancoa@localhost wmco]$ oc delete pods windows-machine-config-operator-554d8d85f4-4pqtj -n openshift-windows-machine-config-operator pod "windows-machine-config-operator-554d8d85f4-4pqtj" deleted [jfrancoa@localhost wmco]$ oc get endpoints -n openshift-windows-machine-config-operator NAME ENDPOINTS AGE windows-exporter 172.31.249.42:9182,172.31.249.201:9182 4s Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift support for Windows Containers 7.0.0 [security update]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:9096 |
Must gather logs: 1. Issue: oc get endpoints -n openshift-windows-machine-config-operator command prints ENDPOINTS <none> in WMCO 6.0.0 for vSphere cloud provider: [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get endpoints -n openshift-windows-machine-config-operator NAME ENDPOINTS AGE windows-exporter <none> 22m When checking the windows-exporter service we can confirm that the endpoints are really missing: [cloud-user@preserve-jfrancoa 119919]$ oc describe service/windows-exporter -n openshift-windows-machine-config-operator Name: windows-exporter Namespace: openshift-windows-machine-config-operator Labels: name=windows-exporter operators.coreos.com/windows-machine-config-operator.openshift-windows-machine-confi= Annotations: <none> Selector: <none> Type: ClusterIP IP Family Policy: SingleStack IP Families: IPv4 IP: 172.30.72.109 IPs: 172.30.72.109 Port: metrics 9182/TCP TargetPort: 9182/TCP Endpoints: <none> Session Affinity: None Events: <none> 2. WMCO & OpenShift Version [cloud-user@preserve-jfrancoa 119919]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-07-11-080250 True False 8h Cluster version is 4.11.0-0.nightly-2022-07-11-080250 [cloud-user@preserve-jfrancoa 119919]$ oc get csv -n openshift-windows-machine-config-operator NAME DISPLAY VERSION REPLACES PHASE elasticsearch-operator.v5.5.0 OpenShift Elasticsearch Operator 5.5.0 Succeeded windows-machine-config-operator.v6.0.0 Windows Machine Config Operator 6.0.0 Succeeded [cloud-user@preserve-jfrancoa 119919]$ oc get cm -n openshift-windows-machine-config-operator NAME DATA AGE kube-root-ca.crt 1 7h54m openshift-service-ca.crt 1 7h54m windows-machine-config-operator-lock 0 44m windows-services-6.0.0-9a1eca1 2 7h53m 3. Platform - VSphere 4. If the platform is vSphere, what is the VMware tools version? 5. Is it a new test case or an old test case? Old test case if it is the old test case, is it regression or first-time tested? It is a regression Is it platform-specific or consistent across all platforms? So far it has occurred only in vSphere, for Azure the endpoints are both present 6. Steps to Reproduce 1. Deploy a 4.11 OCP cluster in vSphere 2. Install WMCO 6.0.0 and create some machinesets 3. Restart the WMCO container by deleting the wmco pod: oc pod delete <wmco-pod-id> -n openshift-windows-machine-config-operator 3. Run: oc get endpoints -n openshift-windows-machine-config-operator 4. Check the endpoints field 7. Actual Result and Expected Result Actual: windows-exporter endpoints shows <none> Expected: windows-exporter endpoints displays the two IPs corresponding to the endpoints 8. A possible workaround has been tried? Is there a way to recover from the issue being tried out? Scaling down and scaling up the machineset made the windows-exporter endpoints appearing. Even though the Windows workers were up and running and the workloads could successfuly run, WMCO was not able to update the windows-exporter endpoints. Once the scale down and happens, it was observed the following log in wmco logs: 1.6576336929887342e+09 INFO metrics Prometheus configured {"endpoints": "windows-exporter", "port": 9182, "name": "metrics"} 9. Logs Must-gather-windows-node-logs(https://github.com/openshift/must-gather/blob/master/collection-scripts/gather_windows_node_logs#L24) oc get network.operator cluster -o yaml oc logs -f deployment/windows-machine-config-operator -n openshift-windows-machine-config-operator Windows MachineSet yaml or windows-instances ConfigMap oc get machineset <windows_machineSet_name> -n openshift-machine-api -o yaml oc get configmaps <windows_configmap_name> -n <namespace_name> -o yaml Optional logs: Anything that can be useful to debug the issue.