Bug 2015415 - WMCO pod recreation cause windows-exporter endpoint cleaned
Summary: WMCO pod recreation cause windows-exporter endpoint cleaned
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Windows Containers
Version: 4.9
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: ---
: 4.9.0
Assignee: Mohammad Saif Shaikh
QA Contact: gaoshang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-19 07:17 UTC by gaoshang
Modified: 2021-12-13 12:46 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-13 12:46:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:4757 0 None None None 2021-12-13 12:46:23 UTC

Description gaoshang 2021-10-19 07:17:44 UTC
Description of problem: If WMCO pod recreated, windows-exporter endpoints will be cleaned, thus Prometheus can not scrape them anymore.

$ oc get ep
NAME               ENDPOINTS                                 AGE
windows-exporter   172.31.249.232:9182,172.31.249.216:9182   34m

$ oc delete pod/windows-machine-config-operator-6d9f4bf96f-kgm2c
pod "windows-machine-config-operator-6d9f4bf96f-kgm2c" deleted

$ oc get ep
NAME               ENDPOINTS   AGE
windows-exporter   <none>      16s

Version-Release number of selected component (if applicable):
OCP version: 4.9.0-0.nightly-2021-10-16-173626
WMCO version: 4.0.0+7991f6f0

How reproducible:
Always

Steps to Reproduce:
1, Scale up Windows node created by machineset, check windows-exporter ep contains Windows IP
2, Delete WMCO pod and wait it recreated
3, Check windows-exporter ep again

Actual results:
windows-exporter endpoints will be cleaned
Expected results:
windows-exporter endpoints should contain all Windows IP

Additional info:

Comment 1 Mohammad Saif Shaikh 2021-10-20 18:34:53 UTC
I was not able to reproduce this issue. After deleting operator pod, the windows-exporter endpoints were quickly repopulated with all Windows IP:port. Environment specs:
- OCP version: latest-4.9
- WMCO version: 4.0.0+7991f6f0
- Platform: Azure


Results with Windows MachineSet having 2 replicas, both configured as node:

$ oc get ep -n openshift-windows-machine-config-operator 
NAME                                              ENDPOINTS                         AGE
windows-exporter                                  10.0.128.7:9182,10.0.128.8:9182   76m

$ oc delete pod windows-machine-config-operator-5486449875-6lzzs -n openshift-windows-machine-config-operator 
pod "windows-machine-config-operator-5486449875-6lzzs" deleted

$ oc get ep -n openshift-windows-machine-config-operator 
NAME                                              ENDPOINTS           AGE
windows-exporter                                  <none>              0s

$ oc get ep -n openshift-windows-machine-config-operator 
NAME                                              ENDPOINTS                         AGE
windows-exporter                                  10.0.128.7:9182,10.0.128.8:9182   4s

Comment 2 Mohammad Saif Shaikh 2021-10-20 20:42:59 UTC
@sgao I also did not see this on vSphere. Cluster installed using 4.9 nightly build and WMCO version was 4.0.0+ba09417.

$ oc get ep -n openshift-windows-machine-config-operator
NAME                                              ENDPOINTS                                 AGE
windows-exporter                                  172.31.251.205:9182,172.31.251.132:9182   13m
windows-machine-config-operator-registry-server   10.129.2.14:50051                         178m

$ oc get pods -A |grep windows
openshift-windows-machine-config-operator          windows-machine-config-operator-74db66f78f-vmfn4                  1/1     Running     0               16m
openshift-windows-machine-config-operator          windows-machine-config-operator-registry-server-d75f9658d-885rl   1/1     Running     0               179m

$ oc delete pod windows-machine-config-operator-74db66f78f-vmfn4 -n openshift-windows-machine-config-operator
pod "windows-machine-config-operator-74db66f78f-vmfn4" deleted

$ oc get ep -n openshift-windows-machine-config-operator
NAME                                              ENDPOINTS                                 AGE
windows-exporter                                  172.31.251.205:9182,172.31.251.132:9182   1s

$oc get ep -n openshift-windows-machine-config-operator
NAME                                              ENDPOINTS                                 AGE
windows-exporter                                  172.31.251.205:9182,172.31.251.132:9182   56s

Comment 3 Mohammad Saif Shaikh 2021-10-20 20:50:38 UTC
The team has triaged this bug and decided it is not a blocker to the v4.0.0 release of WMCO. The endpoints object seems to be properly repopulated with Windows node IPs at most a few minutes after the operator pod is restarted.

Comment 4 gaoshang 2021-10-21 01:14:59 UTC
@mohashai That's strange, unless WMCO reconcile triggered(scale up/down node), ep always empty(after 30 mins) on vSphere here. I'll keep my env tonight, thanks.

$ oc get node  -l kubernetes.io/os=windows
NAME              STATUS   ROLES    AGE   VERSION
winworker-2f7np   Ready    worker   52m   v1.22.1-1676+af080cb8d127b3
winworker-flw7c   Ready    worker   56m   v1.22.1-1676+af080cb8d127b3

$ oc get ep -n openshift-windows-machine-config-operator
NAME               ENDPOINTS   AGE
windows-exporter   <none>      30m

Comment 5 gaoshang 2021-10-25 12:41:12 UTC
@mohashai Found that with template windows-server-2004-template-nics-vmtoolsv11333, this bug no longer exist on OCP 4.9.0-0.nightly-2021-10-22-102153 + vSphere

Comment 6 Mohammad Saif Shaikh 2021-10-25 20:38:25 UTC
Thanks for identifying a solution @sgao, I've moved this to on QA. You can mark it as verified when it is good on your end. We'll try to get this in to WMCO v4.0.0 (OCP 4.9), though it remains not a blocker for the release.

Comment 7 Ronnie Rasouli 2021-11-02 12:29:09 UTC
oc delete pod/windows-machine-config-operator-67d8b7d6d6-bcfhd
pod "windows-machine-config-operator-67d8b7d6d6-bcfhd" deleted
rrasouli@rrasouli-mac openshift-tests-private % oc get pod
NAME                                               READY   STATUS    RESTARTS   AGE
windows-machine-config-operator-67d8b7d6d6-fcnld   1/1     Running   0          5s
rrasouli@rrasouli-mac openshift-tests-private % oc get ep
NAME               ENDPOINTS                             AGE
windows-exporter   10.0.154.207:9182,10.0.159.181:9182   5s

verified on 3.1.0+8ffe65a

Comment 10 errata-xmlrpc 2021-12-13 12:46:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Windows Container Support for Red Hat OpenShift 4.0.1 product release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4757


Note You need to log in before you can comment on or make changes to this bug.