Description of problem: If WMCO pod recreated, windows-exporter endpoints will be cleaned, thus Prometheus can not scrape them anymore. $ oc get ep NAME ENDPOINTS AGE windows-exporter 172.31.249.232:9182,172.31.249.216:9182 34m $ oc delete pod/windows-machine-config-operator-6d9f4bf96f-kgm2c pod "windows-machine-config-operator-6d9f4bf96f-kgm2c" deleted $ oc get ep NAME ENDPOINTS AGE windows-exporter <none> 16s Version-Release number of selected component (if applicable): OCP version: 4.9.0-0.nightly-2021-10-16-173626 WMCO version: 4.0.0+7991f6f0 How reproducible: Always Steps to Reproduce: 1, Scale up Windows node created by machineset, check windows-exporter ep contains Windows IP 2, Delete WMCO pod and wait it recreated 3, Check windows-exporter ep again Actual results: windows-exporter endpoints will be cleaned Expected results: windows-exporter endpoints should contain all Windows IP Additional info:
I was not able to reproduce this issue. After deleting operator pod, the windows-exporter endpoints were quickly repopulated with all Windows IP:port. Environment specs: - OCP version: latest-4.9 - WMCO version: 4.0.0+7991f6f0 - Platform: Azure Results with Windows MachineSet having 2 replicas, both configured as node: $ oc get ep -n openshift-windows-machine-config-operator NAME ENDPOINTS AGE windows-exporter 10.0.128.7:9182,10.0.128.8:9182 76m $ oc delete pod windows-machine-config-operator-5486449875-6lzzs -n openshift-windows-machine-config-operator pod "windows-machine-config-operator-5486449875-6lzzs" deleted $ oc get ep -n openshift-windows-machine-config-operator NAME ENDPOINTS AGE windows-exporter <none> 0s $ oc get ep -n openshift-windows-machine-config-operator NAME ENDPOINTS AGE windows-exporter 10.0.128.7:9182,10.0.128.8:9182 4s
@sgao I also did not see this on vSphere. Cluster installed using 4.9 nightly build and WMCO version was 4.0.0+ba09417. $ oc get ep -n openshift-windows-machine-config-operator NAME ENDPOINTS AGE windows-exporter 172.31.251.205:9182,172.31.251.132:9182 13m windows-machine-config-operator-registry-server 10.129.2.14:50051 178m $ oc get pods -A |grep windows openshift-windows-machine-config-operator windows-machine-config-operator-74db66f78f-vmfn4 1/1 Running 0 16m openshift-windows-machine-config-operator windows-machine-config-operator-registry-server-d75f9658d-885rl 1/1 Running 0 179m $ oc delete pod windows-machine-config-operator-74db66f78f-vmfn4 -n openshift-windows-machine-config-operator pod "windows-machine-config-operator-74db66f78f-vmfn4" deleted $ oc get ep -n openshift-windows-machine-config-operator NAME ENDPOINTS AGE windows-exporter 172.31.251.205:9182,172.31.251.132:9182 1s $oc get ep -n openshift-windows-machine-config-operator NAME ENDPOINTS AGE windows-exporter 172.31.251.205:9182,172.31.251.132:9182 56s
The team has triaged this bug and decided it is not a blocker to the v4.0.0 release of WMCO. The endpoints object seems to be properly repopulated with Windows node IPs at most a few minutes after the operator pod is restarted.
@mohashai That's strange, unless WMCO reconcile triggered(scale up/down node), ep always empty(after 30 mins) on vSphere here. I'll keep my env tonight, thanks. $ oc get node -l kubernetes.io/os=windows NAME STATUS ROLES AGE VERSION winworker-2f7np Ready worker 52m v1.22.1-1676+af080cb8d127b3 winworker-flw7c Ready worker 56m v1.22.1-1676+af080cb8d127b3 $ oc get ep -n openshift-windows-machine-config-operator NAME ENDPOINTS AGE windows-exporter <none> 30m
@mohashai Found that with template windows-server-2004-template-nics-vmtoolsv11333, this bug no longer exist on OCP 4.9.0-0.nightly-2021-10-22-102153 + vSphere
Thanks for identifying a solution @sgao, I've moved this to on QA. You can mark it as verified when it is good on your end. We'll try to get this in to WMCO v4.0.0 (OCP 4.9), though it remains not a blocker for the release.
oc delete pod/windows-machine-config-operator-67d8b7d6d6-bcfhd pod "windows-machine-config-operator-67d8b7d6d6-bcfhd" deleted rrasouli@rrasouli-mac openshift-tests-private % oc get pod NAME READY STATUS RESTARTS AGE windows-machine-config-operator-67d8b7d6d6-fcnld 1/1 Running 0 5s rrasouli@rrasouli-mac openshift-tests-private % oc get ep NAME ENDPOINTS AGE windows-exporter 10.0.154.207:9182,10.0.159.181:9182 5s verified on 3.1.0+8ffe65a
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Windows Container Support for Red Hat OpenShift 4.0.1 product release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4757