Bug 2015773

Summary: Deleting version annotation failed to trigger Windows node reconcile on vSphere
Product: OpenShift Container Platform Reporter: gaoshang <sgao>
Component: Windows ContainersAssignee: elango siva <esiva>
Status: CLOSED ERRATA QA Contact: gaoshang <sgao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.9CC: aos-bugs, esiva, team-winc
Target Milestone: ---   
Target Release: 4.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-13 12:46:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description gaoshang 2021-10-20 05:21:03 UTC
Description of problem: Deleting version annotation will reconfig Windows node in previous WMCO, and after reconfigure, the version annotation will be added back. Now on vSphere, it keeps on reporting `no internal IP address associated` error, does not trigger Windows node reconcile and never adds the version back.

{"level":"error","ts":1634565502.8201108,"logger":"controller-runtime.manager.controller.machine","msg":"Reconciler error","reconciler group":"machine.openshift.io","reconciler kind":"Machine","name":"winworker-hqh94","namespace":"openshift-machine-api","error":"invalid machine winworker-hqh94: no internal IP address associated","errorVerbose":"no internal IP address associated\ngithub.com/openshift/windows-machine-config-operator/controllers.getInternalIPAddress\n\t/remote-source/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:497\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).Reconcile\n\t/remote-source/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371\ninvalid machine winworker-hqh94\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).Reconcile\n\t/remote-source/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:299\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}

Version-Release number of selected component (if applicable):
OCP version: 4.9.0-0.nightly-2021-10-16-173626
WMCO version: 4.0.0+7991f6f0

How reproducible:
Always

Steps to Reproduce:
1, Scale up Windows node created by machineset
2, Delete version annotation of Windows node
3, Check WMCO log

Actual results:
WMCO keeps on reporting `no internal IP address associated` error
Expected results:
WMCO should trigger Windows node reconcile and add version annotation back

Additional info:

Comment 1 elango siva 2021-10-20 22:03:14 UTC
I was able to reproduce this issue locally. 

There was vmware tool config issue that was identified by @jose and he fixed it with temp vspehre windows golden image.
I tried jvaldes/windows-server-2004-template-nics-vmtoolsv11333 image which is present in the vcenter ( used by dev team) .  I dont see the problem and the root cause of the issue is not seen.  Basically ip addresses information in the windows machine object is getting wiped out and that is causing this issue. It doesnt happen with @Jose's image.

1) run oc annotate node winworker-dt6ck windowsmachineconfig.openshift.io/version-
2) check node info where version is missing
    esiva:/home/esiva/go/src/windows-machine-config-operator
    $ oc describe node winworker-dt6ck 
    Name:               winworker-dt6ck
    Roles:              worker
    Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=windows
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=winworker-dt6ck
                    kubernetes.io/os=windows
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/windows-build=10.0.19041
                    node.openshift.io/os_id=Windows
    Annotations:        k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac: 00-15-5D-74-48-03
                    k8s.ovn.org/hybrid-overlay-node-subnet: 10.132.1.0/24
                    machine.openshift.io/machine: openshift-machine-api/winworker-dt6ck
                    volumes.kubernetes.io/controller-managed-attach-detach: true
                    windowsmachineconfig.openshift.io/pub-key-hash: 7c00ba8122aa764a192fe7d2d9ac4d3627b9c443c09480b18c055c2e178a6019

3) wait for a while for reconciler to kick in 

4) Node comes back to ready state 

5) Verified and made sure version is back. 

$ oc describe node winworker-dt6ck 
Name:               winworker-dt6ck
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=windows
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=winworker-dt6ck
                    kubernetes.io/os=windows
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/windows-build=10.0.19041
                    node.openshift.io/os_id=Windows
Annotations:        k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac: 00-15-5D-74-48-03
                    k8s.ovn.org/hybrid-overlay-node-subnet: 10.132.1.0/24
                    machine.openshift.io/machine: openshift-machine-api/winworker-dt6ck
                    volumes.kubernetes.io/controller-managed-attach-detach: true
                    windowsmachineconfig.openshift.io/pub-key-hash: 7c00ba8122aa764a192fe7d2d9ac4d3627b9c443c09480b18c055c2e178a6019
                    windowsmachineconfig.openshift.io/version: 4.0.0+ba09417


If QE team uses same Vcenter, one can use `windows-golden-images/windows-server-2004-template-nics-vmtoolsv11333` instead of of `windows-golden-images/windows-server-2004-template` in template.

Comment 3 elango siva 2021-10-20 22:22:38 UTC
I tried `jvaldes/windows-server-2004-template-nics-vmtoolsv11333` and this is same as `windows-golden-images/windows-server-2004-template-nics-vmtoolsv11333`. Jose placed it in proper folder.

Comment 4 gaoshang 2021-10-25 12:35:11 UTC
With template windows-server-2004-template-nics-vmtoolsv11333, this bug no longer exist on OCP 4.9.0-0.nightly-2021-10-22-102153, thanks.

Comment 7 errata-xmlrpc 2021-12-13 12:46:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Windows Container Support for Red Hat OpenShift 4.0.1 product release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4757