Bug 1953692

Summary: WMCO incorrectly shows node as ready after a failed configuration
Product: OpenShift Container Platform Reporter: Aravindh Puthiyaparambil <aravindh>
Component: Windows ContainersAssignee: Aravindh Puthiyaparambil <aravindh>
Status: CLOSED ERRATA QA Contact: gaoshang <sgao>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: aos-bugs, gfontana, mankulka
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: x86_64   
OS: Windows   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: WMCO was not cordoning the nodes after initial kubelet setup, prematurely making the node available for scheduling Consequence: Pods would get scheduled on the nodes but would not go to Running Fix: Cordon the nodes after initial kubelet setup and uncordon after full configuration Result: Node is no longer prematurely accepts workloads
Story Points: ---
Clone Of:
: 1956412 (view as bug list) Environment:
Last Closed: 2021-08-03 20:29:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1956412    

Description Aravindh Puthiyaparambil 2021-04-26 16:43:14 UTC
Description of problem:
In scenarios where hybrid-overlay or kube-proxy fails to come up when WMCO is configuring the Windows instances, the node still incorrectly shows as ready. 



How reproducible: Always


Steps to Reproduce:
1. Create a Windows node and force either hybrid-overlay or kube-proxy to fail during configuration

Actual results: Node is shown as ready


Expected results: Node should be marked as not ready

Comment 2 gaoshang 2021-05-06 12:56:35 UTC
This bug has been verified on OCP 4.8 + vSphere + Windows Server 2019 and passed, thanks.

Version-Release number of selected component (if applicable):
WMCO built from https://github.com/openshift/windows-machine-config-operator/commit/1ca41c250ff937d1543559ba19e805a7473d45bf
OCP version 4.8.0-0.nightly-2021-04-30-201824

Steps:

1. Install OCP 4.8 on vSphere, build WMCO and install it, refer to https://github.com/openshift/windows-machine-config-operator/blob/master/docs/HACKING.md

2. Create Windows machineset with Windows Server 2019

3. Check WMCO log and watch Windows node status

1), When kubelet service started, Windows node would be Ready but cordoned.

$ oc logs -f deployment.apps/windows-machine-config-operator -n openshift-windows-machine-config-operator
...
2021-05-06T11:59:48.281Z	INFO	VM 172.31.249.149	configured kubelet	{"cmd": "C:\\k\\\\wmcb.exe initialize-kubelet --ignition-file C:\\Windows\\Temp\\worker.ign --kubelet-path C:\\k\\kubelet.exe", "output": "Bootstrapping completed successfully"}


$ oc get nodes -l kubernetes.io/os=windows -owide
NAME              STATUS                     ROLES    AGE   VERSION                            INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
winworker-zk6s4   Ready,SchedulingDisabled   worker   14s   v1.21.0-rc.0.1190+e22a836a8b2659   172.31.249.149   172.31.249.149   Windows Server 2019 Standard   10.0.17763.1697   docker://19.3.14

2), Wait until running hybrid-overlay-node service failed, Windows node would be NotReady and cordoned.

2021-05-06T12:13:04.920Z	ERROR	controller-runtime.manager.controller.machine	Reconciler error	{"reconciler group": "machine.openshift.io", "reconciler kind": "Machine", "name": "winworker-zk6s4", "namespace": "openshift-machine-api", "error": "failed to configure Windows VM 422c050e-a0bc-b215-2a89-3986cbc84aab: configuring node network failed: error waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation for winworker-zk6s4: timeout waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation: timed out waiting for the condition", "errorVerbose": "timed out waiting for the condition\ntimeout waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).waitForNodeAnnotation\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:306\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).configureNetwork\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:225\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure.func1\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:170\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:193\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).addWorkerNode\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:440\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).Reconcile\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:374\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\nerror waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation for winworker-zk6s4\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).configureNetwork\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:226\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure.func1\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:170\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:193\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).addWorkerNode\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:440\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).Reconcile\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:374\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\nconfiguring node network failed\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure.func1\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:171\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:193\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).addWorkerNode\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:440\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).Reconcile\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:374\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\nfailed to configure Windows VM 422c050e-a0bc-b215-2a89-3986cbc84aab\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).addWorkerNode\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:442\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).Reconcile\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:374\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214

$ oc get nodes -l kubernetes.io/os=windows -owide
NAME              STATUS                        ROLES    AGE   VERSION                            INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
winworker-zk6s4   NotReady,SchedulingDisabled   worker   53m   v1.21.0-rc.0.1190+e22a836a8b2659   172.31.249.149   172.31.249.149   Windows Server 2019 Standard   10.0.17763.1697   docker://19.3.14

Comment 5 errata-xmlrpc 2021-08-03 20:29:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Platform for Windows Containers 3.0.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3001