Bug 1953692 - WMCO incorrectly shows node as ready after a failed configuration
Summary: WMCO incorrectly shows node as ready after a failed configuration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Windows Containers
Version: 4.7
Hardware: x86_64
OS: Windows
high
high
Target Milestone: ---
: 4.8.0
Assignee: Aravindh Puthiyaparambil
QA Contact: gaoshang
URL:
Whiteboard:
Depends On:
Blocks: 1956412
TreeView+ depends on / blocked
 
Reported: 2021-04-26 16:43 UTC by Aravindh Puthiyaparambil
Modified: 2021-08-03 20:29 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: WMCO was not cordoning the nodes after initial kubelet setup, prematurely making the node available for scheduling Consequence: Pods would get scheduled on the nodes but would not go to Running Fix: Cordon the nodes after initial kubelet setup and uncordon after full configuration Result: Node is no longer prematurely accepts workloads
Clone Of:
: 1956412 (view as bug list)
Environment:
Last Closed: 2021-08-03 20:29:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift windows-machine-config-operator pull 424 0 None closed Bug 1953692: Fix node being shown as Ready after failed config 2021-05-01 16:07:22 UTC
Red Hat Product Errata RHSA-2021:3001 0 None None None 2021-08-03 20:29:49 UTC

Description Aravindh Puthiyaparambil 2021-04-26 16:43:14 UTC
Description of problem:
In scenarios where hybrid-overlay or kube-proxy fails to come up when WMCO is configuring the Windows instances, the node still incorrectly shows as ready. 



How reproducible: Always


Steps to Reproduce:
1. Create a Windows node and force either hybrid-overlay or kube-proxy to fail during configuration

Actual results: Node is shown as ready


Expected results: Node should be marked as not ready

Comment 2 gaoshang 2021-05-06 12:56:35 UTC
This bug has been verified on OCP 4.8 + vSphere + Windows Server 2019 and passed, thanks.

Version-Release number of selected component (if applicable):
WMCO built from https://github.com/openshift/windows-machine-config-operator/commit/1ca41c250ff937d1543559ba19e805a7473d45bf
OCP version 4.8.0-0.nightly-2021-04-30-201824

Steps:

1. Install OCP 4.8 on vSphere, build WMCO and install it, refer to https://github.com/openshift/windows-machine-config-operator/blob/master/docs/HACKING.md

2. Create Windows machineset with Windows Server 2019

3. Check WMCO log and watch Windows node status

1), When kubelet service started, Windows node would be Ready but cordoned.

$ oc logs -f deployment.apps/windows-machine-config-operator -n openshift-windows-machine-config-operator
...
2021-05-06T11:59:48.281Z	INFO	VM 172.31.249.149	configured kubelet	{"cmd": "C:\\k\\\\wmcb.exe initialize-kubelet --ignition-file C:\\Windows\\Temp\\worker.ign --kubelet-path C:\\k\\kubelet.exe", "output": "Bootstrapping completed successfully"}


$ oc get nodes -l kubernetes.io/os=windows -owide
NAME              STATUS                     ROLES    AGE   VERSION                            INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
winworker-zk6s4   Ready,SchedulingDisabled   worker   14s   v1.21.0-rc.0.1190+e22a836a8b2659   172.31.249.149   172.31.249.149   Windows Server 2019 Standard   10.0.17763.1697   docker://19.3.14

2), Wait until running hybrid-overlay-node service failed, Windows node would be NotReady and cordoned.

2021-05-06T12:13:04.920Z	ERROR	controller-runtime.manager.controller.machine	Reconciler error	{"reconciler group": "machine.openshift.io", "reconciler kind": "Machine", "name": "winworker-zk6s4", "namespace": "openshift-machine-api", "error": "failed to configure Windows VM 422c050e-a0bc-b215-2a89-3986cbc84aab: configuring node network failed: error waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation for winworker-zk6s4: timeout waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation: timed out waiting for the condition", "errorVerbose": "timed out waiting for the condition\ntimeout waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).waitForNodeAnnotation\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:306\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).configureNetwork\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:225\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure.func1\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:170\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:193\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).addWorkerNode\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:440\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).Reconcile\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:374\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\nerror waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation for winworker-zk6s4\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).configureNetwork\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:226\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure.func1\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:170\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:193\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).addWorkerNode\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:440\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).Reconcile\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:374\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\nconfiguring node network failed\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure.func1\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:171\ngithub.com/openshift/windows-machine-config-operator/pkg/nodeconfig.(*nodeConfig).Configure\n\t/build/windows-machine-config-operator/pkg/nodeconfig/nodeconfig.go:193\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).addWorkerNode\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:440\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).Reconcile\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:374\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\nfailed to configure Windows VM 422c050e-a0bc-b215-2a89-3986cbc84aab\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).addWorkerNode\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:442\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).Reconcile\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:374\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214

$ oc get nodes -l kubernetes.io/os=windows -owide
NAME              STATUS                        ROLES    AGE   VERSION                            INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
winworker-zk6s4   NotReady,SchedulingDisabled   worker   53m   v1.21.0-rc.0.1190+e22a836a8b2659   172.31.249.149   172.31.249.149   Windows Server 2019 Standard   10.0.17763.1697   docker://19.3.14

Comment 5 errata-xmlrpc 2021-08-03 20:29:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Platform for Windows Containers 3.0.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3001


Note You need to log in before you can comment on or make changes to this bug.