+++ This bug was initially created as a clone of Bug #1953692 +++ Description of problem: In scenarios where hybrid-overlay or kube-proxy fails to come up when WMCO is configuring the Windows instances, the node still incorrectly shows as ready. How reproducible: Always Steps to Reproduce: 1. Create a Windows node and force either hybrid-overlay or kube-proxy to fail during configuration Actual results: Node is shown as ready Expected results: Node should be marked as not ready --- Additional comment from Giovanni Fontana on 2021-04-26 21:07:02 UTC --- Hi Team, I found this issue during my tests in a VMware lab. Seems that WMCO is not "catching" the error related to hybrid-overlay-node and, due to that, it is incorrectly considering it as *Ready* - when it is not. It is easy to reproduce if you have a VMware lab. Just prepare the windows image using the wrong OS version: Windows 2019, instead of Windows 1909. You will be able to add it to the OpenShift cluster and it will be shown as Ready, however, when you try to run any pod you will get an error like the one below: [gfontana@bastion ~]$ oc get pods win-webserver-864f558d99-h7h8l 0/1 CrashLoopBackOff 9 5m16s [gfontana@bastion ~]$ oc describe pod win-webserver-864f558d99-h7h8l Name: win-webserver-864f558d99-h7h8l (...) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 5m56s default-scheduler Successfully assigned netcandystore/win-webserver-864f558d99-h7h8l to gfontana-djglw-worker-windows-m97zk Normal Pulled 5m33s (x4 over 5m54s) kubelet Container image "mcr.microsoft.com/windows/servercore:ltsc2019" already present on machine Normal Created 5m33s (x4 over 5m54s) kubelet Created container windowswebserver Normal Started 5m32s (x4 over 5m52s) kubelet Started container windowswebserver Normal Killing 5m22s (x6 over 5m50s) kubelet Pod sandbox changed, it will be killed and re-created. Warning BackOff 5m20s (x3 over 5m27s) kubelet Back-off restarting failed container Looking at the WMCO logs you will notice an error related to *hybrid-overlay-node*: 2021-04-20T19:38:59.469Z INFO VM 4201c72e-216a-13d1-69f6-8c6d0573bee7 configured {"service": "hybrid-overlay-node", "args": "--node gfontana-djglw-worker-windows-bvkc2 --hybrid-overlay-vxlan-port=9898 --k8s-kubeconfig c:\\k\\kubeconfig --windows-service --logfile C:\\var\\log\\hybrid-overlay\\hybrid-overlay.log\" depend= kubelet"} 2021-04-20T19:48:59.486Z ERROR controller-runtime.controller Reconciler error {"controller": "windowsmachine-controller", "request": "openshift-machine-api/gfontana-djglw-worker-windows-bvkc2", "error": "failed to configure Windows VM 4201c72e-216a-13d1-69f6-8c6d0573bee7: configuring node network failed: error waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation for gfontana-djglw-worker-windows-bvkc2: timeout waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation: timed out waiting for the condition", "errorVerbose": "timed out waiting for the condition\ntimeout waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation\ngithub.com/openshift/windows-machine-config-operator/pkg/controller/windowsmachine/nodeconfig.(*nodeConfig).waitForNodeAnnotation\n\t/remote-source/build/windows-machine-config-operator/pkg/controller/windowsmachine/nodeconfig/nodeconfig.go:264 ... However, there is no error in the C:\var\log\hybrid-overlay\hybrid-overlay.log file, as you can see below: PS C:\Users\Administrator> cat C:\\var\\log\\hybrid-overlay\\hybrid-overlay.log I0420 15:30:48.285616 2152 service.go:71] Running hybrid-overlay-node as a Windows service I0420 15:30:48.415593 2152 cert_rotation.go:137] Starting client certificate rotation controller I0420 15:49:45.282316 352 service.go:71] Running hybrid-overlay-node as a Windows service I0420 15:49:45.294338 352 cert_rotation.go:137] Starting client certificate rotation controller I0420 15:50:45.527623 3808 cert_rotation.go:137] Starting client certificate rotation controller I0420 16:08:46.544366 4068 cert_rotation.go:137] Starting client certificate rotation controller I0420 16:09:31.197866 2668 service.go:71] Running hybrid-overlay-node as a Windows service I0420 16:09:31.220846 2668 cert_rotation.go:137] Starting client certificate rotation controller I0420 16:09:46.786043 2364 cert_rotation.go:137] Starting client certificate rotation controller You will see that the hybrid-overlay-node is not running: PS C:\Users\Administrator> Get-Service | ?{ $_.Name -match "kube|overlay|docker" } Status Name DisplayName ------ ---- ----------- Running docker Docker Engine Stopped hybrid-overlay-... hybrid-overlay-node Running kubelet kubelet
winworker-w8hsg NotReady worker 5s v1.20.0-1046+5fbfd197c16d3c winworker-w8hsg Ready,SchedulingDisabled worker 79s v1.20.0-1046+5fbfd197c16d3c Verified with Windows-2019 on vSphere environment
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Windows Container Support for Red Hat OpenShift 2.0.1 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2130