Bug 1956412 - WMCO incorrectly shows node as ready after a failed configuration
Summary: WMCO incorrectly shows node as ready after a failed configuration
Keywords:
Status: ON_QA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Windows Containers
Version: 4.7
Hardware: x86_64
OS: Windows
high
high
Target Milestone: ---
: 4.7.z
Assignee: Aravindh Puthiyaparambil
QA Contact: gaoshang
URL:
Whiteboard:
Depends On: 1953692
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-03 16:09 UTC by Aravindh Puthiyaparambil
Modified: 2021-05-14 06:32 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: WMCO was not cordoning the nodes after initial kubelet setup, prematurely making the node available for scheduling Consequence: Pods would get scheduled on the nodes but would not go to Running Fix: Cordon the nodes after initial kubelet setup and uncordon after full configuration Result: Node is no longer prematurely accepts workloads
Clone Of: 1953692
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift windows-machine-config-operator pull 434 0 None closed Bug 1956412: Fix node being shown as Ready after failed config 2021-05-07 14:09:30 UTC

Description Aravindh Puthiyaparambil 2021-05-03 16:09:01 UTC
+++ This bug was initially created as a clone of Bug #1953692 +++

Description of problem:
In scenarios where hybrid-overlay or kube-proxy fails to come up when WMCO is configuring the Windows instances, the node still incorrectly shows as ready. 



How reproducible: Always


Steps to Reproduce:
1. Create a Windows node and force either hybrid-overlay or kube-proxy to fail during configuration

Actual results: Node is shown as ready


Expected results: Node should be marked as not ready

--- Additional comment from Giovanni Fontana on 2021-04-26 21:07:02 UTC ---

Hi Team,

I found this issue during my tests in a VMware lab. Seems that WMCO is not "catching" the error related to hybrid-overlay-node and, due to that, it is incorrectly considering it as *Ready* - when it is not.

It is easy to reproduce if you have a VMware lab. Just prepare the windows image using the wrong OS version: Windows 2019, instead of Windows 1909. You will be able to add it to the OpenShift cluster and it will be shown as Ready, however, when you try to run any pod you will get an error like the one below:

[gfontana@bastion ~]$ oc get pods
win-webserver-864f558d99-h7h8l      0/1     CrashLoopBackOff   9          5m16s

[gfontana@bastion ~]$ oc describe pod win-webserver-864f558d99-h7h8l
Name:           win-webserver-864f558d99-h7h8l
(...)
Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       5m56s                  default-scheduler  Successfully assigned netcandystore/win-webserver-864f558d99-h7h8l to gfontana-djglw-worker-windows-m97zk
  Normal   Pulled          5m33s (x4 over 5m54s)  kubelet            Container image "mcr.microsoft.com/windows/servercore:ltsc2019" already present on machine
  Normal   Created         5m33s (x4 over 5m54s)  kubelet            Created container windowswebserver
  Normal   Started         5m32s (x4 over 5m52s)  kubelet            Started container windowswebserver
  Normal   Killing         5m22s (x6 over 5m50s)  kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  BackOff         5m20s (x3 over 5m27s)  kubelet            Back-off restarting failed container



Looking at the WMCO logs you will notice an error related to *hybrid-overlay-node*:

2021-04-20T19:38:59.469Z	INFO	VM 4201c72e-216a-13d1-69f6-8c6d0573bee7	configured	{"service": "hybrid-overlay-node", "args": "--node gfontana-djglw-worker-windows-bvkc2 --hybrid-overlay-vxlan-port=9898 --k8s-kubeconfig c:\\k\\kubeconfig --windows-service --logfile C:\\var\\log\\hybrid-overlay\\hybrid-overlay.log\" depend= kubelet"}
2021-04-20T19:48:59.486Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "windowsmachine-controller", "request": "openshift-machine-api/gfontana-djglw-worker-windows-bvkc2", "error": "failed to configure Windows VM 4201c72e-216a-13d1-69f6-8c6d0573bee7: configuring node network failed: error waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation for gfontana-djglw-worker-windows-bvkc2: timeout waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation: timed out waiting for the condition", "errorVerbose": "timed out waiting for the condition\ntimeout waiting for k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac node annotation\ngithub.com/openshift/windows-machine-config-operator/pkg/controller/windowsmachine/nodeconfig.(*nodeConfig).waitForNodeAnnotation\n\t/remote-source/build/windows-machine-config-operator/pkg/controller/windowsmachine/nodeconfig/nodeconfig.go:264
...




However, there is no error in the C:\var\log\hybrid-overlay\hybrid-overlay.log file, as you can see below:

PS C:\Users\Administrator> cat C:\\var\\log\\hybrid-overlay\\hybrid-overlay.log
I0420 15:30:48.285616    2152 service.go:71] Running hybrid-overlay-node as a Windows service
I0420 15:30:48.415593    2152 cert_rotation.go:137] Starting client certificate rotation controller
I0420 15:49:45.282316     352 service.go:71] Running hybrid-overlay-node as a Windows service
I0420 15:49:45.294338     352 cert_rotation.go:137] Starting client certificate rotation controller
I0420 15:50:45.527623    3808 cert_rotation.go:137] Starting client certificate rotation controller
I0420 16:08:46.544366    4068 cert_rotation.go:137] Starting client certificate rotation controller
I0420 16:09:31.197866    2668 service.go:71] Running hybrid-overlay-node as a Windows service
I0420 16:09:31.220846    2668 cert_rotation.go:137] Starting client certificate rotation controller
I0420 16:09:46.786043    2364 cert_rotation.go:137] Starting client certificate rotation controller



You will see that the hybrid-overlay-node is not running:
PS C:\Users\Administrator> Get-Service | ?{ $_.Name -match "kube|overlay|docker" }

Status   Name               DisplayName
------   ----               -----------
Running  docker             Docker Engine
Stopped  hybrid-overlay-... hybrid-overlay-node
Running  kubelet            kubelet


Note You need to log in before you can comment on or make changes to this bug.