Created attachment 1871171 [details] must-gather Description of problem: OVN kube-master controller does not clear the NetworkUnavailableCondition on GCP BYOH Windows node Version-Release number of selected component (if applicable): OCP 4.10.6 on GCP WMCO 5.0.0 How reproducible: Always Steps to Reproduce: 1. Install the Windows Machine Config Operator [1] 2. Configure a secret for WMCO [2] 3. Bring up a Windows VM in GCP using the following command: INFRA_ID=$(oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster) gcloud compute instances create "${INFRA_ID}-windows-worker-1" \ --project=${GCP_PROJECT_ID} \ --zone=${GCP_ZONE} \ --machine-type=n1-standard-4 \ --network-interface=subnet="${INFRA_ID}-worker-subnet,no-address" \ --metadata=^,@^sysprep-specialize-script-ps1=Add-WindowsCapability\ -Online\ -Name\ OpenSSH.Server\~\~\~\~0.0.1.0$'\n'\$firewallRuleName\ =\ \"ContainerLogsPort\"$'\n'\$containerLogsPort\ =\ \"10250\"$'\n'New-NetFirewallRule\ -DisplayName\ \$firewallRuleName\ -Direction\ Inbound\ -Action\ Allow\ -Protocol\ TCP\ -LocalPort\ \$containerLogsPort\ -EdgeTraversalPolicy\ Allow$'\n'Set-Service\ -Name\ sshd\ -StartupType\ \'Automatic\'$'\n'Start-Service\ sshd$'\n'\$pubKeyConf\ =\ \(Get-Content\ -path\ C:\\ProgramData\\ssh\\sshd_config\)\ -replace\ \'\#PubkeyAuthentication\ yes\',\'PubkeyAuthentication\ yes\'$'\n'\$pubKeyConf\ \|\ Set-Content\ -Path\ C:\\ProgramData\\ssh\\sshd_config$'\n'\$passwordConf\ =\ \(Get-Content\ -path\ C:\\ProgramData\\ssh\\sshd_config\)\ -replace\ \'\#PasswordAuthentication\ yes\',\'PasswordAuthentication\ yes\'$'\n'\$passwordConf\ \|\ Set-Content\ -Path\ C:\\ProgramData\\ssh\\sshd_config$'\n'\$authorizedKeyFilePath\ =\ \"\$env:ProgramData\\ssh\\administrators_authorized_keys\"$'\n'New-Item\ -Force\ \$authorizedKeyFilePath$'\n'echo\ \"ssh-rsa\ $SSH_KEY\"\ \|\ Out-File\ \$authorizedKeyFilePath\ -Encoding\ ascii$'\n'\$acl\ =\ Get-Acl\ C:\\ProgramData\\ssh\\administrators_authorized_keys$'\n'\$acl.SetAccessRuleProtection\(\$true,\ \$false\)$'\n'\$administratorsRule\ =\ New-Object\ system.security.accesscontrol.filesystemaccessrule\(\"Administrators\",\"FullControl\",\"Allow\"\)$'\n'\$systemRule\ =\ New-Object\ system.security.accesscontrol.filesystemaccessrule\(\"SYSTEM\",\"FullControl\",\"Allow\"\)$'\n'\$acl.SetAccessRule\(\$administratorsRule\)$'\n'\$acl.SetAccessRule\(\$systemRule\)$'\n'\$acl\ \|\ Set-Acl$'\n'Restart-Service\ sshd \ --maintenance-policy=MIGRATE \ --provisioning-model=STANDARD \ --service-account="${INFRA_ID}-w@${GCP_PROJECT_ID}.iam.gserviceaccount.com" \ --scopes=https://www.googleapis.com/auth/cloud-platform \ --tags="${INFRA_ID}-worker" \ --create-disk=auto-delete=yes,boot=yes,device-name=aravindh-winc-jmd59-worker-a-,image=projects/windows-cloud/global/images/windows-server-2019-dc-core-for-containers-v20220314,mode=rw,size=128,type=projects/openshift-gce-devel/zones/us-west1-a/diskTypes/pd-balanced \ --no-shielded-secure-boot \ --shielded-vtpm \ --shielded-integrity-monitoring \ --labels="kubernetes-io-cluster-${INFRA_ID}=owned" \ --reservation-affinity=any In step 3 ensure that you use the public counterpart of the SSH key used in step 2. Look for $SSH_KEY in the gcloud command. 4. Create a user, "wmco", using the GCP console [3] 5. Create the windows-instances ConfigMap [4]. Use the internal IP address of the Windows node and set "username" to "wmco" in the ConfigMap's data section. 6. Wait for the Windows node object to get created with Status = "Ready" Actual results: oc describe node $windows-node will show that the NetworkUnavailable condition is set to True Expected results: The NetworkUnavailable condition on the Windows node should be set to False Additional info: Here is a snippet of the logs from the ovnkube-master container pertaining to the Windows node: ``` I0406 22:58:06.671036 1 transact.go:41] Configuring OVN: [{Op:update Table:NB_Global Row:map[options:{GoMap:map[controller_event:true e2e_timestamp:1649285886 mac_prefix:2e:6e:53 max_tunid:16711680 northd_internal_version:21.12.1-20.21.0-58.3 northd_probe_interval:5000 svc_monitor_mac:ba:0c:80:73:f2:a3 use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {8a07bfde-1d2a-4b25-8a66-b926248720e1}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] I0406 22:58:13.632271 1 node_tracker.go:167] Processing possible switch / router updates for node aravindh-winc-jmd59-windows-worker-a-9q8m4 I0406 22:58:13.634428 1 node_tracker.go:171] Node aravindh-winc-jmd59-windows-worker-a-9q8m4 has invalid / no HostSubnet annotations (probably waiting on initialization): node "aravindh-winc-jmd59-windows-worker-a-9q8m4" has no "k8s.ovn.org/node-subnets" annotation I0406 22:58:13.635666 1 master.go:199] Allocated hybrid overlay HostSubnet 10.132.0.0/24 for node aravindh-winc-jmd59-windows-worker-a-9q8m4 I0406 22:58:13.635850 1 kube.go:99] Setting annotations map[k8s.ovn.org/hybrid-overlay-node-subnet:10.132.0.0/24] on node aravindh-winc-jmd59-windows-worker-a-9q8m4 I0406 22:58:13.638615 1 transact.go:41] Configuring OVN: [{Op:insert Table:Logical_Router_Policy Row:map[action:allow match:ip4.src == 10.128.0.0/14 && ip4.dst == 10.0.128.6/32 priority:101] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996801} {Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:policies Mutator:insert Value:{GoSet:[{GoUUID:u2596996801}]}}] Timeout:<nil> Where:[where column _uuid == {1929b592-080e-469a-bbd3-c7347a2119d1}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] I0406 22:58:13.655758 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:13.659746 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:13.687493 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:14.013440 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:17.608377 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:20.589962 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:20.611656 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:20.627381 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:23.988431 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:24.010521 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:27.532597 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:36.671289 1 transact.go:41] Configuring OVN: [{Op:update Table:NB_Global Row:map[options:{GoMap:map[controller_event:true e2e_timestamp:1649285916 mac_prefix:2e:6e:53 max_tunid:16711680 northd_internal_version:21.12.1-20.21.0-58.3 northd_probe_interval:5000 svc_monitor_mac:ba:0c:80:73:f2:a3 use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {8a07bfde-1d2a-4b25-8a66-b926248720e1}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] I0406 22:58:44.725215 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:45.566542 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:58:54.748518 1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4' I0406 22:59:01.730552 1 master.go:368] Setting up logical route policy for hybrid subnet on node: aravindh-winc-jmd59-worker-a-4b56q ``` Note that the "Cleared node NetworkUnavailable/NoRouteCreated condition for" is missing for the Windows node and neither is the error message "Status update failed for local node" [5] Full logs are present in the must-gather. [1] https://docs.openshift.com/container-platform/4.10/windows_containers/enabling-windows-container-workloads.html#installing-the-wmco [2] https://docs.openshift.com/container-platform/4.10/windows_containers/enabling-windows-container-workloads.html#configuring-secret-for-wmco_enabling-windows-container-workloads [3] https://docs.openshift.com/container-platform/4.10/windows_containers/byoh-windows-instance.html#configuring-byoh-windows-instance [4] https://cloud.google.com/compute/docs/instances/windows/generating-credentials [5] https://github.com/openshift/ovn-kubernetes/blob/50f9cb61c82697186a9844c4fffab093c00fbe5a/go-controller/pkg/ovn/master.go#L1292
I suspect what is happening is that the Windows node is being considered a NoHostSubnet node [1] and because of which oc.addNode [2] never gets called on it and as a result oc.clearInitialNodeNetworkUnavailableCondition() never gets called for the Windows node either. WDYT? [1] https://github.com/openshift/ovn-kubernetes/blob/50f9cb61c82697186a9844c4fffab093c00fbe5a/go-controller/pkg/ovn/ovn.go#L1113 [2] https://github.com/openshift/ovn-kubernetes/blob/50f9cb61c82697186a9844c4fffab093c00fbe5a/go-controller/pkg/ovn/ovn.go#L1122 [3] https://github.com/openshift/ovn-kubernetes/blob/50f9cb61c82697186a9844c4fffab093c00fbe5a/go-controller/pkg/ovn/master.go#L1119
I created an upstream issue [1] and a potential fix [2] [1] https://github.com/ovn-org/ovn-kubernetes/issues/2901 [2] https://github.com/ovn-org/ovn-kubernetes/pull/2902
This has been fixed upstream [1] [1] https://github.com/ovn-org/ovn-kubernetes/pull/2870
*** Bug 2058912 has been marked as a duplicate of this bug. ***
We tried to validate this issue but the NetworkUnavailable options is still appearing as True: "conditions": [ { "lastHeartbeatTime": null, "lastTransitionTime": "2022-05-05T12:54:32Z", "message": "Node created without a route", "reason": "NoRouteCreated", "status": "True", "type": "NetworkUnavailable" }, I made sure that the patch from @jtanenba would be present in the ovn-kubernetes submodule before building the wmco-index image, however I still see the same effect. I'm attaching some of the logs from ovnkube-master and wmco. If you need any other logs, please let me know. I also have a reproducer available in case you would like to jump in (ping me in slack @jfrancoa)
"version": "5.0.0-f409122" "conditions": [ { "lastHeartbeatTime": null, "lastTransitionTime": "2022-05-08T18:12:07Z", "message": "ovn-kube cleared kubelet-set NoRouteCreated", "reason": "RouteCreated", "status": "False", "type": "NetworkUnavailable" },
"conditions": [ { "lastHeartbeatTime": null, "lastTransitionTime": "2022-05-08T18:12:07Z", "message": "ovn-kube cleared kubelet-set NoRouteCreated", "reason": "RouteCreated", "status": "False", "type": "NetworkUnavailable" },
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days