Bug 2072780

Summary: OVN kube-master does not clear NetworkUnavailableCondition on GCP BYOH Windows node
Product: OpenShift Container Platform Reporter: Aravindh Puthiyaparambil <aravindh>
Component: NetworkingAssignee: Jacob Tanenbaum <jtanenba>
Networking sub component: ovn-kubernetes QA Contact: Ronnie Rasouli <rrasouli>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: anbhat, arnaik, jfrancoa, jtanenba, mapandey, rteague
Version: 4.10   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: x86_64   
OS: Windows   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2079546 (view as bug list) Environment:
Last Closed: 2022-08-10 11:04:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2079546    

Description Aravindh Puthiyaparambil 2022-04-06 23:41:00 UTC
Created attachment 1871171 [details]
must-gather

Description of problem:
OVN kube-master controller does not clear the NetworkUnavailableCondition on GCP BYOH Windows node

Version-Release number of selected component (if applicable):
OCP 4.10.6 on GCP
WMCO 5.0.0

How reproducible: Always

Steps to Reproduce:
1. Install the Windows Machine Config Operator [1]
2. Configure a secret for WMCO [2]
3. Bring up a Windows VM in GCP using the following command:
INFRA_ID=$(oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster)
gcloud compute instances create "${INFRA_ID}-windows-worker-1" \
  --project=${GCP_PROJECT_ID} \
  --zone=${GCP_ZONE} \
  --machine-type=n1-standard-4 \
  --network-interface=subnet="${INFRA_ID}-worker-subnet,no-address" \
  --metadata=^,@^sysprep-specialize-script-ps1=Add-WindowsCapability\ -Online\ -Name\ OpenSSH.Server\~\~\~\~0.0.1.0$'\n'\$firewallRuleName\ =\ \"ContainerLogsPort\"$'\n'\$containerLogsPort\ =\ \"10250\"$'\n'New-NetFirewallRule\ -DisplayName\ \$firewallRuleName\ -Direction\ Inbound\ -Action\ Allow\ -Protocol\ TCP\ -LocalPort\ \$containerLogsPort\ -EdgeTraversalPolicy\ Allow$'\n'Set-Service\ -Name\ sshd\ -StartupType\ \'Automatic\'$'\n'Start-Service\ sshd$'\n'\$pubKeyConf\ =\ \(Get-Content\ -path\ C:\\ProgramData\\ssh\\sshd_config\)\ -replace\ \'\#PubkeyAuthentication\ yes\',\'PubkeyAuthentication\ yes\'$'\n'\$pubKeyConf\ \|\ Set-Content\ -Path\ C:\\ProgramData\\ssh\\sshd_config$'\n'\$passwordConf\ =\ \(Get-Content\ -path\ C:\\ProgramData\\ssh\\sshd_config\)\ -replace\ \'\#PasswordAuthentication\ yes\',\'PasswordAuthentication\ yes\'$'\n'\$passwordConf\ \|\ Set-Content\ -Path\ C:\\ProgramData\\ssh\\sshd_config$'\n'\$authorizedKeyFilePath\ =\ \"\$env:ProgramData\\ssh\\administrators_authorized_keys\"$'\n'New-Item\ -Force\ \$authorizedKeyFilePath$'\n'echo\ \"ssh-rsa\ $SSH_KEY\"\ \|\ Out-File\ \$authorizedKeyFilePath\ -Encoding\ ascii$'\n'\$acl\ =\ Get-Acl\ C:\\ProgramData\\ssh\\administrators_authorized_keys$'\n'\$acl.SetAccessRuleProtection\(\$true,\ \$false\)$'\n'\$administratorsRule\ =\ New-Object\ system.security.accesscontrol.filesystemaccessrule\(\"Administrators\",\"FullControl\",\"Allow\"\)$'\n'\$systemRule\ =\ New-Object\ system.security.accesscontrol.filesystemaccessrule\(\"SYSTEM\",\"FullControl\",\"Allow\"\)$'\n'\$acl.SetAccessRule\(\$administratorsRule\)$'\n'\$acl.SetAccessRule\(\$systemRule\)$'\n'\$acl\ \|\ Set-Acl$'\n'Restart-Service\ sshd \
  --maintenance-policy=MIGRATE \
  --provisioning-model=STANDARD \
  --service-account="${INFRA_ID}-w@${GCP_PROJECT_ID}.iam.gserviceaccount.com" \
  --scopes=https://www.googleapis.com/auth/cloud-platform \
  --tags="${INFRA_ID}-worker" \
  --create-disk=auto-delete=yes,boot=yes,device-name=aravindh-winc-jmd59-worker-a-,image=projects/windows-cloud/global/images/windows-server-2019-dc-core-for-containers-v20220314,mode=rw,size=128,type=projects/openshift-gce-devel/zones/us-west1-a/diskTypes/pd-balanced \
  --no-shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring \
  --labels="kubernetes-io-cluster-${INFRA_ID}=owned" \
  --reservation-affinity=any

In step 3 ensure that you use the public counterpart of the SSH key used in step 2. Look for $SSH_KEY in the gcloud command.

4. Create a user, "wmco", using the GCP console [3]
5. Create the windows-instances ConfigMap [4]. Use the internal IP address of the Windows node and set "username" to "wmco" in the ConfigMap's data section.
6. Wait for the Windows node object to get created with Status = "Ready"

Actual results:
oc describe node $windows-node will show that the NetworkUnavailable condition is set to True 

Expected results:
The NetworkUnavailable condition on the Windows node should be set to False

Additional info:
Here is a snippet of the logs from the ovnkube-master container pertaining to the Windows node:

```
I0406 22:58:06.671036       1 transact.go:41] Configuring OVN: [{Op:update Table:NB_Global Row:map[options:{GoMap:map[controller_event:true e2e_timestamp:1649285886 mac_prefix:2e:6e:53 max_tunid:16711680 northd_internal_version:21.12.1-20.21.0-58.3 northd_probe_interval:5000 svc_monitor_mac:ba:0c:80:73:f2:a3 use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {8a07bfde-1d2a-4b25-8a66-b926248720e1}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]
I0406 22:58:13.632271       1 node_tracker.go:167] Processing possible switch / router updates for node aravindh-winc-jmd59-windows-worker-a-9q8m4
I0406 22:58:13.634428       1 node_tracker.go:171] Node aravindh-winc-jmd59-windows-worker-a-9q8m4 has invalid / no HostSubnet annotations (probably waiting on initialization): node "aravindh-winc-jmd59-windows-worker-a-9q8m4" has no "k8s.ovn.org/node-subnets" annotation
I0406 22:58:13.635666       1 master.go:199] Allocated hybrid overlay HostSubnet 10.132.0.0/24 for node aravindh-winc-jmd59-windows-worker-a-9q8m4
I0406 22:58:13.635850       1 kube.go:99] Setting annotations map[k8s.ovn.org/hybrid-overlay-node-subnet:10.132.0.0/24] on node aravindh-winc-jmd59-windows-worker-a-9q8m4
I0406 22:58:13.638615       1 transact.go:41] Configuring OVN: [{Op:insert Table:Logical_Router_Policy Row:map[action:allow match:ip4.src == 10.128.0.0/14 && ip4.dst == 10.0.128.6/32 priority:101] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996801} {Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:policies Mutator:insert Value:{GoSet:[{GoUUID:u2596996801}]}}] Timeout:<nil> Where:[where column _uuid == {1929b592-080e-469a-bbd3-c7347a2119d1}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]
I0406 22:58:13.655758       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:13.659746       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:13.687493       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:14.013440       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:17.608377       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:20.589962       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:20.611656       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:20.627381       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:23.988431       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:24.010521       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:27.532597       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:36.671289       1 transact.go:41] Configuring OVN: [{Op:update Table:NB_Global Row:map[options:{GoMap:map[controller_event:true e2e_timestamp:1649285916 mac_prefix:2e:6e:53 max_tunid:16711680 northd_internal_version:21.12.1-20.21.0-58.3 northd_probe_interval:5000 svc_monitor_mac:ba:0c:80:73:f2:a3 use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {8a07bfde-1d2a-4b25-8a66-b926248720e1}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]
I0406 22:58:44.725215       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:45.566542       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:54.748518       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:59:01.730552       1 master.go:368] Setting up logical route policy for hybrid subnet on node: aravindh-winc-jmd59-worker-a-4b56q
```

Note that the "Cleared node NetworkUnavailable/NoRouteCreated condition for" is missing for the Windows node and neither is the error message "Status update failed for local node" [5]

Full logs are present in the must-gather.

[1] https://docs.openshift.com/container-platform/4.10/windows_containers/enabling-windows-container-workloads.html#installing-the-wmco
[2] https://docs.openshift.com/container-platform/4.10/windows_containers/enabling-windows-container-workloads.html#configuring-secret-for-wmco_enabling-windows-container-workloads
[3] https://docs.openshift.com/container-platform/4.10/windows_containers/byoh-windows-instance.html#configuring-byoh-windows-instance
[4] https://cloud.google.com/compute/docs/instances/windows/generating-credentials
[5] https://github.com/openshift/ovn-kubernetes/blob/50f9cb61c82697186a9844c4fffab093c00fbe5a/go-controller/pkg/ovn/master.go#L1292

Comment 1 Aravindh Puthiyaparambil 2022-04-07 00:04:13 UTC
I suspect what is happening is that the Windows node is being considered a NoHostSubnet node [1] and because of which oc.addNode [2] never gets called on it and as a result oc.clearInitialNodeNetworkUnavailableCondition() never gets called for the Windows node either. WDYT?

[1] https://github.com/openshift/ovn-kubernetes/blob/50f9cb61c82697186a9844c4fffab093c00fbe5a/go-controller/pkg/ovn/ovn.go#L1113
[2] https://github.com/openshift/ovn-kubernetes/blob/50f9cb61c82697186a9844c4fffab093c00fbe5a/go-controller/pkg/ovn/ovn.go#L1122
[3] https://github.com/openshift/ovn-kubernetes/blob/50f9cb61c82697186a9844c4fffab093c00fbe5a/go-controller/pkg/ovn/master.go#L1119

Comment 2 Aravindh Puthiyaparambil 2022-04-07 22:37:28 UTC
I created an upstream issue [1] and a potential fix [2]

[1] https://github.com/ovn-org/ovn-kubernetes/issues/2901
[2] https://github.com/ovn-org/ovn-kubernetes/pull/2902

Comment 4 Aravindh Puthiyaparambil 2022-04-08 18:16:57 UTC
This has been fixed upstream [1]

[1] https://github.com/ovn-org/ovn-kubernetes/pull/2870

Comment 8 Jacob Tanenbaum 2022-05-02 18:15:37 UTC
*** Bug 2058912 has been marked as a duplicate of this bug. ***

Comment 10 Jose Luis Franco 2022-05-05 13:23:52 UTC
We tried to validate this issue but the NetworkUnavailable options is still appearing as True:

        "conditions": [
            {
                "lastHeartbeatTime": null,
                "lastTransitionTime": "2022-05-05T12:54:32Z",
                "message": "Node created without a route",
                "reason": "NoRouteCreated",
                "status": "True",
                "type": "NetworkUnavailable"
            },

I made sure that the patch from @jtanenba would be present in the ovn-kubernetes submodule before building the wmco-index image, however I still see the same effect. 

I'm attaching some of the logs from ovnkube-master and wmco. If you need any other logs, please let me know. I also have a reproducer available in case you would like to jump in (ping me in slack @jfrancoa)

Comment 15 Ronnie Rasouli 2022-05-08 18:23:52 UTC
"version": "5.0.0-f409122"

 "conditions": [
            {
                "lastHeartbeatTime": null,
                "lastTransitionTime": "2022-05-08T18:12:07Z",
                "message": "ovn-kube cleared kubelet-set NoRouteCreated",
                "reason": "RouteCreated",
                "status": "False",
                "type": "NetworkUnavailable"
            },

Comment 16 Ronnie Rasouli 2022-05-08 18:24:45 UTC
 "conditions": [
            {
                "lastHeartbeatTime": null,
                "lastTransitionTime": "2022-05-08T18:12:07Z",
                "message": "ovn-kube cleared kubelet-set NoRouteCreated",
                "reason": "RouteCreated",
                "status": "False",
                "type": "NetworkUnavailable"
            },

Comment 19 errata-xmlrpc 2022-08-10 11:04:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 20 Red Hat Bugzilla 2023-09-15 01:53:43 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days