Bug 2072780 - OVN kube-master does not clear NetworkUnavailableCondition on GCP BYOH Windows node [NEEDINFO]
Summary: OVN kube-master does not clear NetworkUnavailableCondition on GCP BYOH Window...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: x86_64
OS: Windows
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Jacob Tanenbaum
QA Contact: Ronnie Rasouli
URL:
Whiteboard:
: 2058912 (view as bug list)
Depends On:
Blocks: 2079546
TreeView+ depends on / blocked
 
Reported: 2022-04-06 23:41 UTC by Aravindh Puthiyaparambil
Modified: 2022-08-10 11:04 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2079546 (view as bug list)
Environment:
Last Closed: 2022-08-10 11:04:00 UTC
Target Upstream Version:
jfrancoa: needinfo? (jtanenba)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 1064 0 None open Bug 2079439: [DownstreamMerge] 4-29-22 2022-05-02 13:36:12 UTC

Description Aravindh Puthiyaparambil 2022-04-06 23:41:00 UTC
Created attachment 1871171 [details]
must-gather

Description of problem:
OVN kube-master controller does not clear the NetworkUnavailableCondition on GCP BYOH Windows node

Version-Release number of selected component (if applicable):
OCP 4.10.6 on GCP
WMCO 5.0.0

How reproducible: Always

Steps to Reproduce:
1. Install the Windows Machine Config Operator [1]
2. Configure a secret for WMCO [2]
3. Bring up a Windows VM in GCP using the following command:
INFRA_ID=$(oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster)
gcloud compute instances create "${INFRA_ID}-windows-worker-1" \
  --project=${GCP_PROJECT_ID} \
  --zone=${GCP_ZONE} \
  --machine-type=n1-standard-4 \
  --network-interface=subnet="${INFRA_ID}-worker-subnet,no-address" \
  --metadata=^,@^sysprep-specialize-script-ps1=Add-WindowsCapability\ -Online\ -Name\ OpenSSH.Server\~\~\~\~0.0.1.0$'\n'\$firewallRuleName\ =\ \"ContainerLogsPort\"$'\n'\$containerLogsPort\ =\ \"10250\"$'\n'New-NetFirewallRule\ -DisplayName\ \$firewallRuleName\ -Direction\ Inbound\ -Action\ Allow\ -Protocol\ TCP\ -LocalPort\ \$containerLogsPort\ -EdgeTraversalPolicy\ Allow$'\n'Set-Service\ -Name\ sshd\ -StartupType\ \'Automatic\'$'\n'Start-Service\ sshd$'\n'\$pubKeyConf\ =\ \(Get-Content\ -path\ C:\\ProgramData\\ssh\\sshd_config\)\ -replace\ \'\#PubkeyAuthentication\ yes\',\'PubkeyAuthentication\ yes\'$'\n'\$pubKeyConf\ \|\ Set-Content\ -Path\ C:\\ProgramData\\ssh\\sshd_config$'\n'\$passwordConf\ =\ \(Get-Content\ -path\ C:\\ProgramData\\ssh\\sshd_config\)\ -replace\ \'\#PasswordAuthentication\ yes\',\'PasswordAuthentication\ yes\'$'\n'\$passwordConf\ \|\ Set-Content\ -Path\ C:\\ProgramData\\ssh\\sshd_config$'\n'\$authorizedKeyFilePath\ =\ \"\$env:ProgramData\\ssh\\administrators_authorized_keys\"$'\n'New-Item\ -Force\ \$authorizedKeyFilePath$'\n'echo\ \"ssh-rsa\ $SSH_KEY\"\ \|\ Out-File\ \$authorizedKeyFilePath\ -Encoding\ ascii$'\n'\$acl\ =\ Get-Acl\ C:\\ProgramData\\ssh\\administrators_authorized_keys$'\n'\$acl.SetAccessRuleProtection\(\$true,\ \$false\)$'\n'\$administratorsRule\ =\ New-Object\ system.security.accesscontrol.filesystemaccessrule\(\"Administrators\",\"FullControl\",\"Allow\"\)$'\n'\$systemRule\ =\ New-Object\ system.security.accesscontrol.filesystemaccessrule\(\"SYSTEM\",\"FullControl\",\"Allow\"\)$'\n'\$acl.SetAccessRule\(\$administratorsRule\)$'\n'\$acl.SetAccessRule\(\$systemRule\)$'\n'\$acl\ \|\ Set-Acl$'\n'Restart-Service\ sshd \
  --maintenance-policy=MIGRATE \
  --provisioning-model=STANDARD \
  --service-account="${INFRA_ID}-w@${GCP_PROJECT_ID}.iam.gserviceaccount.com" \
  --scopes=https://www.googleapis.com/auth/cloud-platform \
  --tags="${INFRA_ID}-worker" \
  --create-disk=auto-delete=yes,boot=yes,device-name=aravindh-winc-jmd59-worker-a-,image=projects/windows-cloud/global/images/windows-server-2019-dc-core-for-containers-v20220314,mode=rw,size=128,type=projects/openshift-gce-devel/zones/us-west1-a/diskTypes/pd-balanced \
  --no-shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring \
  --labels="kubernetes-io-cluster-${INFRA_ID}=owned" \
  --reservation-affinity=any

In step 3 ensure that you use the public counterpart of the SSH key used in step 2. Look for $SSH_KEY in the gcloud command.

4. Create a user, "wmco", using the GCP console [3]
5. Create the windows-instances ConfigMap [4]. Use the internal IP address of the Windows node and set "username" to "wmco" in the ConfigMap's data section.
6. Wait for the Windows node object to get created with Status = "Ready"

Actual results:
oc describe node $windows-node will show that the NetworkUnavailable condition is set to True 

Expected results:
The NetworkUnavailable condition on the Windows node should be set to False

Additional info:
Here is a snippet of the logs from the ovnkube-master container pertaining to the Windows node:

```
I0406 22:58:06.671036       1 transact.go:41] Configuring OVN: [{Op:update Table:NB_Global Row:map[options:{GoMap:map[controller_event:true e2e_timestamp:1649285886 mac_prefix:2e:6e:53 max_tunid:16711680 northd_internal_version:21.12.1-20.21.0-58.3 northd_probe_interval:5000 svc_monitor_mac:ba:0c:80:73:f2:a3 use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {8a07bfde-1d2a-4b25-8a66-b926248720e1}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]
I0406 22:58:13.632271       1 node_tracker.go:167] Processing possible switch / router updates for node aravindh-winc-jmd59-windows-worker-a-9q8m4
I0406 22:58:13.634428       1 node_tracker.go:171] Node aravindh-winc-jmd59-windows-worker-a-9q8m4 has invalid / no HostSubnet annotations (probably waiting on initialization): node "aravindh-winc-jmd59-windows-worker-a-9q8m4" has no "k8s.ovn.org/node-subnets" annotation
I0406 22:58:13.635666       1 master.go:199] Allocated hybrid overlay HostSubnet 10.132.0.0/24 for node aravindh-winc-jmd59-windows-worker-a-9q8m4
I0406 22:58:13.635850       1 kube.go:99] Setting annotations map[k8s.ovn.org/hybrid-overlay-node-subnet:10.132.0.0/24] on node aravindh-winc-jmd59-windows-worker-a-9q8m4
I0406 22:58:13.638615       1 transact.go:41] Configuring OVN: [{Op:insert Table:Logical_Router_Policy Row:map[action:allow match:ip4.src == 10.128.0.0/14 && ip4.dst == 10.0.128.6/32 priority:101] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:u2596996801} {Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:policies Mutator:insert Value:{GoSet:[{GoUUID:u2596996801}]}}] Timeout:<nil> Where:[where column _uuid == {1929b592-080e-469a-bbd3-c7347a2119d1}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]
I0406 22:58:13.655758       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:13.659746       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:13.687493       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:14.013440       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:17.608377       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:20.589962       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:20.611656       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:20.627381       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:23.988431       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:24.010521       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:27.532597       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:36.671289       1 transact.go:41] Configuring OVN: [{Op:update Table:NB_Global Row:map[options:{GoMap:map[controller_event:true e2e_timestamp:1649285916 mac_prefix:2e:6e:53 max_tunid:16711680 northd_internal_version:21.12.1-20.21.0-58.3 northd_probe_interval:5000 svc_monitor_mac:ba:0c:80:73:f2:a3 use_logical_dp_groups:true]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {8a07bfde-1d2a-4b25-8a66-b926248720e1}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]
I0406 22:58:44.725215       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:45.566542       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:58:54.748518       1 informer.go:294] Successfully synced 'aravindh-winc-jmd59-windows-worker-a-9q8m4'
I0406 22:59:01.730552       1 master.go:368] Setting up logical route policy for hybrid subnet on node: aravindh-winc-jmd59-worker-a-4b56q
```

Note that the "Cleared node NetworkUnavailable/NoRouteCreated condition for" is missing for the Windows node and neither is the error message "Status update failed for local node" [5]

Full logs are present in the must-gather.

[1] https://docs.openshift.com/container-platform/4.10/windows_containers/enabling-windows-container-workloads.html#installing-the-wmco
[2] https://docs.openshift.com/container-platform/4.10/windows_containers/enabling-windows-container-workloads.html#configuring-secret-for-wmco_enabling-windows-container-workloads
[3] https://docs.openshift.com/container-platform/4.10/windows_containers/byoh-windows-instance.html#configuring-byoh-windows-instance
[4] https://cloud.google.com/compute/docs/instances/windows/generating-credentials
[5] https://github.com/openshift/ovn-kubernetes/blob/50f9cb61c82697186a9844c4fffab093c00fbe5a/go-controller/pkg/ovn/master.go#L1292

Comment 1 Aravindh Puthiyaparambil 2022-04-07 00:04:13 UTC
I suspect what is happening is that the Windows node is being considered a NoHostSubnet node [1] and because of which oc.addNode [2] never gets called on it and as a result oc.clearInitialNodeNetworkUnavailableCondition() never gets called for the Windows node either. WDYT?

[1] https://github.com/openshift/ovn-kubernetes/blob/50f9cb61c82697186a9844c4fffab093c00fbe5a/go-controller/pkg/ovn/ovn.go#L1113
[2] https://github.com/openshift/ovn-kubernetes/blob/50f9cb61c82697186a9844c4fffab093c00fbe5a/go-controller/pkg/ovn/ovn.go#L1122
[3] https://github.com/openshift/ovn-kubernetes/blob/50f9cb61c82697186a9844c4fffab093c00fbe5a/go-controller/pkg/ovn/master.go#L1119

Comment 2 Aravindh Puthiyaparambil 2022-04-07 22:37:28 UTC
I created an upstream issue [1] and a potential fix [2]

[1] https://github.com/ovn-org/ovn-kubernetes/issues/2901
[2] https://github.com/ovn-org/ovn-kubernetes/pull/2902

Comment 4 Aravindh Puthiyaparambil 2022-04-08 18:16:57 UTC
This has been fixed upstream [1]

[1] https://github.com/ovn-org/ovn-kubernetes/pull/2870

Comment 8 Jacob Tanenbaum 2022-05-02 18:15:37 UTC
*** Bug 2058912 has been marked as a duplicate of this bug. ***

Comment 10 Jose Luis Franco 2022-05-05 13:23:52 UTC
We tried to validate this issue but the NetworkUnavailable options is still appearing as True:

        "conditions": [
            {
                "lastHeartbeatTime": null,
                "lastTransitionTime": "2022-05-05T12:54:32Z",
                "message": "Node created without a route",
                "reason": "NoRouteCreated",
                "status": "True",
                "type": "NetworkUnavailable"
            },

I made sure that the patch from @jtanenba@redhat.com would be present in the ovn-kubernetes submodule before building the wmco-index image, however I still see the same effect. 

I'm attaching some of the logs from ovnkube-master and wmco. If you need any other logs, please let me know. I also have a reproducer available in case you would like to jump in (ping me in slack @jfrancoa)

Comment 15 Ronnie Rasouli 2022-05-08 18:23:52 UTC
"version": "5.0.0-f409122"

 "conditions": [
            {
                "lastHeartbeatTime": null,
                "lastTransitionTime": "2022-05-08T18:12:07Z",
                "message": "ovn-kube cleared kubelet-set NoRouteCreated",
                "reason": "RouteCreated",
                "status": "False",
                "type": "NetworkUnavailable"
            },

Comment 16 Ronnie Rasouli 2022-05-08 18:24:45 UTC
 "conditions": [
            {
                "lastHeartbeatTime": null,
                "lastTransitionTime": "2022-05-08T18:12:07Z",
                "message": "ovn-kube cleared kubelet-set NoRouteCreated",
                "reason": "RouteCreated",
                "status": "False",
                "type": "NetworkUnavailable"
            },

Comment 19 errata-xmlrpc 2022-08-10 11:04:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.