Bug 1888959

Summary: Upgrade of a baremetal IPI environment from 4.5.15 to 4.6.0-rc.4 succeeds but worker MCP is left in updating state with worker nodes not upgraded
Product: OpenShift Container Platform Reporter: Marius Cornea <mcornea>
Component: NetworkingAssignee: Tim Rozet <trozet>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: aconstan, bbennett, sasha, yprokule
Version: 4.6Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1889393 (view as bug list) Environment:
Last Closed: 2020-10-21 19:55:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1889393    

Description Marius Cornea 2020-10-16 15:16:54 UTC
Description of problem:

Upgrade of a baremetal IPI environment from 4.5.15 to 4.6.0-rc.4 succeeds but worker MCP is left in updating state with worker nodes not upgraded.


Version-Release number of selected component (if applicable):
4.6.0-rc.4

How reproducible:
100%

Steps to Reproduce:
1. Deploy 4.5.15 cluster with OVNKubernetes Hybrid overlay enabled via baremetal IPI flow

2. Upgrade to 4.6.0-rc.4

3. Wait for cluster version to report 4.6.0-rc.4:
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-rc.4   True        False         80m     Cluster version is 4.6.0-rc.4

4. Check worker MCP

NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-df9ea06b02c85c66de6b74b08934c33c   False     True       False      2              0                   0                     0                      3h44m


Actual results:
worker MCP is still updating and READYMACHINECOUNT=0

Expected results:
worker MCP was updated and worker nodes got all changes introduced in 4.6 such as the br-ex OVS bridge

Additional info:

Investigating we can see in the machine-config-daemon logs on the first worker node that it cannot reach the kubernetes cluster IP:

oc -n openshift-machine-config-operator get pods -o wide | grep worker-0-0
machine-config-daemon-pqtgm                 2/2     Running   0          114m   

192.168.123.117   worker-0-0   <none>           <none>

oc -n openshift-machine-config-operator logs machine-config-daemon-pqtgm -c machine-config-daemon | tail -5

Trace[92342574]: [30.000806126s] [30.000806126s] END
E1016 14:55:42.400670  168893 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout
I1016 14:56:49.026702  168893 trace.go:205] Trace[109723740]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (16-Oct-2020 14:56:19.025) (total time: 30000ms):
Trace[109723740]: [30.000770474s] [30.000770474s] END
E1016 14:56:49.026745  168893 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout
[kni@provisionhost-0-0 ~]$ oc -n openshift-machine-config-operator logs machine-config-daemon-pqtgm -c machine-config-daemon | tail -10
E1016 14:55:20.146954  168893 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout
I1016 14:55:42.400630  168893 trace.go:205] Trace[92342574]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (16-Oct-2020 14:55:12.399) (total time: 30000ms):
Trace[92342574]: [30.000806126s] [30.000806126s] END
E1016 14:55:42.400670  168893 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout
I1016 14:56:49.026702  168893 trace.go:205] Trace[109723740]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (16-Oct-2020 14:56:19.025) (total time: 30000ms):
Trace[109723740]: [30.000770474s] [30.000770474s] END
E1016 14:56:49.026745  168893 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout
I1016 14:57:03.781730  168893 trace.go:205] Trace[1021552332]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (16-Oct-2020 14:56:33.780) (total time: 30000ms):
Trace[1021552332]: [30.000879015s] [30.000879015s] END
E1016 14:57:03.781773  168893 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout

When trying to curl the cluster address from all nodes we can see that worker-0-0 is the only one failing:

for node in $(oc get nodes --no-headers=true -o custom-columns=NAME:.metadata.name); do ssh core@$node 'hostname;echo; curl -skI https://172.30.0.1:443 --connect-timeout 5; echo ======================';done
master-0-0

HTTP/2 403 
audit-id: c238a1a7-bcf3-471f-8d6d-58482a23a237
cache-control: no-cache, private
content-type: application/json
x-content-type-options: nosniff
x-kubernetes-pf-flowschema-uid: 1adfdee3-739c-4874-b808-a3c597472add
x-kubernetes-pf-prioritylevel-uid: f28ae5bc-96e4-45b9-b392-fd6886601179
content-length: 234
date: Fri, 16 Oct 2020 15:03:24 GMT

======================
master-0-1

HTTP/2 403 
audit-id: 4f471624-77dc-4891-b753-24ac30fecffd
cache-control: no-cache, private
content-type: application/json
x-content-type-options: nosniff
x-kubernetes-pf-flowschema-uid: 1adfdee3-739c-4874-b808-a3c597472add
x-kubernetes-pf-prioritylevel-uid: f28ae5bc-96e4-45b9-b392-fd6886601179
content-length: 234
date: Fri, 16 Oct 2020 15:03:24 GMT

======================
master-0-2

HTTP/2 403 
audit-id: 5bb0e82f-210a-4350-bb3b-0f6b7faa7153
cache-control: no-cache, private
content-type: application/json
x-content-type-options: nosniff
x-kubernetes-pf-flowschema-uid: 1adfdee3-739c-4874-b808-a3c597472add
x-kubernetes-pf-prioritylevel-uid: f28ae5bc-96e4-45b9-b392-fd6886601179
content-length: 234
date: Fri, 16 Oct 2020 15:03:25 GMT

======================
worker-0-0

======================
worker-0-1

HTTP/2 403 
audit-id: 40bf37f5-17ae-4144-a0d6-a8b1e41066ca
cache-control: no-cache, private
content-type: application/json
x-content-type-options: nosniff
x-kubernetes-pf-flowschema-uid: 1adfdee3-739c-4874-b808-a3c597472add
x-kubernetes-pf-prioritylevel-uid: f28ae5bc-96e4-45b9-b392-fd6886601179
content-length: 234
date: Fri, 16 Oct 2020 15:03:31 GMT

======================

Looking at the OVN logs we can see in the ovs-node-qhdtd pod log the following recurring errors:

2020-10-16T15:09:36.214Z|00027|ofproto_dpif_xlate(handler14)|WARN|Dropped 30 log messages in last 304 seconds (most recently, 304 seconds ago) due to excessive rate
2020-10-16T15:09:36.214Z|00028|ofproto_dpif_xlate(handler14)|WARN|over max translation depth 64 on bridge br-int while processing arp,in_port=1,vlan_tci=0x0000,dl_src=0a:58:a9:fe:00:01,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=169.254.0.1,arp_tpa=169.254.11.119,arp_op=1,arp_sha=0a:58:a9:fe:00:01,arp_tha=00:00:00:00:00:00
2020-10-16T15:09:36.214Z|00029|ofproto_dpif_xlate(handler14)|WARN|over max translation depth 64 on bridge br-int while processing arp,in_port=1,vlan_tci=0x0000,dl_src=0a:58:a9:fe:00:01,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=169.254.0.1,arp_tpa=169.254.11.119,arp_op=1,arp_sha=0a:58:a9:fe:00:01,arp_tha=00:00:00:00:00:00
2020-10-16T15:09:36.214Z|00030|ofproto_dpif_xlate(handler14)|WARN|over max translation depth 64 on bridge br-int while processing arp,in_port=1,vlan_tci=0x0000,dl_src=0a:58:a9:fe:00:01,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=169.254.0.1,arp_tpa=169.254.11.119,arp_op=1,arp_sha=0a:58:a9:fe:00:01,arp_tha=00:00:00:00:00:00
2020-10-16T15:09:36.214Z|00031|ofproto_dpif_xlate(handler14)|WARN|over max translation depth 64 on bridge br-int while processing arp,in_port=1,vlan_tci=0x0000,dl_src=0a:58:a9:fe:00:01,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=169.254.0.1,arp_tpa=169.254.11.119,arp_op=1,arp_sha=0a:58:a9:fe:00:01,arp_tha=00:00:00:00:00:00
2020-10-16T15:09:36.215Z|00032|ofproto_dpif_xlate(handler14)|WARN|over max translation depth 64 on bridge br-int while processing arp,in_port=1,vlan_tci=0x0000,dl_src=0a:58:a9:fe:00:01,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=169.254.0.1,arp_tpa=169.254.11.119,arp_op=1,arp_sha=0a:58:a9:fe:00:01,arp_tha=00:00:00:00:00:00


Attaching must-gather.

Comment 2 Marius Cornea 2020-10-16 15:23:17 UTC
Note also that all cluster operators report as available with 4.6 version:

oc get clusteroperators
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-rc.4   True        False         False      112m
cloud-credential                           4.6.0-rc.4   True        False         False      4h31m
cluster-autoscaler                         4.6.0-rc.4   True        False         False      4h9m
config-operator                            4.6.0-rc.4   True        False         False      4h9m
console                                    4.6.0-rc.4   True        False         False      123m
csi-snapshot-controller                    4.6.0-rc.4   True        False         False      128m
dns                                        4.6.0-rc.4   True        False         False      142m
etcd                                       4.6.0-rc.4   True        False         False      4h12m
image-registry                             4.6.0-rc.4   True        False         False      147m
ingress                                    4.6.0-rc.4   True        False         False      158m
insights                                   4.6.0-rc.4   True        False         False      4h9m
kube-apiserver                             4.6.0-rc.4   True        False         False      4h12m
kube-controller-manager                    4.6.0-rc.4   True        False         False      4h12m
kube-scheduler                             4.6.0-rc.4   True        False         False      4h10m
kube-storage-version-migrator              4.6.0-rc.4   True        False         False      3h18m
machine-api                                4.6.0-rc.4   True        False         False      4h1m
machine-approver                           4.6.0-rc.4   True        False         False      4h11m
machine-config                             4.6.0-rc.4   True        False         False      112m
marketplace                                4.6.0-rc.4   True        False         False      117m
monitoring                                 4.6.0-rc.4   True        False         False      116m
network                                    4.6.0-rc.4   True        False         False      4h14m
node-tuning                                4.6.0-rc.4   True        False         False      158m
openshift-apiserver                        4.6.0-rc.4   True        False         False      112m
openshift-controller-manager               4.6.0-rc.4   True        False         False      4h9m
openshift-samples                          4.6.0-rc.4   True        False         False      108m
operator-lifecycle-manager                 4.6.0-rc.4   True        False         False      4h13m
operator-lifecycle-manager-catalog         4.6.0-rc.4   True        False         False      4h13m
operator-lifecycle-manager-packageserver   4.6.0-rc.4   True        False         False      117m
service-ca                                 4.6.0-rc.4   True        False         False      4h14m
storage                                    4.6.0-rc.4   True        False         False      158m

Comment 4 Tim Rozet 2020-10-21 19:55:42 UTC

*** This bug has been marked as a duplicate of bug 1880591 ***