Description of problem: Upgrade of a baremetal IPI environment from 4.5.15 to 4.6.0-rc.4 succeeds but worker MCP is left in updating state with worker nodes not upgraded. Version-Release number of selected component (if applicable): 4.6.0-rc.4 How reproducible: 100% Steps to Reproduce: 1. Deploy 4.5.15 cluster with OVNKubernetes Hybrid overlay enabled via baremetal IPI flow 2. Upgrade to 4.6.0-rc.4 3. Wait for cluster version to report 4.6.0-rc.4: NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-rc.4 True False 80m Cluster version is 4.6.0-rc.4 4. Check worker MCP NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-df9ea06b02c85c66de6b74b08934c33c False True False 2 0 0 0 3h44m Actual results: worker MCP is still updating and READYMACHINECOUNT=0 Expected results: worker MCP was updated and worker nodes got all changes introduced in 4.6 such as the br-ex OVS bridge Additional info: Investigating we can see in the machine-config-daemon logs on the first worker node that it cannot reach the kubernetes cluster IP: oc -n openshift-machine-config-operator get pods -o wide | grep worker-0-0 machine-config-daemon-pqtgm 2/2 Running 0 114m 192.168.123.117 worker-0-0 <none> <none> oc -n openshift-machine-config-operator logs machine-config-daemon-pqtgm -c machine-config-daemon | tail -5 Trace[92342574]: [30.000806126s] [30.000806126s] END E1016 14:55:42.400670 168893 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout I1016 14:56:49.026702 168893 trace.go:205] Trace[109723740]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (16-Oct-2020 14:56:19.025) (total time: 30000ms): Trace[109723740]: [30.000770474s] [30.000770474s] END E1016 14:56:49.026745 168893 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout [kni@provisionhost-0-0 ~]$ oc -n openshift-machine-config-operator logs machine-config-daemon-pqtgm -c machine-config-daemon | tail -10 E1016 14:55:20.146954 168893 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout I1016 14:55:42.400630 168893 trace.go:205] Trace[92342574]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (16-Oct-2020 14:55:12.399) (total time: 30000ms): Trace[92342574]: [30.000806126s] [30.000806126s] END E1016 14:55:42.400670 168893 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout I1016 14:56:49.026702 168893 trace.go:205] Trace[109723740]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (16-Oct-2020 14:56:19.025) (total time: 30000ms): Trace[109723740]: [30.000770474s] [30.000770474s] END E1016 14:56:49.026745 168893 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout I1016 14:57:03.781730 168893 trace.go:205] Trace[1021552332]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (16-Oct-2020 14:56:33.780) (total time: 30000ms): Trace[1021552332]: [30.000879015s] [30.000879015s] END E1016 14:57:03.781773 168893 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout When trying to curl the cluster address from all nodes we can see that worker-0-0 is the only one failing: for node in $(oc get nodes --no-headers=true -o custom-columns=NAME:.metadata.name); do ssh core@$node 'hostname;echo; curl -skI https://172.30.0.1:443 --connect-timeout 5; echo ======================';done master-0-0 HTTP/2 403 audit-id: c238a1a7-bcf3-471f-8d6d-58482a23a237 cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 1adfdee3-739c-4874-b808-a3c597472add x-kubernetes-pf-prioritylevel-uid: f28ae5bc-96e4-45b9-b392-fd6886601179 content-length: 234 date: Fri, 16 Oct 2020 15:03:24 GMT ====================== master-0-1 HTTP/2 403 audit-id: 4f471624-77dc-4891-b753-24ac30fecffd cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 1adfdee3-739c-4874-b808-a3c597472add x-kubernetes-pf-prioritylevel-uid: f28ae5bc-96e4-45b9-b392-fd6886601179 content-length: 234 date: Fri, 16 Oct 2020 15:03:24 GMT ====================== master-0-2 HTTP/2 403 audit-id: 5bb0e82f-210a-4350-bb3b-0f6b7faa7153 cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 1adfdee3-739c-4874-b808-a3c597472add x-kubernetes-pf-prioritylevel-uid: f28ae5bc-96e4-45b9-b392-fd6886601179 content-length: 234 date: Fri, 16 Oct 2020 15:03:25 GMT ====================== worker-0-0 ====================== worker-0-1 HTTP/2 403 audit-id: 40bf37f5-17ae-4144-a0d6-a8b1e41066ca cache-control: no-cache, private content-type: application/json x-content-type-options: nosniff x-kubernetes-pf-flowschema-uid: 1adfdee3-739c-4874-b808-a3c597472add x-kubernetes-pf-prioritylevel-uid: f28ae5bc-96e4-45b9-b392-fd6886601179 content-length: 234 date: Fri, 16 Oct 2020 15:03:31 GMT ====================== Looking at the OVN logs we can see in the ovs-node-qhdtd pod log the following recurring errors: 2020-10-16T15:09:36.214Z|00027|ofproto_dpif_xlate(handler14)|WARN|Dropped 30 log messages in last 304 seconds (most recently, 304 seconds ago) due to excessive rate 2020-10-16T15:09:36.214Z|00028|ofproto_dpif_xlate(handler14)|WARN|over max translation depth 64 on bridge br-int while processing arp,in_port=1,vlan_tci=0x0000,dl_src=0a:58:a9:fe:00:01,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=169.254.0.1,arp_tpa=169.254.11.119,arp_op=1,arp_sha=0a:58:a9:fe:00:01,arp_tha=00:00:00:00:00:00 2020-10-16T15:09:36.214Z|00029|ofproto_dpif_xlate(handler14)|WARN|over max translation depth 64 on bridge br-int while processing arp,in_port=1,vlan_tci=0x0000,dl_src=0a:58:a9:fe:00:01,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=169.254.0.1,arp_tpa=169.254.11.119,arp_op=1,arp_sha=0a:58:a9:fe:00:01,arp_tha=00:00:00:00:00:00 2020-10-16T15:09:36.214Z|00030|ofproto_dpif_xlate(handler14)|WARN|over max translation depth 64 on bridge br-int while processing arp,in_port=1,vlan_tci=0x0000,dl_src=0a:58:a9:fe:00:01,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=169.254.0.1,arp_tpa=169.254.11.119,arp_op=1,arp_sha=0a:58:a9:fe:00:01,arp_tha=00:00:00:00:00:00 2020-10-16T15:09:36.214Z|00031|ofproto_dpif_xlate(handler14)|WARN|over max translation depth 64 on bridge br-int while processing arp,in_port=1,vlan_tci=0x0000,dl_src=0a:58:a9:fe:00:01,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=169.254.0.1,arp_tpa=169.254.11.119,arp_op=1,arp_sha=0a:58:a9:fe:00:01,arp_tha=00:00:00:00:00:00 2020-10-16T15:09:36.215Z|00032|ofproto_dpif_xlate(handler14)|WARN|over max translation depth 64 on bridge br-int while processing arp,in_port=1,vlan_tci=0x0000,dl_src=0a:58:a9:fe:00:01,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=169.254.0.1,arp_tpa=169.254.11.119,arp_op=1,arp_sha=0a:58:a9:fe:00:01,arp_tha=00:00:00:00:00:00 Attaching must-gather.
Note also that all cluster operators report as available with 4.6 version: oc get clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-rc.4 True False False 112m cloud-credential 4.6.0-rc.4 True False False 4h31m cluster-autoscaler 4.6.0-rc.4 True False False 4h9m config-operator 4.6.0-rc.4 True False False 4h9m console 4.6.0-rc.4 True False False 123m csi-snapshot-controller 4.6.0-rc.4 True False False 128m dns 4.6.0-rc.4 True False False 142m etcd 4.6.0-rc.4 True False False 4h12m image-registry 4.6.0-rc.4 True False False 147m ingress 4.6.0-rc.4 True False False 158m insights 4.6.0-rc.4 True False False 4h9m kube-apiserver 4.6.0-rc.4 True False False 4h12m kube-controller-manager 4.6.0-rc.4 True False False 4h12m kube-scheduler 4.6.0-rc.4 True False False 4h10m kube-storage-version-migrator 4.6.0-rc.4 True False False 3h18m machine-api 4.6.0-rc.4 True False False 4h1m machine-approver 4.6.0-rc.4 True False False 4h11m machine-config 4.6.0-rc.4 True False False 112m marketplace 4.6.0-rc.4 True False False 117m monitoring 4.6.0-rc.4 True False False 116m network 4.6.0-rc.4 True False False 4h14m node-tuning 4.6.0-rc.4 True False False 158m openshift-apiserver 4.6.0-rc.4 True False False 112m openshift-controller-manager 4.6.0-rc.4 True False False 4h9m openshift-samples 4.6.0-rc.4 True False False 108m operator-lifecycle-manager 4.6.0-rc.4 True False False 4h13m operator-lifecycle-manager-catalog 4.6.0-rc.4 True False False 4h13m operator-lifecycle-manager-packageserver 4.6.0-rc.4 True False False 117m service-ca 4.6.0-rc.4 True False False 4h14m storage 4.6.0-rc.4 True False False 158m
*** This bug has been marked as a duplicate of bug 1880591 ***