Description of problem: A failure in the 4.7->4.8 upgrade job in "Check if alerts are firing during or after upgrade success" is showing lots of etcdHighNumberOfLeaderChanges alerts. Looking in some of the etcd pod logs I noticed this: "dropped internal Raft message since sending buffer is full (overloaded network)" https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400432036841263104/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-etcd_etcd-ci-op-b8s8c7wc-8929c-wgbsn-master-2_etcd.log other etcd pod logs have other network related messages, like: "Err :connection error: desc = \"transport: Error while dialing dial tcp 10.0.0.3:9978: connect: connection refused\". Reconnecting..."" https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400432036841263104/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-etcd_etcd-ci-op-b8s8c7wc-8929c-wgbsn-master-2_etcd-metrics.log the above is from this job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400432036841263104 A few other new ovn-upgrade failures have new bugs and all seem network related. It's possible that test case failures with different top level symptoms (like above) are all coming from the same root cause. two examples of these bugs just file: https://bugzilla.redhat.com/show_bug.cgi?id=1968009 https://bugzilla.redhat.com/show_bug.cgi?id=1968021
Created attachment 1789401 [details] cpu dip at 14:27 CPU dip at 14:27
I don't think etcd is the problem - that seems like a red herring As far as I can tell, the etcd message "{"level":"warn","ts":"2021-06-03T14:30:45.207Z","caller":"rafthttp/peer.go:267","msg":"dropped internal Raft message since sending buffer is full (overloaded network)","message-type":"MsgHeartbeat","local-member-id":"2bc264d787b8e152","from":"2bc264d787b8e152","remote-peer-id":"6d1ad43e626b4535","remote-peer-active":false}" is not that interesting - these messages show up and go away as the masters are rebooted during the upgrade process.
Update: Pod creation issues clearly happen during ovn-kube failover. Trying to determine exactly what the timeline is, so we can see how to best fix this up.
Let's look at the timing for a particular pod, thanks to loki: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221622728500000%22,%221622728720000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade%2F1400432036841263104%5C%22%7D%20%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-ovn-kubernetes%5C%22%20or%20systemd_unit%3D%5C%22kubelet.service%5C%22%20or%20systemd_unit%3D%5C%22crio.service%5C%22%20%7C%3D%20%5C%22openshift-kube-apiserver%2Finstaller-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2%5C%22%22,%22maxLines%22:4000%7D%5D 2021-06-03 15:58:36 I0603 13:58:36.768948 131496 cni.go:223] [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa] ADD finished CNI request [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa], result "", err failed to get pod annotation: timed out waiting for annotationsShow context 2021-06-03 15:58:31 I0603 13:58:31.435072 1 pods.go:302] [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2] addLogicalPort took 194.863518ms 2021-06-03 15:58:31 I0603 13:58:31.240455 1 kube.go:61] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.129.0.3/23"],"mac_address":"0a:58:0a:81:00:03","gateway_ips":["10.129.0.1"],"ip_address":"10.129.0.3/23","gateway_ip":"10.129.0.1"}}] on pod openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 2021-06-03 15:58:06 I0603 13:58:06.713263 131496 cni.go:213] [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa] ADD starting CNI request [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa] So it does take a long time for annotations to be added, but it seems like we sneak in under the deadline. I wonder why ovnk-node doesn't see the annotations?
This is very curious - it seems the ovnkube-node process was starting up during the failed CNI add: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221622728500000%22,%221622728780000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade%2F1400432036841263104%5C%22%7D%20%20%7C%20unpack%20%7C%20host%3D%5C%22ci-op-b8s8c7wc-8929c-wgbsn-master-2%5C%22%20and%20namespace%3D%5C%22openshift-ovn-kubernetes%5C%22%20and%20app%3D%5C%22ovnkube-node%5C%22%22,%22maxLines%22:4000%7D%5D Even the ovn-controller didn't see the ports until 13:58:57, well after 13:58:31, when the logical port was finished. 2021-06-03T13:58:57Z|00024|binding|INFO|openshift-kube-apiserver_installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2: Claiming 0a:58:0a:81:00:03 10.129.0.3
Another update: This time, better. I managed to get a good CI run (by working around https://github.com/openshift/cluster-network-operator/pull/1118 with an ugly hack), and the result is better. I didn't see any timing out waiting for annotations, so that might not be a problem (though who knows how frequent that might have been). Instead, I see a decent amount of "failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding". So, looking in to that.
I dont see how the etcd raft leadership has anything to do with OVN. Also other non-ovn jobs are showing the same failure. There is another bug about the same alert firing that the etcd team is looking into. Duping to that. *** This bug has been marked as a duplicate of bug 1972948 ***