Bug 1968030
Summary: | etcdHighNumberOfLeaderChanges alerts during ovn-kube upgrade | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | jamo luhrsen <jluhrsen> | ||||
Component: | Networking | Assignee: | Tim Rozet <trozet> | ||||
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | ||||
Status: | CLOSED DUPLICATE | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | high | CC: | anbhat, bbennett, wking | ||||
Version: | 4.8 | ||||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-09-30 20:43:18 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
jamo luhrsen
2021-06-04 18:14:41 UTC
Created attachment 1789401 [details]
cpu dip at 14:27
CPU dip at 14:27
I don't think etcd is the problem - that seems like a red herring As far as I can tell, the etcd message "{"level":"warn","ts":"2021-06-03T14:30:45.207Z","caller":"rafthttp/peer.go:267","msg":"dropped internal Raft message since sending buffer is full (overloaded network)","message-type":"MsgHeartbeat","local-member-id":"2bc264d787b8e152","from":"2bc264d787b8e152","remote-peer-id":"6d1ad43e626b4535","remote-peer-active":false}" is not that interesting - these messages show up and go away as the masters are rebooted during the upgrade process. Update: Pod creation issues clearly happen during ovn-kube failover. Trying to determine exactly what the timeline is, so we can see how to best fix this up. Let's look at the timing for a particular pod, thanks to loki: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221622728500000%22,%221622728720000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade%2F1400432036841263104%5C%22%7D%20%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-ovn-kubernetes%5C%22%20or%20systemd_unit%3D%5C%22kubelet.service%5C%22%20or%20systemd_unit%3D%5C%22crio.service%5C%22%20%7C%3D%20%5C%22openshift-kube-apiserver%2Finstaller-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2%5C%22%22,%22maxLines%22:4000%7D%5D 2021-06-03 15:58:36 I0603 13:58:36.768948 131496 cni.go:223] [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa] ADD finished CNI request [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa], result "", err failed to get pod annotation: timed out waiting for annotationsShow context 2021-06-03 15:58:31 I0603 13:58:31.435072 1 pods.go:302] [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2] addLogicalPort took 194.863518ms 2021-06-03 15:58:31 I0603 13:58:31.240455 1 kube.go:61] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.129.0.3/23"],"mac_address":"0a:58:0a:81:00:03","gateway_ips":["10.129.0.1"],"ip_address":"10.129.0.3/23","gateway_ip":"10.129.0.1"}}] on pod openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 2021-06-03 15:58:06 I0603 13:58:06.713263 131496 cni.go:213] [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa] ADD starting CNI request [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa] So it does take a long time for annotations to be added, but it seems like we sneak in under the deadline. I wonder why ovnk-node doesn't see the annotations? This is very curious - it seems the ovnkube-node process was starting up during the failed CNI add: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221622728500000%22,%221622728780000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade%2F1400432036841263104%5C%22%7D%20%20%7C%20unpack%20%7C%20host%3D%5C%22ci-op-b8s8c7wc-8929c-wgbsn-master-2%5C%22%20and%20namespace%3D%5C%22openshift-ovn-kubernetes%5C%22%20and%20app%3D%5C%22ovnkube-node%5C%22%22,%22maxLines%22:4000%7D%5D Even the ovn-controller didn't see the ports until 13:58:57, well after 13:58:31, when the logical port was finished. 2021-06-03T13:58:57Z|00024|binding|INFO|openshift-kube-apiserver_installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2: Claiming 0a:58:0a:81:00:03 10.129.0.3 Another update: This time, better. I managed to get a good CI run (by working around https://github.com/openshift/cluster-network-operator/pull/1118 with an ugly hack), and the result is better. I didn't see any timing out waiting for annotations, so that might not be a problem (though who knows how frequent that might have been). Instead, I see a decent amount of "failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding". So, looking in to that. I dont see how the etcd raft leadership has anything to do with OVN. Also other non-ovn jobs are showing the same failure. There is another bug about the same alert firing that the etcd team is looking into. Duping to that. *** This bug has been marked as a duplicate of bug 1972948 *** |