Bug 1968030

Summary:

etcdHighNumberOfLeaderChanges alerts during ovn-kube upgrade

Product:

OpenShift Container Platform

Reporter:

jamo luhrsen <jluhrsen>

Component:

Networking

Assignee:

Tim Rozet <trozet>

Networking sub component:

ovn-kubernetes

QA Contact:

Anurag saxena <anusaxen>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

high

Priority:

high

CC:

anbhat, bbennett, wking

Version:

4.8

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-09-30 20:43:18 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
cpu dip at 14:27	none

Description jamo luhrsen 2021-06-04 18:14:41 UTC

Description of problem:

A failure in the 4.7->4.8 upgrade job in "Check if alerts are firing during or after upgrade success" is
showing lots of etcdHighNumberOfLeaderChanges alerts. Looking in some of the etcd pod logs I noticed
this:

  "dropped internal Raft message since sending buffer is full (overloaded network)"
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400432036841263104/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-etcd_etcd-ci-op-b8s8c7wc-8929c-wgbsn-master-2_etcd.log

other etcd pod logs have other network related messages, like:
  "Err :connection error: desc = \"transport: Error while dialing dial tcp 10.0.0.3:9978: connect: connection refused\". Reconnecting...""
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400432036841263104/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-etcd_etcd-ci-op-b8s8c7wc-8929c-wgbsn-master-2_etcd-metrics.log

  the above is from this job:
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400432036841263104

A few other new ovn-upgrade failures have new bugs and all seem network related. It's possible
that test case failures with different top level symptoms (like above) are all coming from
the same root cause. two examples of these bugs just file:
  https://bugzilla.redhat.com/show_bug.cgi?id=1968009
  https://bugzilla.redhat.com/show_bug.cgi?id=1968021

Comment 4 Casey Callendrello 2021-06-08 15:46:44 UTC

Created attachment 1789401 [details]
cpu dip at 14:27

CPU dip at 14:27

Comment 5 Casey Callendrello 2021-06-08 16:35:04 UTC

I don't think etcd is the problem - that seems like a red herring

As far as I can tell, the etcd message "{"level":"warn","ts":"2021-06-03T14:30:45.207Z","caller":"rafthttp/peer.go:267","msg":"dropped internal Raft message since sending buffer is full (overloaded network)","message-type":"MsgHeartbeat","local-member-id":"2bc264d787b8e152","from":"2bc264d787b8e152","remote-peer-id":"6d1ad43e626b4535","remote-peer-active":false}" 

is not that interesting - these messages show up and go away as the masters are rebooted during the upgrade process.

Comment 6 Casey Callendrello 2021-06-08 16:54:43 UTC

Update: Pod creation issues clearly happen during ovn-kube failover. Trying to determine exactly what the timeline is, so we can see how to best fix this up.

Comment 7 Casey Callendrello 2021-06-08 18:50:35 UTC

Let's look at the timing for a particular pod, thanks to loki:

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221622728500000%22,%221622728720000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade%2F1400432036841263104%5C%22%7D%20%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-ovn-kubernetes%5C%22%20or%20systemd_unit%3D%5C%22kubelet.service%5C%22%20or%20systemd_unit%3D%5C%22crio.service%5C%22%20%7C%3D%20%5C%22openshift-kube-apiserver%2Finstaller-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2%5C%22%22,%22maxLines%22:4000%7D%5D


2021-06-03 15:58:36	
I0603 13:58:36.768948  131496 cni.go:223] [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa] ADD finished CNI request [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa], result "", err failed to get pod annotation: timed out waiting for annotationsShow context
	
2021-06-03 15:58:31	
I0603 13:58:31.435072       1 pods.go:302] [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2] addLogicalPort took 194.863518ms
	
2021-06-03 15:58:31	
I0603 13:58:31.240455       1 kube.go:61] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.129.0.3/23"],"mac_address":"0a:58:0a:81:00:03","gateway_ips":["10.129.0.1"],"ip_address":"10.129.0.3/23","gateway_ip":"10.129.0.1"}}] on pod openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2
	
2021-06-03 15:58:06	
I0603 13:58:06.713263  131496 cni.go:213] [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa] ADD starting CNI request [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa]

So it does take a long time for annotations to be added, but it seems like we sneak in under the deadline. I wonder why ovnk-node doesn't see the annotations?

Comment 8 Casey Callendrello 2021-06-08 19:00:40 UTC

This is very curious - it seems the ovnkube-node process was starting up during the failed CNI add:

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221622728500000%22,%221622728780000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade%2F1400432036841263104%5C%22%7D%20%20%7C%20unpack%20%7C%20host%3D%5C%22ci-op-b8s8c7wc-8929c-wgbsn-master-2%5C%22%20and%20namespace%3D%5C%22openshift-ovn-kubernetes%5C%22%20and%20app%3D%5C%22ovnkube-node%5C%22%22,%22maxLines%22:4000%7D%5D

Even the ovn-controller didn't see the ports until 13:58:57, well after 13:58:31, when the logical port was finished.

2021-06-03T13:58:57Z|00024|binding|INFO|openshift-kube-apiserver_installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2: Claiming 0a:58:0a:81:00:03 10.129.0.3

Comment 10 Casey Callendrello 2021-06-09 12:39:40 UTC

Another update: This time, better.

I managed to get a good CI run (by working around https://github.com/openshift/cluster-network-operator/pull/1118 with an ugly hack), and the result is better.

I didn't see any timing out waiting for annotations, so that might not be a problem (though who knows how frequent that might have been). Instead, I see a decent amount of "failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding".

So, looking in to that.

Comment 16 Tim Rozet 2021-09-30 20:43:18 UTC

I dont see how the etcd raft leadership has anything to do with OVN. Also other non-ovn jobs are showing the same failure. There is another bug about the same alert firing that the etcd team is looking into. Duping to that.

*** This bug has been marked as a duplicate of bug 1972948 ***