1968030 – etcdHighNumberOfLeaderChanges alerts during ovn-kube upgrade

Bug 1968030 - etcdHighNumberOfLeaderChanges alerts during ovn-kube upgrade

Summary: etcdHighNumberOfLeaderChanges alerts during ovn-kube upgrade

Keywords:
Status:	CLOSED DUPLICATE of bug 1972948
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Tim Rozet
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-04 18:14 UTC by jamo luhrsen
Modified:	2021-09-30 20:43 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-30 20:43:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cpu dip at 14:27 (125.16 KB, image/png) 2021-06-08 15:46 UTC, Casey Callendrello	no flags	Details
View All

Description jamo luhrsen 2021-06-04 18:14:41 UTC

Description of problem:

A failure in the 4.7->4.8 upgrade job in "Check if alerts are firing during or after upgrade success" is
showing lots of etcdHighNumberOfLeaderChanges alerts. Looking in some of the etcd pod logs I noticed
this:

  "dropped internal Raft message since sending buffer is full (overloaded network)"
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400432036841263104/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-etcd_etcd-ci-op-b8s8c7wc-8929c-wgbsn-master-2_etcd.log

other etcd pod logs have other network related messages, like:
  "Err :connection error: desc = \"transport: Error while dialing dial tcp 10.0.0.3:9978: connect: connection refused\". Reconnecting...""
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400432036841263104/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/pods/openshift-etcd_etcd-ci-op-b8s8c7wc-8929c-wgbsn-master-2_etcd-metrics.log

  the above is from this job:
  https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade/1400432036841263104

A few other new ovn-upgrade failures have new bugs and all seem network related. It's possible
that test case failures with different top level symptoms (like above) are all coming from
the same root cause. two examples of these bugs just file:
  https://bugzilla.redhat.com/show_bug.cgi?id=1968009
  https://bugzilla.redhat.com/show_bug.cgi?id=1968021

Comment 4 Casey Callendrello 2021-06-08 15:46:44 UTC

Created attachment 1789401 [details]
cpu dip at 14:27

CPU dip at 14:27

Comment 5 Casey Callendrello 2021-06-08 16:35:04 UTC

I don't think etcd is the problem - that seems like a red herring

As far as I can tell, the etcd message "{"level":"warn","ts":"2021-06-03T14:30:45.207Z","caller":"rafthttp/peer.go:267","msg":"dropped internal Raft message since sending buffer is full (overloaded network)","message-type":"MsgHeartbeat","local-member-id":"2bc264d787b8e152","from":"2bc264d787b8e152","remote-peer-id":"6d1ad43e626b4535","remote-peer-active":false}" 

is not that interesting - these messages show up and go away as the masters are rebooted during the upgrade process.

Comment 6 Casey Callendrello 2021-06-08 16:54:43 UTC

Update: Pod creation issues clearly happen during ovn-kube failover. Trying to determine exactly what the timeline is, so we can see how to best fix this up.

Comment 7 Casey Callendrello 2021-06-08 18:50:35 UTC

Let's look at the timing for a particular pod, thanks to loki:

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221622728500000%22,%221622728720000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade%2F1400432036841263104%5C%22%7D%20%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-ovn-kubernetes%5C%22%20or%20systemd_unit%3D%5C%22kubelet.service%5C%22%20or%20systemd_unit%3D%5C%22crio.service%5C%22%20%7C%3D%20%5C%22openshift-kube-apiserver%2Finstaller-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2%5C%22%22,%22maxLines%22:4000%7D%5D


2021-06-03 15:58:36	
I0603 13:58:36.768948  131496 cni.go:223] [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa] ADD finished CNI request [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa], result "", err failed to get pod annotation: timed out waiting for annotationsShow context
	
2021-06-03 15:58:31	
I0603 13:58:31.435072       1 pods.go:302] [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2] addLogicalPort took 194.863518ms
	
2021-06-03 15:58:31	
I0603 13:58:31.240455       1 kube.go:61] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.129.0.3/23"],"mac_address":"0a:58:0a:81:00:03","gateway_ips":["10.129.0.1"],"ip_address":"10.129.0.3/23","gateway_ip":"10.129.0.1"}}] on pod openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2
	
2021-06-03 15:58:06	
I0603 13:58:06.713263  131496 cni.go:213] [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa] ADD starting CNI request [openshift-kube-apiserver/installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2 25b4d88ee334e0112ceff99f13e994cfc1170080aa3db06ab38775d3046f95fa]

So it does take a long time for annotations to be added, but it seems like we sneak in under the deadline. I wonder why ovnk-node doesn't see the annotations?

Comment 8 Casey Callendrello 2021-06-08 19:00:40 UTC

This is very curious - it seems the ovnkube-node process was starting up during the failed CNI add:

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221622728500000%22,%221622728780000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade%2F1400432036841263104%5C%22%7D%20%20%7C%20unpack%20%7C%20host%3D%5C%22ci-op-b8s8c7wc-8929c-wgbsn-master-2%5C%22%20and%20namespace%3D%5C%22openshift-ovn-kubernetes%5C%22%20and%20app%3D%5C%22ovnkube-node%5C%22%22,%22maxLines%22:4000%7D%5D

Even the ovn-controller didn't see the ports until 13:58:57, well after 13:58:31, when the logical port was finished.

2021-06-03T13:58:57Z|00024|binding|INFO|openshift-kube-apiserver_installer-8-ci-op-b8s8c7wc-8929c-wgbsn-master-2: Claiming 0a:58:0a:81:00:03 10.129.0.3

Comment 10 Casey Callendrello 2021-06-09 12:39:40 UTC

Another update: This time, better.

I managed to get a good CI run (by working around https://github.com/openshift/cluster-network-operator/pull/1118 with an ugly hack), and the result is better.

I didn't see any timing out waiting for annotations, so that might not be a problem (though who knows how frequent that might have been). Instead, I see a decent amount of "failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding".

So, looking in to that.

Comment 16 Tim Rozet 2021-09-30 20:43:18 UTC

I dont see how the etcd raft leadership has anything to do with OVN. Also other non-ovn jobs are showing the same failure. There is another bug about the same alert firing that the etcd team is looking into. Duping to that.

*** This bug has been marked as a duplicate of bug 1972948 ***

Note You need to log in before you can comment on or make changes to this bug.