1916029 – Upgrade from 4.6.9 to 4.7.0-fc.2 is failing, network begins to drop packets during upgrade, stable prior to upgrade

Bug 1916029 - Upgrade from 4.6.9 to 4.7.0-fc.2 is failing, network begins to drop packets during upgrade, stable prior to upgrade

Summary: Upgrade from 4.6.9 to 4.7.0-fc.2 is failing, network begins to drop packets d...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Naga Ravi Chaitanya Elluri
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:	aos-scalability-46
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-14 00:44 UTC by Naga Ravi Chaitanya Elluri
Modified:	2022-10-28 10:18 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-10-28 10:18:28 UTC
Target Upstream Version:
Embargoed:
Flags:	surya: needinfo-

Attachments	(Terms of Use)

Description Naga Ravi Chaitanya Elluri 2021-01-14 00:44:50 UTC

Description of problem:
An upgrade run on a 250 node cluster built with 4.6.9 bits to 4.7.0-fc.2 brought down the API server during the machine-config operator upgrade. Looking the Etcd , disk fsync seems to be normal but peer network latency was > 1 sec meaning there was something wrong with the networking. RTT seems to be normal between the masters: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.7-sdn-kube-1.20/bugs/upgrades/network_rtt_master.txt but there are huge packet drops on ens5 interface:http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.7-sdn-kube-1.20/bugs/upgrades/network_packets_master.txt.

Snapshots of the dashboards containing the cluster/API and Etcd metrics: https://snapshot.raintank.io/dashboard/snapshot/J44uuaO7FJIL0F1UsJs6ah7egPqHoMlV, https://snapshot.raintank.io/dashboard/snapshot/vDXxrDgkyF3tfs2zxetIPwrOMW1aRpYE.

Logs: Journal and pod logs from the master node - http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.7-sdn-kube-1.20/bugs/upgrades/. We could not grab the must-gather as the API was not responding.

Cluster configuration:
The 250 node cluster was hosted on AWS with masters backed by io1 storage and it was loaded with 4000 projects with the following objects per project:

12 imagestreams
3 buildconfigs
6 builds
1 deployment with 2 pod replicas (sleep) mounting two secrets each. deployment-2pod
2 deployments with 1 pod replicas (sleep) mounting two secrets. deployment-1pod
3 services, one pointing to deployment-2pod, and other two pointing to deployment-1pod
3 route. 1 pointing to the service deployment-2pod and other two pointing to deployment-1pod
10 secrets. 2 of them mounted by the previous deployments.
10 configMaps. 2 of them mounted by the previous deployments.

Version-Release number of selected component (if applicable):
4.6.9 and was being upgraded to 4.7.0-fc.2.

How reproducible:
We have seen similar behavior with 4.5 -> 4.6 upgrades as well.

Steps to Reproduce:
1. Install a cluster using 4.6.9 payload
2. Scale the cluster to 250 nodes and load the cluster using e2e-benchmarking/kube-burner: https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner#cluster-density-variables
3. Observe the upgrade process especially during the machine-config operator upgrade which is almost last part of the process.

Actual results:
Cluster is highly unstable with unresponsive API, Etcd going through large number of leader elections.

Expected results:
Cluster is stable.

Comment 1 Naga Ravi Chaitanya Elluri 2021-01-18 17:09:41 UTC

We did a second round of upgrade but this time the builds were excluded during the cluster load phase which seems to reduce the load on the cluster especially Etcd since the DB size is seen to be half when compared to the cluster with  builds. The upgrade passed without any issues in this case, so the suspect is high load causing network congestion causing the control plane to go unstable and eventually go unresponsive when machine-config cluster operator upgrade is in progess where masters are rebooted.

Comment 2 Dan Williams 2021-01-19 15:28:23 UTC

During situations where resource contention/packet drops are suspected, can you jump into an OVS container on the node and run "ovs-dpctl show" to get OVS data plane stats on packet drops/upcalls?

Comment 3 Surya Seetharaman 2021-01-20 10:15:35 UTC

Can I have access to the environment for the stress test or could you provide us with the OVS and SDN container logs from the nodes when you saw the stress on ens5 ?

Also out of curiosity where can I find more info on the builds? I am not an expert on OCP stress tests/upgrades, so would like to know what kind of build was included per project since you mentioned in the second comment that removing the builds seemed to get the cluster back to normal.

Comment 4 Surya Seetharaman 2021-01-20 10:30:37 UTC

oh nvm, about the container logs, my eyes didn't notice that you've attached the master node's logs.

Comment 5 Surya Seetharaman 2021-01-20 11:19:09 UTC

Okay so skimmed through the logs, and basically at this point every component is screaming that API is unavailable like stated in comment #1:

2021-01-13T22:40:14.824620481+00:00 stderr F F0113 22:40:14.824578   34446 cmd.go:106] Failed to initialize sdn: failed to initialize SDN: could not get ClusterNetwork resource: Get "https://api-int.upgradecluster.perfscale.devcluster.openshift.com:6443/apis/network.openshift.io/v1/clusternetworks/default": dial tcp 10.0.xxx.xxx:6443: i/o timeout

So we'd need the ovs flows from the nodes like Dan asked ^ and I'll try and get help from the OVS team to find the bottleneck for the load.

Comment 6 Naga Ravi Chaitanya Elluri 2021-01-20 20:35:44 UTC

There is an issue with network flows in 4.6.9: https://bugzilla.redhat.com/show_bug.cgi?id=1914284, we are re-running the test on a cluster built with 4.6.12 payload which has the fix and try to upgrade to 4.7.0.fc.3. We will grab the ovs flows information this time if/when we see the packet drops. Thanks.

Note You need to log in before you can comment on or make changes to this bug.