Bug 1916029 - Upgrade from 4.6.9 to 4.7.0-fc.2 is failing, network begins to drop packets during upgrade, stable prior to upgrade
Summary: Upgrade from 4.6.9 to 4.7.0-fc.2 is failing, network begins to drop packets d...
Keywords:
Status: CLOSED EOL
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Naga Ravi Chaitanya Elluri
QA Contact: zhaozhanqi
URL:
Whiteboard: aos-scalability-46
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-14 00:44 UTC by Naga Ravi Chaitanya Elluri
Modified: 2022-10-28 10:18 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-28 10:18:28 UTC
Target Upstream Version:
Embargoed:
surya: needinfo-


Attachments (Terms of Use)

Description Naga Ravi Chaitanya Elluri 2021-01-14 00:44:50 UTC
Description of problem:
An upgrade run on a 250 node cluster built with 4.6.9 bits to 4.7.0-fc.2 brought down the API server during the machine-config operator upgrade. Looking the Etcd , disk fsync seems to be normal but peer network latency was > 1 sec meaning there was something wrong with the networking. RTT seems to be normal between the masters: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.7-sdn-kube-1.20/bugs/upgrades/network_rtt_master.txt but there are huge packet drops on ens5 interface:http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.7-sdn-kube-1.20/bugs/upgrades/network_packets_master.txt. 

Snapshots of the dashboards containing the cluster/API and Etcd metrics: https://snapshot.raintank.io/dashboard/snapshot/J44uuaO7FJIL0F1UsJs6ah7egPqHoMlV, https://snapshot.raintank.io/dashboard/snapshot/vDXxrDgkyF3tfs2zxetIPwrOMW1aRpYE.

Logs: Journal and pod logs from the master node - http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.7-sdn-kube-1.20/bugs/upgrades/. We could not grab the must-gather as the API was not responding.

Cluster configuration:
The 250 node cluster was hosted on AWS with masters backed by io1 storage and it was loaded with 4000 projects with the following objects per project:

    12 imagestreams
    3 buildconfigs
    6 builds
    1 deployment with 2 pod replicas (sleep) mounting two secrets each. deployment-2pod
    2 deployments with 1 pod replicas (sleep) mounting two secrets. deployment-1pod
    3 services, one pointing to deployment-2pod, and other two pointing to deployment-1pod
    3 route. 1 pointing to the service deployment-2pod and other two pointing to deployment-1pod
    10 secrets. 2 of them mounted by the previous deployments.
    10 configMaps. 2 of them mounted by the previous deployments.

Version-Release number of selected component (if applicable):
4.6.9 and was being upgraded to 4.7.0-fc.2.


How reproducible:
We have seen similar behavior with 4.5 -> 4.6 upgrades as well.

Steps to Reproduce:
1. Install a cluster using 4.6.9 payload
2. Scale the cluster to 250 nodes and load the cluster using e2e-benchmarking/kube-burner: https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner#cluster-density-variables
3. Observe the upgrade process especially during the machine-config operator upgrade which is almost last part of the process.

Actual results:
Cluster is highly unstable with unresponsive API, Etcd going through large number of leader elections.

Expected results:
Cluster is stable.

Comment 1 Naga Ravi Chaitanya Elluri 2021-01-18 17:09:41 UTC
We did a second round of upgrade but this time the builds were excluded during the cluster load phase which seems to reduce the load on the cluster especially Etcd since the DB size is seen to be half when compared to the cluster with  builds. The upgrade passed without any issues in this case, so the suspect is high load causing network congestion causing the control plane to go unstable and eventually go unresponsive when machine-config cluster operator upgrade is in progess where masters are rebooted.

Comment 2 Dan Williams 2021-01-19 15:28:23 UTC
During situations where resource contention/packet drops are suspected, can you jump into an OVS container on the node and run "ovs-dpctl show" to get OVS data plane stats on packet drops/upcalls?

Comment 3 Surya Seetharaman 2021-01-20 10:15:35 UTC
Can I have access to the environment for the stress test or could you provide us with the OVS and SDN container logs from the nodes when you saw the stress on ens5 ?

Also out of curiosity where can I find more info on the builds? I am not an expert on OCP stress tests/upgrades, so would like to know what kind of build was included per project since you mentioned in the second comment that removing the builds seemed to get the cluster back to normal.

Comment 4 Surya Seetharaman 2021-01-20 10:30:37 UTC
oh nvm, about the container logs, my eyes didn't notice that you've attached the master node's logs.

Comment 5 Surya Seetharaman 2021-01-20 11:19:09 UTC
Okay so skimmed through the logs, and basically at this point every component is screaming that API is unavailable like stated in comment #1:

2021-01-13T22:40:14.824620481+00:00 stderr F F0113 22:40:14.824578   34446 cmd.go:106] Failed to initialize sdn: failed to initialize SDN: could not get ClusterNetwork resource: Get "https://api-int.upgradecluster.perfscale.devcluster.openshift.com:6443/apis/network.openshift.io/v1/clusternetworks/default": dial tcp 10.0.xxx.xxx:6443: i/o timeout

So we'd need the ovs flows from the nodes like Dan asked ^ and I'll try and get help from the OVS team to find the bottleneck for the load.

Comment 6 Naga Ravi Chaitanya Elluri 2021-01-20 20:35:44 UTC
There is an issue with network flows in 4.6.9: https://bugzilla.redhat.com/show_bug.cgi?id=1914284, we are re-running the test on a cluster built with 4.6.12 payload which has the fix and try to upgrade to 4.7.0.fc.3. We will grab the ovs flows information this time if/when we see the packet drops. Thanks.


Note You need to log in before you can comment on or make changes to this bug.