Description of problem: An upgrade run on a 250 node cluster built with 4.6.9 bits to 4.7.0-fc.2 brought down the API server during the machine-config operator upgrade. Looking the Etcd , disk fsync seems to be normal but peer network latency was > 1 sec meaning there was something wrong with the networking. RTT seems to be normal between the masters: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.7-sdn-kube-1.20/bugs/upgrades/network_rtt_master.txt but there are huge packet drops on ens5 interface:http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.7-sdn-kube-1.20/bugs/upgrades/network_packets_master.txt. Snapshots of the dashboards containing the cluster/API and Etcd metrics: https://snapshot.raintank.io/dashboard/snapshot/J44uuaO7FJIL0F1UsJs6ah7egPqHoMlV, https://snapshot.raintank.io/dashboard/snapshot/vDXxrDgkyF3tfs2zxetIPwrOMW1aRpYE. Logs: Journal and pod logs from the master node - http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.7-sdn-kube-1.20/bugs/upgrades/. We could not grab the must-gather as the API was not responding. Cluster configuration: The 250 node cluster was hosted on AWS with masters backed by io1 storage and it was loaded with 4000 projects with the following objects per project: 12 imagestreams 3 buildconfigs 6 builds 1 deployment with 2 pod replicas (sleep) mounting two secrets each. deployment-2pod 2 deployments with 1 pod replicas (sleep) mounting two secrets. deployment-1pod 3 services, one pointing to deployment-2pod, and other two pointing to deployment-1pod 3 route. 1 pointing to the service deployment-2pod and other two pointing to deployment-1pod 10 secrets. 2 of them mounted by the previous deployments. 10 configMaps. 2 of them mounted by the previous deployments. Version-Release number of selected component (if applicable): 4.6.9 and was being upgraded to 4.7.0-fc.2. How reproducible: We have seen similar behavior with 4.5 -> 4.6 upgrades as well. Steps to Reproduce: 1. Install a cluster using 4.6.9 payload 2. Scale the cluster to 250 nodes and load the cluster using e2e-benchmarking/kube-burner: https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner#cluster-density-variables 3. Observe the upgrade process especially during the machine-config operator upgrade which is almost last part of the process. Actual results: Cluster is highly unstable with unresponsive API, Etcd going through large number of leader elections. Expected results: Cluster is stable.
We did a second round of upgrade but this time the builds were excluded during the cluster load phase which seems to reduce the load on the cluster especially Etcd since the DB size is seen to be half when compared to the cluster with builds. The upgrade passed without any issues in this case, so the suspect is high load causing network congestion causing the control plane to go unstable and eventually go unresponsive when machine-config cluster operator upgrade is in progess where masters are rebooted.
During situations where resource contention/packet drops are suspected, can you jump into an OVS container on the node and run "ovs-dpctl show" to get OVS data plane stats on packet drops/upcalls?
Can I have access to the environment for the stress test or could you provide us with the OVS and SDN container logs from the nodes when you saw the stress on ens5 ? Also out of curiosity where can I find more info on the builds? I am not an expert on OCP stress tests/upgrades, so would like to know what kind of build was included per project since you mentioned in the second comment that removing the builds seemed to get the cluster back to normal.
oh nvm, about the container logs, my eyes didn't notice that you've attached the master node's logs.
Okay so skimmed through the logs, and basically at this point every component is screaming that API is unavailable like stated in comment #1: 2021-01-13T22:40:14.824620481+00:00 stderr F F0113 22:40:14.824578 34446 cmd.go:106] Failed to initialize sdn: failed to initialize SDN: could not get ClusterNetwork resource: Get "https://api-int.upgradecluster.perfscale.devcluster.openshift.com:6443/apis/network.openshift.io/v1/clusternetworks/default": dial tcp 10.0.xxx.xxx:6443: i/o timeout So we'd need the ovs flows from the nodes like Dan asked ^ and I'll try and get help from the OVS team to find the bottleneck for the load.
There is an issue with network flows in 4.6.9: https://bugzilla.redhat.com/show_bug.cgi?id=1914284, we are re-running the test on a cluster built with 4.6.12 payload which has the fix and try to upgrade to 4.7.0.fc.3. We will grab the ovs flows information this time if/when we see the packet drops. Thanks.