Description of problem: We have a 3 masters + 21 worker node of OCP 4.5 on baremetal. We are running a simple pod churn test. Here's what the test does 1. Spin up 200 pods in increments of 50 across the cluster in a new namespace 2. Delete namespace and pods 3. Re-run the same workload in a loop This scenario, creates a pod churn in the environment while not creating a large number of pods overall. However, one the 3rd-4th iteration of the workload, we see the issues mentioned below. However, using OpenShiftSDN we did not hit these issues in spite of running the pod churn workload for hundreds of iterations. While running this workload, we are hitting 3 different issues, all of which "seem" to point to some instability in networking. This is a blanket BZ to track these issues High number of TX errors on geneve interfaces: https://bugzilla.redhat.com/show_bug.cgi?id=1834918 Frequent restarts of kubescheduler pods on baremetal deployments: https://bugzilla.redhat.com/show_bug.cgi?id=1834908 Frequent TLS handshake errors causing cluster instability and failure of workloads: https://bugzilla.redhat.com/show_bug.cgi?id=1834914 Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-05-04-113741 How reproducible: 100% Steps to Reproduce: 1. Deploy an OCP cluster with OVN on BM 2. Run the workload as mentioned earlier 3. Actual results: After a few iterations, the workload fails (can't launch anymore pods) with the three symptoms in the BZes linked above. Expected results: Given that OpenShiftSDn was able to run the same workload (200 pods launch and delete), for several hundred iterations on the same environment this makes networking a suspect. Additional info: As additional info, we initially deploy with masters only and the add worker nodes. This leads to the master nodes carrying the worker label (we set them to unschedulable later using oc edit schedulers.config.openshift.io cluster )
Updated some of the bugs referenced in this BZ with must-gather and additional details. Overall it seems like there are consistent neworktransmit errors in the prometheus alerts: https://snapshot.raintank.io/dashboard/snapshot/vmqPeuQ3AL8TDkorrC5wqDNe60Ap8tlp When running any workload like creating projects/imagestreams or pods we see two types of errors on the client side. So this is not restricted to pod churn workloads... 1. Unexpected error: <*url.Error | 0xc001bcb560>: { Op: "Post", URL: "https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods", Err: {s: "EOF"}, } Post https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods: EOF 2. Get https://api.test714.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout In OCP API server logs we keep seeing ocp-o.txt:324:W0513 18:29:33.496286 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:325:W0513 18:29:37.662837 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:326:W0513 18:29:37.994312 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:327:W0513 18:29:38.022618 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:328:W0513 18:29:38.210862 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:329:W0513 18:29:38.227472 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... I made sure we don't have any unexpected Ips/ hosts in the baremetal environment by running an nmap. Happy to give access to the environment and help debug further.
Back with another datapoint, Tried the same installer build that I am using on baremetal, to deploy on AWS with OVNKubernetes at the same scale (3 masters+ 21 workers). I'm able to run all my tests there successfully and don't see similar issues as in baremetal. Few things I checked to see this is not environmental 1. I ran an nmap across the 192.168.222.0/24 network to see there are no unwanted hosts/ips 2. I made sure any hosts that we are not using in the deployment are powered off (we have some extra hosts that are not being used). So really the only hosts powered on are the ones in the cluster (3 masters+21 workers).
*** This bug has been marked as a duplicate of bug 1834918 ***