Bug 1835376

Summary:	Pod Churn Tests lead to TX errors on Geneve interface, kube-scheduler restarts and TLS errors
Product:	OpenShift Container Platform	Reporter:	Sai Sindhur Malleni <smalleni>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Networking sub component:	ovn-kubernetes	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	aconstan, akamra, dblack, dwilson, jtaleric, mkarg
Version:	4.5
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-18 14:10:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2020-05-13 17:31:20 UTC

Description of problem:

We have a 3 masters + 21 worker node of OCP 4.5 on baremetal. We are running a simple pod churn test. Here's what the test does

1. Spin up 200 pods in increments of 50 across the cluster in a new namespace
2. Delete namespace and pods
3. Re-run the same workload in a loop

This scenario, creates a pod churn in the environment while not creating a large number of pods overall. However, one the 3rd-4th iteration of the workload, we see the issues mentioned below. However, using OpenShiftSDN we did not hit these issues in spite of running the pod churn workload for hundreds of iterations.

While running this workload, we are hitting 3 different issues, all of which "seem" to point to some instability in networking. This is a blanket BZ to track these issues

High number of TX errors on geneve interfaces: https://bugzilla.redhat.com/show_bug.cgi?id=1834918
Frequent restarts of kubescheduler pods on baremetal deployments: https://bugzilla.redhat.com/show_bug.cgi?id=1834908
Frequent TLS handshake errors causing cluster instability and failure of workloads: https://bugzilla.redhat.com/show_bug.cgi?id=1834914

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-05-04-113741

How reproducible:
100%

Steps to Reproduce:
1. Deploy an OCP cluster with OVN on BM
2. Run the workload as mentioned earlier
3.

Actual results:
After a few iterations, the workload fails (can't launch anymore pods) with the three symptoms in the BZes linked above.

Expected results:

Given that OpenShiftSDn was able to run the same workload (200 pods launch and delete), for several hundred iterations on the same environment this makes networking a suspect.

Additional info:

As additional info, we initially deploy with masters only and the add worker nodes. This leads to the master nodes carrying the worker label (we set them to unschedulable later using oc edit schedulers.config.openshift.io cluster
)

Comment 1 Sai Sindhur Malleni 2020-05-14 18:05:24 UTC

Updated some of the bugs referenced in this BZ with must-gather and additional details. Overall it seems like there are consistent neworktransmit errors in the prometheus alerts: https://snapshot.raintank.io/dashboard/snapshot/vmqPeuQ3AL8TDkorrC5wqDNe60Ap8tlp

When running any workload like creating projects/imagestreams or pods we see two types of errors on the client side. So this is not restricted to pod churn workloads...


1. Unexpected error:
    <*url.Error | 0xc001bcb560>: {                               
        Op: "Post",                                               
        URL: "https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods",
        Err: {s: "EOF"},                                       
    }                                                          
    Post https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods: EOF
2. Get https://api.test714.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout 

In OCP API server logs we keep seeing

ocp-o.txt:324:W0513 18:29:33.496286       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1
68.222.10:2379: connect: connection refused". Reconnecting...
ocp-o.txt:325:W0513 18:29:37.662837       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1
68.222.10:2379: connect: connection refused". Reconnecting...                                
ocp-o.txt:326:W0513 18:29:37.994312       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1
68.222.10:2379: connect: connection refused". Reconnecting...                  
ocp-o.txt:327:W0513 18:29:38.022618       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1
68.222.10:2379: connect: connection refused". Reconnecting...                                
ocp-o.txt:328:W0513 18:29:38.210862       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1
68.222.10:2379: connect: connection refused". Reconnecting... 
ocp-o.txt:329:W0513 18:29:38.227472       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1
68.222.10:2379: connect: connection refused". Reconnecting...   

I made sure we don't have any unexpected Ips/ hosts in the baremetal environment by running an nmap.

Happy to give access to the environment and help debug further.

Comment 2 Sai Sindhur Malleni 2020-05-14 18:53:28 UTC

Back with another datapoint,

Tried the same installer build that I am using on baremetal, to deploy on AWS with OVNKubernetes at the same scale (3 masters+ 21 workers). I'm able to run all my tests there successfully and don't see similar issues as in baremetal.

Few things I checked to see this is not environmental

1. I ran an nmap across the 192.168.222.0/24 network to see there are no unwanted hosts/ips
2. I made sure any hosts that we are not using in the deployment are powered off (we have some extra hosts that are not being used). So really the only hosts powered on are the ones in the cluster (3 masters+21 workers).

Comment 4 Alexander Constantinescu 2020-05-18 14:10:24 UTC


*** This bug has been marked as a duplicate of bug 1834918 ***