Bug 1835376 - Pod Churn Tests lead to TX errors on Geneve interface, kube-scheduler restarts and TLS errors
Summary: Pod Churn Tests lead to TX errors on Geneve interface, kube-scheduler restart...
Keywords:
Status: CLOSED DUPLICATE of bug 1834918
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-13 17:31 UTC by Sai Sindhur Malleni
Modified: 2020-05-18 14:10 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-18 14:10:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Sai Sindhur Malleni 2020-05-13 17:31:20 UTC
Description of problem:

We have a 3 masters + 21 worker node of OCP 4.5 on baremetal. We are running a simple pod churn test. Here's what the test does

1. Spin up 200 pods in increments of 50 across the cluster in a new namespace
2. Delete namespace and pods
3. Re-run the same workload in a loop

This scenario, creates a pod churn in the environment while not creating a large number of pods overall. However, one the 3rd-4th iteration of the workload, we see the issues mentioned below. However, using OpenShiftSDN we did not hit these issues in spite of running the pod churn workload for hundreds of iterations.

While running this workload, we are hitting 3 different issues, all of which "seem" to point to some instability in networking. This is a blanket BZ to track these issues

High number of TX errors on geneve interfaces: https://bugzilla.redhat.com/show_bug.cgi?id=1834918
Frequent restarts of kubescheduler pods on baremetal deployments: https://bugzilla.redhat.com/show_bug.cgi?id=1834908
Frequent TLS handshake errors causing cluster instability and failure of workloads: https://bugzilla.redhat.com/show_bug.cgi?id=1834914


Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-05-04-113741


How reproducible:
100%

Steps to Reproduce:
1. Deploy an OCP cluster with OVN on BM
2. Run the workload as mentioned earlier
3.

Actual results:
After a few iterations, the workload fails (can't launch anymore pods) with the three symptoms in the BZes linked above.

Expected results:

Given that OpenShiftSDn was able to run the same workload (200 pods launch and delete), for several hundred iterations on the same environment this makes networking a suspect.

Additional info:

As additional info, we initially deploy with masters only and the add worker nodes. This leads to the master nodes carrying the worker label (we set them to unschedulable later using  oc edit schedulers.config.openshift.io cluster
)

Comment 1 Sai Sindhur Malleni 2020-05-14 18:05:24 UTC
Updated some of the bugs referenced in this BZ with must-gather and additional details. Overall it seems like there are consistent neworktransmit errors in the prometheus alerts: https://snapshot.raintank.io/dashboard/snapshot/vmqPeuQ3AL8TDkorrC5wqDNe60Ap8tlp

When running any workload like creating projects/imagestreams or pods we see two types of errors on the client side. So this is not restricted to pod churn workloads...


1. Unexpected error:
    <*url.Error | 0xc001bcb560>: {                               
        Op: "Post",                                               
        URL: "https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods",
        Err: {s: "EOF"},                                       
    }                                                          
    Post https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods: EOF
2. Get https://api.test714.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout 

In OCP API server logs we keep seeing

ocp-o.txt:324:W0513 18:29:33.496286       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1
68.222.10:2379: connect: connection refused". Reconnecting...
ocp-o.txt:325:W0513 18:29:37.662837       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1
68.222.10:2379: connect: connection refused". Reconnecting...                                
ocp-o.txt:326:W0513 18:29:37.994312       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1
68.222.10:2379: connect: connection refused". Reconnecting...                  
ocp-o.txt:327:W0513 18:29:38.022618       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1
68.222.10:2379: connect: connection refused". Reconnecting...                                
ocp-o.txt:328:W0513 18:29:38.210862       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1
68.222.10:2379: connect: connection refused". Reconnecting... 
ocp-o.txt:329:W0513 18:29:38.227472       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1
68.222.10:2379: connect: connection refused". Reconnecting...   

I made sure we don't have any unexpected Ips/ hosts in the baremetal environment by running an nmap.

Happy to give access to the environment and help debug further.

Comment 2 Sai Sindhur Malleni 2020-05-14 18:53:28 UTC
Back with another datapoint,

Tried the same installer build that I am using on baremetal, to deploy on AWS with OVNKubernetes at the same scale (3 masters+ 21 workers). I'm able to run all my tests there successfully and don't see similar issues as in baremetal.

Few things I checked to see this is not environmental

1. I ran an nmap across the 192.168.222.0/24 network to see there are no unwanted hosts/ips
2. I made sure any hosts that we are not using in the deployment are powered off (we have some extra hosts that are not being used). So really the only hosts powered on are the ones in the cluster (3 masters+21 workers).

Comment 4 Alexander Constantinescu 2020-05-18 14:10:24 UTC

*** This bug has been marked as a duplicate of bug 1834918 ***


Note You need to log in before you can comment on or make changes to this bug.