Bug 1835376
Summary: | Pod Churn Tests lead to TX errors on Geneve interface, kube-scheduler restarts and TLS errors | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sai Sindhur Malleni <smalleni> |
Component: | Networking | Assignee: | Ben Bennett <bbennett> |
Networking sub component: | ovn-kubernetes | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | aconstan, akamra, dblack, dwilson, jtaleric, mkarg |
Version: | 4.5 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-05-18 14:10:24 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sai Sindhur Malleni
2020-05-13 17:31:20 UTC
Updated some of the bugs referenced in this BZ with must-gather and additional details. Overall it seems like there are consistent neworktransmit errors in the prometheus alerts: https://snapshot.raintank.io/dashboard/snapshot/vmqPeuQ3AL8TDkorrC5wqDNe60Ap8tlp When running any workload like creating projects/imagestreams or pods we see two types of errors on the client side. So this is not restricted to pod churn workloads... 1. Unexpected error: <*url.Error | 0xc001bcb560>: { Op: "Post", URL: "https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods", Err: {s: "EOF"}, } Post https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods: EOF 2. Get https://api.test714.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout In OCP API server logs we keep seeing ocp-o.txt:324:W0513 18:29:33.496286 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:325:W0513 18:29:37.662837 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:326:W0513 18:29:37.994312 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:327:W0513 18:29:38.022618 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:328:W0513 18:29:38.210862 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:329:W0513 18:29:38.227472 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... I made sure we don't have any unexpected Ips/ hosts in the baremetal environment by running an nmap. Happy to give access to the environment and help debug further. Back with another datapoint, Tried the same installer build that I am using on baremetal, to deploy on AWS with OVNKubernetes at the same scale (3 masters+ 21 workers). I'm able to run all my tests there successfully and don't see similar issues as in baremetal. Few things I checked to see this is not environmental 1. I ran an nmap across the 192.168.222.0/24 network to see there are no unwanted hosts/ips 2. I made sure any hosts that we are not using in the deployment are powered off (we have some extra hosts that are not being used). So really the only hosts powered on are the ones in the cluster (3 masters+21 workers). *** This bug has been marked as a duplicate of bug 1834918 *** |