Description of problem: We have a baremetal deployment of OCP 4.5 using OVNKuberentes as the SDN backend. After the deployment, we are running a simple performance test that launches 250ish pods and then deleted them and does it again in a cycle. After a few times, our test fails due to TLS issues. More specifically the client side error is May 11 21:40:33.475: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin/test/extended/cluster/utils.go:176]: Unexpected error: <*url.Error | 0xc001bcb560>: { Op: "Post", URL: "https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods", Err: {s: "EOF"}, } Post https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods: EOF occurred Grepping for EOF in kubeapi-server logs [kni@e19-h24-b01-fc640 workloads]$ grep -inr EOF * kube0.txt:562:I0511 21:43:11.176982 1 log.go:172] http: TLS handshake error from [::1]:56852: EOF kube0.txt:563:I0511 21:43:11.178842 1 log.go:172] http: TLS handshake error from [::1]:57112: EOF kube0.txt:564:I0511 21:43:11.178989 1 log.go:172] http: TLS handshake error from [::1]:57368: EOF kube1.txt:562:I0511 21:40:45.470863 1 log.go:172] http: TLS handshake error from [::1]:48544: EOF kube1.txt:563:I0511 21:40:45.473881 1 log.go:172] http: TLS handshake error from [::1]:47926: EOF Around the same time we see TLS handshake errors on master-0 and master-1 Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-05-04-113741 How reproducible: 100% Steps to Reproduce: 1. Deploy a cluster on BM with OVnKubernetes 2. Launch a few hundred pods in succession using nodevertical tests https://github.com/openshift-scale/workloads/blob/master/workloads/nodevertical.yml 3. Actual results: After launching a few hundred pods, we consistently run into TLS issues Expected results: Given that we have ample capcity on the cluster with 21 baremetal workers, we should not be seeing any errors when launching a few hundred pods. Additional info:
Please provide must-gather logs for us to take a look.
I could not find anything specific to API server in the logs around the time of failure. Moving this to OVN to check for issues.
We are constantly seeing alerts from prometheus about network transmit errors https://snapshot.raintank.io/dashboard/snapshot/vmqPeuQ3AL8TDkorrC5wqDNe60Ap8tlp
In OCP API server logs we keep seeing ocp-o.txt:324:W0513 18:29:33.496286 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:325:W0513 18:29:37.662837 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:326:W0513 18:29:37.994312 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:327:W0513 18:29:38.022618 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:328:W0513 18:29:38.210862 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:329:W0513 18:29:38.227472 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:330:W0513 18:29:38.318076 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:331:W0513 18:29:38.373034 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:332:W0513 18:29:38.397214 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:333:W0513 18:29:38.469380 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:334:W0513 18:29:38.512675 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:335:W0513 18:29:38.615945 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:336:W0513 18:29:38.811455 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:337:W0513 18:29:38.937376 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:338:W0513 18:29:39.163924 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:339:W0513 18:29:39.201738 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:340:W0513 18:29:39.217714 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:341:W0513 18:29:39.717367 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting... ocp-o.txt:342:W0513 18:29:40.389601 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.222.10:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.1 68.222.10:2379: connect: connection refused". Reconnecting...
This happens with OVNkubernetes more, but have seen this with OpenShiftSDN too, although not as frequent. There seems to be an APIserver issue here, and the fix https://github.com/openshift/machine-config-operator/commit/022933c07a4e37bed097f1cd1fa4cd2d637decc0#diff-5ba89d3d43c54f0b6f021120bef55bd2 did not help either.
moving this back to API Server per Sai's comment.
As another datapoint, I redeployed the environment with OpenShift SDN and don't see this failure anymore. At this point, I've done enough testing to say that this happens consistently on BM deployments using OVN but either doesn't happen at all or happens *very rarely* when using OpenShift SDN>
Even on a 4.4 cluster, we are seeing this when running our mastervertical tests (loads control plane with resources like deployments/services/namespaces) error: unable to recognize "/tmp/mastervertical0-config.json": Get https://api.test652.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout [] <nil> 0xc000e9f260 exit status 1 <nil> <nil> true [0xc00018a2f0 0xc00018a378 0xc00018a378] [0xc00018a2f0 0xc00018a378] [0xc00018a338 0xc00018a368] [0x9f1ea0 0x9f1fd0] 0xc001663020 <nil>}: error: unable to recognize "/tmp/mastervertical0-config.json": Get https://api.test652.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout error: unable to recognize "/tmp/mastervertical0-config.json": Get https://api.test652.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout
Some relevant messages in the kube apiserver logs around the time that client saw the error... E0601 14:48:00.761233 1 writers.go:105] apiserver was unable to write a JSON response: http: Handler timeout E0601 14:48:00.761261 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E0601 14:48:00.762327 1 writers.go:118] apiserver was unable to write a fallback JSON response: http: Handler timeout I0601 14:48:00.763409 1 trace.go:116] Trace[17239209]: "Create" url:/apis/authentication.k8s.io/v1beta1/tokenreviews,user-agent:kube-rbac-proxy/v0.0.0 (linux/amd64) kubernetes/$Format,client:192.168.222.123 (started: 2020-06-01 14:44:33.399680067 +0000 UTC m=+1284.329956867) (total time: 3m27.363705818s): Trace[17239209]: [3m27.361030956s] [3m27.361030956s] About to convert to expected version
So the client sees the errors around 14:49:11 Jun 1 14:48:37.431: INFO: Running 'oc --namespace=mastervertical0 --config= process -f /root/workload/build-config-template.yaml -p IDENTIFIER=0' Jun 1 14:48:41.553: INFO: Running 'oc --namespace=mastervertical0 --config= create -f /tmp/mastervertical0-config.json' Jun 1 14:49:11.634: INFO: Error running &{/usr/bin/oc [oc --namespace=mastervertical0 --config= create -f /tmp/mastervertical0-config.json] [] error: unable to recognize "/tmp/mastervertical0-config.json": Get https://api.test652.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout error: unable to recognize "/tmp/mastervertical0-config.json": Get https://api.test652.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout [] <nil> 0xc000e9f260 exit status 1 <nil> <nil> true [0xc00018a2f0 0xc00018a378 0xc00018a378] [0xc00018a2f0 0xc00018a378] [0xc00018a338 0xc00018a368] [0x9f1ea0 0x9f1fd0] 0xc001663020 <nil>}: error: unable to recognize "/tmp/mastervertical0-config.json": Get https://api.test652.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout error: unable to recognize "/tmp/mastervertical0-config.json": Get https://api.test652.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout Jun 1 14:49:11.634: INFO: Unable to create template objects. Error: exit status 1 and the errors in kube apiuserver logs are around 14:48:00 I0601 14:47:47.846810 1 trace.go:116] Trace[1042896261]: "Create" url:/apis/authentication.k8s.io/v1beta1/tokenreviews,user-agent:kube-rbac-proxy/v0.0.0 (linux/amd64) kubernetes/$Format,client:192.168.222.83 (started: 2020-06-01 14:47:47.287923766 +0000 UTC m=+1478.218200564) (total time: 558.856282ms): Trace[1042896261]: [558.168242ms] [558.168242ms] About to convert to expected version E0601 14:48:00.761233 1 writers.go:105] apiserver was unable to write a JSON response: http: Handler timeout E0601 14:48:00.761261 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E0601 14:48:00.762327 1 writers.go:118] apiserver was unable to write a fallback JSON response: http: Handler timeout I0601 14:48:00.763409 1 trace.go:116] Trace[17239209]: "Create" url:/apis/authentication.k8s.io/v1beta1/tokenreviews,user-agent:kube-rbac-proxy/v0.0.0 (linux/amd64) kubernetes/$Format,client:192.168.222.123 (started: 2020-06-01 14:44:33.399680067 +0000 UTC m=+1284.329956867) (total time: 3m27.363705818s): To restate, we have now since seen this error both on 4.5 and 4.4 but more frequently in 4.5. Also, there seems to be some correlation with the networking backend used (OpenShiftSDN vs OVN). To verify that this is not related to our tooling/workloads, we are able to run the same workload without issues on the public cloud, this only happens on baremetal deployments.
Sai, Could you try this fix? It modifies the health check in LB config. https://github.com/openshift/machine-config-operator/pull/1785/files Also, your recent comment does not show TLS errors. Are you seeing a different error in your test? Or are you seeing both the errors?
I'm seeing both TLS erorrs and i/o timeouts from the API, for example, 1. Unexpected error: <*url.Error | 0xc001bcb560>: { Op: "Post", URL: "https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods", Err: {s: "EOF"}, } Post https://api.test769.myocp4.com:6443/api/v1/namespaces/nodevertical0/pods: EOF 2. Get https://api.test714.myocp4.com:6443/api?timeout=32s: dial tcp 192.168.222.3:6443: i/o timeout Siva, can you confirm which 4.5 nightly has this fix?
Also, is this expected to be backported to 4.4?
Thanks for clarifying. 4.5 PR is not yet merged into master. e2e is still running. https://github.com/openshift/machine-config-operator/pull/1787 Here is the 4.5 BZ with PR linked. https://bugzilla.redhat.com/show_bug.cgi?id=1844384 It might probably ported to 4.4. I will confirm this with Stefan on Monday.
Hi Venkata, Any updates? This one is blocking the scale team, and we need the fixes in 4.4.z please.
Hi Rashid, Sorry, I don't have a good answer for you. Looks like I need to get more data to make a case for this change to be ported to 4.4. This a high priority item for me this week and will try to unblock the team.
Sai and me were discussing to come up with data points for platform and network backend combinations and these are his observations. AWS + OpenShiftSDN --> Test Passes AWS + OVNKuberentes --> Test Passes BM + OpenShiftSDN --> Test Passes BM + OVNKubernrtes --> Test Fails So, I am not sure if those changes are enough given that Sai is not seeing the issue with BM + OpenShiftSDN.
Moving this to OVN team for them to have a look as BM+OpenShiftSDN test passes.
Probably related: https://bugzilla.redhat.com/show_bug.cgi?id=1844576
Mike Cambria and Guiliume are actively working on this bug as a very high priority. Sai has offered their cluster, thanks And we have it reproduced on AWS also. I will ask Mike to update the bz regularly.
We've been updating https://bugzilla.redhat.com/show_bug.cgi?id=1844576 with what we see so far. We can reproduce packet ordering issues with OVNKubernets (Geneve) on BM, GCP and AWS. Same test, same version of OpenShift and OVS but using SDN (VxLAN) do not show the problem on GCP or AWS.
Packet ordering issue understood (see https://bugzilla.redhat.com/show_bug.cgi?id=1844576) and further places for improvement identified. Will setup original load test on BM Monday to try to capture TCP flows when TLS errors are taking place to see if these still need to be addressed or if they can be explained by the known TCP ordering problem.
I think I know what's going on. The test has the pod trying to connect to the apiserver on the HOST network. To get from OVN overlay to host network OVN(?) will NAT <g> Failing pod ran on worker000: [kni@e19-h24-b01-fc640 ~]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES scale-ci-mastervertical-dbfth 0/1 Error 0 111m 10.131.0.4 worker000 <none> <none> [kni@e19-h24-b01-fc640 ~]$ Things work at least once (probably more.) Eventually things fail due to the SYN/ACK from the apiserver to pod doesn't have the dstIP translated back to the pod IP address: Here is the SYN from the pod: 14:29:00.887656 IP (tos 0x0, ttl 64, id 29681, offset 0, flags [DF], proto TCP (6), length 60) 10.131.0.4.44218 > 192.168.222.3.6443: Flags [S], cksum 0xa961 (incorrect -> 0xaccf), seq 1184011679, win 27200, options [mss 1360,sackOK,TS val 23127209 ecr 0,nop,wscale 7], length 0 14:29:00.887705 IP (tos 0x0, ttl 62, id 29681, offset 0, flags [DF], proto TCP (6), length 60) 169.254.33.2.44218 > 192.168.222.3.6443: Flags [S], cksum 0x69db (incorrect -> 0xec55), seq 1184011679, win 27200, options [mss 1360,sackOK,TS val 23127209 ecr 0,nop,wscale 7], length 0 14:29:00.887724 IP (tos 0x0, ttl 61, id 29681, offset 0, flags [DF], proto TCP (6), length 60) 192.168.222.13.44218 > 192.168.222.3.6443: Flags [S], cksum 0x3d91 (incorrect -> 0x18a0), seq 1184011679, win 27200, options [mss 1360,sackOK,TS val 23127209 ecr 0,nop,wscale 7], length 0 The SYN/ACK from the apiserver: 14:29:00.887860 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 192.168.222.3.6443 > 192.168.222.13.44218: Flags [S.], cksum 0x1781 (correct), seq 1512979706, ack 1184011680, win 28960, options [mss 1460,sackOK,TS val 3465793434 ecr 23111964,nop,wscale 7], length 0 14:29:00.887888 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60) 192.168.222.3.6443 > 169.254.33.2.44218: Flags [S.], cksum 0xeb36 (correct), seq 1512979706, ack 1184011680, win 28960, options [mss 1460,sackOK,TS val 3465793434 ecr 23111964,nop,wscale 7], length 0 14:29:00.887903 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60) 192.168.222.3.6443 > 169.254.33.2.44218: Flags [S.], cksum 0xeb36 (correct), seq 1512979706, ack 1184011680, win 28960, options [mss 1460,sackOK,TS val 3465793434 ecr 23111964,nop,wscale 7], length 0 This last 169.254.33.2 should have been converted to the pod IP of 10.131.0.4. When things work, this dstIP conversion takes place. After 5 retries, the client gives up. I'm not able to "oc rsh scale-ci-mastervertical-dbfth" to try to connect manually. I started other pods on worker000 and they can connect to the apiserver without problem. All NAT that is needed is performed.
You cannot oc rsh since the pod already went into error. Might need to spin it up again. Let me know over chat if we should do that.
Test passes on v4.5. The packet capture shows the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1826769, which also passes on 4.5. Marking as duplicate
*** This bug has been marked as a duplicate of bug 1826769 ***