Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1762137

Summary:	Get error : "TLS handshake timeout"
Product:	OpenShift Container Platform	Reporter:	zhou ying <yinzhou>
Component:	kube-apiserver	Assignee:	Stefan Schimanski <sttts>
Status:	CLOSED ERRATA	QA Contact:	zhou ying <yinzhou>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.2.0	CC:	aos-bugs, dhansen, ewolinet, mfojtik
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-23 11:07:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description zhou ying 2019-10-16 02:27:48 UTC

Description of problem:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.2/85

Version-Release number of selected component (if applicable):
release-openshift-ocp-installer-e2e-aws-proxy-4.2


[sig-apps] CronJob should delete successful finished jobs with limit of one successful job [Suite:openshift/conformance/parallel] [Suite:k8s] expand_less	17s
fail [k8s.io/kubernetes/test/e2e/e2e.go:104]: Unexpected error:
    <*url.Error | 0xc002e3a540>: {
        Op: "Get",
        URL: "https://api.ci-op-pxrdxk91-9c5bf.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0",
        Err: {},
    }
    Get https://api.ci-op-pxrdxk91-9c5bf.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: net/http: TLS handshake timeout
occurred

[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-bindmounted] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with defaults [Suite:openshift/conformance/parallel] [Suite:k8s] expand_less	12s
fail [k8s.io/kubernetes/test/e2e/e2e.go:104]: Unexpected error:
    <*url.Error | 0xc003670330>: {
        Op: "Get",
        URL: "https://api.ci-op-pxrdxk91-9c5bf.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0",
        Err: {},
    }
    Get https://api.ci-op-pxrdxk91-9c5bf.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/nodes?fieldSelector=spec.unschedulable%3Dfalse&resourceVersion=0: net/http: TLS handshake timeout
occurred

Oct 15 21:29:23.655 E ns/openshift-ingress pod/router-prometheus-897669695-xhrnh node/ip-10-0-142-143.ec2.internal container=router container exited with code 2 (Error): go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:28:11.125738       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:28:16.140130       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:28:21.135372       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:28:26.139040       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:28:32.626815       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:28:37.605538       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:28:42.605837       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:28:47.621741       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:28:52.608262       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:28:57.619228       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:29:02.605424       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:29:07.606204       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:29:13.774262       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\nI1015 21:29:18.771430       1 router.go:561] Router reloaded:\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n

Comment 1 Daneyon Hansen 2019-10-22 00:27:09 UTC

Similar to [1], I believe the test failure is due to a lack of resource capacity for the proxy created by the e2e-aws-proxy job. Eric Wolinetz increased the resource capacity of the proxy that gets created by the e2e-aws-proxy job. Please retest and update the bug based on your findings.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1761677

Comment 2 Dan Mace 2019-10-31 20:01:05 UTC

Ingress doesn't own or manage the API load balancer, that's created by the installer. The similarity to https://bugzilla.redhat.com/show_bug.cgi?id=1765276 is interesting.

Moving over to the installer team. Could be the apiserver endpoint (as is possible in https://bugzilla.redhat.com/show_bug.cgi?id=1765276) or a networking issue.

Comment 3 Abhinav Dahiya 2019-11-04 19:58:39 UTC

The error is from hitting the kube-apiserver.

Comment 6 Daneyon Hansen 2019-11-11 21:54:16 UTC

It appears that the external apiserver url is being used by the 2 failing tests. [1] removed the external apiserver from default noProxy list. `proxyconnect` does not appear in either "TLS handshake timeout failure" so the calls are not being proxied as expected. [2] was recently merged to revert [1]. Can you rerun the test with a payload that includes [2] and report back?

[1] https://github.com/openshift/cluster-network-operator/pull/328
[2] https://github.com/openshift/cluster-network-operator/pull/388

Comment 7 Michal Fojtik 2019-11-21 12:34:04 UTC

Moving to MODIFIED to rerun the test and verify this as fixed.

Comment 9 zhou ying 2019-11-26 07:51:16 UTC

Build cluster on aws with proxy, run the automation on locally , can't reproduce the issue again. 

[zhouying@dhcp-140-138 origin]$ openshift-tests run-test "[sig-apps] CronJob should delete successful/failed finished jobs with limit of one job [Suite:openshift/conformance/parallel] [Suite:k8s]"
......
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [sig-apps] CronJob
  /home/golang/src/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/apps/cronjob.go:55
[It] should delete successful/failed finished jobs with limit of one job [Suite:openshift/conformance/parallel] [Suite:k8s]
  /home/golang/src/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/apps/cronjob.go:233
STEP: Creating a AllowConcurrent cronjob with custom successful-jobs-history-limit
STEP: Ensuring a finished job exists
STEP: Ensuring a finished job exists by listing jobs explicitly
STEP: Ensuring this job and its pods does not exist anymore
STEP: Ensuring there is 1 finished job by listing jobs explicitly
STEP: Removing cronjob
STEP: Creating a AllowConcurrent cronjob with custom failed-jobs-history-limit
STEP: Ensuring a finished job exists
STEP: Ensuring a finished job exists by listing jobs explicitly
STEP: Ensuring this job and its pods does not exist anymore
STEP: Ensuring there is 1 finished job by listing jobs explicitly
STEP: Removing cronjob
[AfterEach] [sig-apps] CronJob
  /home/golang/src/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:152
Nov 26 15:23:31.096: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-cronjob-2666" for this suite.
Nov 26 15:23:32.372: INFO: Running AfterSuite actions on all nodes
Nov 26 15:23:32.372: INFO: Running AfterSuite actions on node 1

Comment 11 errata-xmlrpc 2020-01-23 11:07:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062