Created attachment 1513908 [details] "TLS handshake error" in grafana-proxy container logs Description of problem: This bug is cloned from https://jira.coreos.com/browse/MON-495 File it again for QE team to track the monitoring issue in Bugzilla. Deploy cluster monitoring with new installer on AWS, there are a lot of TLS error such as 2018/12/13 06:05:52 server.go:2753: http: TLS handshake error from 10.131.0.1:34500: EOF 2018/12/13 06:06:02 server.go:2753: http: TLS handshake error from 10.131.0.1:34540: EOF 2018/12/13 06:06:12 server.go:2753: http: TLS handshake error from 10.131.0.1:34584: EOF it seems it does not affect the function $oc -n openshift-monitoring get pod | grep grafana grafana-58456d859d-hcmj2 2/2 Running 0 1h Checked in the node where grafana pod runs at, the state is TIME_WAIT $netstat -anlp | grep 3000 tcp 0 0 10.131.0.1:39376 10.131.0.4:3000 TIME_WAIT - tcp 0 0 10.131.0.1:39428 10.131.0.4:3000 TIME_WAIT - tcp 0 0 10.131.0.1:39556 10.131.0.4:3000 TIME_WAIT - tcp 0 0 10.131.0.1:39596 10.131.0.4:3000 TIME_WAIT - tcp 0 0 10.131.0.1:39470 10.131.0.4:3000 TIME_WAIT - tcp 0 0 10.131.0.1:39510 10.131.0.4:3000 TIME_WAIT - Version-Release number of selected component (if applicable): docker.io/grafana/grafana:5.2.4 docker.io/openshift/oauth-proxy:v1.1.0 docker.io/openshift/prometheus-alertmanager:v0.15.2 docker.io/openshift/prometheus-node-exporter:v0.16.0 docker.io/openshift/prometheus:v2.5.0 quay.io/coreos/configmap-reload:v0.0.1 quay.io/coreos/kube-rbac-proxy:v0.4.0 quay.io/coreos/kube-state-metrics:v1.4.0 quay.io/coreos/prom-label-proxy:v0.1.0 quay.io/coreos/prometheus-config-reloader:v0.26.0 quay.io/coreos/prometheus-operator:v0.26.0 quay.io/openshift/origin-configmap-reload:v3.11 quay.io/openshift/origin-telemeter:v4.0 quay.io/surbania/k8s-prometheus-adapter-amd64:326bf3c quay.io/openshift-release-dev/ocp-v4.0@sha256:4f94db8849ed915994678726680fc39bdb47722d3dd570af47b666b0160602e5 How reproducible: Always Steps to Reproduce: 1.Deploy cluster monitoring with new installer on AWS 2. 3. Actual results: Expected results: Additional info:
I suspect this will be fixed by https://github.com/openshift/installer/pull/924
I can confirm that https://github.com/openshift/installer/pull/924 does not fix this. I have installed a cluster with that change made (I used v0.12.0), and the error is still occurring despite the ${var.cluster_name}-api-int LB group having these healthcheck params: Protocol: HTTPS Path: /healthz Port: 6443 Healthy threshold: 3 Unhealthy threshold: 3 Timeout: 10 Interval: 10 Success codes: 200-399
not fixed, still see the error # oc -n openshift-monitoring logs grafana-78765ddcc7-7n8zz -c grafana-proxy .................................................... 2019/02/21 03:20:41 server.go:2923: http: TLS handshake error from 10.128.2.1:60838: EOF 2019/02/21 03:20:51 server.go:2923: http: TLS handshake error from 10.128.2.1:60892: EOF 2019/02/21 03:21:01 server.go:2923: http: TLS handshake error from 10.128.2.1:60946: EOF 2019/02/21 03:21:11 server.go:2923: http: TLS handshake error from 10.128.2.1:32874: EOF 2019/02/21 03:21:21 server.go:2923: http: TLS handshake error from 10.128.2.1:32928: EOF 2019/02/21 03:21:31 server.go:2923: http: TLS handshake error from 10.128.2.1:32984: EOF 2019/02/21 03:21:41 server.go:2923: http: TLS handshake error from 10.128.2.1:33036: EOF 2019/02/21 03:21:51 server.go:2923: http: TLS handshake error from 10.128.2.1:33088: EOF 2019/02/21 03:22:01 server.go:2923: http: TLS handshake error from 10.128.2.1:33170: EOF 2019/02/21 03:22:11 server.go:2923: http: TLS handshake error from 10.128.2.1:33224: EOF 2019/02/21 03:22:21 server.go:2923: http: TLS handshake error from 10.128.2.1:33276: EOF 2019/02/21 03:22:31 server.go:2923: http: TLS handshake error from 10.128.2.1:33342: EOF 2019/02/21 03:22:41 server.go:2923: http: TLS handshake error from 10.128.2.1:33542: EOF 2019/02/21 03:22:51 server.go:2923: http: TLS handshake error from 10.128.2.1:33626: EOF 2019/02/21 03:23:01 server.go:2923: http: TLS handshake error from 10.128.2.1:33710: EOF 2019/02/21 03:23:11 server.go:2923: http: TLS handshake error from 10.128.2.1:33784: EOF 2019/02/21 03:23:21 server.go:2923: http: TLS handshake error from 10.128.2.1:33868: EOF 2019/02/21 03:23:31 server.go:2923: http: TLS handshake error from 10.128.2.1:33940: EOF 2019/02/21 03:23:41 server.go:2923: http: TLS handshake error from 10.128.2.1:34010: EOF 2019/02/21 03:23:51 server.go:2923: http: TLS handshake error from 10.128.2.1:34084: EOF # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-20-194410 True False 24h Cluster version is 4.0.0-0.nightly-2019-02-20-194410
This was observed in 4.4. It is still worth investigating IMO, why waste customers' storage if looking into the error might unveil more serious issues?
Tested with 4.4.0-0.nightly-2020-01-24-141203, issue is fixed # oc -n openshift-monitoring logs grafana-bbb6fcc-qf2j4 -c grafana-proxy 2020/01/26 23:42:28 provider.go:118: Defaulting client-id to system:serviceaccount:openshift-monitoring:grafana 2020/01/26 23:42:28 provider.go:123: Defaulting client-secret to service account token /var/run/secrets/kubernetes.io/serviceaccount/token 2020/01/26 23:42:28 provider.go:311: Delegation of authentication and authorization to OpenShift is enabled for bearer tokens and client certificates. 2020/01/26 23:42:28 oauthproxy.go:200: mapping path "/" => upstream "http://localhost:3001/" 2020/01/26 23:42:28 oauthproxy.go:221: compiled skip-auth-regex => "^/metrics" 2020/01/26 23:42:28 oauthproxy.go:227: OAuthProxy configured for Client ID: system:serviceaccount:openshift-monitoring:grafana 2020/01/26 23:42:28 oauthproxy.go:237: Cookie settings: name:_oauth_proxy secure(https):true httponly:true expiry:168h0m0s domain:<default> refresh:disabled 2020/01/26 23:42:28 http.go:96: HTTPS: listening on [::]:3000
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581