Description of problem: We're suddenly observing the following sort of log entry in etcd member logs 4 times per second: 2020-07-08 14:16:57.396288 I | embed: rejected connection from "10.0.229.6:57222" (error "EOF", ServerName "") 2020-07-08 14:16:57.494467 I | embed: rejected connection from "10.0.181.60:37336" (error "EOF", ServerName "") 2020-07-08 14:16:57.810433 I | embed: rejected connection from "10.0.161.1:47454" (error "EOF", ServerName "") 2020-07-08 14:16:57.810806 I | embed: rejected connection from "[::1]:38488" (error "EOF", ServerName "") This is new to 4.6. These entries are suspicious and should be explained to prove there's no functional regression somewhere in the stack. Version-Release number of selected component (if applicable): How reproducible: Launch a new cluster and observe the etcd member logs. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The following procedure stops the connection attempts: oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator oc scale --replicas 0 -n openshift-kube-apiserver-operator deployments/kube-apiserver-operator oc delete -n openshift-kube-apiserver podnetworkconnectivitycheck --all Then, this procedure starts the connection attempts back up: oc scale --replicas 1 -n openshift-kube-apiserver-operator deployments/kube-apiserver-operator So the podnetworkconnectivitycheck mechanism is responsible for the connections.
The errors in the logs are a result of the podnetworkconnectivitycheck dialing the etcd endpoint without presenting a client certificate, which the server listener specifics as required. Given: 1. We don't want to set up mTLS in the probe, and 2. We don't want to push an upstream change to etcd's TLS client cert settings without strong justification Here's a plan of action: 1. Tolerate the spam for now even though it's annoying and scary to users 2. Try to configure etcd logging to suppress the noise if possible 3. Switch the probe use the the hopefully forthcoming http health check endpoint[1] which removes the client certificate requirement altogether [1] https://github.com/etcd-io/etcd/issues/11993
*** Bug 1857190 has been marked as a duplicate of this bug. ***
For the verifier: when verifying this bug of KAS PR, please double verify the closed OAS log bug 1857190 to ensure that disappears. Thanks.
First, let's confirm if the fix has been already in the latest payload. $ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-12-155346 |grep kube-apiserver cluster-kube-apiserver-operator https://github.com/openshift/cluster-kube-apiserver-operator 57a1aa9e336d93876b7c4291431d603a0dd71abe $ git log --date local --pretty="%h %an %cd - %s" 57a1aa9e | grep '#901' 3d4b6e98 OpenShift Merge Robot Thu Jul 16 13:34:16 2020 - Merge pull request #901 from sanchezl/point-to-point The fix was in. - Check the etcd log, pods=$(oc get pods -n openshift-etcd -l etcd -o name) $ for pod in $pods;do oc logs -n openshift-etcd $pod -c etcd |grep 'embed: rejected connection';done 2020-08-13 03:16:25.783603 I | embed: rejected connection from "10.0.0.5:42458" (error "read tcp 10.0.0.3:2379->10.0.0.5:42458: use of closed network connection", ServerName "") 2020-08-13 03:16:25.905702 I | embed: rejected connection from "10.0.0.4:36848" (error "read tcp 10.0.0.3:2380->10.0.0.4:36848: use of closed network connection", ServerName "") 2020-08-13 03:16:25.905831 I | embed: rejected connection from "10.0.0.5:46002" (error "read tcp 10.0.0.3:2380->10.0.0.5:46002: use of closed network connection", ServerName "") 2020-08-13 03:16:25.905882 I | embed: rejected connection from "10.0.0.4:36850" (error "read tcp 10.0.0.3:2380->10.0.0.4:36850: use of closed network connection", ServerName "") 2020-08-13 05:00:31.198637 I | embed: rejected connection from "10.129.0.19:57252" (error "EOF", ServerName "") 2020-08-13 08:08:31.504605 I | embed: rejected connection from "10.129.0.19:34678" (error "EOF", ServerName "") 2020-08-13 03:11:25.765127 I | embed: rejected connection from "10.0.0.3:38382" (error "EOF", ServerName "") 2020-08-13 03:11:25.790598 I | embed: rejected connection from "10.0.0.3:38408" (error "EOF", ServerName "") 2020-08-13 03:17:32.144609 I | embed: rejected connection from "10.0.0.3:55652" (error "EOF", ServerName "") 2020-08-13 03:18:40.267284 I | embed: rejected connection from "10.0.0.3:45858" (error "EOF", ServerName "") 2020-08-13 03:25:03.553195 I | embed: rejected connection from "10.0.0.5:35680" (error "EOF", ServerName "") 2020-08-13 03:25:04.720663 I | embed: rejected connection from "10.0.0.3:44612" (error "EOF", ServerName "") 2020-08-13 03:26:25.094808 I | embed: rejected connection from "10.0.0.3:35020" (error "read tcp 10.0.0.4:2380->10.0.0.3:35020: use of closed network connection", ServerName "") 2020-08-13 03:26:25.094844 I | embed: rejected connection from "10.0.0.3:35022" (error "read tcp 10.0.0.4:2380->10.0.0.3:35022: use of closed network connection", ServerName "") 2020-08-13 03:26:25.094858 I | embed: rejected connection from "10.0.0.5:46782" (error "read tcp 10.0.0.4:2380->10.0.0.5:46782: use of closed network connection", ServerName "") 2020-08-13 03:26:25.096445 I | embed: rejected connection from "10.0.0.5:46780" (error "set tcp 10.0.0.4:2380: use of closed network connection", ServerName "") 2020-08-13 03:27:37.171078 I | embed: rejected connection from "10.0.0.3:33226" (error "EOF", ServerName "") 2020-08-13 03:10:52.559597 I | embed: rejected connection from "10.0.0.4:33550" (error "EOF", ServerName "") 2020-08-13 03:11:01.301711 I | embed: rejected connection from "10.0.0.4:59838" (error "read tcp 10.0.0.5:2379->10.0.0.4:59838: read: connection reset by peer", ServerName "") 2020-08-13 03:23:57.190395 I | embed: rejected connection from "10.0.0.4:54104" (error "read tcp 10.0.0.5:2380->10.0.0.4:54104: use of closed network connection", ServerName "") 2020-08-13 03:23:57.190525 I | embed: rejected connection from "10.0.0.4:54102" (error "read tcp 10.0.0.5:2380->10.0.0.4:54102: use of closed network connection", ServerName "") 2020-08-13 03:23:57.191440 I | embed: rejected connection from "10.0.0.3:50804" (error "set tcp 10.0.0.5:2380: use of closed network connection", ServerName "") 2020-08-13 03:23:57.192788 I | embed: rejected connection from "10.0.0.3:50802" (error "set tcp 10.0.0.5:2380: use of closed network connection", ServerName "") 2020-08-13 03:27:32.006357 I | embed: rejected connection from "10.0.0.4:48118" (error "EOF", ServerName "") 2020-08-13 03:27:32.199004 I | embed: rejected connection from "10.0.0.4:48146" (error "EOF", ServerName "") 2020-08-13 08:08:31.507146 I | embed: rejected connection from "10.0.0.3:35046" (error "EOF", ServerName "") 2020-08-13 10:12:31.311770 I | embed: rejected connection from "10.0.0.3:54160" (error "EOF", ServerName "") $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-08-12-155346 True False 7h5m Cluster version is 4.6.0-0.nightly-2020-08-12-155346 we can see the cluster uptime is about 7 hours, there are only total 27 records, compared with before, so less log spam - Check the openshfit-apiserver log, $ oc get pods -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-5fd65468b5-jllrh 2/2 Running 0 6h49m apiserver-5fd65468b5-mz85t 2/2 Running 0 6h41m apiserver-5fd65468b5-t4t4g 2/2 Running 0 6h51m $ for s in {jllrh,mz85t,t4t4g}; do oc logs -n openshift-apiserver apiserver-5fd65468b5-$s -c openshift-apiserver | grep 'TLS handshake error'; done I0813 09:29:39.440719 1 log.go:181] http: TLS handshake error from 10.130.0.1:43310: EOF I0813 09:08:43.714322 1 log.go:181] http: TLS handshake error from 10.129.0.1:56350: EOF I0813 09:08:43.741269 1 log.go:181] http: TLS handshake error from 10.129.0.1:49626: EOF I0813 09:29:39.469671 1 log.go:181] http: TLS handshake error from 10.128.0.1:35400: EOF there are only total 4 records, compared with before, so less log spam. From above verification, we can see the fix works well as expected, move the bug Verified.
*** Bug 1857723 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196