Bug 1855284

Summary:	New unexplained, suspicious, and voluminous "rejected connection" etcd log entries present
Product:	OpenShift Container Platform	Reporter:	Dan Mace <dmace>
Component:	kube-apiserver	Assignee:	Dan Mace <dmace>
Status:	CLOSED ERRATA	QA Contact:	Xingxing Xia <xxia>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	aos-bugs, kewang, mfojtik, sanchezl, xxia
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:13:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Dan Mace 2020-07-09 13:02:22 UTC

Description of problem:

We're suddenly observing the following sort of log entry in etcd member logs 4 times per second:

2020-07-08 14:16:57.396288 I | embed: rejected connection from "10.0.229.6:57222" (error "EOF", ServerName "")
2020-07-08 14:16:57.494467 I | embed: rejected connection from "10.0.181.60:37336" (error "EOF", ServerName "")
2020-07-08 14:16:57.810433 I | embed: rejected connection from "10.0.161.1:47454" (error "EOF", ServerName "")
2020-07-08 14:16:57.810806 I | embed: rejected connection from "[::1]:38488" (error "EOF", ServerName "")

This is new to 4.6. These entries are suspicious and should be explained to prove there's no functional regression somewhere in the stack.


Version-Release number of selected component (if applicable):


How reproducible:

Launch a new cluster and observe the etcd member logs.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Dan Mace 2020-07-09 14:28:23 UTC

The following procedure stops the connection attempts:

  oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator
  oc scale --replicas 0 -n openshift-kube-apiserver-operator deployments/kube-apiserver-operator
  oc delete -n openshift-kube-apiserver podnetworkconnectivitycheck --all

Then, this procedure starts the connection attempts back up:

  oc scale --replicas 1 -n openshift-kube-apiserver-operator deployments/kube-apiserver-operator

So the podnetworkconnectivitycheck mechanism is responsible for the connections.

Comment 2 Dan Mace 2020-07-09 15:40:33 UTC

The errors in the logs are a result of the podnetworkconnectivitycheck dialing the etcd endpoint without presenting a client certificate, which the server listener specifics as required. Given:

1. We don't want to set up mTLS in the probe, and
2. We don't want to push an upstream change to etcd's TLS client cert settings without strong justification

Here's a plan of action:

1. Tolerate the spam for now even though it's annoying and scary to users
2. Try to configure etcd logging to suppress the noise if possible
3. Switch the probe use the the hopefully forthcoming http health check endpoint[1] which removes the client certificate requirement altogether

[1] https://github.com/etcd-io/etcd/issues/11993

Comment 5 Luis Sanchez 2020-08-03 20:10:19 UTC

*** Bug 1857190 has been marked as a duplicate of this bug. ***

Comment 6 Xingxing Xia 2020-08-04 02:21:08 UTC

For the verifier: when verifying this bug of KAS PR, please double verify the closed OAS log bug 1857190 to ensure that disappears. Thanks.

Comment 7 Ke Wang 2020-08-13 10:31:41 UTC

First, let's confirm if the fix has been already in the latest payload.

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-12-155346 |grep kube-apiserver
  cluster-kube-apiserver-operator  https://github.com/openshift/cluster-kube-apiserver-operator  57a1aa9e336d93876b7c4291431d603a0dd71abe

$ git log --date local --pretty="%h %an %cd - %s" 57a1aa9e | grep '#901'
3d4b6e98 OpenShift Merge Robot Thu Jul 16 13:34:16 2020 - Merge pull request #901 from sanchezl/point-to-point

The fix was in.

- Check the etcd log, 
pods=$(oc get pods -n openshift-etcd -l etcd -o name)

$ for pod in $pods;do oc logs -n openshift-etcd $pod -c etcd |grep 'embed: rejected connection';done
2020-08-13 03:16:25.783603 I | embed: rejected connection from "10.0.0.5:42458" (error "read tcp 10.0.0.3:2379->10.0.0.5:42458: use of closed network connection", ServerName "")
2020-08-13 03:16:25.905702 I | embed: rejected connection from "10.0.0.4:36848" (error "read tcp 10.0.0.3:2380->10.0.0.4:36848: use of closed network connection", ServerName "")
2020-08-13 03:16:25.905831 I | embed: rejected connection from "10.0.0.5:46002" (error "read tcp 10.0.0.3:2380->10.0.0.5:46002: use of closed network connection", ServerName "")
2020-08-13 03:16:25.905882 I | embed: rejected connection from "10.0.0.4:36850" (error "read tcp 10.0.0.3:2380->10.0.0.4:36850: use of closed network connection", ServerName "")
2020-08-13 05:00:31.198637 I | embed: rejected connection from "10.129.0.19:57252" (error "EOF", ServerName "")
2020-08-13 08:08:31.504605 I | embed: rejected connection from "10.129.0.19:34678" (error "EOF", ServerName "")
2020-08-13 03:11:25.765127 I | embed: rejected connection from "10.0.0.3:38382" (error "EOF", ServerName "")
2020-08-13 03:11:25.790598 I | embed: rejected connection from "10.0.0.3:38408" (error "EOF", ServerName "")
2020-08-13 03:17:32.144609 I | embed: rejected connection from "10.0.0.3:55652" (error "EOF", ServerName "")
2020-08-13 03:18:40.267284 I | embed: rejected connection from "10.0.0.3:45858" (error "EOF", ServerName "")
2020-08-13 03:25:03.553195 I | embed: rejected connection from "10.0.0.5:35680" (error "EOF", ServerName "")
2020-08-13 03:25:04.720663 I | embed: rejected connection from "10.0.0.3:44612" (error "EOF", ServerName "")
2020-08-13 03:26:25.094808 I | embed: rejected connection from "10.0.0.3:35020" (error "read tcp 10.0.0.4:2380->10.0.0.3:35020: use of closed network connection", ServerName "")
2020-08-13 03:26:25.094844 I | embed: rejected connection from "10.0.0.3:35022" (error "read tcp 10.0.0.4:2380->10.0.0.3:35022: use of closed network connection", ServerName "")
2020-08-13 03:26:25.094858 I | embed: rejected connection from "10.0.0.5:46782" (error "read tcp 10.0.0.4:2380->10.0.0.5:46782: use of closed network connection", ServerName "")
2020-08-13 03:26:25.096445 I | embed: rejected connection from "10.0.0.5:46780" (error "set tcp 10.0.0.4:2380: use of closed network connection", ServerName "")
2020-08-13 03:27:37.171078 I | embed: rejected connection from "10.0.0.3:33226" (error "EOF", ServerName "")
2020-08-13 03:10:52.559597 I | embed: rejected connection from "10.0.0.4:33550" (error "EOF", ServerName "")
2020-08-13 03:11:01.301711 I | embed: rejected connection from "10.0.0.4:59838" (error "read tcp 10.0.0.5:2379->10.0.0.4:59838: read: connection reset by peer", ServerName "")
2020-08-13 03:23:57.190395 I | embed: rejected connection from "10.0.0.4:54104" (error "read tcp 10.0.0.5:2380->10.0.0.4:54104: use of closed network connection", ServerName "")
2020-08-13 03:23:57.190525 I | embed: rejected connection from "10.0.0.4:54102" (error "read tcp 10.0.0.5:2380->10.0.0.4:54102: use of closed network connection", ServerName "")
2020-08-13 03:23:57.191440 I | embed: rejected connection from "10.0.0.3:50804" (error "set tcp 10.0.0.5:2380: use of closed network connection", ServerName "")
2020-08-13 03:23:57.192788 I | embed: rejected connection from "10.0.0.3:50802" (error "set tcp 10.0.0.5:2380: use of closed network connection", ServerName "")
2020-08-13 03:27:32.006357 I | embed: rejected connection from "10.0.0.4:48118" (error "EOF", ServerName "")
2020-08-13 03:27:32.199004 I | embed: rejected connection from "10.0.0.4:48146" (error "EOF", ServerName "")
2020-08-13 08:08:31.507146 I | embed: rejected connection from "10.0.0.3:35046" (error "EOF", ServerName "")
2020-08-13 10:12:31.311770 I | embed: rejected connection from "10.0.0.3:54160" (error "EOF", ServerName "")

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-12-155346   True        False         7h5m    Cluster version is 4.6.0-0.nightly-2020-08-12-155346

we can see the cluster uptime is about 7 hours, there are only total 27 records, compared with before, so less log spam


- Check the openshfit-apiserver log, 

$ oc get pods -n openshift-apiserver
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-5fd65468b5-jllrh   2/2     Running   0          6h49m
apiserver-5fd65468b5-mz85t   2/2     Running   0          6h41m
apiserver-5fd65468b5-t4t4g   2/2     Running   0          6h51m

$ for s in {jllrh,mz85t,t4t4g}; do oc logs -n openshift-apiserver apiserver-5fd65468b5-$s -c openshift-apiserver | grep 'TLS handshake error'; done
I0813 09:29:39.440719       1 log.go:181] http: TLS handshake error from 10.130.0.1:43310: EOF
I0813 09:08:43.714322       1 log.go:181] http: TLS handshake error from 10.129.0.1:56350: EOF
I0813 09:08:43.741269       1 log.go:181] http: TLS handshake error from 10.129.0.1:49626: EOF
I0813 09:29:39.469671       1 log.go:181] http: TLS handshake error from 10.128.0.1:35400: EOF

there are only total 4 records, compared with before, so less log spam.

From above verification, we can see the fix works well as expected, move the bug Verified.

Comment 8 Ke Wang 2020-08-21 06:47:10 UTC

*** Bug 1857723 has been marked as a duplicate of this bug. ***

Comment 10 errata-xmlrpc 2020-10-27 16:13:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196