Bug 1855284
Summary: | New unexplained, suspicious, and voluminous "rejected connection" etcd log entries present | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Dan Mace <dmace> |
Component: | kube-apiserver | Assignee: | Dan Mace <dmace> |
Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.6 | CC: | aos-bugs, kewang, mfojtik, sanchezl, xxia |
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 16:13:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dan Mace
2020-07-09 13:02:22 UTC
The following procedure stops the connection attempts: oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator oc scale --replicas 0 -n openshift-kube-apiserver-operator deployments/kube-apiserver-operator oc delete -n openshift-kube-apiserver podnetworkconnectivitycheck --all Then, this procedure starts the connection attempts back up: oc scale --replicas 1 -n openshift-kube-apiserver-operator deployments/kube-apiserver-operator So the podnetworkconnectivitycheck mechanism is responsible for the connections. The errors in the logs are a result of the podnetworkconnectivitycheck dialing the etcd endpoint without presenting a client certificate, which the server listener specifics as required. Given: 1. We don't want to set up mTLS in the probe, and 2. We don't want to push an upstream change to etcd's TLS client cert settings without strong justification Here's a plan of action: 1. Tolerate the spam for now even though it's annoying and scary to users 2. Try to configure etcd logging to suppress the noise if possible 3. Switch the probe use the the hopefully forthcoming http health check endpoint[1] which removes the client certificate requirement altogether [1] https://github.com/etcd-io/etcd/issues/11993 *** Bug 1857190 has been marked as a duplicate of this bug. *** For the verifier: when verifying this bug of KAS PR, please double verify the closed OAS log bug 1857190 to ensure that disappears. Thanks. First, let's confirm if the fix has been already in the latest payload. $ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-12-155346 |grep kube-apiserver cluster-kube-apiserver-operator https://github.com/openshift/cluster-kube-apiserver-operator 57a1aa9e336d93876b7c4291431d603a0dd71abe $ git log --date local --pretty="%h %an %cd - %s" 57a1aa9e | grep '#901' 3d4b6e98 OpenShift Merge Robot Thu Jul 16 13:34:16 2020 - Merge pull request #901 from sanchezl/point-to-point The fix was in. - Check the etcd log, pods=$(oc get pods -n openshift-etcd -l etcd -o name) $ for pod in $pods;do oc logs -n openshift-etcd $pod -c etcd |grep 'embed: rejected connection';done 2020-08-13 03:16:25.783603 I | embed: rejected connection from "10.0.0.5:42458" (error "read tcp 10.0.0.3:2379->10.0.0.5:42458: use of closed network connection", ServerName "") 2020-08-13 03:16:25.905702 I | embed: rejected connection from "10.0.0.4:36848" (error "read tcp 10.0.0.3:2380->10.0.0.4:36848: use of closed network connection", ServerName "") 2020-08-13 03:16:25.905831 I | embed: rejected connection from "10.0.0.5:46002" (error "read tcp 10.0.0.3:2380->10.0.0.5:46002: use of closed network connection", ServerName "") 2020-08-13 03:16:25.905882 I | embed: rejected connection from "10.0.0.4:36850" (error "read tcp 10.0.0.3:2380->10.0.0.4:36850: use of closed network connection", ServerName "") 2020-08-13 05:00:31.198637 I | embed: rejected connection from "10.129.0.19:57252" (error "EOF", ServerName "") 2020-08-13 08:08:31.504605 I | embed: rejected connection from "10.129.0.19:34678" (error "EOF", ServerName "") 2020-08-13 03:11:25.765127 I | embed: rejected connection from "10.0.0.3:38382" (error "EOF", ServerName "") 2020-08-13 03:11:25.790598 I | embed: rejected connection from "10.0.0.3:38408" (error "EOF", ServerName "") 2020-08-13 03:17:32.144609 I | embed: rejected connection from "10.0.0.3:55652" (error "EOF", ServerName "") 2020-08-13 03:18:40.267284 I | embed: rejected connection from "10.0.0.3:45858" (error "EOF", ServerName "") 2020-08-13 03:25:03.553195 I | embed: rejected connection from "10.0.0.5:35680" (error "EOF", ServerName "") 2020-08-13 03:25:04.720663 I | embed: rejected connection from "10.0.0.3:44612" (error "EOF", ServerName "") 2020-08-13 03:26:25.094808 I | embed: rejected connection from "10.0.0.3:35020" (error "read tcp 10.0.0.4:2380->10.0.0.3:35020: use of closed network connection", ServerName "") 2020-08-13 03:26:25.094844 I | embed: rejected connection from "10.0.0.3:35022" (error "read tcp 10.0.0.4:2380->10.0.0.3:35022: use of closed network connection", ServerName "") 2020-08-13 03:26:25.094858 I | embed: rejected connection from "10.0.0.5:46782" (error "read tcp 10.0.0.4:2380->10.0.0.5:46782: use of closed network connection", ServerName "") 2020-08-13 03:26:25.096445 I | embed: rejected connection from "10.0.0.5:46780" (error "set tcp 10.0.0.4:2380: use of closed network connection", ServerName "") 2020-08-13 03:27:37.171078 I | embed: rejected connection from "10.0.0.3:33226" (error "EOF", ServerName "") 2020-08-13 03:10:52.559597 I | embed: rejected connection from "10.0.0.4:33550" (error "EOF", ServerName "") 2020-08-13 03:11:01.301711 I | embed: rejected connection from "10.0.0.4:59838" (error "read tcp 10.0.0.5:2379->10.0.0.4:59838: read: connection reset by peer", ServerName "") 2020-08-13 03:23:57.190395 I | embed: rejected connection from "10.0.0.4:54104" (error "read tcp 10.0.0.5:2380->10.0.0.4:54104: use of closed network connection", ServerName "") 2020-08-13 03:23:57.190525 I | embed: rejected connection from "10.0.0.4:54102" (error "read tcp 10.0.0.5:2380->10.0.0.4:54102: use of closed network connection", ServerName "") 2020-08-13 03:23:57.191440 I | embed: rejected connection from "10.0.0.3:50804" (error "set tcp 10.0.0.5:2380: use of closed network connection", ServerName "") 2020-08-13 03:23:57.192788 I | embed: rejected connection from "10.0.0.3:50802" (error "set tcp 10.0.0.5:2380: use of closed network connection", ServerName "") 2020-08-13 03:27:32.006357 I | embed: rejected connection from "10.0.0.4:48118" (error "EOF", ServerName "") 2020-08-13 03:27:32.199004 I | embed: rejected connection from "10.0.0.4:48146" (error "EOF", ServerName "") 2020-08-13 08:08:31.507146 I | embed: rejected connection from "10.0.0.3:35046" (error "EOF", ServerName "") 2020-08-13 10:12:31.311770 I | embed: rejected connection from "10.0.0.3:54160" (error "EOF", ServerName "") $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-08-12-155346 True False 7h5m Cluster version is 4.6.0-0.nightly-2020-08-12-155346 we can see the cluster uptime is about 7 hours, there are only total 27 records, compared with before, so less log spam - Check the openshfit-apiserver log, $ oc get pods -n openshift-apiserver NAME READY STATUS RESTARTS AGE apiserver-5fd65468b5-jllrh 2/2 Running 0 6h49m apiserver-5fd65468b5-mz85t 2/2 Running 0 6h41m apiserver-5fd65468b5-t4t4g 2/2 Running 0 6h51m $ for s in {jllrh,mz85t,t4t4g}; do oc logs -n openshift-apiserver apiserver-5fd65468b5-$s -c openshift-apiserver | grep 'TLS handshake error'; done I0813 09:29:39.440719 1 log.go:181] http: TLS handshake error from 10.130.0.1:43310: EOF I0813 09:08:43.714322 1 log.go:181] http: TLS handshake error from 10.129.0.1:56350: EOF I0813 09:08:43.741269 1 log.go:181] http: TLS handshake error from 10.129.0.1:49626: EOF I0813 09:29:39.469671 1 log.go:181] http: TLS handshake error from 10.128.0.1:35400: EOF there are only total 4 records, compared with before, so less log spam. From above verification, we can see the fix works well as expected, move the bug Verified. *** Bug 1857723 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |