Description of problem: After bootstrap is complete, the etcd-operator continues trying and failing to connect to the deleted bootstrap etcd member endpoint. This appears in logs with messages like: W0507 13:05:33.927788 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://10.0.15.248:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing d ial tcp 10.0.15.248:2379: connect: connection refused". Reconnecting... The issue never corrects itself. Version-Release number of selected component (if applicable): 4.4, 4.5 How reproducible: Create a new IPI cluster on any platform. Actual results: After bootstrap, the operator will spew logs about failed connection attempts to the bootstrap node IP. Expected results: The operator should forget about the bootstrap node after bootstrap is complete. Additional info:
*** This bug has been marked as a duplicate of bug 1835238 ***
I made a mistake — this is separate from https://bugzilla.redhat.com/show_bug.cgi?id=1835238 which is concerned with cleaning up elsewhere.
The original fix for this (https://github.com/openshift/cluster-etcd-operator/pull/342) had a critical flaw that made it passed CI. The fix is reverted in https://github.com/openshift/cluster-etcd-operator/pull/364 and needs reworked. The revert PR is using this bz to have a valid bug association for merging. Once the revert merges, I'll move this bug to 4.6 as the impact (log spam) doesn't justify blocking the 4.5 release.
The team has been busy with 4.5 bugs adding UpcomingSprint keyword to continue this work in 4.6.
I believe the affect of this is more than log spam. I think it is having an effect on code ready workspaces; making it unusable. I am unable to build any code base that requires pulling dependencies from maven. Could this bug possible be affecting the internet connections of containers as well? This is what is displayed in the logs of freshly installed 4.4.3 - 4.4.6 clusters after simply trying any get started environment in code ready workspaces: time="2020-06-01T13:51:23Z" level=info msg="Starting reverse proxy (Listening on ':4402')" time="2020-06-01T13:51:23Z" level=info msg="Starting reverse proxy (Listening on ':4400')" time="2020-06-01T13:51:23Z" level=info msg="Starting reverse proxy (Listening on ':4401')" 2020/06/01 14:01:20 [001] WARN: Websocket error: readfrom tcp 127.0.0.1:44434->127.0.0.1:4444: splice: connection reset by peer 2020/06/01 14:01:20 [001] WARN: Websocket error: readfrom tcp 10.131.0.29:4400->10.131.0.1:53026: use of closed network connection 2020/06/01 14:01:20 http: response.WriteHeader on hijacked connection from github.com/eclipse/che-jwtproxy/vendor/github.com/coreos/goproxy.(*ProxyHttpServer).ServeHTTP (proxy.go:149) 2020/06/01 14:01:20 http: response.Write on hijacked connection from io.copyBuffer (io.go:404) 2020/06/01 14:01:28 [002] WARN: Websocket error: readfrom tcp 127.0.0.1:44650->127.0.0.1:4444: splice: connection reset by peer 2020/06/01 14:01:28 [002] WARN: Websocket error: readfrom tcp 10.131.0.29:4400->10.131.0.1:53242: use of closed network connection 2020/06/01 14:01:28 http: response.WriteHeader on hijacked connection from github.com/eclipse/che-jwtproxy/vendor/github.com/coreos/goproxy.(*ProxyHttpServer).ServeHTTP (proxy.go:149) 2020/06/01 14:01:28 http: response.Write on hijacked connection from io.copyBuffer (io.go:404)
(In reply to Dean from comment #10) > I believe the affect of this is more than log spam. I think it is having an > effect on code ready workspaces; making it unusable. I am unable to build > any code base that requires pulling dependencies from maven. Could this bug > possible be affecting the internet connections of containers as well? This > is what is displayed in the logs of freshly installed 4.4.3 - 4.4.6 clusters > after simply trying any get started environment in code ready workspaces: > > time="2020-06-01T13:51:23Z" level=info msg="Starting reverse proxy > (Listening on ':4402')" > time="2020-06-01T13:51:23Z" level=info msg="Starting reverse proxy > (Listening on ':4400')" > time="2020-06-01T13:51:23Z" level=info msg="Starting reverse proxy > (Listening on ':4401')" > 2020/06/01 14:01:20 [001] WARN: Websocket error: readfrom tcp > 127.0.0.1:44434->127.0.0.1:4444: splice: connection reset by peer > 2020/06/01 14:01:20 [001] WARN: Websocket error: readfrom tcp > 10.131.0.29:4400->10.131.0.1:53026: use of closed network connection > 2020/06/01 14:01:20 http: response.WriteHeader on hijacked connection from > github.com/eclipse/che-jwtproxy/vendor/github.com/coreos/goproxy. > (*ProxyHttpServer).ServeHTTP (proxy.go:149) > 2020/06/01 14:01:20 http: response.Write on hijacked connection from > io.copyBuffer (io.go:404) > 2020/06/01 14:01:28 [002] WARN: Websocket error: readfrom tcp > 127.0.0.1:44650->127.0.0.1:4444: splice: connection reset by peer > 2020/06/01 14:01:28 [002] WARN: Websocket error: readfrom tcp > 10.131.0.29:4400->10.131.0.1:53242: use of closed network connection > 2020/06/01 14:01:28 http: response.WriteHeader on hijacked connection from > github.com/eclipse/che-jwtproxy/vendor/github.com/coreos/goproxy. > (*ProxyHttpServer).ServeHTTP (proxy.go:149) > 2020/06/01 14:01:28 http: response.Write on hijacked connection from > io.copyBuffer (io.go:404) I'm not sure I can conclude from your output that the defunct bootstrap member's presence is the cause of the problem you're having. Unfortunately I don't use CRC and we have no automated coverage in CI, so more investigation would probably be necessary. I'm not aware of any similar failure modes on other platforms that look like what you're seeing, but maybe it rings a bell for someone else on the team.
This is a bare metal install on VSphere. I do not use CRC either. I do a fresh install of a 4.4.3 (or 4.4.6) bare metal cluster with three masters and 2 worker nodes. I create two persistent nfs volumes (one for postgres and another for the workspace). I start code ready workspaces and spin up a quarkus workspace using the "Get Started" template available within code ready workspaces. I select "Package the application" from the menu shown above on the right side of the IDE. The error occurs. The process gets stuck at "Downloading from central: https://repo.maven.apache.org/maven2/io/quarkus/quarkus-bom/1.3.2.Final/quarkus-bom-1.3.2.Final.pom"
For the time being, is there any kind of manual work around to take the bootstrap node out of the All_ETCD_ENDPOINTS list? I've tried a few things but something keeps injecting it back in.
(In reply to Dean from comment #14) > For the time being, is there any kind of manual work around to take the > bootstrap node out of the All_ETCD_ENDPOINTS list? I've tried a few things > but something keeps injecting it back in. Now I wonder if we're talking about the same issue. Are you referring to https://bugzilla.redhat.com/show_bug.cgi?id=1835238 (which covers stale member environment)? This bug is about the etcd-operator's internal client never forgetting the bootstrap member, and I still don't have any evidence of a serious functional impact. If you're in fact talking about bug #1835238, we should probably move the conversation. I don't have a full background on #1835238 so I can't speculate right now as to whether it would be related to the problems you're seeing. I'm not yet seeing any connection between the problem described in this bug and your symptoms.
Just to be clear, the underlying cause of the errors described in this bug relate to the alpha.installer.openshift.io/etcd-bootstrap annotation on the openshift-etcd/hosts-etcd-2 endpoint (4.4)/openshift-etcd/etcd-endpoints configmap (4.5+) and not the ALL_ETCD_ENDPOINTS (or any other) environment variable (e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1835238). Hope that helps disambiguate.
(In reply to Dan Mace from comment #15) > (In reply to Dean from comment #14) > > For the time being, is there any kind of manual work around to take the > > bootstrap node out of the All_ETCD_ENDPOINTS list? I've tried a few things > > but something keeps injecting it back in. > > Now I wonder if we're talking about the same issue. Are you referring to > https://bugzilla.redhat.com/show_bug.cgi?id=1835238 (which covers stale > member environment)? > > This bug is about the etcd-operator's internal client never forgetting the > bootstrap member, and I still don't have any evidence of a serious > functional impact. If you're in fact talking about bug #1835238, we should > probably move the conversation. I don't have a full background on #1835238 > so I can't speculate right now as to whether it would be related to the > problems you're seeing. I'm not yet seeing any connection between the > problem described in this bug and your symptoms. I am not sure if this bug is causing the issue I am referring to. However, I have spent days trying to get past the code ready workspace issue (installing, reinstalling openshift multiple times) and the only error I see in the environment is this one. I would like to eliminate it as a possible cause. If there is a work around to simply remove the bootstrap node so my etcd-operator pod isn't throwing hundreds of errors a minute I would like to at least eliminate this as the culprit.
(In reply to Dan Mace from comment #16) > Just to be clear, the underlying cause of the errors described in this bug > relate to the alpha.installer.openshift.io/etcd-bootstrap annotation on the > openshift-etcd/hosts-etcd-2 endpoint (4.4)/openshift-etcd/etcd-endpoints > configmap (4.5+) and not the ALL_ETCD_ENDPOINTS (or any other) environment > variable (e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1835238). > > Hope that helps disambiguate. I don't see the etcd-endpoints config-map. Does this look right to you? [dpeterson@webservice nfs-client]$ oc get configmap NAME DATA AGE config 1 34h config-2 1 34h etcd-ca-bundle 1 34h etcd-metrics-proxy-client-ca 1 34h etcd-metrics-proxy-client-ca-2 1 34h etcd-metrics-proxy-serving-ca 1 34h etcd-metrics-proxy-serving-ca-2 1 34h etcd-peer-client-ca 1 34h etcd-peer-client-ca-2 1 34h etcd-pod 3 34h etcd-pod-2 3 34h etcd-scripts 4 34h etcd-serving-ca 1 34h etcd-serving-ca-2 1 34h restore-etcd-pod 3 34h revision-status-1 2 34h revision-status-2 2 34h [dpeterson@webservice nfs-client]$
https://github.com/openshift/cluster-etcd-operator/pull/367 is the new attempt at a fix.
(In reply to Dan Mace from comment #19) > https://github.com/openshift/cluster-etcd-operator/pull/367 is the new > attempt at a fix. Is that fix in any of the releases here yet? https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/ Any way to manually apply this or get around the problem in a 4.4.3 cluster?
Adding Upgrades keyword as it was root cause of https://bugzilla.redhat.com/show_bug.cgi?id=1841484
If there's a configuration where we suspect this is still happening, I recommend opening a new bug. If you do, to improve the chance of us investigating soon, I recommend providing an automated reproducer at least, and ideally providing examples from CI. Thanks.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days