1832923 – etcd-operator continuously tries and fails to connect to defunct bootstrap etcd member endpoint

Bug 1832923 - etcd-operator continuously tries and fails to connect to defunct bootstrap etcd member endpoint

Summary: etcd-operator continuously tries and fails to connect to defunct bootstrap et...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Dan Mace
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1859678
TreeView+	depends on / blocked

Reported:	2020-05-07 13:53 UTC by Dan Mace
Modified:	2023-12-15 17:51 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1859678 (view as bug list)
Environment:
Last Closed:	2020-10-27 15:58:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 367	None	closed	Bug 1832923: Forget bootstrap etcd member IP after bootstrap	2021-02-08 01:03:51 UTC
Red Hat Bugzilla	1852309	low	CLOSED	[Webscale] Degraded etcd-operator because etcd-bootstrap member	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	5161361	None	None	None	2020-06-16 15:54:14 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 15:59:14 UTC

Description Dan Mace 2020-05-07 13:53:18 UTC

Description of problem:

After bootstrap is complete, the etcd-operator continues trying and failing to connect to the deleted bootstrap etcd member endpoint.

This appears in logs with messages like:

W0507 13:05:33.927788       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://10.0.15.248:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing d
ial tcp 10.0.15.248:2379: connect: connection refused". Reconnecting...

The issue never corrects itself.

Version-Release number of selected component (if applicable):

4.4, 4.5

How reproducible:

Create a new IPI cluster on any platform.

Actual results:

After bootstrap, the operator will spew logs about failed connection attempts to the bootstrap node IP.

Expected results:

The operator should forget about the bootstrap node after bootstrap is complete.


Additional info:

Comment 1 Dan Mace 2020-05-18 12:41:49 UTC


*** This bug has been marked as a duplicate of bug 1835238 ***

Comment 2 Dan Mace 2020-05-20 19:46:11 UTC

I made a mistake — this is separate from https://bugzilla.redhat.com/show_bug.cgi?id=1835238 which is concerned with cleaning up elsewhere.

Comment 6 Dan Mace 2020-05-28 13:49:54 UTC

The original fix for this (https://github.com/openshift/cluster-etcd-operator/pull/342) had a critical flaw that made it passed CI. The fix is reverted in https://github.com/openshift/cluster-etcd-operator/pull/364 and needs reworked.

The revert PR is using this bz to have a valid bug association for merging. Once the revert merges, I'll move this bug to 4.6 as the impact (log spam) doesn't justify blocking the 4.5 release.

Comment 8 Sam Batschelet 2020-05-29 12:40:29 UTC

The team has been busy with 4.5 bugs adding UpcomingSprint keyword to continue this work in 4.6.

Comment 10 Dean 2020-06-01 14:02:53 UTC

I believe the affect of this is more than log spam. I think it is having an effect on code ready workspaces; making it unusable. I am unable to build any code base that requires pulling dependencies from maven. Could this bug possible be affecting the internet connections of containers as well? This is what is displayed in the logs of freshly installed 4.4.3 - 4.4.6 clusters after simply trying any get started environment in code ready workspaces:

time="2020-06-01T13:51:23Z" level=info msg="Starting reverse proxy (Listening on ':4402')" 
time="2020-06-01T13:51:23Z" level=info msg="Starting reverse proxy (Listening on ':4400')" 
time="2020-06-01T13:51:23Z" level=info msg="Starting reverse proxy (Listening on ':4401')" 
2020/06/01 14:01:20 [001] WARN: Websocket error: readfrom tcp 127.0.0.1:44434->127.0.0.1:4444: splice: connection reset by peer
2020/06/01 14:01:20 [001] WARN: Websocket error: readfrom tcp 10.131.0.29:4400->10.131.0.1:53026: use of closed network connection
2020/06/01 14:01:20 http: response.WriteHeader on hijacked connection from github.com/eclipse/che-jwtproxy/vendor/github.com/coreos/goproxy.(*ProxyHttpServer).ServeHTTP (proxy.go:149)
2020/06/01 14:01:20 http: response.Write on hijacked connection from io.copyBuffer (io.go:404)
2020/06/01 14:01:28 [002] WARN: Websocket error: readfrom tcp 127.0.0.1:44650->127.0.0.1:4444: splice: connection reset by peer
2020/06/01 14:01:28 [002] WARN: Websocket error: readfrom tcp 10.131.0.29:4400->10.131.0.1:53242: use of closed network connection
2020/06/01 14:01:28 http: response.WriteHeader on hijacked connection from github.com/eclipse/che-jwtproxy/vendor/github.com/coreos/goproxy.(*ProxyHttpServer).ServeHTTP (proxy.go:149)
2020/06/01 14:01:28 http: response.Write on hijacked connection from io.copyBuffer (io.go:404)

Comment 12 Dan Mace 2020-06-01 14:23:08 UTC

(In reply to Dean from comment #10)
> I believe the affect of this is more than log spam. I think it is having an
> effect on code ready workspaces; making it unusable. I am unable to build
> any code base that requires pulling dependencies from maven. Could this bug
> possible be affecting the internet connections of containers as well? This
> is what is displayed in the logs of freshly installed 4.4.3 - 4.4.6 clusters
> after simply trying any get started environment in code ready workspaces:
> 
> time="2020-06-01T13:51:23Z" level=info msg="Starting reverse proxy
> (Listening on ':4402')" 
> time="2020-06-01T13:51:23Z" level=info msg="Starting reverse proxy
> (Listening on ':4400')" 
> time="2020-06-01T13:51:23Z" level=info msg="Starting reverse proxy
> (Listening on ':4401')" 
> 2020/06/01 14:01:20 [001] WARN: Websocket error: readfrom tcp
> 127.0.0.1:44434->127.0.0.1:4444: splice: connection reset by peer
> 2020/06/01 14:01:20 [001] WARN: Websocket error: readfrom tcp
> 10.131.0.29:4400->10.131.0.1:53026: use of closed network connection
> 2020/06/01 14:01:20 http: response.WriteHeader on hijacked connection from
> github.com/eclipse/che-jwtproxy/vendor/github.com/coreos/goproxy.
> (*ProxyHttpServer).ServeHTTP (proxy.go:149)
> 2020/06/01 14:01:20 http: response.Write on hijacked connection from
> io.copyBuffer (io.go:404)
> 2020/06/01 14:01:28 [002] WARN: Websocket error: readfrom tcp
> 127.0.0.1:44650->127.0.0.1:4444: splice: connection reset by peer
> 2020/06/01 14:01:28 [002] WARN: Websocket error: readfrom tcp
> 10.131.0.29:4400->10.131.0.1:53242: use of closed network connection
> 2020/06/01 14:01:28 http: response.WriteHeader on hijacked connection from
> github.com/eclipse/che-jwtproxy/vendor/github.com/coreos/goproxy.
> (*ProxyHttpServer).ServeHTTP (proxy.go:149)
> 2020/06/01 14:01:28 http: response.Write on hijacked connection from
> io.copyBuffer (io.go:404)


I'm not sure I can conclude from your output that the defunct bootstrap member's presence is the cause of the problem you're having. Unfortunately I don't use CRC and we have no automated coverage in CI, so more investigation would probably be necessary. I'm not aware of any similar failure modes on other platforms that look like what you're seeing, but maybe it rings a bell for someone else on the team.

Comment 13 Dean 2020-06-01 14:25:25 UTC

This is a bare metal install on VSphere. I do not use CRC either. 

I do a fresh install of a 4.4.3 (or 4.4.6) bare metal cluster with three masters and 2 worker nodes. 
I create two persistent nfs volumes (one for postgres and another for the workspace). 
I start code ready workspaces and spin up a quarkus workspace using the "Get Started" template available within code ready workspaces.
I select "Package the application" from the menu shown above on the right side of the IDE.
The error occurs. The process gets stuck at "Downloading from central: https://repo.maven.apache.org/maven2/io/quarkus/quarkus-bom/1.3.2.Final/quarkus-bom-1.3.2.Final.pom"

Comment 14 Dean 2020-06-01 16:41:22 UTC

For the time being, is there any kind of manual work around to take the bootstrap node out of the All_ETCD_ENDPOINTS list? I've tried a few things but something keeps injecting it back in.

Comment 15 Dan Mace 2020-06-01 17:05:23 UTC

(In reply to Dean from comment #14)
> For the time being, is there any kind of manual work around to take the
> bootstrap node out of the All_ETCD_ENDPOINTS list? I've tried a few things
> but something keeps injecting it back in.

Now I wonder if we're talking about the same issue. Are you referring to https://bugzilla.redhat.com/show_bug.cgi?id=1835238 (which covers stale member environment)?

This bug is about the etcd-operator's internal client never forgetting the bootstrap member, and I still don't have any evidence of a serious functional impact. If you're in fact talking about bug #1835238, we should probably move the conversation. I don't have a full background on #1835238 so I can't speculate right now as to whether it would be related to the problems you're seeing. I'm not yet seeing any connection between the problem described in this bug and your symptoms.

Comment 16 Dan Mace 2020-06-01 17:09:05 UTC

Just to be clear, the underlying cause of the errors described in this bug relate to the alpha.installer.openshift.io/etcd-bootstrap annotation on the openshift-etcd/hosts-etcd-2 endpoint (4.4)/openshift-etcd/etcd-endpoints configmap (4.5+) and not the ALL_ETCD_ENDPOINTS (or any other) environment variable (e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1835238).

Hope that helps disambiguate.

Comment 17 Dean 2020-06-01 17:33:19 UTC

(In reply to Dan Mace from comment #15)
> (In reply to Dean from comment #14)
> > For the time being, is there any kind of manual work around to take the
> > bootstrap node out of the All_ETCD_ENDPOINTS list? I've tried a few things
> > but something keeps injecting it back in.
> 
> Now I wonder if we're talking about the same issue. Are you referring to
> https://bugzilla.redhat.com/show_bug.cgi?id=1835238 (which covers stale
> member environment)?
> 
> This bug is about the etcd-operator's internal client never forgetting the
> bootstrap member, and I still don't have any evidence of a serious
> functional impact. If you're in fact talking about bug #1835238, we should
> probably move the conversation. I don't have a full background on #1835238
> so I can't speculate right now as to whether it would be related to the
> problems you're seeing. I'm not yet seeing any connection between the
> problem described in this bug and your symptoms.

I am not sure if this bug is causing the issue I am referring to. However, I have spent days trying to get past the code ready workspace issue (installing, reinstalling openshift multiple times) and the only error I see in the environment is this one. I would like to eliminate it as a possible cause. If there is a work around to simply remove the bootstrap node so my etcd-operator pod isn't throwing hundreds of errors a minute I would like to at least eliminate this as the culprit.

Comment 18 Dean 2020-06-01 18:22:26 UTC

(In reply to Dan Mace from comment #16)
> Just to be clear, the underlying cause of the errors described in this bug
> relate to the alpha.installer.openshift.io/etcd-bootstrap annotation on the
> openshift-etcd/hosts-etcd-2 endpoint (4.4)/openshift-etcd/etcd-endpoints
> configmap (4.5+) and not the ALL_ETCD_ENDPOINTS (or any other) environment
> variable (e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1835238).
> 
> Hope that helps disambiguate.

I don't see the etcd-endpoints config-map. Does this look right to you?

[dpeterson@webservice nfs-client]$ oc get configmap
NAME                              DATA   AGE
config                            1      34h
config-2                          1      34h
etcd-ca-bundle                    1      34h
etcd-metrics-proxy-client-ca      1      34h
etcd-metrics-proxy-client-ca-2    1      34h
etcd-metrics-proxy-serving-ca     1      34h
etcd-metrics-proxy-serving-ca-2   1      34h
etcd-peer-client-ca               1      34h
etcd-peer-client-ca-2             1      34h
etcd-pod                          3      34h
etcd-pod-2                        3      34h
etcd-scripts                      4      34h
etcd-serving-ca                   1      34h
etcd-serving-ca-2                 1      34h
restore-etcd-pod                  3      34h
revision-status-1                 2      34h
revision-status-2                 2      34h
[dpeterson@webservice nfs-client]$

Comment 19 Dan Mace 2020-06-01 18:42:46 UTC

https://github.com/openshift/cluster-etcd-operator/pull/367 is the new attempt at a fix.

Comment 20 Dean 2020-06-01 21:10:07 UTC

(In reply to Dan Mace from comment #19)
> https://github.com/openshift/cluster-etcd-operator/pull/367 is the new
> attempt at a fix.

Is that fix in any of the releases here yet? https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/

Any way to manually apply this or get around the problem in a 4.4.3 cluster?

Comment 21 Lalatendu Mohanty 2020-06-02 13:05:47 UTC

Adding Upgrades keyword as it was root cause of https://bugzilla.redhat.com/show_bug.cgi?id=1841484

Comment 43 Dan Mace 2020-09-18 12:12:45 UTC

If there's a configuration where we suspect this is still happening, I recommend opening a new bug. If you do, to improve the chance of us investigating soon, I recommend providing an automated reproducer at least, and ideally providing examples from CI. Thanks.

Comment 45 errata-xmlrpc 2020-10-27 15:58:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 48 Red Hat Bugzilla 2023-09-15 00:31:32 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.