1552332 – 15 minute delay to take master out of cluster during abrupt shutdown

Bug 1552332 - 15 minute delay to take master out of cluster during abrupt shutdown

Summary: 15 minute delay to take master out of cluster during abrupt shutdown

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.4.z
Assignee:	Michal Fojtik
QA Contact:	Wang Haoran
Docs Contact:
URL:
Whiteboard:
Depends On:	1550470 1561748 1561749 1564978
Blocks:
TreeView+	depends on / blocked

Reported:	2018-03-06 23:24 UTC by Robert Bost
Modified:	2021-06-10 15:06 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-18 07:00:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1134	0	None	None	None	2018-04-18 07:02:00 UTC

Description Robert Bost 2018-03-06 23:24:37 UTC

Description of problem: There appears to be a 15 minute delay to take a master node out of the cluster after it was abruptly shutdown (i.e. forced power down, crash, etc).


Version-Release number of selected component (if applicable): OpenShift 3.4


How reproducible: Seems to be always


Steps to Reproduce:
1- Shutdown Master1
2- Master2 and Master3 saw Master1 go down
Mar  6 17:29:29 xxx00189 etcd: lost the TCP streaming connection with peer f1a8c1bd301a53b2 (stream MsgApp v2 reader)
Mar  6 17:29:29 xxx00189 etcd: peer f1a8c1bd301a53b2 became inactive
Mar  6 17:29:29 xxx00189 etcd: lost the TCP streaming connection with peer f1a8c1bd301a53b2 (stream Message reader)
Mar  6 17:30:02 xxx00189 etcd: lost the TCP streaming connection with peer f1a8c1bd301a53b2 (stream MsgApp v2 writer)
Mar  6 17:30:12 xxx00189 etcd: lost the TCP streaming connection with peer f1a8c1bd301a53b2 (stream Message writer)
p
3- Increase the replicas from 1 to 3 for a particular pod, but nothing happened. It just sat there saying "Scaling to3"
4- Delete a pod and didn't see this pod come spin up again
5- Run 'oc get nodes' and the status for Master1 didn't change from ready. 
After 15 minutes (see below logs), it seems that finally the OCP cluster took out Master1.

Mar  6 17:44:56 xxx00189 etcd: failed to reach the peerURL(https://10.245.160.88:2380) of member f1a8c1bd301a53b2 (Get https://10.245.160.88:2380/version: dial tcp 10.245.160.88:2380: i/o timeout)
Mar  6 17:45:01 xxx00189 etcd: failed to reach the peerURL(https://10.245.160.88:2380) of member f1a8c1bd301a53b2 (Get https://10.245.160.88:2380/version: dial tcp 10.245.160.88:2380: i/o timeout)
...
Mar  6 17:45:43 xxx00189 etcd: health check for peer f1a8c1bd301a53b2 could not connect: dial tcp 10.245.160.88:2380: i/o timeout

After the above started to flow on Master2&3, Master1 status became NotReady, the pod that was deleted spawned again and the replicas from 1 to 3 finally took effect.

Comment 11 Wang Haoran 2018-04-03 08:57:49 UTC

Verified with:
openshift v3.4.1.44.52
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Comment 14 errata-xmlrpc 2018-04-18 07:00:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1134

Note You need to log in before you can comment on or make changes to this bug.