1578087 – Stopping/Restarting etcd leader cause master api and controllers pods restart multiple times

Bug 1578087 - Stopping/Restarting etcd leader cause master api and controllers pods restart multiple times

Summary: Stopping/Restarting etcd leader cause master api and controllers pods restart...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.10.0
Assignee:	Jordan Liggitt
QA Contact:	Vikas Laad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-14 19:00 UTC by Vikas Laad
Modified:	2018-07-30 19:15 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-07-30 19:15:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
master journal log (3.12 MB, text/x-vhdl) 2018-05-14 19:03 UTC, Vikas Laad	no flags	Details
exited api container log (86.92 KB, text/plain) 2018-05-14 19:04 UTC, Vikas Laad	no flags	Details
master api pod log (69.31 KB, text/plain) 2018-05-14 19:04 UTC, Vikas Laad	no flags	Details
controllers exited container log (1002.97 KB, text/plain) 2018-05-15 14:24 UTC, Vikas Laad	no flags	Details
controller manager exited container log (1.55 KB, text/plain) 2018-05-15 14:24 UTC, Vikas Laad	no flags	Details
api exited container log (294.85 KB, text/plain) 2018-05-15 14:25 UTC, Vikas Laad	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1816	0	None	None	None	2018-07-30 19:15:53 UTC

Description Vikas Laad 2018-05-14 19:00:46 UTC

Description of problem:
Restarting or Stoppint etcd causes master api pod restart multiple times.

NAME                                                            READY     STATUS    RESTARTS   AGE
master-api-ip-172-31-49-98.us-west-2.compute.internal           1/1       Running   13         1h
master-controllers-ip-172-31-49-98.us-west-2.compute.internal   1/1       Running   4          1h

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.41.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

How reproducible:
Always

Steps to Reproduce:
1. Create OCP cluster with 3 etcd (not co-located), 1 master, 1 infra and 2 compute nodes
2. Create few pods, imagestreams, builds etc
3. Stop etcd leader node from aws console
4. watch master api and controllers pods
5. oc commands fail sometimes when api pod is restarting

Actual results:
Many restarts of master api and controllers pods

Expected results:
No restarts of master api and controllers pods

Additional info:
attaching jounal logs from master node, master api and controller pod logs, exited container logs

Comment 1 Vikas Laad 2018-05-14 19:03:51 UTC

Created attachment 1436508 [details]
master journal log

Comment 2 Vikas Laad 2018-05-14 19:04:11 UTC

Created attachment 1436509 [details]
exited api container log

Comment 3 Vikas Laad 2018-05-14 19:04:31 UTC

Created attachment 1436510 [details]
master api pod log

Comment 5 Vikas Laad 2018-05-15 13:27:37 UTC

I will attach controller manager logs today.

Comment 6 Vikas Laad 2018-05-15 14:24:19 UTC

Created attachment 1436801 [details]
controllers exited container log

Comment 7 Vikas Laad 2018-05-15 14:24:42 UTC

Created attachment 1436802 [details]
controller manager exited container log

Comment 8 Vikas Laad 2018-05-15 14:25:03 UTC

Created attachment 1436803 [details]
api exited container log

Comment 17 Jordan Liggitt 2018-05-21 11:49:55 UTC

This seems like it might be related to the issue fixed by https://github.com/openshift/origin/pull/19638

Comment 18 Michal Fojtik 2018-05-21 12:15:54 UTC

Definitely, moving on QA to test that fix. 

Vikas can you try with the latest build?

Comment 20 Vikas Laad 2018-05-22 15:39:11 UTC

I did not see this problem in following version, tried multiple times to restart the etcd leader.

openshift v3.10.0-0.50.0
kubernetes v1.10.0+b81c8f8

Comment 22 errata-xmlrpc 2018-07-30 19:15:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Note You need to log in before you can comment on or make changes to this bug.