Bug 1578087 - Stopping/Restarting etcd leader cause master api and controllers pods restart multiple times
Summary: Stopping/Restarting etcd leader cause master api and controllers pods restart...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.10.0
Assignee: Jordan Liggitt
QA Contact: Vikas Laad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-14 19:00 UTC by Vikas Laad
Modified: 2018-07-30 19:15 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-07-30 19:15:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
master journal log (3.12 MB, text/x-vhdl)
2018-05-14 19:03 UTC, Vikas Laad
no flags Details
exited api container log (86.92 KB, text/plain)
2018-05-14 19:04 UTC, Vikas Laad
no flags Details
master api pod log (69.31 KB, text/plain)
2018-05-14 19:04 UTC, Vikas Laad
no flags Details
controllers exited container log (1002.97 KB, text/plain)
2018-05-15 14:24 UTC, Vikas Laad
no flags Details
controller manager exited container log (1.55 KB, text/plain)
2018-05-15 14:24 UTC, Vikas Laad
no flags Details
api exited container log (294.85 KB, text/plain)
2018-05-15 14:25 UTC, Vikas Laad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 0 None None None 2018-07-30 19:15:53 UTC

Description Vikas Laad 2018-05-14 19:00:46 UTC
Description of problem:
Restarting or Stoppint etcd causes master api pod restart multiple times.

NAME                                                            READY     STATUS    RESTARTS   AGE
master-api-ip-172-31-49-98.us-west-2.compute.internal           1/1       Running   13         1h
master-controllers-ip-172-31-49-98.us-west-2.compute.internal   1/1       Running   4          1h

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.41.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

How reproducible:
Always

Steps to Reproduce:
1. Create OCP cluster with 3 etcd (not co-located), 1 master, 1 infra and 2 compute nodes
2. Create few pods, imagestreams, builds etc
3. Stop etcd leader node from aws console
4. watch master api and controllers pods
5. oc commands fail sometimes when api pod is restarting

Actual results:
Many restarts of master api and controllers pods

Expected results:
No restarts of master api and controllers pods

Additional info:
attaching jounal logs from master node, master api and controller pod logs, exited container logs

Comment 1 Vikas Laad 2018-05-14 19:03:51 UTC
Created attachment 1436508 [details]
master journal log

Comment 2 Vikas Laad 2018-05-14 19:04:11 UTC
Created attachment 1436509 [details]
exited api container log

Comment 3 Vikas Laad 2018-05-14 19:04:31 UTC
Created attachment 1436510 [details]
master api pod log

Comment 5 Vikas Laad 2018-05-15 13:27:37 UTC
I will attach controller manager logs today.

Comment 6 Vikas Laad 2018-05-15 14:24:19 UTC
Created attachment 1436801 [details]
controllers exited container log

Comment 7 Vikas Laad 2018-05-15 14:24:42 UTC
Created attachment 1436802 [details]
controller manager exited container log

Comment 8 Vikas Laad 2018-05-15 14:25:03 UTC
Created attachment 1436803 [details]
api exited container log

Comment 17 Jordan Liggitt 2018-05-21 11:49:55 UTC
This seems like it might be related to the issue fixed by https://github.com/openshift/origin/pull/19638

Comment 18 Michal Fojtik 2018-05-21 12:15:54 UTC
Definitely, moving on QA to test that fix. 

Vikas can you try with the latest build?

Comment 20 Vikas Laad 2018-05-22 15:39:11 UTC
I did not see this problem in following version, tried multiple times to restart the etcd leader.

openshift v3.10.0-0.50.0
kubernetes v1.10.0+b81c8f8

Comment 22 errata-xmlrpc 2018-07-30 19:15:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.