Bug 1425211 - [3.4] Restarting atomic-openshift-master-api takes ~5 minutes to restart
Summary: [3.4] Restarting atomic-openshift-master-api takes ~5 minutes to restart
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.4.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.4.z
Assignee: Derek Carr
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks: OSOPS_V3 1427532
TreeView+ depends on / blocked
 
Reported: 2017-02-20 20:50 UTC by Max Whittingham
Modified: 2018-04-23 18:53 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: excessive logging to journal Consequence: caused masters to take longer to restart Fix: reduce the amount of logging that occurs when initial list/watch actions happen against etcd. Result: the journal is no longer pegged with a lot of messages that cause logging messages to be rate limited and dropped. server restart time should be improved on clusters with larger data sets.
Clone Of:
: 1427532 (view as bug list)
Environment:
Last Closed: 2017-03-15 20:02:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 13126 0 None None None 2017-02-28 14:25:38 UTC
Red Hat Product Errata RHBA-2017:0512 0 normal SHIPPED_LIVE OpenShift Container Platform 3.4.1.10, 3.3.1.17, and 3.2.1.28 bug fix update 2017-03-16 00:01:17 UTC

Description Max Whittingham 2017-02-20 20:50:36 UTC
Description of problem:
Restarting atomic-openshift-master-api takes a long time, about ~5 minutes to restart the service. 

Version-Release number of selected component (if applicable):


How reproducible:
systemctl restart atomic-openshift-master-api

Steps to Reproduce:
1. restart atomic-openshift-master-api.service
2. 
3.

Actual results:
Sometimes, this lengthy restart time will cause a timeout and the restart will fail.

Expected results:
Service should restart much faster

Additional info:
Had to increase TimeoutSec=360 in /usr/lib/systemd/system/atomic-openshift-master-api.service otherwise the api server would time out before it was able to read in all the decoded dir things

Comment 1 Derek Carr 2017-02-20 21:53:52 UTC
is there any log details you can provide for when this happens?  right now, there could be any number of things that could be stuck during startup and its not possible to diagnose this w/o more detail.

Comment 2 Seth Jennings 2017-02-20 22:02:01 UTC
Are you running an RPM or containerized install?

Can you restart and provide the recent logs from the API service with

journalctl -u atomic-openshift-master-api --since="10 minutes"

(assuming the restart doesn't take longer than 10 minutes)

Comment 4 Seth Jennings 2017-02-20 23:02:06 UTC
There are quite a few resources that the master is having to read into the cache on startup.  Around 10,000.  The process is stalling in a highly periodic way during this read; every 30s.  The process will decode at :13 then stall, then start again at :43, then stall.  My only guess ATM is that some I/O scheduler, either at the host or AWS level, is at work here, blocking the process for using to much I/O.

Comment 8 Mike Fiedler 2017-02-22 16:37:14 UTC
If it is an etcd iops issue, the etcd member logs should be showing delayed heartbeats and/or leader re-election.   Can we get etcd member logs during master restart?

If it is AWS I/O throttling, iostat for the etcd snapshot storage device during restart might show something - though in my experience it does not show a symptom like the pattern mentioned in comment 4.

Comment 10 Max Whittingham 2017-02-22 19:42:04 UTC
(In reply to Mike Fiedler from comment #8)
> If it is an etcd iops issue, the etcd member logs should be showing delayed
> heartbeats and/or leader re-election.   Can we get etcd member logs during
> master restart?
> 
> If it is AWS I/O throttling, iostat for the etcd snapshot storage device
> during restart might show something - though in my experience it does not
> show a symptom like the pattern mentioned in comment 4.

Uploaded the etcd logs during a master restart.

Comment 19 Derek Carr 2017-02-27 18:25:09 UTC
UPSTREAM PR:
https://github.com/kubernetes/kubernetes/pull/42178

Comment 25 Troy Dawson 2017-03-07 19:58:38 UTC
This has been merged into ocp and is in OCP v3.4.1.10 or newer.

Comment 27 Zhang Cheng 2017-03-08 09:52:33 UTC
Verified.
[root@ip-172-18-0-40 ~]# oc version
oc v3.4.1.10
kubernetes v1.4.0+776c994
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-0-40.ec2.internal:8443
openshift v3.4.1.10
kubernetes v1.4.0+776c994
[root@ip-172-18-0-40 ~]# oc get node 
NAME                           STATUS                     AGE
ip-172-18-0-40.ec2.internal    Ready,SchedulingDisabled   2h
ip-172-18-5-189.ec2.internal   Ready                      2h
[root@ip-172-18-0-40 ~]# systemctl restart atomic-openshift-node
[root@ip-172-18-0-40 ~]# oc get node 
NAME                           STATUS                     AGE
ip-172-18-0-40.ec2.internal    Ready,SchedulingDisabled   2h
ip-172-18-5-189.ec2.internal   Ready                      2h

Comment 29 errata-xmlrpc 2017-03-15 20:02:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0512


Note You need to log in before you can comment on or make changes to this bug.