Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1425211 - [3.4] Restarting atomic-openshift-master-api takes ~5 minutes to restart
[3.4] Restarting atomic-openshift-master-api takes ~5 minutes to restart
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
3.4.1
Unspecified Unspecified
medium Severity medium
: ---
: 3.4.z
Assigned To: Derek Carr
DeShuai Ma
: OpsBlocker
Depends On:
Blocks: OSOPS_V3 1427532
  Show dependency treegraph
 
Reported: 2017-02-20 15:50 EST by Max Whittingham
Modified: 2018-04-23 14:53 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: excessive logging to journal Consequence: caused masters to take longer to restart Fix: reduce the amount of logging that occurs when initial list/watch actions happen against etcd. Result: the journal is no longer pegged with a lot of messages that cause logging messages to be rate limited and dropped. server restart time should be improved on clusters with larger data sets.
Story Points: ---
Clone Of:
: 1427532 (view as bug list)
Environment:
Last Closed: 2017-03-15 16:02:48 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Origin (Github) 13126 None None None 2017-02-28 09:25 EST
Red Hat Product Errata RHBA-2017:0512 normal SHIPPED_LIVE OpenShift Container Platform 3.4.1.10, 3.3.1.17, and 3.2.1.28 bug fix update 2017-03-15 20:01:17 EDT

  None (edit)
Description Max Whittingham 2017-02-20 15:50:36 EST
Description of problem:
Restarting atomic-openshift-master-api takes a long time, about ~5 minutes to restart the service. 

Version-Release number of selected component (if applicable):


How reproducible:
systemctl restart atomic-openshift-master-api

Steps to Reproduce:
1. restart atomic-openshift-master-api.service
2. 
3.

Actual results:
Sometimes, this lengthy restart time will cause a timeout and the restart will fail.

Expected results:
Service should restart much faster

Additional info:
Had to increase TimeoutSec=360 in /usr/lib/systemd/system/atomic-openshift-master-api.service otherwise the api server would time out before it was able to read in all the decoded dir things
Comment 1 Derek Carr 2017-02-20 16:53:52 EST
is there any log details you can provide for when this happens?  right now, there could be any number of things that could be stuck during startup and its not possible to diagnose this w/o more detail.
Comment 2 Seth Jennings 2017-02-20 17:02:01 EST
Are you running an RPM or containerized install?

Can you restart and provide the recent logs from the API service with

journalctl -u atomic-openshift-master-api --since="10 minutes"

(assuming the restart doesn't take longer than 10 minutes)
Comment 4 Seth Jennings 2017-02-20 18:02:06 EST
There are quite a few resources that the master is having to read into the cache on startup.  Around 10,000.  The process is stalling in a highly periodic way during this read; every 30s.  The process will decode at :13 then stall, then start again at :43, then stall.  My only guess ATM is that some I/O scheduler, either at the host or AWS level, is at work here, blocking the process for using to much I/O.
Comment 8 Mike Fiedler 2017-02-22 11:37:14 EST
If it is an etcd iops issue, the etcd member logs should be showing delayed heartbeats and/or leader re-election.   Can we get etcd member logs during master restart?

If it is AWS I/O throttling, iostat for the etcd snapshot storage device during restart might show something - though in my experience it does not show a symptom like the pattern mentioned in comment 4.
Comment 10 Max Whittingham 2017-02-22 14:42:04 EST
(In reply to Mike Fiedler from comment #8)
> If it is an etcd iops issue, the etcd member logs should be showing delayed
> heartbeats and/or leader re-election.   Can we get etcd member logs during
> master restart?
> 
> If it is AWS I/O throttling, iostat for the etcd snapshot storage device
> during restart might show something - though in my experience it does not
> show a symptom like the pattern mentioned in comment 4.

Uploaded the etcd logs during a master restart.
Comment 19 Derek Carr 2017-02-27 13:25:09 EST
UPSTREAM PR:
https://github.com/kubernetes/kubernetes/pull/42178
Comment 25 Troy Dawson 2017-03-07 14:58:38 EST
This has been merged into ocp and is in OCP v3.4.1.10 or newer.
Comment 27 Zhang Cheng 2017-03-08 04:52:33 EST
Verified.
[root@ip-172-18-0-40 ~]# oc version
oc v3.4.1.10
kubernetes v1.4.0+776c994
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-0-40.ec2.internal:8443
openshift v3.4.1.10
kubernetes v1.4.0+776c994
[root@ip-172-18-0-40 ~]# oc get node 
NAME                           STATUS                     AGE
ip-172-18-0-40.ec2.internal    Ready,SchedulingDisabled   2h
ip-172-18-5-189.ec2.internal   Ready                      2h
[root@ip-172-18-0-40 ~]# systemctl restart atomic-openshift-node
[root@ip-172-18-0-40 ~]# oc get node 
NAME                           STATUS                     AGE
ip-172-18-0-40.ec2.internal    Ready,SchedulingDisabled   2h
ip-172-18-5-189.ec2.internal   Ready                      2h
Comment 29 errata-xmlrpc 2017-03-15 16:02:48 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0512

Note You need to log in before you can comment on or make changes to this bug.