Description of problem: Restarting atomic-openshift-master-api takes a long time, about ~5 minutes to restart the service. Version-Release number of selected component (if applicable): How reproducible: systemctl restart atomic-openshift-master-api Steps to Reproduce: 1. restart atomic-openshift-master-api.service 2. 3. Actual results: Sometimes, this lengthy restart time will cause a timeout and the restart will fail. Expected results: Service should restart much faster Additional info: Had to increase TimeoutSec=360 in /usr/lib/systemd/system/atomic-openshift-master-api.service otherwise the api server would time out before it was able to read in all the decoded dir things
is there any log details you can provide for when this happens? right now, there could be any number of things that could be stuck during startup and its not possible to diagnose this w/o more detail.
Are you running an RPM or containerized install? Can you restart and provide the recent logs from the API service with journalctl -u atomic-openshift-master-api --since="10 minutes" (assuming the restart doesn't take longer than 10 minutes)
There are quite a few resources that the master is having to read into the cache on startup. Around 10,000. The process is stalling in a highly periodic way during this read; every 30s. The process will decode at :13 then stall, then start again at :43, then stall. My only guess ATM is that some I/O scheduler, either at the host or AWS level, is at work here, blocking the process for using to much I/O.
If it is an etcd iops issue, the etcd member logs should be showing delayed heartbeats and/or leader re-election. Can we get etcd member logs during master restart? If it is AWS I/O throttling, iostat for the etcd snapshot storage device during restart might show something - though in my experience it does not show a symptom like the pattern mentioned in comment 4.
(In reply to Mike Fiedler from comment #8) > If it is an etcd iops issue, the etcd member logs should be showing delayed > heartbeats and/or leader re-election. Can we get etcd member logs during > master restart? > > If it is AWS I/O throttling, iostat for the etcd snapshot storage device > during restart might show something - though in my experience it does not > show a symptom like the pattern mentioned in comment 4. Uploaded the etcd logs during a master restart.
UPSTREAM PR: https://github.com/kubernetes/kubernetes/pull/42178
This has been merged into ocp and is in OCP v3.4.1.10 or newer.
Verified. [root@ip-172-18-0-40 ~]# oc version oc v3.4.1.10 kubernetes v1.4.0+776c994 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-0-40.ec2.internal:8443 openshift v3.4.1.10 kubernetes v1.4.0+776c994 [root@ip-172-18-0-40 ~]# oc get node NAME STATUS AGE ip-172-18-0-40.ec2.internal Ready,SchedulingDisabled 2h ip-172-18-5-189.ec2.internal Ready 2h [root@ip-172-18-0-40 ~]# systemctl restart atomic-openshift-node [root@ip-172-18-0-40 ~]# oc get node NAME STATUS AGE ip-172-18-0-40.ec2.internal Ready,SchedulingDisabled 2h ip-172-18-5-189.ec2.internal Ready 2h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0512