Bug 1441994

Summary: Rebooting nodes is slow
Product: OpenShift Container Platform Reporter: Marko Myllynen <myllynen>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED WONTFIX QA Contact: Xiaoli Tian <xtian>
Severity: low Docs Contact:
Priority: unspecified    
Version: 3.4.1CC: aos-bugs, eparis, gblomqui, jokerman, jswensso, mmccomas
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-03 17:56:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marko Myllynen 2017-04-13 09:17:21 UTC
Description of problem:
In a containerized HA setup (3 master/etcd nodes) on RHEL rebooting a master for whatever reason (e.g., after applying OS level upgrades) rebooting the node takes several minutes (~6) as several services are shutting down very slowly.

But if stopping these services manually before reboot then the whole reboot is almost instant:

systemctl stop atomic-openshift-master-controllers.service 
systemctl stop atomic-openshift-master-api.service 
systemctl stop atomic-openshift-node.service 
systemctl stop etcd_container.service 
systemctl stop openvswitch.service 

Note that OS upgrade is not needed, happens even merely rebooting a master. Also, looks like that the above steps are needed but it could be that not all of them are necessary.

Please make sure rebooting a master node happens quickly even if not manually stopping services.

Version-Release number of selected component (if applicable):
v3.4.1.12 images, latest RHEL 7 RPMs.

Comment 1 Marko Myllynen 2017-04-13 12:12:25 UTC
This is also an issue on non-master nodes, there stopping these is helping:

systemctl stop atomic-openshift-node.service
systemctl stop openvswitch.service

But if there are containers still running then shutting down docker.service will take few minutes. Draining the node doesn't remove daemonset pods (e.g., fluentd from a typical basic installation) without some extra gymnastics (e.g., oc label node <node> logging-infra-fluentd=false --overwrite).

The latter may be perhaps a bit hard to automate but the former is probably something that can be solved without going to specifics.

Comment 2 Eric Paris 2017-04-17 18:32:20 UTC
hopefully the kube team can at least do some initial tiage to figure out what is slow. Then we can re-assess who the BZ 'belongs' to.

Comment 5 Seth Jennings 2017-09-05 01:40:56 UTC
The ExecStop for all containerized openshift units is a "docker stop" command.

https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_master/templates/docker-cluster

They have:

After={{ openshift.docker.service_name }}.service
Requires={{ openshift.docker.service_name }}.service
PartOf={{ openshift.docker.service_name }}.service

There might be some docker dependency that is not being expressed in the unit files, causing "docker stop" to hang and delaying shutdown.

There isn't enough information here to verify that theory.  I'm not sure exactly what information would be useful.  Just to verify, this is OCP 3.4?

If you watch the console when shutdown down, it should show the units that are holding up the shutdown.  Can you make a list of those units?  Especially note if docker is one of them.  If docker is shutdown before the atomic-openshift units, that would indicate a dependency issue.

As a workaround, one can set DefaultTimeoutStopSec lower in /etc/systemd/system.conf to speed up the shutdown process.

Comment 6 Marko Myllynen 2017-09-05 15:47:25 UTC
(In reply to Seth Jennings from comment #5)
> 
> There might be some docker dependency that is not being expressed in the
> unit files, causing "docker stop" to hang and delaying shutdown.
> 
> There isn't enough information here to verify that theory.  I'm not sure
> exactly what information would be useful.  Just to verify, this is OCP 3.4?

This reproduces on containerized OCP 3.6 as well.

> If you watch the console when shutdown down, it should show the units that
> are holding up the shutdown.  Can you make a list of those units? 
> Especially note if docker is one of them.  If docker is shutdown before the
> atomic-openshift units, that would indicate a dependency issue.

A bit after a fresh OCP 3.6 installation I rebooted a master, first things are slow with:

         Stopping atomic-openshift-node.service...
         Stopping Atomic OpenShift Master Controllers...

The next pause is with:

         Stopping openvswitch.service...
         Stopping Atomic OpenShift Master API...

Then with:

         Stopping The Etcd Server container...

After these I can see Docker stopping almost instantly:

         Stopping Docker Application Container Engine...

But if stopping these manually prior reboot one by one then all of the stop almost instantly. So might be that starting to shutdown some of the services in parallel triggers some sort of a dependency issue which causes the slowdown.

> As a workaround, one can set DefaultTimeoutStopSec lower in
> /etc/systemd/system.conf to speed up the shutdown process.

Yes, thanks, this should be a good workaround.

Comment 8 Greg Blomquist 2019-07-03 17:56:59 UTC
No updates in 2 years and workaround verified in comment #6