1441994 – Rebooting nodes is slow

Bug 1441994 - Rebooting nodes is slow

Summary: Rebooting nodes is slow

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Seth Jennings
QA Contact:	Xiaoli Tian
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-13 09:17 UTC by Marko Myllynen
Modified:	2019-07-03 17:56 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-03 17:56:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Marko Myllynen 2017-04-13 09:17:21 UTC

Description of problem:
In a containerized HA setup (3 master/etcd nodes) on RHEL rebooting a master for whatever reason (e.g., after applying OS level upgrades) rebooting the node takes several minutes (~6) as several services are shutting down very slowly.

But if stopping these services manually before reboot then the whole reboot is almost instant:

systemctl stop atomic-openshift-master-controllers.service 
systemctl stop atomic-openshift-master-api.service 
systemctl stop atomic-openshift-node.service 
systemctl stop etcd_container.service 
systemctl stop openvswitch.service 

Note that OS upgrade is not needed, happens even merely rebooting a master. Also, looks like that the above steps are needed but it could be that not all of them are necessary.

Please make sure rebooting a master node happens quickly even if not manually stopping services.

Version-Release number of selected component (if applicable):
v3.4.1.12 images, latest RHEL 7 RPMs.

Comment 1 Marko Myllynen 2017-04-13 12:12:25 UTC

This is also an issue on non-master nodes, there stopping these is helping:

systemctl stop atomic-openshift-node.service
systemctl stop openvswitch.service

But if there are containers still running then shutting down docker.service will take few minutes. Draining the node doesn't remove daemonset pods (e.g., fluentd from a typical basic installation) without some extra gymnastics (e.g., oc label node <node> logging-infra-fluentd=false --overwrite).

The latter may be perhaps a bit hard to automate but the former is probably something that can be solved without going to specifics.

Comment 2 Eric Paris 2017-04-17 18:32:20 UTC

hopefully the kube team can at least do some initial tiage to figure out what is slow. Then we can re-assess who the BZ 'belongs' to.

Comment 5 Seth Jennings 2017-09-05 01:40:56 UTC

The ExecStop for all containerized openshift units is a "docker stop" command.

https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_master/templates/docker-cluster

They have:

After={{ openshift.docker.service_name }}.service
Requires={{ openshift.docker.service_name }}.service
PartOf={{ openshift.docker.service_name }}.service

There might be some docker dependency that is not being expressed in the unit files, causing "docker stop" to hang and delaying shutdown.

There isn't enough information here to verify that theory.  I'm not sure exactly what information would be useful.  Just to verify, this is OCP 3.4?

If you watch the console when shutdown down, it should show the units that are holding up the shutdown.  Can you make a list of those units?  Especially note if docker is one of them.  If docker is shutdown before the atomic-openshift units, that would indicate a dependency issue.

As a workaround, one can set DefaultTimeoutStopSec lower in /etc/systemd/system.conf to speed up the shutdown process.

Comment 6 Marko Myllynen 2017-09-05 15:47:25 UTC

(In reply to Seth Jennings from comment #5)
> 
> There might be some docker dependency that is not being expressed in the
> unit files, causing "docker stop" to hang and delaying shutdown.
> 
> There isn't enough information here to verify that theory.  I'm not sure
> exactly what information would be useful.  Just to verify, this is OCP 3.4?

This reproduces on containerized OCP 3.6 as well.

> If you watch the console when shutdown down, it should show the units that
> are holding up the shutdown.  Can you make a list of those units? 
> Especially note if docker is one of them.  If docker is shutdown before the
> atomic-openshift units, that would indicate a dependency issue.

A bit after a fresh OCP 3.6 installation I rebooted a master, first things are slow with:

         Stopping atomic-openshift-node.service...
         Stopping Atomic OpenShift Master Controllers...

The next pause is with:

         Stopping openvswitch.service...
         Stopping Atomic OpenShift Master API...

Then with:

         Stopping The Etcd Server container...

After these I can see Docker stopping almost instantly:

         Stopping Docker Application Container Engine...

But if stopping these manually prior reboot one by one then all of the stop almost instantly. So might be that starting to shutdown some of the services in parallel triggers some sort of a dependency issue which causes the slowdown.

> As a workaround, one can set DefaultTimeoutStopSec lower in
> /etc/systemd/system.conf to speed up the shutdown process.

Yes, thanks, this should be a good workaround.

Comment 8 Greg Blomquist 2019-07-03 17:56:59 UTC

No updates in 2 years and workaround verified in comment #6

Note You need to log in before you can comment on or make changes to this bug.