Bug 1441994
Summary: | Rebooting nodes is slow | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Marko Myllynen <myllynen> |
Component: | Node | Assignee: | Seth Jennings <sjenning> |
Status: | CLOSED WONTFIX | QA Contact: | Xiaoli Tian <xtian> |
Severity: | low | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.4.1 | CC: | aos-bugs, eparis, gblomqui, jokerman, jswensso, mmccomas |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-07-03 17:56:59 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Marko Myllynen
2017-04-13 09:17:21 UTC
This is also an issue on non-master nodes, there stopping these is helping: systemctl stop atomic-openshift-node.service systemctl stop openvswitch.service But if there are containers still running then shutting down docker.service will take few minutes. Draining the node doesn't remove daemonset pods (e.g., fluentd from a typical basic installation) without some extra gymnastics (e.g., oc label node <node> logging-infra-fluentd=false --overwrite). The latter may be perhaps a bit hard to automate but the former is probably something that can be solved without going to specifics. hopefully the kube team can at least do some initial tiage to figure out what is slow. Then we can re-assess who the BZ 'belongs' to. The ExecStop for all containerized openshift units is a "docker stop" command. https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_master/templates/docker-cluster They have: After={{ openshift.docker.service_name }}.service Requires={{ openshift.docker.service_name }}.service PartOf={{ openshift.docker.service_name }}.service There might be some docker dependency that is not being expressed in the unit files, causing "docker stop" to hang and delaying shutdown. There isn't enough information here to verify that theory. I'm not sure exactly what information would be useful. Just to verify, this is OCP 3.4? If you watch the console when shutdown down, it should show the units that are holding up the shutdown. Can you make a list of those units? Especially note if docker is one of them. If docker is shutdown before the atomic-openshift units, that would indicate a dependency issue. As a workaround, one can set DefaultTimeoutStopSec lower in /etc/systemd/system.conf to speed up the shutdown process. (In reply to Seth Jennings from comment #5) > > There might be some docker dependency that is not being expressed in the > unit files, causing "docker stop" to hang and delaying shutdown. > > There isn't enough information here to verify that theory. I'm not sure > exactly what information would be useful. Just to verify, this is OCP 3.4? This reproduces on containerized OCP 3.6 as well. > If you watch the console when shutdown down, it should show the units that > are holding up the shutdown. Can you make a list of those units? > Especially note if docker is one of them. If docker is shutdown before the > atomic-openshift units, that would indicate a dependency issue. A bit after a fresh OCP 3.6 installation I rebooted a master, first things are slow with: Stopping atomic-openshift-node.service... Stopping Atomic OpenShift Master Controllers... The next pause is with: Stopping openvswitch.service... Stopping Atomic OpenShift Master API... Then with: Stopping The Etcd Server container... After these I can see Docker stopping almost instantly: Stopping Docker Application Container Engine... But if stopping these manually prior reboot one by one then all of the stop almost instantly. So might be that starting to shutdown some of the services in parallel triggers some sort of a dependency issue which causes the slowdown. > As a workaround, one can set DefaultTimeoutStopSec lower in > /etc/systemd/system.conf to speed up the shutdown process. Yes, thanks, this should be a good workaround. No updates in 2 years and workaround verified in comment #6 |