Description of problem: In a containerized HA setup (3 master/etcd nodes) on RHEL rebooting a master for whatever reason (e.g., after applying OS level upgrades) rebooting the node takes several minutes (~6) as several services are shutting down very slowly. But if stopping these services manually before reboot then the whole reboot is almost instant: systemctl stop atomic-openshift-master-controllers.service systemctl stop atomic-openshift-master-api.service systemctl stop atomic-openshift-node.service systemctl stop etcd_container.service systemctl stop openvswitch.service Note that OS upgrade is not needed, happens even merely rebooting a master. Also, looks like that the above steps are needed but it could be that not all of them are necessary. Please make sure rebooting a master node happens quickly even if not manually stopping services. Version-Release number of selected component (if applicable): v3.4.1.12 images, latest RHEL 7 RPMs.
This is also an issue on non-master nodes, there stopping these is helping: systemctl stop atomic-openshift-node.service systemctl stop openvswitch.service But if there are containers still running then shutting down docker.service will take few minutes. Draining the node doesn't remove daemonset pods (e.g., fluentd from a typical basic installation) without some extra gymnastics (e.g., oc label node <node> logging-infra-fluentd=false --overwrite). The latter may be perhaps a bit hard to automate but the former is probably something that can be solved without going to specifics.
hopefully the kube team can at least do some initial tiage to figure out what is slow. Then we can re-assess who the BZ 'belongs' to.
The ExecStop for all containerized openshift units is a "docker stop" command. https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_master/templates/docker-cluster They have: After={{ openshift.docker.service_name }}.service Requires={{ openshift.docker.service_name }}.service PartOf={{ openshift.docker.service_name }}.service There might be some docker dependency that is not being expressed in the unit files, causing "docker stop" to hang and delaying shutdown. There isn't enough information here to verify that theory. I'm not sure exactly what information would be useful. Just to verify, this is OCP 3.4? If you watch the console when shutdown down, it should show the units that are holding up the shutdown. Can you make a list of those units? Especially note if docker is one of them. If docker is shutdown before the atomic-openshift units, that would indicate a dependency issue. As a workaround, one can set DefaultTimeoutStopSec lower in /etc/systemd/system.conf to speed up the shutdown process.
(In reply to Seth Jennings from comment #5) > > There might be some docker dependency that is not being expressed in the > unit files, causing "docker stop" to hang and delaying shutdown. > > There isn't enough information here to verify that theory. I'm not sure > exactly what information would be useful. Just to verify, this is OCP 3.4? This reproduces on containerized OCP 3.6 as well. > If you watch the console when shutdown down, it should show the units that > are holding up the shutdown. Can you make a list of those units? > Especially note if docker is one of them. If docker is shutdown before the > atomic-openshift units, that would indicate a dependency issue. A bit after a fresh OCP 3.6 installation I rebooted a master, first things are slow with: Stopping atomic-openshift-node.service... Stopping Atomic OpenShift Master Controllers... The next pause is with: Stopping openvswitch.service... Stopping Atomic OpenShift Master API... Then with: Stopping The Etcd Server container... After these I can see Docker stopping almost instantly: Stopping Docker Application Container Engine... But if stopping these manually prior reboot one by one then all of the stop almost instantly. So might be that starting to shutdown some of the services in parallel triggers some sort of a dependency issue which causes the slowdown. > As a workaround, one can set DefaultTimeoutStopSec lower in > /etc/systemd/system.conf to speed up the shutdown process. Yes, thanks, this should be a good workaround.
No updates in 2 years and workaround verified in comment #6