Description of problem: Due to change in dependency tree, restarting iptables causes docker.service to try to restart, but it does not do so, resulting in a sigterm. Although the docs require a restart of the docker service when making iptables changes (to prevent networking between pods from failing), crashing the service is unexpected. Version-Release number of selected component (if applicable): 3.4.0 1. Add a rule to /etc/sysconfig/iptables or iptables.d 2. Restart the iptables service: systemctl restart iptables 3. Docker is sent a sigterm and tries to stop 4. Docker gets kill -9ed because it doesnt think it needs to stop 5. Docker restarts 6. Cluster is in disarray due to containers being yanked in and out of existence All of this is happening due to systemd dependency rulesets that have changed between OCP 3.3 and 3.4. Here is a OCP 3.4 system dependency chain snippet: docker.service ● ├─docker-storage-setup.service ● ├─rhel-push-plugin.socket ● ├─system.slice ● └─basic.target ● ├─iptables.service ● ├─microcode.service ● ├─rhel-autorelabel-mark.service ● ├─rhel-autorelabel.service ● ├─rhel-configure.service ● ├─rhel-dmesg.service ● ├─rhel-loadmodules.service ● ├─selinux-policy-migrate-local-changes ● ├─paths.target vs the same chain but for 3.4 docker.service ● ├─docker-storage-setup.service ● ├─iptables.service ● ├─rhel-push-plugin.socket ● ├─system.slice ● └─basic.target ● ├─iptables.service ● ├─microcode.service ● ├─rhel-autorelabel-mark.service ● ├─rhel-autorelabel.service ● ├─rhel-configure.service ● ├─rhel-dmesg.service ● ├─rhel-loadmodules.service ● ├─selinux-policy-migrate-local-changes ● ├─paths.target So when the bellow actions take place, systemd is trying to force docker to restart which it refuses to and eventually systemd more or less Kill -9s the process. Here is an example puppet run that updates iptables file and restarts the service and the consecutive calls to docker to stop: Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Utils::Hiera_deploy/Ssh_authorized_key[adeluca]/ensure) created Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Utils::Hiera_deploy/Ssh_authorized_key[wtmowery]/ensure) created Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Utils::Hiera_deploy/Ssh_authorized_key[kromhout]/ensure) created Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Firewall/Exec[rebuild_iptables]) Triggered 'refresh' from 1 events Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Firewall/Exec[check_iptables]) Triggered 'refresh' from 1 events Jan 24 09:25:22 osapp1p systemd: Cannot add dependency job for unit microcode.service, ignoring: Unit is not loaded properly: Invalid argument. Jan 24 09:25:22 osapp1p systemd: Stopping Docker Application Container Engine... Jan 24 09:25:22 osapp1p dockerd-current: time="2017-01-24T09:25:22.561054004-05:00" level=info msg="stopping containerd after receiving terminated" Jan 24 09:25:22 osapp1p dockerd-current: time="2017-01-24T09:25:22.561154302-05:00" level=info msg="Processing signal 'terminated'" Jan 24 09:26:53 osapp1p systemd: docker.service stop-final-sigterm timed out. Killing. Jan 24 09:26:53 osapp1p systemd: Unit docker.service entered failed state. Jan 24 09:26:53 osapp1p systemd: docker.service failed.
I've been unable to reproduce this error. Is this still reproducible in the existing environment? [root@ose3-master ~]# rpm -qa atomic-openshift atomic-openshift-3.4.1.9-1.git.0.9752e82.el7.x86_64 [root@ose3-master ~]# rpm -qa docker docker-1.12.6-11.el7.x86_64 [root@ose3-master ~]# systemctl restart iptables [root@ose3-master ~]# journalctl -u docker Mar 06 16:21:16 ose3-master systemd[1]: Stopping Docker Application Container Engine... Mar 06 16:21:16 ose3-master dockerd-current[17124]: time="2017-03-06T16:21:16.742811066-05:00" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing" Mar 06 16:21:17 ose3-master systemd[1]: Starting Docker Application Container Engine... Mar 06 16:21:17 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:17.322595988-05:00" level=info msg="libcontainerd: new containerd process, pid: 26176" Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.390293736-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds" Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.390775928-05:00" level=info msg="Loading containers: start." Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.398338507-05:00" level=info msg="Firewalld running: false" Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.508507689-05:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address" Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.581944386-05:00" level=info msg="Loading containers: done." Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.582062323-05:00" level=info msg="Daemon has completed initialization" Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.582100505-05:00" level=info msg="Docker daemon" commit="96d83a5/1.12.6" graphdriver=devicemapper version=1.12.6 Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.594740275-05:00" level=info msg="API listen on /var/run/docker.sock" Mar 06 16:21:18 ose3-master systemd[1]: Started Docker Application Container Engine.
Easiest way to check and see if this is going to happen or not is to view the systemd dependency tree for the docker service
[root@ose3-master ~]# systemctl list-dependencies docker docker.service ● ├─docker-cleanup.timer ● ├─docker-storage-setup.service ● ├─iptables.service ● ├─rhel-push-plugin.socket ● ├─system.slice ● └─basic.target ● ├─firewalld.service ● ├─iptables.service ● ├─microcode.service ● ├─rhel-autorelabel-mark.service ● ├─rhel-autorelabel.service ● ├─rhel-configure.service ● ├─rhel-dmesg.service ● ├─rhel-loadmodules.service ● ├─paths.target
Yeah so the above 100% breaks in 3.4.0.x, it also happened in our Dev cluster which is on 3.4.1.5, do not have *.9 installed yet so cannot confirm or deny wether its still a problem, but for sure in the other two version this causes all sorts of havoc as described in the BZ/case attached.
@Boris, could you attach a listing of all package versions installed in the effected environment. Thanks
all openshift related or all pkgs period?
All packages please. Something along the lines of: # rpm -qa | sort > installed-rpms.txt
Created attachment 1260920 [details] prod rpms
Created attachment 1260921 [details] dev rpms
Note that tomorrow early in the AM the -prod rpm list will match pretty closely to the -dev rpm list.
I have reproduced the described behavior. To correct this issue, the docker service should be less tightly coupled to the iptables service. To implement, the custom systemd unit file should be updated as follows. Replace 'Requires=' with 'Wants=' # vi /etc/systemd/system/docker.service.d/custom.conf [Unit] Wants=iptables.service After=iptables.service After editing the file, the systemd configuration should be reloaded: # systemctl daemon-reload Restarting iptables will no longer cause a restart of docker. A PR has been opened to address this and will be backported once approved. https://github.com/openshift/openshift-ansible/pull/3588
Commit pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/7c8f77dbf4ea10a5b9034dbd2a6519d8e0d5e8ce Merge pull request #3588 from mtnbikenc/docker-fix Bug 1416156: Loosely couple docker to iptables service
Verified with version openshift-ansible-3.5.28-1.git.0.103513e. Prepare a env with "os_firewall_use_firewalld=false" to make sure OCP using iptables.service. Modified the custom.conf like this, can reproduce the bug: # cat /etc/systemd/system/docker.service.d/custom.conf [Unit] #Wants=iptables.service Requires=iptables.service After=iptables.service # systemctl daemon-reload # systemctl restart iptables # journalctl -xe -f -u docker Change the custom.conf to it just installed, docker.service won't restart when restarting iptables.service. # cat /etc/systemd/system/docker.service.d/custom.conf [Unit] Wants=iptables.service After=iptables.service # systemctl daemon-reload # systemctl restart iptables # journalctl -xe -f -u docker
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0903