Bug 1416156 - Modifying iptables causes ungraceful termination of docker.service
Summary: Modifying iptables causes ungraceful termination of docker.service
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Russell Teague
QA Contact: Wenkai Shi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-24 17:49 UTC by Steven Walter
Modified: 2017-07-24 14:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: A custom systemd unit file was used for the docker service specifying 'Requires=iptables' Consequence: This resulted in the docker service being restarted when iptables was restarted. Fix: The custom systemd unit file was updated to specify 'Wants=iptables'. Result: This will still ensure that iptables is started before docker, but will not cause docker to restart when iptables is restarted.
Clone Of:
Environment:
Last Closed: 2017-04-12 18:49:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
prod rpms (24.09 KB, text/plain)
2017-03-07 18:51 UTC, Boris Kurktchiev
no flags Details
dev rpms (24.06 KB, text/plain)
2017-03-07 18:52 UTC, Boris Kurktchiev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0903 0 normal SHIPPED_LIVE OpenShift Container Platform atomic-openshift-utils bug fix and enhancement 2017-04-12 22:45:42 UTC

Description Steven Walter 2017-01-24 17:49:27 UTC
Description of problem:
Due to change in dependency tree, restarting iptables causes docker.service to try to restart, but it does not do so, resulting in a sigterm. Although the docs require a restart of the docker service when making iptables changes (to prevent networking between pods from failing), crashing the service is unexpected.

Version-Release number of selected component (if applicable):
3.4.0



1. Add a rule to /etc/sysconfig/iptables or iptables.d
2. Restart the iptables service: systemctl restart iptables
3. Docker is sent a sigterm and tries to stop
4. Docker gets kill -9ed because it doesnt think it needs to stop
5. Docker restarts
6. Cluster is in disarray due to containers being yanked in and out of existence

All of this is happening due to systemd dependency rulesets that have changed between OCP 3.3 and 3.4. Here is a OCP 3.4 system dependency chain snippet:
docker.service
● ├─docker-storage-setup.service
● ├─rhel-push-plugin.socket
● ├─system.slice
● └─basic.target
●   ├─iptables.service
●   ├─microcode.service
●   ├─rhel-autorelabel-mark.service
●   ├─rhel-autorelabel.service
●   ├─rhel-configure.service
●   ├─rhel-dmesg.service
●   ├─rhel-loadmodules.service
●   ├─selinux-policy-migrate-local-changes
●   ├─paths.target

vs the same chain but for 3.4
docker.service
● ├─docker-storage-setup.service
● ├─iptables.service
● ├─rhel-push-plugin.socket
● ├─system.slice
● └─basic.target
●   ├─iptables.service
●   ├─microcode.service
●   ├─rhel-autorelabel-mark.service
●   ├─rhel-autorelabel.service
●   ├─rhel-configure.service
●   ├─rhel-dmesg.service
●   ├─rhel-loadmodules.service
●   ├─selinux-policy-migrate-local-changes
●   ├─paths.target

So when the bellow actions take place, systemd is trying to force docker to restart which it refuses to and eventually systemd more or less Kill -9s the process.

Here is an example puppet run that updates iptables file and restarts the service and the consecutive calls to docker to stop:
Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Utils::Hiera_deploy/Ssh_authorized_key[adeluca]/ensure) created
Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Utils::Hiera_deploy/Ssh_authorized_key[wtmowery]/ensure) created
Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Utils::Hiera_deploy/Ssh_authorized_key[kromhout]/ensure) created
Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Firewall/Exec[rebuild_iptables]) Triggered 'refresh' from 1 events
Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Firewall/Exec[check_iptables]) Triggered 'refresh' from 1 events
Jan 24 09:25:22 osapp1p systemd: Cannot add dependency job for unit microcode.service, ignoring: Unit is not loaded properly: Invalid argument.
Jan 24 09:25:22 osapp1p systemd: Stopping Docker Application Container Engine...
Jan 24 09:25:22 osapp1p dockerd-current: time="2017-01-24T09:25:22.561054004-05:00" level=info msg="stopping containerd after receiving terminated"
Jan 24 09:25:22 osapp1p dockerd-current: time="2017-01-24T09:25:22.561154302-05:00" level=info msg="Processing signal 'terminated'"
Jan 24 09:26:53 osapp1p systemd: docker.service stop-final-sigterm timed out. Killing.
Jan 24 09:26:53 osapp1p systemd: Unit docker.service entered failed state.
Jan 24 09:26:53 osapp1p systemd: docker.service failed.

Comment 2 Russell Teague 2017-03-06 21:23:50 UTC
I've been unable to reproduce this error.  Is this still reproducible in the existing environment?

[root@ose3-master ~]# rpm -qa atomic-openshift
atomic-openshift-3.4.1.9-1.git.0.9752e82.el7.x86_64

[root@ose3-master ~]# rpm -qa docker
docker-1.12.6-11.el7.x86_64

[root@ose3-master ~]# systemctl restart iptables

[root@ose3-master ~]# journalctl -u docker
Mar 06 16:21:16 ose3-master systemd[1]: Stopping Docker Application Container Engine...
Mar 06 16:21:16 ose3-master dockerd-current[17124]: time="2017-03-06T16:21:16.742811066-05:00" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing"
Mar 06 16:21:17 ose3-master systemd[1]: Starting Docker Application Container Engine...
Mar 06 16:21:17 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:17.322595988-05:00" level=info msg="libcontainerd: new containerd process, pid: 26176"
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.390293736-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.390775928-05:00" level=info msg="Loading containers: start."
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.398338507-05:00" level=info msg="Firewalld running: false"
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.508507689-05:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.581944386-05:00" level=info msg="Loading containers: done."
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.582062323-05:00" level=info msg="Daemon has completed initialization"
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.582100505-05:00" level=info msg="Docker daemon" commit="96d83a5/1.12.6" graphdriver=devicemapper version=1.12.6
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.594740275-05:00" level=info msg="API listen on /var/run/docker.sock"
Mar 06 16:21:18 ose3-master systemd[1]: Started Docker Application Container Engine.

Comment 3 Boris Kurktchiev 2017-03-06 21:26:57 UTC
Easiest way to check and see if this is going to happen or not is to view the systemd dependency tree for the docker service

Comment 4 Russell Teague 2017-03-06 21:36:49 UTC
[root@ose3-master ~]# systemctl list-dependencies docker
docker.service
● ├─docker-cleanup.timer
● ├─docker-storage-setup.service
● ├─iptables.service
● ├─rhel-push-plugin.socket
● ├─system.slice
● └─basic.target
●   ├─firewalld.service
●   ├─iptables.service
●   ├─microcode.service
●   ├─rhel-autorelabel-mark.service
●   ├─rhel-autorelabel.service
●   ├─rhel-configure.service
●   ├─rhel-dmesg.service
●   ├─rhel-loadmodules.service
●   ├─paths.target

Comment 5 Boris Kurktchiev 2017-03-07 14:54:50 UTC
Yeah so the above 100% breaks in 3.4.0.x, it also happened in our Dev cluster which is on 3.4.1.5, do not have *.9 installed yet so cannot confirm or deny wether its still a problem, but for sure in the other two version this causes all sorts of havoc as described in the BZ/case attached.

Comment 6 Russell Teague 2017-03-07 15:03:09 UTC
@Boris, could you attach a listing of all package versions installed in the effected environment.  Thanks

Comment 7 Boris Kurktchiev 2017-03-07 17:39:12 UTC
all openshift related or all pkgs period?

Comment 8 Russell Teague 2017-03-07 18:43:36 UTC
All packages please.  Something along the lines of:
# rpm -qa | sort > installed-rpms.txt

Comment 9 Boris Kurktchiev 2017-03-07 18:51:45 UTC
Created attachment 1260920 [details]
prod rpms

Comment 10 Boris Kurktchiev 2017-03-07 18:52:03 UTC
Created attachment 1260921 [details]
dev rpms

Comment 11 Boris Kurktchiev 2017-03-07 18:52:42 UTC
Note that tomorrow early in the AM the -prod rpm list will match pretty closely to the -dev rpm list.

Comment 12 Russell Teague 2017-03-07 20:12:06 UTC
I have reproduced the described behavior.  To correct this issue, the docker service should be less tightly coupled to the iptables service.  To implement, the custom systemd unit file should be updated as follows.

Replace 'Requires=' with 'Wants='

# vi /etc/systemd/system/docker.service.d/custom.conf
[Unit]
Wants=iptables.service
After=iptables.service

After editing the file, the systemd configuration should be reloaded:
# systemctl daemon-reload

Restarting iptables will no longer cause a restart of docker.

A PR has been opened to address this and will be backported once approved.
https://github.com/openshift/openshift-ansible/pull/3588

Comment 13 openshift-github-bot 2017-03-07 21:15:32 UTC
Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/7c8f77dbf4ea10a5b9034dbd2a6519d8e0d5e8ce
Merge pull request #3588 from mtnbikenc/docker-fix

Bug 1416156: Loosely couple docker to iptables service

Comment 15 Wenkai Shi 2017-03-10 05:49:02 UTC
Verified with version openshift-ansible-3.5.28-1.git.0.103513e. Prepare a env with "os_firewall_use_firewalld=false" to make sure OCP using iptables.service.
Modified the custom.conf like this, can reproduce the bug:
# cat /etc/systemd/system/docker.service.d/custom.conf 
[Unit]
#Wants=iptables.service
Requires=iptables.service
After=iptables.service
# systemctl daemon-reload
# systemctl restart iptables
# journalctl -xe -f -u docker

Change the custom.conf to it just installed, docker.service won't restart when restarting iptables.service.
# cat /etc/systemd/system/docker.service.d/custom.conf 
[Unit]
Wants=iptables.service
After=iptables.service
# systemctl daemon-reload
# systemctl restart iptables
# journalctl -xe -f -u docker

Comment 17 errata-xmlrpc 2017-04-12 18:49:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0903


Note You need to log in before you can comment on or make changes to this bug.