Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1416156 - Modifying iptables causes ungraceful termination of docker.service
Modifying iptables causes ungraceful termination of docker.service
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer (Show other bugs)
3.4.0
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Russell Teague
Wenkai Shi
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-01-24 12:49 EST by Steven Walter
Modified: 2017-07-24 10 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: A custom systemd unit file was used for the docker service specifying 'Requires=iptables' Consequence: This resulted in the docker service being restarted when iptables was restarted. Fix: The custom systemd unit file was updated to specify 'Wants=iptables'. Result: This will still ensure that iptables is started before docker, but will not cause docker to restart when iptables is restarted.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-04-12 14:49:16 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
prod rpms (24.09 KB, text/plain)
2017-03-07 13:51 EST, Boris Kurktchiev
no flags Details
dev rpms (24.06 KB, text/plain)
2017-03-07 13:52 EST, Boris Kurktchiev
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0903 normal SHIPPED_LIVE OpenShift Container Platform atomic-openshift-utils bug fix and enhancement 2017-04-12 18:45:42 EDT

  None (edit)
Description Steven Walter 2017-01-24 12:49:27 EST
Description of problem:
Due to change in dependency tree, restarting iptables causes docker.service to try to restart, but it does not do so, resulting in a sigterm. Although the docs require a restart of the docker service when making iptables changes (to prevent networking between pods from failing), crashing the service is unexpected.

Version-Release number of selected component (if applicable):
3.4.0



1. Add a rule to /etc/sysconfig/iptables or iptables.d
2. Restart the iptables service: systemctl restart iptables
3. Docker is sent a sigterm and tries to stop
4. Docker gets kill -9ed because it doesnt think it needs to stop
5. Docker restarts
6. Cluster is in disarray due to containers being yanked in and out of existence

All of this is happening due to systemd dependency rulesets that have changed between OCP 3.3 and 3.4. Here is a OCP 3.4 system dependency chain snippet:
docker.service
● ├─docker-storage-setup.service
● ├─rhel-push-plugin.socket
● ├─system.slice
● └─basic.target
●   ├─iptables.service
●   ├─microcode.service
●   ├─rhel-autorelabel-mark.service
●   ├─rhel-autorelabel.service
●   ├─rhel-configure.service
●   ├─rhel-dmesg.service
●   ├─rhel-loadmodules.service
●   ├─selinux-policy-migrate-local-changes@targeted.service
●   ├─paths.target

vs the same chain but for 3.4
docker.service
● ├─docker-storage-setup.service
● ├─iptables.service
● ├─rhel-push-plugin.socket
● ├─system.slice
● └─basic.target
●   ├─iptables.service
●   ├─microcode.service
●   ├─rhel-autorelabel-mark.service
●   ├─rhel-autorelabel.service
●   ├─rhel-configure.service
●   ├─rhel-dmesg.service
●   ├─rhel-loadmodules.service
●   ├─selinux-policy-migrate-local-changes@targeted.service
●   ├─paths.target

So when the bellow actions take place, systemd is trying to force docker to restart which it refuses to and eventually systemd more or less Kill -9s the process.

Here is an example puppet run that updates iptables file and restarts the service and the consecutive calls to docker to stop:
Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Utils::Hiera_deploy/Ssh_authorized_key[adeluca]/ensure) created
Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Utils::Hiera_deploy/Ssh_authorized_key[wtmowery]/ensure) created
Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Utils::Hiera_deploy/Ssh_authorized_key[kromhout]/ensure) created
Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Firewall/Exec[rebuild_iptables]) Triggered 'refresh' from 1 events
Jan 24 09:25:22 osapp1p puppet-agent[20599]: (/Stage[main]/Firewall/Exec[check_iptables]) Triggered 'refresh' from 1 events
Jan 24 09:25:22 osapp1p systemd: Cannot add dependency job for unit microcode.service, ignoring: Unit is not loaded properly: Invalid argument.
Jan 24 09:25:22 osapp1p systemd: Stopping Docker Application Container Engine...
Jan 24 09:25:22 osapp1p dockerd-current: time="2017-01-24T09:25:22.561054004-05:00" level=info msg="stopping containerd after receiving terminated"
Jan 24 09:25:22 osapp1p dockerd-current: time="2017-01-24T09:25:22.561154302-05:00" level=info msg="Processing signal 'terminated'"
Jan 24 09:26:53 osapp1p systemd: docker.service stop-final-sigterm timed out. Killing.
Jan 24 09:26:53 osapp1p systemd: Unit docker.service entered failed state.
Jan 24 09:26:53 osapp1p systemd: docker.service failed.
Comment 2 Russell Teague 2017-03-06 16:23:50 EST
I've been unable to reproduce this error.  Is this still reproducible in the existing environment?

[root@ose3-master ~]# rpm -qa atomic-openshift
atomic-openshift-3.4.1.9-1.git.0.9752e82.el7.x86_64

[root@ose3-master ~]# rpm -qa docker
docker-1.12.6-11.el7.x86_64

[root@ose3-master ~]# systemctl restart iptables

[root@ose3-master ~]# journalctl -u docker
Mar 06 16:21:16 ose3-master systemd[1]: Stopping Docker Application Container Engine...
Mar 06 16:21:16 ose3-master dockerd-current[17124]: time="2017-03-06T16:21:16.742811066-05:00" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing"
Mar 06 16:21:17 ose3-master systemd[1]: Starting Docker Application Container Engine...
Mar 06 16:21:17 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:17.322595988-05:00" level=info msg="libcontainerd: new containerd process, pid: 26176"
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.390293736-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.390775928-05:00" level=info msg="Loading containers: start."
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.398338507-05:00" level=info msg="Firewalld running: false"
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.508507689-05:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.581944386-05:00" level=info msg="Loading containers: done."
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.582062323-05:00" level=info msg="Daemon has completed initialization"
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.582100505-05:00" level=info msg="Docker daemon" commit="96d83a5/1.12.6" graphdriver=devicemapper version=1.12.6
Mar 06 16:21:18 ose3-master dockerd-current[26170]: time="2017-03-06T16:21:18.594740275-05:00" level=info msg="API listen on /var/run/docker.sock"
Mar 06 16:21:18 ose3-master systemd[1]: Started Docker Application Container Engine.
Comment 3 Boris Kurktchiev 2017-03-06 16:26:57 EST
Easiest way to check and see if this is going to happen or not is to view the systemd dependency tree for the docker service
Comment 4 Russell Teague 2017-03-06 16:36:49 EST
[root@ose3-master ~]# systemctl list-dependencies docker
docker.service
● ├─docker-cleanup.timer
● ├─docker-storage-setup.service
● ├─iptables.service
● ├─rhel-push-plugin.socket
● ├─system.slice
● └─basic.target
●   ├─firewalld.service
●   ├─iptables.service
●   ├─microcode.service
●   ├─rhel-autorelabel-mark.service
●   ├─rhel-autorelabel.service
●   ├─rhel-configure.service
●   ├─rhel-dmesg.service
●   ├─rhel-loadmodules.service
●   ├─paths.target
Comment 5 Boris Kurktchiev 2017-03-07 09:54:50 EST
Yeah so the above 100% breaks in 3.4.0.x, it also happened in our Dev cluster which is on 3.4.1.5, do not have *.9 installed yet so cannot confirm or deny wether its still a problem, but for sure in the other two version this causes all sorts of havoc as described in the BZ/case attached.
Comment 6 Russell Teague 2017-03-07 10:03:09 EST
@Boris, could you attach a listing of all package versions installed in the effected environment.  Thanks
Comment 7 Boris Kurktchiev 2017-03-07 12:39:12 EST
all openshift related or all pkgs period?
Comment 8 Russell Teague 2017-03-07 13:43:36 EST
All packages please.  Something along the lines of:
# rpm -qa | sort > installed-rpms.txt
Comment 9 Boris Kurktchiev 2017-03-07 13:51 EST
Created attachment 1260920 [details]
prod rpms
Comment 10 Boris Kurktchiev 2017-03-07 13:52 EST
Created attachment 1260921 [details]
dev rpms
Comment 11 Boris Kurktchiev 2017-03-07 13:52:42 EST
Note that tomorrow early in the AM the -prod rpm list will match pretty closely to the -dev rpm list.
Comment 12 Russell Teague 2017-03-07 15:12:06 EST
I have reproduced the described behavior.  To correct this issue, the docker service should be less tightly coupled to the iptables service.  To implement, the custom systemd unit file should be updated as follows.

Replace 'Requires=' with 'Wants='

# vi /etc/systemd/system/docker.service.d/custom.conf
[Unit]
Wants=iptables.service
After=iptables.service

After editing the file, the systemd configuration should be reloaded:
# systemctl daemon-reload

Restarting iptables will no longer cause a restart of docker.

A PR has been opened to address this and will be backported once approved.
https://github.com/openshift/openshift-ansible/pull/3588
Comment 13 openshift-github-bot 2017-03-07 16:15:32 EST
Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/7c8f77dbf4ea10a5b9034dbd2a6519d8e0d5e8ce
Merge pull request #3588 from mtnbikenc/docker-fix

Bug 1416156: Loosely couple docker to iptables service
Comment 15 Wenkai Shi 2017-03-10 00:49:02 EST
Verified with version openshift-ansible-3.5.28-1.git.0.103513e. Prepare a env with "os_firewall_use_firewalld=false" to make sure OCP using iptables.service.
Modified the custom.conf like this, can reproduce the bug:
# cat /etc/systemd/system/docker.service.d/custom.conf 
[Unit]
#Wants=iptables.service
Requires=iptables.service
After=iptables.service
# systemctl daemon-reload
# systemctl restart iptables
# journalctl -xe -f -u docker

Change the custom.conf to it just installed, docker.service won't restart when restarting iptables.service.
# cat /etc/systemd/system/docker.service.d/custom.conf 
[Unit]
Wants=iptables.service
After=iptables.service
# systemctl daemon-reload
# systemctl restart iptables
# journalctl -xe -f -u docker
Comment 17 errata-xmlrpc 2017-04-12 14:49:16 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0903

Note You need to log in before you can comment on or make changes to this bug.