Bug 1571184 - starting docker.service takes over a minute with restricted network
Summary: starting docker.service takes over a minute with restricted network
Alias: None
Product: Fedora
Classification: Fedora
Component: docker
Version: 28
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
Assignee: Lokesh Mandvekar
QA Contact: Fedora Extras Quality Assurance
Depends On:
TreeView+ depends on / blocked
Reported: 2018-04-24 09:07 UTC by Martin Pitt
Modified: 2018-04-25 13:22 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2018-04-25 13:22:52 UTC
Type: Bug

Attachments (Terms of Use)

Description Martin Pitt 2018-04-24 09:07:03 UTC
Description of problem: There is a regression in recent Fedora 28 wrt. stopping/starting docker. This now takes over a minute, and causes some error messages (although they might be unrelated)

Version-Release number of selected component (if applicable):


How reproducible: Always

Steps to Reproduce:
1. systemctl stop docker
2. systemctl start docker

Actual results:

starting hangs for a long time. During that time, docker.service has this status:

● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
   Active: activating (start) since Tue 2018-04-24 04:55:01 EDT; 43s ago
     Docs: http://docs.docker.com
  Process: 2224 ExecStartPre=/bin/sh -c echo -n >> /etc/resolv.conf (code=exited, status=0/SUCCESS)
 Main PID: 2225 (dockerd-current)
    Tasks: 7 (limit: 8192)
   Memory: 6.1M
      CPU: 40ms
   CGroup: /system.slice/docker.service
           └─2225 /usr/bin/dockerd-current --add-runtime oci=/usr/libexec/docker/docker-runc-current --default-runtime=oci --authorization-plugin=rhel-push-plugin>

The journal shows a few warnings:

Apr 24 04:55:01 localhost.localdomain systemd[1]: Starting Docker Application Container Engine...
Apr 24 04:55:01 localhost.localdomain dockerd-current[2225]: time="2018-04-24T04:55:01.405816100-04:00" level=warning msg="could not change group /var/run/docker.>

time="2018-04-24T04:55:01.405816100-04:00" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
time="2018-04-24T04:56:01.497619996-04:00" level=info msg="[graphdriver] using prior storage driver: overlay2"
time="2018-04-24T04:56:01.884860951-04:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
time="2018-04-24T04:56:01.887574814-04:00" level=warning msg="Your kernel does not support cgroup rt period"
time="2018-04-24T04:56:01.887737443-04:00" level=warning msg="Your kernel does not support cgroup rt runtime"
time="2018-04-24T04:56:01.889550502-04:00" level=info msg="Loading containers: start."
time="2018-04-24T04:56:01.901566960-04:00" level=info msg="Firewalld running: true"
time="2018-04-24T04:56:02.507683812-04:00" level=info msg="Default bridge (docker0) is assigned with an IP address Daemon option --bip can be used to set a preferred IP address"
time="2018-04-24T04:56:02.851019455-04:00" level=info msg="Loading containers: done."
time="2018-04-24T04:56:02.986695538-04:00" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.17.0\n"
time="2018-04-24T04:56:02.987037993-04:00" level=info msg="Daemon has completed initialization"
time="2018-04-24T04:56:02.987177164-04:00" level=info msg="Docker daemon" commit=c301b04-unsupported graphdriver=overlay2 version=1.13.1
Started Docker Application Container Engine.
time="2018-04-24T04:56:03.034892577-04:00" level=info msg="API listen on /var/run/docker.sock"
time="2018-04-24T04:56:03.080536802-04:00" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.17.0\n"
time="2018-04-24T04:56:03.549823496-04:00" level=warning msg="failed to retrieve docker-init version: unknown output format: tini version 0.17.0\n"

Expected results: docker.service starts fast

Additional info: This did not happen with our (Cockpit test VM images) previous Fedora 28 image that was built on March 29.

Comment 1 Martin Pitt 2018-04-24 09:11:35 UTC
This also happens on the initial start during system boot, so it's not related to stopping it first or only for "re"starting. Adjusting title accordingly.

Comment 3 Ed Santiago 2018-04-24 12:52:32 UTC
Cannot reproduce on rawhide docker-1.13.1-52.git89b0e65.fc29; I don't have easy access to f28 right now but will try to spin one up. While I do... quick question about this line:

    Process: 2224 ExecStartPre=/bin/sh -c echo -n >> /etc/resolv.conf

I can't find ExecStartPre in any Fedora docker unit files. Is it possible that you have a custom one in /etc/systemd/system/docker* ?

Comment 4 Martin Pitt 2018-04-24 13:03:30 UTC
Indeed, sorry I forgot to mention that. It's a workaround for an old bug:

# HACK: docker fails without /etc/resolv.conf
# https://bugzilla.redhat.com/show_bug.cgi?id=1448331
mkdir -p /etc/systemd/system/docker.service.d
printf "[Service]\nExecStartPre=/bin/sh -c 'echo -n >> /etc/resolv.conf'\n" > /etc/systemd/system/docker.service.d/resolv.conf
systemctl daemon-reload

This ensures that /etc/resolv.conf exists before starting docker.

Comment 5 Martin Pitt 2018-04-24 13:05:23 UTC
BTW, you are welcome to use our Cockpit test VM for examination:

Comment 6 Ed Santiago 2018-04-24 13:18:51 UTC
> # HACK: docker fails without /etc/resolv.conf

Thanks for the quick reply. I've spun up a fresh f28, same docker/systemd/kernel as above; recreated your resolv.conf unit file; still can't reproduce the delay:

    # time systemctl start docker

    real    0m0.427s
    user    0m0.002s
    sys     0m0.007s

I'm going to bow out now but just out of curiosity, have you tried or could you try:

    # rm -rf /etc/systemd/system/docker*;systemctl daemon-reload;time systemctl restart docker

Comment 7 Martin Pitt 2018-04-24 20:09:54 UTC
The docker.service.d/resolv.conf drop-in is irrelevant, but indeed I now discovered the relevant part: In a "normal" networking VM docker.service actually works fine. It now hangs in our Cockpit test environment, where we start VMs with

   -netdev user,id=base0,restrict=on,net=,hostname=m1.cockpit.lan,hostfwd=tcp:,hostfwd=tcp: -device rtl8139,netdev=base0,bus=pci.0,addr=0x0e

"restrict=on" disables all network access except for the forwarded 22 and 9090 ports; this makes sure that nothing accidentally talks to the network (this would introduce instability and make things less reproducible).

Unfortunately this cannot be reproduced with simply stopping eth0 or the default route, as that fails fast. Apparently *if* there is a default route, docker.service tries to talk to the network. I'll see to investigate this more closely.

I made comment 5 public again. These VM images are public, so is the knowledge about that IP address, so there's nothing to hide.

Comment 8 Martin Pitt 2018-04-25 08:54:44 UTC
Ah, strace gives it away:

1689  04:43:45.482391 connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("")}, 16) = 0
1689  04:43:45.482470 epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=644705632, u64=139909204372832}}) = 0
1689  04:43:45.482539 getsockname(7, {sa_family=AF_INET, sin_port=htons(47455), sin_addr=inet_addr("")}, [112->16]) = 0
1689  04:43:45.482612 getpeername(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("")}, [112->16]) = 0
1689  04:43:45.482747 write(7, "Q\36\1\0\0\1\0\0\0\0\0\0\10registry\6access\6redhat\3com\7cockpit\3lan\0\0\1\0\1", 56) = 56
1689  04:43:45.482976 read(7, 0xc420241800, 512) = -1 EAGAIN (Resource temporarily unavailable) is the nameserver in resolv.conf.

/etc/containers/registries.conf:registries = ['docker.io', 'registry.fedoraproject.org', 'quay.io', 'registry.access.redhat.com']

So apparently the new thing is that starting up docker now blocks on resolving and pinging these registries. If I remove these from /etc/containers/registries.conf, startup is fast again.

So this is very minor indeed, and might be a wontfix.

Comment 9 Daniel Walsh 2018-04-25 13:22:52 UTC
Yes Just before I read your last Comment, I realized the one rule I always have with software.  If it works and pauses for a long time, it is always nameservice resolving that is broken.

Note You need to log in before you can comment on or make changes to this bug.