Bug 1341906 - Running kubernetes density e2e causes docker 1.10 to wedge, completely unresponsive.
Summary: Running kubernetes density e2e causes docker 1.10 to wedge, completely unresp...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: docker
Version: 7.2
Hardware: All
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Lokesh Mandvekar
QA Contact: atomic-bugs@redhat.com
URL:
Whiteboard:
Depends On: 1340519
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-02 02:10 UTC by Timothy St. Clair
Modified: 2019-03-06 00:35 UTC (History)
27 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1340519
Environment:
Last Closed: 2016-06-23 16:19:18 UTC
Target Upstream Version:


Attachments (Terms of Use)
Docker daemon stacktrace dumped while the daemon was hung (166.95 KB, text/plain)
2016-06-02 18:09 UTC, Steve Kuznetsov
no flags Details
Journald logsfrom the Docker daemon before and up to a hang (1.75 MB, text/plain)
2016-06-02 18:09 UTC, Steve Kuznetsov
no flags Details
Another stacktrace (169.47 KB, text/plain)
2016-06-02 18:20 UTC, Steve Kuznetsov
no flags Details
Another journal (1.48 MB, text/plain)
2016-06-02 18:22 UTC, Steve Kuznetsov
no flags Details
docker-1.10.3-32 stacktrace (165.02 KB, text/plain)
2016-06-07 19:55 UTC, Steve Kuznetsov
no flags Details
docker-1.10.3-32 journal (1.05 MB, text/x-vhdl)
2016-06-07 19:56 UTC, Steve Kuznetsov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1274 0 normal SHIPPED_LIVE docker bug fix and enhancement update 2016-06-23 20:12:28 UTC

Comment 2 Daniel Walsh 2016-06-02 15:54:38 UTC
We know this is an issue on Fedora caused by the latest changes to systemd controlling the number of tasks (threads) in a unit file.  Don't see how this is the same issue on RHEL.

We have not been able to reproduce on RHEL.  Can we get the information Anotonio wanted on RHEL?  A reproducer?

Comment 3 Antonio Murdaca 2016-06-02 16:05:22 UTC
Right, unfortunately, I suspect the TasksMax issue in Fedora was just another bug hit. I just tried reproducing running 200 containers in RHEL and everything went fine with docker-1.10.3-28.el7.x86_64.

I suspect there's another issue (at least in RHEL and possibily in Fedora) related to Steve's issue here https://github.com/openshift/origin/issues/9107 - though I cannot find anything useful from the wedged machine and I cannot reproduce by just running Docker.

Comment 4 Timothy St. Clair 2016-06-02 17:13:11 UTC
The reproducer is sporadic and uses the e2es.  Openshift CI is catching this in ~ 1/20 runs - https://github.com/openshift/origin/issues/9107 , robs reproducer was also sporadic but once it happens the host is dead.

Comment 5 Ian McLeod 2016-06-02 17:31:24 UTC
(In reply to Timothy St. Clair from comment #4)
> The reproducer is sporadic and uses the e2es.  Openshift CI is catching this
> in ~ 1/20 runs - https://github.com/openshift/origin/issues/9107 , robs
> reproducer was also sporadic but once it happens the host is dead.

Since this is sporadic, and the decision about this being a blocker is tied to whether or not it is a regression, I'd like to ask for a clarification.

Can someone please confirm whether a comparable number of runs (significantly > 20) of the exact same triggering tests have be run on a setup using docker 1.9?

Comment 6 Steve Kuznetsov 2016-06-02 17:39:48 UTC
(In reply to Ian McLeod from comment #5)
> Can someone please confirm whether a comparable number of runs
> (significantly > 20) of the exact same triggering tests have be run on a
> setup using docker 1.9?

The OpenShift CI has been running our extended conformance suite[1] reliably without hitting this issue for ~1000 times until we upgraded Docker versions. Unfortunately by the time I picked up the flake issue we were not holding enough history on our tests to be able to pinpoint exactly at which point the issue began, but it seems to coincide with the docker1.10.3 upgrade, and to my knowledge we have not made other large changes to our devenv AMIs in the same time frame.

[1]https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance

Comment 7 Timothy St. Clair 2016-06-02 17:56:36 UTC
We have never seen this behavior in 1.9 across the entire pref+scale & Q/E test runs.

Comment 8 Steve Kuznetsov 2016-06-02 18:09:19 UTC
Created attachment 1164147 [details]
Docker daemon stacktrace dumped while the daemon was hung

Comment 9 Steve Kuznetsov 2016-06-02 18:09:57 UTC
Created attachment 1164148 [details]
Journald logsfrom the Docker daemon before and up to a hang

Comment 10 Steve Kuznetsov 2016-06-02 18:12:11 UTC
I've added sample Docker daemon stacktrace and journald logs from a hang we experienced just now in the OpenShift CI queue -- will attempt to reproduce the hang on the affected AWS instance as was possible previously. In previous cases, once a machine was affected I was able to reproduce the hang with 100% success by re-starting the daemon and re-running the tests. Unfortunately the instances on which this was possible are no longer around on AWS, but hopefully I will be able to sequester others for access by Antonio et al.

Comment 11 Antonio Murdaca 2016-06-02 18:19:50 UTC
Steve are you testing with the latest docker (from brew I think)? Latest shoukd be -28 build number.

Comment 12 Steve Kuznetsov 2016-06-02 18:20:47 UTC
Created attachment 1164152 [details]
Another stacktrace

Comment 13 Steve Kuznetsov 2016-06-02 18:22:05 UTC
Created attachment 1164153 [details]
Another journal

Added another set of stacktrace/journal -- won't add more unless they seem substantially different.

Comment 14 Steve Kuznetsov 2016-06-02 18:23:15 UTC
(In reply to Antonio Murdaca from comment #11)
> Steve are you testing with the latest docker (from brew I think)? Latest
> shoukd be -28 build number.

$ uname -r
3.10.0-229.7.2.el7.x86_64
$ docker version
Client:
 Version:         1.10.3
 API version:     1.22
 Package version: docker-common-1.10.3-25.el7.x86_64
 Go version:      go1.4.2
 Git commit:      86bbf84/1.10.3
 Built:           
 OS/Arch:         linux/amd64

Server:
 Version:         1.10.3
 API version:     1.22
 Package version: docker-common-1.10.3-25.el7.x86_64
 Go version:      go1.4.2
 Git commit:      86bbf84/1.10.3
 Built:           
 OS/Arch:         linux/amd64

Comment 15 Steve Kuznetsov 2016-06-02 18:31:17 UTC
(In reply to Antonio Murdaca from comment #11)
> Steve are you testing with the latest docker (from brew I think)? Latest
> shoukd be -28 build number.

We are installing the version of Docker we use from the following repository, as set up in the vagrant-openshift[1] plugin:

$ yum provides docker
...
docker-1.10.3-25.el7.x86_64 : Automates deployment of containerized applications
Repo        : @dockerextra

$ cat /etc/yum.repos.d/dockerextra.repo 
[dockerextra]
name=RHEL Docker Extra
baseurl=https://mirror.openshift.com/enterprise/rhel/dockerextra/x86_64/os/
enabled=1
gpgcheck=0
failovermethod=priority
sslverify=False
sslclientcert=/var/lib/yum/client-cert.pem
sslclientkey=/var/lib/yum/client-key.pem


[1]https://github.com/openshift/vagrant-openshift/blob/master/lib/vagrant-openshift/action/install_origin_base_dependencies.rb#L45-L56

Comment 16 Daniel Walsh 2016-06-02 18:39:10 UTC
Antonio does the latest docker built in golang 1.6 and eliminate docker-forwarder?

Comment 17 Antonio Murdaca 2016-06-02 18:40:51 UTC
(In reply to Daniel Walsh from comment #16)
> Antonio does the latest docker built in golang 1.6 and eliminate
> docker-forwarder?

No, as far as I know docker in RHEL is built using golang 1.4.2 which needs the journald forwarder

Comment 18 Daniel Walsh 2016-06-02 18:45:55 UTC
Well did we use that in docker-1.9?  Did we use journald for logging?  Just trying to think if there is anything we have done to docker-1.10 that was not in the old version that could be causing this issue.

Comment 19 Timothy St. Clair 2016-06-02 19:10:23 UTC
To add more specific details, in our environment we are building latest kube 1.3 using golang 1.6, and we've set journald as the log driver using docker-latest-1.10.3-22.el7.x86_64

Comment 20 Antonio Murdaca 2016-06-02 19:19:04 UTC
(In reply to Timothy St. Clair from comment #19)
> To add more specific details, in our environment we are building latest kube
> 1.3 using golang 1.6, and we've set journald as the log driver using
> docker-latest-1.10.3-22.el7.x86_64

The journald log drives comes as a default in the dockee.service unit file. Did you change anything else in there?

Comment 21 Timothy St. Clair 2016-06-02 19:20:46 UTC
(In reply to Antonio Murdaca from comment #20)
> The journald log drives comes as a default in the dockee.service unit file.
> Did you change anything else in there?

nope

Comment 22 Steve Kuznetsov 2016-06-02 19:30:53 UTC
(In reply to Antonio Murdaca from comment #20)
> (In reply to Timothy St. Clair from comment #19)
> > To add more specific details, in our environment we are building latest kube
> > 1.3 using golang 1.6, and we've set journald as the log driver using
> > docker-latest-1.10.3-22.el7.x86_64
> 
> The journald log drives comes as a default in the dockee.service unit file.
> Did you change anything else in there?

I don't think in the OpenShift CI we explicitly set anything to do with the log driver, the only changes I can find to the unit file happen in the vagrant-openshift plugin[1] and amend the initial startup timeout.

Here is the content of the unit file:
$ cat /lib/systemd/system/docker.service 
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target
Wants=docker-storage-setup.service
Requires=rhel-push-plugin.socket

[Service]
Type=notify
NotifyAccess=all
EnvironmentFile=-/etc/sysconfig/docker
EnvironmentFile=-/etc/sysconfig/docker-storage
EnvironmentFile=-/etc/sysconfig/docker-network
Environment=GOTRACEBACK=crash
ExecStart=/bin/sh -c '/usr/bin/docker-current daemon \
          --authorization-plugin=rhel-push-plugin \
          $OPTIONS \
          $DOCKER_STORAGE_OPTIONS \
          $DOCKER_NETWORK_OPTIONS \
          $ADD_REGISTRY \
          $BLOCK_REGISTRY \
          $INSECURE_REGISTRY \
          2>&1 | /usr/bin/forward-journald -tag docker'
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
MountFlags=slave
TimeoutStartSec=10min
Restart=on-abnormal
StandardOutput=null
StandardError=null

[Install]
WantedBy=multi-user.target

[1]https://github.com/openshift/vagrant-openshift/blob/master/lib/vagrant-openshift/action/install_origin_base_dependencies.rb#L262-L266

Comment 23 Steve Kuznetsov 2016-06-02 20:21:13 UTC
It seems like the kernel issues that I had run into previously when attempting to debug this issue directly stem from attaching `strace` to the Docker daemon. I will file a separate BZ against the kernel for that.

Comment 24 Daniel Walsh 2016-06-02 20:27:46 UTC
I was thinking one possibility was the sharing the same mount namespace with the host, but the docker.service file you just showed is still suing

MountFlags=slave

Which we have removed from our docker.service file.  You should probably not be setting it either, but I guess it is not a difference between docker-1.9 setup and docker-1.10.

Comment 25 Daniel Walsh 2016-06-02 20:29:01 UTC
If you are using docker-latest, you did update /etc/sysconfig/docker-latest-storage to use real devices not loopbak correct?

Comment 26 Eric Paris 2016-06-02 20:38:44 UTC
@dwalsh RHEL and Atomic PMs are still working on 1.6.  7.2.6 at best at this point :-(

Comment 27 Steve Kuznetsov 2016-06-02 20:45:31 UTC
(In reply to Daniel Walsh from comment #25)
> If you are using docker-latest, you did update
> /etc/sysconfig/docker-latest-storage to use real devices not loopbak correct?

We do make changes to /etc/sysconfig/docker-storage to use the devicemapper direct lvm in the vagrant-openshift plugin[1]. The file looks like this on an affected system:

$ sudo cat /etc/sysconfig/docker-storage
# This file may be automatically generated by an installation program.
# Please DO NOT edit this file directly. Instead edit
# /etc/sysconfig/docker-storage-setup and/or refer to
# "man docker-storage-setup".

# By default, Docker uses a loopback-mounted sparse file in
# /var/lib/docker.  The loopback makes it slower, and there are some
# restrictive defaults, such as 100GB max storage.

DOCKER_STORAGE_OPTIONS='-s devicemapper --storage-opt dm.datadev=/dev/docker-vg/docker-data --storage-opt dm.metadatadev=/dev/docker-vg/docker-metadata'


[1]https://github.com/openshift/vagrant-openshift/blob/master/lib/vagrant-openshift/action/install_origin_base_dependencies.rb#L217-L229

Comment 28 Daniel Walsh 2016-06-02 21:12:57 UTC
Steve are you testing against docker-latest though?

If yes then you should be modifying 
sudo cat /etc/sysconfig/docker-latest-storage*

Comment 29 Steve Kuznetsov 2016-06-03 12:58:44 UTC
(In reply to Daniel Walsh from comment #28)
> Steve are you testing against docker-latest though?
> 
> If yes then you should be modifying 
> sudo cat /etc/sysconfig/docker-latest-storage*

We aren't testing against docker-latest. 

This is the /bin/docker entry:

$ cat $(which docker)
#!/bin/sh
. /etc/sysconfig/docker
[ -e "${DOCKERBINARY}" ] || DOCKERBINARY=/usr/bin/docker-current
if [ ! -f /usr/bin/docker-current ]; then
    DOCKERBINARY=/usr/bin/docker-latest
fi
if [[ ${DOCKERBINARY} != "/usr/bin/docker-current" && ${DOCKERBINARY} != /usr/bin/docker-latest ]]; then
    echo "DOCKERBINARY has been set to an invalid value:" $DOCKERBINARY
    echo ""
    echo "Please set DOCKERBINARY to /usr/bin/docker-current or /usr/bin/docker-latest
by editing /etc/sysconfig/docker"
else
    exec ${DOCKERBINARY} "$@"
fi

This is how we are starting the Docker daemon:

/usr/bin/docker-current daemon --authorization-plugin=rhel-push-plugin
                               --insecure-registry=172.30.0.0/16 
                               --insecure-registry=ci.dev.openshift.redhat.com:5000 
                               --selinux-enabled 
                               -s devicemapper 
                               --storage-opt dm.datadev=/dev/docker-vg/docker-data 
                               --storage-opt dm.metadatadev=/dev/docker-vg/docker-metadata

Comment 30 Daniel Walsh 2016-06-03 17:53:12 UTC
Ok the logs above indicated docker-latest, which was causing confusion.

Comment 31 Steve Kuznetsov 2016-06-03 19:06:51 UTC
(In reply to Daniel Walsh from comment #30)
> Ok the logs above indicated docker-latest, which was causing confusion.

Which logs? Everything is running docker-current, there is no docker-latest even installed on the system.

Comment 32 Daniel Walsh 2016-06-03 20:16:42 UTC
In comment 1

Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: process finished code=0

Comment 33 Steve Kuznetsov 2016-06-03 20:28:51 UTC
@Antonio

Running the tests again trying to get the flakes to show up with the upstream version of docker1.10.3[1] and also with the old docker1.9.1[2].

In the configuration with the upstream docker1.10.3, as evidenced by the diff in the above link, I had to manually create the unit file for the daemon to get it to mirror what we had previously. In the process, I removed the dependency on the `rhel-push-plugin` which was previously provided to the daemon as `--authorization-plugin=rhel-push-plugin` as noted above.

I will report back when I can determine if I am able to get the flakes to reproduce with either setup.

[1] https://github.com/openshift/vagrant-openshift/pull/452
[2] https://github.com/openshift/vagrant-openshift/pull/453

Comment 34 Antonio Murdaca 2016-06-03 20:33:16 UTC
Steve, awesome

Comment 35 Steve Kuznetsov 2016-06-06 14:55:42 UTC
Docker 1.9.1 RPM from the rhel-server-extras repository fails to do `docker pull` with the following message:

--> Image centos:centos7 was not found, pulling ...
error: API error (500): repository name component must match "[a-z0-9](?:-*[a-z0-9])*(?:[._][a-z0-9](?:-*[a-z0-9])*)*"
!!! Error in hack/build-base-images.sh:28
    '"${oc}" ex dockerbuild $2 $1' exited with status 1
Call stack:
    1: hack/build-base-images.sh:28 build(...)
    2: hack/build-base-images.sh:32 main(...)
Exiting with status 1

I'm not sure I'll be able to continue testing with that version of Docker and OpenShift at origin/master:HEAD

Comment 36 Antonio Murdaca 2016-06-06 15:56:10 UTC
I think I've found a possible deadlock in one of our patches - stack dump shows 2 goroutines waiting to acquire a lock. I still have to test this out with Steve.

Comment 37 Antonio Murdaca 2016-06-06 16:19:12 UTC
Sent out the patch with my guess on how to fix this to Lokesh who is gonna try to rebuild docker for rhel7 so that Steve can test it out.

Comment 38 Antonio Murdaca 2016-06-07 07:15:59 UTC
Opened a PR to fix this wedge on https://github.com/projectatomic/docker/pull/161

Waiting tests from Steve also

Comment 39 Daniel Walsh 2016-06-07 15:05:42 UTC
Once Steve confirms the fix, we will merge.

Comment 40 Steve Kuznetsov 2016-06-07 17:32:08 UTC
Testing with Antonio's patch yielded:
  4/100 conformance test failures unrelated to the Docker daemon
 41/100 failures in correctly provisioning the environment (not related)[1][2]
  7/100 failures in `docker rm`[3]

Since we did not have nearly 100 clean runs, I will do another batch right now, but preliminary results look good.

[1] https://github.com/openshift/origin/issues/8571
[2] https://github.com/openshift/origin/issues/9202
[3] https://github.com/openshift/origin/issues/9201

Comment 41 Steve Kuznetsov 2016-06-07 19:45:24 UTC
After further testing, Antonio's first set of patches (built into docker-1.10.3-23 by Lokesh) have exhibited the Docker daemon hang described in this bug.

Comment 42 Daniel Walsh 2016-06-07 19:49:48 UTC
If the problem is still happenin, could you attach new logs and traceback to see if we have a new issue or the same issue.

Comment 43 Steve Kuznetsov 2016-06-07 19:55:41 UTC
Created attachment 1165757 [details]
docker-1.10.3-32 stacktrace

Comment 44 Steve Kuznetsov 2016-06-07 19:56:25 UTC
Created attachment 1165758 [details]
docker-1.10.3-32 journal

Comment 45 Antonio Murdaca 2016-06-07 20:08:37 UTC
let's try my second patch - the one Dan merged in projectatomic/docker - I think that's better

Comment 46 Antonio Murdaca 2016-06-07 20:10:42 UTC
The patch in https://github.com/projectatomic/docker/pull/161 which is the newer Steve is going to try should avoid deadlocking at all cause we're using a copy of the containers list. Let's test that!

Comment 47 Steve Kuznetsov 2016-06-08 14:35:57 UTC
From 100 tests this morning (some overlap, so doesn't add to 100)
 17 failures from `yum`[1]
 78 failures from `devicemapper`[2]
  9 test failures [3][4][5][6][7][8]

In none of the test cases, however, did we see the Docker daemon hang.

[1] https://github.com/openshift/origin/issues/9203
[2] https://github.com/openshift/origin/issues/9201
[3] https://github.com/openshift/origin/issues/9221
[4] https://github.com/openshift/origin/issues/9222
[5] https://github.com/openshift/origin/issues/9223
[6] https://github.com/openshift/origin/issues/9224
[7] https://github.com/openshift/origin/issues/9225
[8] https://github.com/kubernetes/kubernetes/issues/23591

Comment 48 Antonio Murdaca 2016-06-08 14:37:02 UTC
Fixed in last rhel7-1.10.3 branch - assigning to Lokesh to rebuild and errata in RHEL

Comment 52 errata-xmlrpc 2016-06-23 16:19:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1274


Note You need to log in before you can comment on or make changes to this bug.