Bug 1389829 - docker hangs and USR1 kills it
Summary: docker hangs and USR1 kills it
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.2.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Jhon Honce
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-28 19:17 UTC by Luke Meyer
Modified: 2016-10-28 20:51 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-28 20:51:32 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Luke Meyer 2016-10-28 19:17:56 UTC
Description of problem:

The main problem is that very sporadically docker seems to get hung (e.g. docker ps, docker info never return any response), and pods on the system get "stuck" in terminating. The typical experience has been that docker restart doesn't work, it's impossible to clear out /var/lib/docker and start over (something remains mounted in it), and even a soft reboot gets hung, requiring a hard reboot.

In addition to that problem, we're having trouble gathering useful diagnostics about what is going on. Sending a kill -USR1 to the docker process does not succeed in getting a stack trace, instead the process exits. strace on the process doesn't show any system interaction going on at all.

Version-Release number of selected component (if applicable):
docker-1.10.3-44.el7.x86_64
OSE 3.2.1

How reproducible:
Every few weeks among the customer's many nodes.

Additional info:
This seems similar to https://bugzilla.redhat.com/show_bug.cgi?id=1380011 but it's a newer version of docker and OSE, and a simple restart doesn't usually work.

Also sending USR1 to docker makes it exit even when it's not in this hung state, which is very surprising to me at least.

Going to attach some logs and diagnostics from the customer.

Comment 5 Luke Meyer 2016-10-28 20:24:35 UTC
I set up a test system with this exact version of docker (docker-1.10.3-44.el7.x86_64) and tested with kill -USR1. It exited. I then upgraded to docker-1.10.3-46.el7.14 and tested, and it gave the stack trace expected. So that is one mystery solved.

Comment 6 Jhon Honce 2016-10-28 20:51:32 UTC
Unit file from -44 contains a pipeline for ExecStart whereas -46 does not.  So in -44 the MainPid reported is the shell that invoked the docker process not the docker process as is reported in -46.

Using kill -USR1 against the docker process in -44 works as expected.


Note You need to log in before you can comment on or make changes to this bug.