Bug 1389829

Summary: docker hangs and USR1 kills it
Product: OpenShift Container Platform Reporter: Luke Meyer <lmeyer>
Component: ContainersAssignee: Jhon Honce <jhonce>
Status: CLOSED NOTABUG QA Contact: DeShuai Ma <dma>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.1CC: aos-bugs, jokerman, mmccomas
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-28 20:51:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Luke Meyer 2016-10-28 19:17:56 UTC
Description of problem:

The main problem is that very sporadically docker seems to get hung (e.g. docker ps, docker info never return any response), and pods on the system get "stuck" in terminating. The typical experience has been that docker restart doesn't work, it's impossible to clear out /var/lib/docker and start over (something remains mounted in it), and even a soft reboot gets hung, requiring a hard reboot.

In addition to that problem, we're having trouble gathering useful diagnostics about what is going on. Sending a kill -USR1 to the docker process does not succeed in getting a stack trace, instead the process exits. strace on the process doesn't show any system interaction going on at all.

Version-Release number of selected component (if applicable):
docker-1.10.3-44.el7.x86_64
OSE 3.2.1

How reproducible:
Every few weeks among the customer's many nodes.

Additional info:
This seems similar to https://bugzilla.redhat.com/show_bug.cgi?id=1380011 but it's a newer version of docker and OSE, and a simple restart doesn't usually work.

Also sending USR1 to docker makes it exit even when it's not in this hung state, which is very surprising to me at least.

Going to attach some logs and diagnostics from the customer.

Comment 5 Luke Meyer 2016-10-28 20:24:35 UTC
I set up a test system with this exact version of docker (docker-1.10.3-44.el7.x86_64) and tested with kill -USR1. It exited. I then upgraded to docker-1.10.3-46.el7.14 and tested, and it gave the stack trace expected. So that is one mystery solved.

Comment 6 Jhon Honce 2016-10-28 20:51:32 UTC
Unit file from -44 contains a pipeline for ExecStart whereas -46 does not.  So in -44 the MainPid reported is the shell that invoked the docker process not the docker process as is reported in -46.

Using kill -USR1 against the docker process in -44 works as expected.