Hide Forgot
Description of problem: The main problem is that very sporadically docker seems to get hung (e.g. docker ps, docker info never return any response), and pods on the system get "stuck" in terminating. The typical experience has been that docker restart doesn't work, it's impossible to clear out /var/lib/docker and start over (something remains mounted in it), and even a soft reboot gets hung, requiring a hard reboot. In addition to that problem, we're having trouble gathering useful diagnostics about what is going on. Sending a kill -USR1 to the docker process does not succeed in getting a stack trace, instead the process exits. strace on the process doesn't show any system interaction going on at all. Version-Release number of selected component (if applicable): docker-1.10.3-44.el7.x86_64 OSE 3.2.1 How reproducible: Every few weeks among the customer's many nodes. Additional info: This seems similar to https://bugzilla.redhat.com/show_bug.cgi?id=1380011 but it's a newer version of docker and OSE, and a simple restart doesn't usually work. Also sending USR1 to docker makes it exit even when it's not in this hung state, which is very surprising to me at least. Going to attach some logs and diagnostics from the customer.
I set up a test system with this exact version of docker (docker-1.10.3-44.el7.x86_64) and tested with kill -USR1. It exited. I then upgraded to docker-1.10.3-46.el7.14 and tested, and it gave the stack trace expected. So that is one mystery solved.
Unit file from -44 contains a pipeline for ExecStart whereas -46 does not. So in -44 the MainPid reported is the shell that invoked the docker process not the docker process as is reported in -46. Using kill -USR1 against the docker process in -44 works as expected.