1126874 – Unable to delete container properly after docker daemon kill with started container on device-mapper

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1126874 - Unable to delete container properly after docker daemon kill with started container on device-mapper

Summary: Unable to delete container properly after docker daemon kill with started con...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	docker
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Vivek Goyal
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	1194909
Blocks:
TreeView+	depends on / blocked

Reported:	2014-08-05 13:51 UTC by Jiri Zupka
Modified:	2019-03-06 01:27 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-09-29 13:56:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jiri Zupka 2014-08-05 13:51:28 UTC

Description of problem:
Unable to delete container properly after killing of docker daemon with the started container.

Version-Release number of selected component (if applicable):
docker.x86_64 0:1.1.2-6.el7

How reproducible:
always


Steps to Reproduce:
1. start docker daemon
2. docker run -i -t --name=rm_test bash -c "while [ true ]; do sleep 1; done"  
     start infinite container in docker daemon 
     a) (process which not accept SIGTERM) (problem with process PID 1) PID namespaces
3. systemctl restart docker.service      
          try to restart docker daemon  (after 10s sends SIGKILL)
*****!!! container disk is still mounted because docker daemon was KILLED
4. docker rm rm_test
     should raise exception, but command docker ps show that container is successfully removed
5. systemctl restart docker.service      
6. docker ps
     show removed container but with autogenerated name.

Actual results:
     show removed container but with autogenerated name.


Expected results:
     container is properly removed.
 

Additional info:

Comment 2 Jiri Zupka 2014-08-05 14:49:57 UTC

will be reproducible by autotest docker test restart.

git clone –recursive git://github.com/autotest/autotest.git
cd autotest/client/
./autotest-local run docker --args="docker_daemon/restart"

This pull req countaine docker_daemon/restart tests
https://github.com/autotest/autotest-docker/pull/255

Comment 3 Daniel Walsh 2014-09-12 19:15:31 UTC

Could you recheck with docker-1.2

Comment 4 Matthias Clasen 2014-09-30 14:26:01 UTC

moving docker bugs off alexl

Comment 5 Mike Snitzer 2014-11-20 20:17:47 UTC

This is waaaaay too docker specific for me to make any sense of how this is a problem with DM thinp (kernel) code.

The comment#0 is very difficult for me to understand.  Not understanding why vbatts cannot own fixing this problem but consult me as needed.

Reassigning to Dan Walsh... ;)

Comment 6 Vivek Goyal 2014-11-24 18:25:25 UTC

Ok, So there seems to be two problems here.

- Should an init inside the container still be ignore SIGTERM if it is sent from host or any of the parent pid namespaces. I am not sure about the answer to that question but I kind of feel that it sounds wrong. How is one supposed to stop a container otherwise.

- There seem to be a problem on docker side too. If device is busy devmapper backend will return error to docker  and "docker rm" operation should fail. For some reason it looks like docker might be ignoring that code and still continue ahead with deletion of some metadata associated with the container. And when docker later starts, it probably is loading all the metadata and it finds the files associated with container in devmapper backend. As it has lost the original name of container, it generates one.

I think second behavior should get in docker for sure. Fail "docker rm" if backend can not finish the operation and returns error.

Comment 7 Vivek Goyal 2014-11-24 19:07:43 UTC

Ok, w.r.t first problem (init in pid namespace ignores SIGTERM), there has been some discussion.

https://lists.openvz.org/pipermail/devel/2010-May/024843.html
http://lwn.net/Articles/532748/

Bottom line seems to be that init needs to establish a SIGTERM handler. I am not sure sure if SIGTERM will be blocked from processes in same container or child/peer container or not. I hope it will be.

If that's the case, can we modify docker init to establish SIGTERM handler and
hopefully that will take care of the problem.

Irrespective of that, docker still needs to be fixed to handle error from devmapper backend though.

Comment 8 Vivek Goyal 2014-11-25 15:23:36 UTC

I guess following code is the problem.

Destroy() {
        // Deregister the container before removing its directory, to avoid race conditions
        daemon.idIndex.Delete(container.ID)
        daemon.containers.Delete(container.ID)
        container.derefVolumes()
        if _, err := daemon.containerGraph.Purge(container.ID); err != nil {
                log.Debugf("Unable to remove container from link graph: %s", err)
        }

        if err := daemon.driver.Remove(container.ID); err != nil {
                return fmt.Errorf("Driver %s failed to remove root filesystem %s: %s", daemon.driver, container.ID, err)
        }

}

we seem to delete container first from containerGraph and cleanup some
other metadata before we actually call into driver to Remove().

I would think that if we call into driver first to Remove() and cleanup
container metadata only after a successful call this problem should be fixed.

Not sure if there  are any other implications of changing this order.

Comment 10 Vivek Goyal 2015-03-10 20:16:34 UTC

Second part of the problem should be fixed by this commit. If device removal failed from device mapper, then docker will report it instead of silently removing container data from its metadata.

https://github.com/docker/docker/pull/11137

This is being tracked in following bz.

https://bugzilla.redhat.com/show_bug.cgi?id=1194909

Comment 11 Vivek Goyal 2015-04-02 19:45:50 UTC

Ok, this bug is combination of 3 issues.

- Restarting docker service did not kill the running container.
- As container was not killed, underlying device was still busy, hence docker rm operation failed but container was deleted from the view and it should not have been.
- When docker service was restarted, deleted container reappeared with a new regenerated name.

First problem seems to have been fixed upstream. I am testing latest docker bits and now container gets killed. I am not sure which set of patches fixed it.

Second problem should be fixed in 1.6.0 where my patches are there to not remove container from active view.

I think third problem is still there. I see that container reappears with new name. I will have to dig deeper into it.

Comment 13 Daniel Walsh 2015-06-03 11:56:08 UTC

Vivek any update on this bug?

Note You need to log in before you can comment on or make changes to this bug.