1391665 – Build fails with Driver devicemapper failed to remove root filesystem - Device Busy

Bug 1391665 - Build fails with Driver devicemapper failed to remove root filesystem - Device Busy

Summary: Build fails with Driver devicemapper failed to remove root filesystem - Devi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Build
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ben Parees
QA Contact:	Vikas Laad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1427691
TreeView+	depends on / blocked

Reported:	2016-11-03 17:23 UTC by Walid A.
Modified:	2021-12-10 14:47 UTC (History)
CC List:	26 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Source to image builds expect image commits to take no longer than 2 minutes. Consequence: Commits which take longer than 2 minutes result in a timeout and a failed build. Fix: The timeout will be removed so image commits can take indeterminate lengths of time. Result: Commits which take an excessive amount of time will not result in a failed build.
Clone Of:
Clones:	1427691 (view as bug list)
Environment:
Last Closed:	2017-04-12 19:15:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0884	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.5 RPM Release Advisory	2017-04-12 22:50:07 UTC

Comment 1 Daniel Walsh 2016-11-10 16:40:49 UTC

What version of docker is this happening on?

Comment 2 Walid A. 2016-11-10 16:58:32 UTC

(In reply to Daniel Walsh from comment #1)
> What version of docker is this happening on?

# docker --version
Docker version 1.12.2-rc2, build 19dbe35
# 
# docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 14
Server Version: 1.12.2-rc2
Storage Driver: devicemapper
 Pool Name: docker_vg-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: 
 Metadata file: 
 Data Space Used: 2.693 GB
 Data Space Total: 34.32 GB
 Data Space Available: 31.63 GB
 Metadata Space Used: 544.8 kB
 Metadata Space Total: 88.08 MB
 Metadata Space Available: 87.54 MB
 Thin Pool Minimum Free Space: 3.432 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Library Version: 1.02.135-RHEL7 (2016-09-28)
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: overlay bridge null host
 Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: runc docker-runc
Default Runtime: docker-runc
Security Options: seccomp selinux
Kernel Version: 3.10.0-514.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 4
Total Memory: 15.26 GiB
Name: ip-172-31-58-210.us-west-2.compute.internal
ID: 5BHK:LWOJ:6HOD:TCPX:YAPL:MBSO:WLKM:E4I7:A5YR:JCMI:Z42L:W7J6
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://registry.ops.openshift.com/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 registry.ops.openshift.com
 127.0.0.0/8
Registries: registry.ops.openshift.com (insecure), registry.access.redhat.com (secure), docker.io (secure)
#

Comment 3 Daniel Walsh 2016-11-10 18:40:41 UTC

This looks like the docker daemon is going down, or the container looses access to the docker socket?

Comment 4 Vivek Goyal 2016-11-10 18:44:21 UTC

Make sure docker daemon is running in slave mount namespace. Check docker.service file and make sure MountFlags=slave is there.

Also how do I reproduce the problem. I need a low level reproducer to be able to dive deeper.

Comment 5 Walid A. 2016-11-10 20:02:31 UTC

(In reply to Vivek Goyal from comment #4)
> Make sure docker daemon is running in slave mount namespace. Check
> docker.service file and make sure MountFlags=slave is there.
> 

Here is what I have in the application node /usr/lib/systemd/system/docker.service file:

# cat /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target
Wants=docker-storage-setup.service
Requires=rhel-push-plugin.socket

[Service]
Type=notify
NotifyAccess=all
EnvironmentFile=-/etc/sysconfig/docker
EnvironmentFile=-/etc/sysconfig/docker-storage
EnvironmentFile=-/etc/sysconfig/docker-network

Environment=GOTRACEBACK=crash
Environment=DOCKER_HTTP_HOST_COMPAT=1
Environment=PATH=/usr/libexec/docker:/usr/bin:/usr/sbin
ExecStart=/usr/bin/dockerd-current \
          --add-runtime docker-runc=/usr/libexec/docker/docker-runc \
          --default-runtime=docker-runc \
          --authorization-plugin=rhel-push-plugin \
          --exec-opt native.cgroupdriver=systemd \
          $OPTIONS \
          $DOCKER_STORAGE_OPTIONS \
          $DOCKER_NETWORK_OPTIONS \
          $ADD_REGISTRY \
          $BLOCK_REGISTRY \
          $INSECURE_REGISTRY
ExecReload=/bin/kill -s HUP $MAINPID
TasksMax=infinity
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
TimeoutStartSec=0
Restart=on-abnormal
MountFlags=slave


[Install]
WantedBy=multi-user.target
# 
========================================
> Also how do I reproduce the problem. I need a low level reproducer to be
> able to dive deeper.

Basically on a recent OCP 3.4 master node with devicemapper as the docker storage driver, create a new project on openshift master node, then run:

  oc new-app --template=dancer-mysql-example

Then start a new build:

oc start-build dancer-mysql-example

You may need to repeat the oc start-build command 2-3 times if build is successful, before you can reproduce it.  In my case I run reliability testing where we are creating projects, deploying new applications, accessing the applications in the background

I can give you access to my testbed, if you email me (wabouham), and send me your id_rsa.pub public key so I can add it to allowed_keys file on my node.

Comment 6 Daniel Walsh 2016-11-10 21:06:26 UTC

This begins to look like a leaked mount point causing the issue.  Someone needs to make sure that docker/docker-containerd and potentially docker-shim are running in the same mount namespace and that mount namespace is different then the host.


ls -l /proc/PID/ns/mnt

ls -l /proc/self/ns/mnt 
lrwxrwxrwx. 1 root root 0 Nov 10 16:06 /proc/self/ns/mnt -> 'mnt:[4026531840]'

Comment 7 Vivek Goyal 2016-11-10 22:15:03 UTC

So we are running with MountFlags=slave. That's is good. But looks like mount point is still leaking in some process which make dir/device busy.

Can we try to figure out which process this mount point is leaking into. When this error happens, try to grep for failing container id as follows in proc. This basically tries to go through all processes on system and check their mount namespaces and see which process might be seeing the device.

find /proc/*/mounts | xargs grep <container-id>

Once you see bunch of entries, extract pids from it and check which process name these pids belong to.  something simple as follows will work.

ps -eaf | grep <pid>

BTW, I was debugging a similar issue upstream and I suspect it could be same. Upstream is trying to fix it.

https://github.com/docker/docker/issues/27381

Comment 8 Vivek Goyal 2016-11-17 19:17:39 UTC

Upstream has fixed dockerd.

https://github.com/docker/containerd/pull/348

Hopefully this fixes the docker issue. We will have to make sure it shows up in docker-1.12.

Comment 14 Wang Haoran 2016-12-02 04:49:58 UTC

We test this today with openshift on openstack, this block all the build this time.
all the build failed with : 
error: Execution of post execute step failed
warning: Failed to remove container "0dc2039735d15f5785a8335b83ff61403d01b4d785e4d805984b290166a25636": Error response from daemon: {"message":"Driver devicemapper failed to remove root filesystem 0dc2039735d15f5785a8335b83ff61403d01b4d785e4d805984b290166a25636: Device is Busy"}
error: build error: building haowang/dancer-mysql-example-1:03406c8a failed when committing the image due to error: Cannot connect to the Docker daemon. Is the docker daemon running on this host?


docker version
Client:
 Version:         1.12.3
 API version:     1.24
 Package version: docker-common-1.12.3-8.el7.x86_64
 Go version:      go1.6.2
 Git commit:      8b91553-redhat
 Built:           Wed Nov 16 11:35:08 2016
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.3
 API version:     1.24
 Package version: docker-common-1.12.3-8.el7.x86_64
 Go version:      go1.6.2
 Git commit:      8b91553-redhat
 Built:           Wed Nov 16 11:35:08 2016
 OS/Arch:         linux/amd64


openshift version
openshift v3.4.0.32+d349492
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

Comment 25 Vivek Goyal 2016-12-08 22:06:50 UTC

I think with docker 1.10, this is generic layer reference counting issue (and not storage driver issue).

Error code ErrActiveMount maps to string "mount still active". And I suspect that in this case it is coming from.

release() function in layer/mounted_layer.go

Also I think something has gone wrong with reference counting. May be some bug is present.

1.12 does not seem to have a that bug as we get "Device Busy" message and that suggests that device is still busy as it is mounted in some other namespace.

Antonio, would you have some thoughts on this layer reference counting issue in 1.10.3.


func (rl *referencedRWLayer) release() error {
        rl.activityL.Lock()
        defer rl.activityL.Unlock()

        if rl.activityCount > 0 {
                return ErrActiveMount
        }

        rl.activityCount = -1

        return nil
}

Comment 27 Vivek Goyal 2016-12-09 13:45:16 UTC

Thanks Antonio.

So we don't know why reference counting is going off. That means we need a reproducer so that we can try to narrow it down.

Walid, can you give me the Dockerfile which can reproduce this issue. I can try running in my VM and if issue reproduces, debug it.

Comment 28 Walid A. 2016-12-09 13:52:33 UTC

Vivek, Antonio, 

I created a separate BZ to track the Docker 1.10.3 issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1403027
Title: Build fails with Driver devicemapper failed to remove root filesystem: mount still active 

My cluster is also still available if you need to run some tests on it.
Please refer to BZ 1403027 so we can keep the two issues separate.

Thanks,

Walid

Comment 29 Vivek Goyal 2016-12-09 19:02:28 UTC

Just going through logs. I think containers which are giving "mount still active" error have not been stopped properly through docker. I am not seeing
a "stop" message for these containers.

For example, following are some of the messages for container "eafe4f3abbd3491e21444c"

Dec  7 06:58:37 ip-172-31-7-32 docker-current: time="2016-12-07T06:58:37.053695653-05:00" level=info msg="{Action=attach, ID=eafe4f3abbd3491e21444cc10a985473f54fa4d625183f507819df02b6927615, LoginUID=4294967295, PID=8672}"

Dec  7 06:58:37 ip-172-31-7-32 docker-current: time="2016-12-07T06:58:37.058841359-05:00" level=info msg="{Action=start, ID=eafe4f3abbd3491e21444cc10a985473f54fa4d625183f507819df02b6927615, LoginUID=4294967295, PID=8672,..........


Dec  7 06:58:41 ip-172-31-7-32 docker-current: time="2016-12-07T06:58:41.098285980-05:00" level=info msg="{Action=wait, ID=eafe4f3abbd3491e21444cc10a985473f54fa4d625183f507819df02b6927615, LoginUID=4294967295, PID=8672}"


Dec  7 07:01:44 ip-172-31-7-32 systemd: Stopped docker container eafe4f3abbd3491e21444cc10a985473f54fa4d625183f507819df02b6927615.

Dec  7 07:03:46 ip-172-31-7-32 docker-current: time="2016-12-07T07:03:46.663777541-05:00" level=info msg="{Action=remove, ID=eafe4f3abbd3491e21444cc10a985473f54fa4d625183f507819df02b6927615, LoginUID=4294967295, PID=8672}"

Note, We got create, start, wait, and remove directly. There is no "stop" command for container.

Only message I see that systemd says container stopped. I am not sure who
told systemd to stop that container. Did we not call "docker stop" to stop container.

I suspect that we did not go through docker to stop container, and hence mount is still active?

I am still wondering how did systemd gave message that stopped above container while I don't see any log of stop message. How was systemd told to stop this container.

Mrunal, will you have any idea about this?

Comment 30 Mrunal Patel 2017-01-04 18:50:06 UTC

Vivek,
docker has a check that doesn't allow remove unless a force is passed as well. I think that the container is being force removed in the log above.

Comment 31 Vivek Goyal 2017-01-04 19:54:52 UTC

So docker 1.12 gives the deivce busy error message and this most likely means that mount point has leaked into some other mount namespace.

Now I will need some help from openshift engineers to provide following.

- Provide a reproducer which I can run on my local system and be able to
  reproduce the problem.

- If reproducer is not in sight yet, just after failure, run following on the node where failure happened.
 
  $ find /proc/*/mounts | xargs grep <container-id>

  Here <container-id> will be taken from error message. 

- Above command will list bunch of files which will contain pid in them. Then pick any one pid and run "ps -eaf | grep <pid>" and see which process has that pid. That should give some idea about where we might be leaking the mount point.

Comment 32 Vivek Goyal 2017-01-04 20:01:11 UTC

Antonio, Lokesh, Do we have latest containerd fixes in docker-1.12. That might help fix this issue.

Specifically, upstream I was debugging similar issue.
 
https://github.com/docker/docker/issues/27381

And crosby michael said that a containerd PR might fix it.

https://github.com/docker/containerd/pull/348

This was merged very recently (Nov 14, 2016).

Do we have those fixes? If not, I think we should try to take those fixes and retry and see if problem is resolved or not.

Comment 37 Lokesh Mandvekar 2017-01-05 20:03:40 UTC

(In reply to Vivek Goyal from comment #32)
> Antonio, Lokesh, Do we have latest containerd fixes in docker-1.12. That
> might help fix this issue.
> 
> Specifically, upstream I was debugging similar issue.
>  
> https://github.com/docker/docker/issues/27381
> 
> And crosby michael said that a containerd PR might fix it.
> 
> https://github.com/docker/containerd/pull/348
> 
> This was merged very recently (Nov 14, 2016).
> 
> Do we have those fixes? If not, I think we should try to take those fixes
> and retry and see if problem is resolved or not.

The containerd commit has been included since atleast 1.12.5-3.

Comment 43 Antonio Murdaca 2017-01-09 14:00:26 UTC

Backported https://github.com/projectatomic/docker/commit/e7e50ff6d9b7638679fbcc7835fb810ca8fd9101

Also, proposed upstream to include your fix into 1.13.x release https://github.com/docker/docker/pull/29995

Comment 46 Troy Dawson 2017-01-27 17:49:18 UTC

This should be working in OCP v3.5.0.10 or newer.

Comment 47 Walid A. 2017-02-01 16:46:39 UTC

Verified on OCP cluster:
openshift v3.5.0.8+736adae
kubernetes v1.5.2+43a9be4
docker version: 1.12.6-1

This issue was not observed after 6 days of continuous testing.

Comment 48 Brendan Mchugh 2017-02-08 16:27:33 UTC

Is there any possibility of getting this fix as a backport into 3.4?

Comment 54 Vivek Goyal 2017-02-15 16:07:34 UTC

So following error message comes because this mount point (dir) is still mounted on in another mount namespace.

Feb 09 16:49:37 node2 dockerd-current[92384]: time="2017-02-09T16:49:37.481109614Z" level=error msg="Handler for DELETE /containers/df4aa648247b095818bf75422e1b1825a570ee8b81614550a8c1df50641aa342 returned error: Driver devicemapper failed to remove root filesystem df4aa648247b095818bf75422e1b1825a570ee8b81614550a8c1df50641aa342: remove /var/lib/docker/devicemapper/mnt/707fc7d514f8136c0985152a76fb51232ebb7b8a035c37e82e1a084ab057995d: device or resource busy"

So it could mean few things.

- Container is still somehow running and that's why container rootfs is still mounted on this path.

- Or container mount point has leaked into some other mount namespace and that keeps this directory busy and hence container deletion fails.

So I have few questions.

- Does this problem fail the build? If not, then it should not be a huge issue as these containers should be cleaned up by cron job which runs periodically and tries to clean dead containers. This assumes that this situation is temporary and as soon as process which keeps mount points busy exits, container is removable again.

- Can we upgrade to latest docker. I remember there were some fixes for containerd so that containers could be properly stopped before we tried to remove these. It is possible that this problem belongs to that class. So lets upgrade to latest and rule out that possibility.

- If problem persists even after upgrade, then I will need a small reproducer which I can constantly run and try to narrow down the problem. Again, its not a code bug as such in devicemapper. This whole exercise will take place just to narrow down who is keeping the mount point busy and why. Devicemapper is doing the right thing here.

Comment 57 Jhon Honce 2017-02-16 20:19:55 UTC

Reset status to 'assigned' to re-open issue.

Comment 78 Giovanni Condello 2017-02-23 22:38:56 UTC

If it can be of any help, we have an openshift cluster stuck where we can trigger this behavior on-demand 100% of the times. 
I said stuck because it actually causes our build to fail.

Comment 109 Ben Parees 2017-02-27 23:27:04 UTC

fix to change the timeout in s2i is here:
https://github.com/openshift/source-to-image/pull/693

will need to rebase into origin after it merges.

Comment 110 Vivek Goyal 2017-02-28 13:23:40 UTC

Also please stop using forced removal of container. That option should really not be used, IMO.

Comment 111 Aaron Ship 2017-02-28 14:10:19 UTC

Ben in relation to comment #109 do we have a time frame on a fix of this issue, as I am getting pressure from SA as he is having to provide information to his customer.

Kind Regards
Aaron Ship
Critical Situation Manager
Red Hat - Customer Experience & Engagement.
aship
Mobile/SMS: +420 773003556
Office:+420532294755

Comment 112 Ben Parees 2017-03-01 19:53:12 UTC

fix for 3.5:
https://github.com/openshift/origin/pull/13172
and master:
https://github.com/openshift/origin/pull/13170

Comment 113 Troy Dawson 2017-03-03 17:51:48 UTC

This has been merged into ocp and is in OCP v3.5.0.38 or newer.

Comment 115 Vikas Laad 2017-03-10 13:53:42 UTC

Completed testing on 3.5.0.50, did not see this issue.

Comment 116 Steven Walter 2017-03-17 14:56:10 UTC

Do we expect to pull this back into 3.4.x?

Comment 119 errata-xmlrpc 2017-04-12 19:15:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884

Note You need to log in before you can comment on or make changes to this bug.

amurdaca
aos-bugs
aship
bmchugh
bparees
dwalsh
erich
g.condello
haowang
imcleod
jhonce
jkaur
jokerman
lsm5
mifiedle
mmccomas
mpatel
stwalter
tatanaka
tdawson
twaugh
vgoyal
vlaad
wabouham
wzheng
xtian