Bug 1328204

Summary: docker/docker-latest daemons unable to clean up thinp resources
Product: Red Hat Enterprise Linux 7 Reporter: Qian Cai <qcai>
Component: dockerAssignee: Stephen Chung <schung>
Status: CLOSED DUPLICATE QA Contact: atomic-bugs <atomic-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.2CC: ajia, amurdaca, jrieden, lsm5, vgoyal
Target Milestone: rcKeywords: Extras
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-20 13:35:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
the docker-latest journal file when it is in a mess
none
docker-latest daemon journal with debug enabled none

Description Qian Cai 2016-04-18 17:37:19 UTC
Description of problem:
Right now, it is possible to start either docker or docker-latest daemon if there is another docker/docker-latest is running. The second one will take over the first one became the one that is running, i.e.,

# systemctl start docker
# systemctl start docker-latest
(Now, docker-latest is running, and docker is dead.)

After this happened, it created a mess including systemctl stop docker/docker-latest or docker/docker-latest stop $(docker/docker-latest ps -a -q) unable to clean up thinp resources. For example,

# docker-latest run -d nginx
# dmsetup info
Name:              rhel--latest-docker--latest--pool
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 5
Number of targets: 1
UUID: LVM-11skPfGYN2KRf5gk6ioa79x0edYyxteFMlKw3QbD3wS1O8ueAFHPaVtfTgmnRo4g-tpool
...
Name:              docker-8:1-8571283-425cb51f6f604209c0dbc26691918c9dfe685f16188a29107352fbb39670ebdf
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 6
Number of targets: 1

# docker-latest stop $(docker-latest ps -a -q)
(No docker process is running after that.)
# dmsetup info
Name:              rhel--latest-docker--latest--pool
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 5
Number of targets: 1
UUID: LVM-11skPfGYN2KRf5gk6ioa79x0edYyxteFMlKw3QbD3wS1O8ueAFHPaVtfTgmnRo4g-tpool
...
Name:              docker-8:1-8571283-425cb51f6f604209c0dbc26691918c9dfe685f16188a29107352fbb39670ebdf
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        0
Event number:      0
Major, minor:      253, 6
Number of targets: 1

# systemctl stop docker-latest
# dmsetup info
(Same output as the above.)

As the results, there is no way to destroy the thinp,

# lvremove rhel-latest
  Logical volume rhel-latest/docker-latest-pool is used by another device.

until issued one of the followings,

1) "docker-latest rm"
2) reboot
3) manually deleted all docker-8:1-xxx devices using "dmsetup remove -f"

How reproducible:
always

Comment 1 Daniel Walsh 2016-04-18 17:48:32 UTC
Vivek any ideas?

Comment 2 Qian Cai 2016-04-18 17:53:33 UTC
Created attachment 1148260 [details]
the docker-latest journal file when it is in a mess

Comment 4 Daniel Walsh 2016-04-18 17:57:12 UTC
Antonio from the log file above it looks like we have the parseint error.

Comment 5 Vivek Goyal 2016-04-18 17:59:21 UTC
Looks like some error at higher level fails to cleanup container storage hence making thin pool busy.

Comment 6 Antonio Murdaca 2016-04-18 18:12:21 UTC
Dan is that happening with a docker 1.9 daemon? Not sure we merged the fix in rhel 1.9

Comment 7 Daniel Walsh 2016-04-18 18:18:11 UTC
Which patch?

Comment 8 Antonio Murdaca 2016-04-18 18:20:07 UTC
https://github.com/projectatomic/docker/pull/88

We didn't end up merging this in Fedora branch, nor we merged it for rhel in 1.9

I'll make a PR against the rhel branch for inclusion

Comment 9 Qian Cai 2016-04-18 18:41:31 UTC
OK, it turned out it does not even need docker/docker-latest daemons to take over each other to reproduce this. Just use all legit operations.

# rpm -qa | grep docker
docker-selinux-1.9.1-28.el7.x86_64
docker-1.9.1-28.el7.x86_64
docker-utils-1.9.1-28.el7.x86_64
docker-forward-journald-1.9.1-28.el7.x86_64
docker-latest-1.10.3-9.el7.x86_64

# cat /etc/sysconfig/docker-storage-setup
DEVS=/dev/vda
VG=rhel
# cat /etc/sysconfig/docker-latest-storage-setup
DEVS=/dev/vdb
VG=rhel-latest

Everything starts from scratch.
# systemctl stop docker
# systemctl stop docker-latest
# rm -rf /var/lib/docker/*
# rm -rf /var/lib/docker-latest/*
# rm -f /etc/sysconfig/docker-storage
# rm -f /etc/sysconfig/docker-latest-storage
# lvs
(Empty as expected.)
# dmsetup info | grep 'docker-8:1'
(Empty as expected.)

# systemctl start docker
# lvs
  LV          VG   Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool rhel twi-a-t--- 3.19g             1.82   0.33                            
# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: active (running) since Mon 2016-04-18 14:11:06 EDT; 5s ago

# Easy to reproduce by running a lot of containers.
# for i in `seq 1 100`; do docker run -d nginx; done
# docker stop $(docker ps -a -q)
# dmsetup info | grep 'docker-8:1'
(Empty as expected.)

# docker info
Containers: 100
Images: 8
Server Version: 1.9.1
Storage Driver: devicemapper
 Pool Name: rhel-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 107.4 GB
 Backing Filesystem: xfs
 Data file: 
 Metadata file: 
 Data Space Used: 2.992 GB
 Data Space Total: 5.931 GB
 Data Space Available: 2.939 GB
 Metadata Space Used: 3.027 MB
 Metadata Space Total: 12.58 MB
 Metadata Space Available: 9.556 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2015-12-01)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-327.10.1.el7.x86_64
Operating System: Employee SKU
CPUs: 2
Total Memory: 3.702 GiB
Name: rhel7-00.localdomain
ID: BOU7:OGFB:IV5G:Z5SH:EI6T:NGXN:RDKR:MHWM:IJUE:DYBU:ZATU:GENW
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

# systemctl stop docker

# systemctl start docker-latest
# lvs
  LV                 VG          Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool        rhel        twi-a-t--- 4.60g             59.27  24.06                           
  docker-latest-pool rhel-latest twi-a-t--- 3.19g             0.60   0.33                            
# systemctl status docker-latest
● docker-latest.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker-latest.service; disabled; vendor preset: disabled)
   Active: active (running) since Mon 2016-04-18 14:14:10 EDT; 4s ago

# for i in `seq 1 100`; do docker run -d nginx; done
# docker-latest stop $(docker-latest ps -a -q)
# dmsetup info | grep 'docker-8:1'
Name:              docker-8:1-41991960-b54610d86248a72e653dbbd34f222352d5690faa0a8464869722db94b88e4466
Name:              docker-8:1-41991960-404f2097bb6a53625f708907036d7c00339811939157734c759c826482f84187
Name:              docker-8:1-41991960-40fc396d0b2aa5dab237c2f2e76af84b7d48ab37deafd3ac3a2c43fe2f34e2f7
...

# dmsetup info
Name:              rhel--latest-docker--latest--pool
State:             ACTIVE
Read Ahead:        8192
Tables present:    LIVE
Open count:        99

# docker-latest info
Containers: 100
 Running: 0
 Paused: 0
 Stopped: 100
Images: 1
Server Version: 1.10.3
Storage Driver: devicemapper
 Pool Name: rhel--latest-docker--latest--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: 
 Metadata file: 
 Data Space Used: 3.102 GB
 Data Space Total: 3.427 GB
 Data Space Available: 324.5 MB
 Metadata Space Used: 3.039 MB
 Metadata Space Total: 12.58 MB
 Metadata Space Available: 9.544 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2015-12-01)
Execution Driver: native-0.2
Logging Driver: journald
Plugins: 
 Volume: local
 Network: host bridge null
Kernel Version: 3.10.0-327.10.1.el7.x86_64
Operating System: Employee SKU
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 0
CPUs: 2
Total Memory: 3.702 GiB
Name: rhel7-00.localdomain
ID: BOU7:OGFB:IV5G:Z5SH:EI6T:NGXN:RDKR:MHWM:IJUE:DYBU:ZATU:GENW
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Registries: docker.io (secure)

Comment 10 Qian Cai 2016-04-18 18:48:34 UTC
(In reply to CAI Qian from comment #9)
> # for i in `seq 1 100`; do docker run -d nginx; done
Also tried docker-latest instead docker, and still same issue.
> # docker-latest stop $(docker-latest ps -a -q)
> # dmsetup info | grep 'docker-8:1'
> Name:             
> docker-8:1-41991960-
> b54610d86248a72e653dbbd34f222352d5690faa0a8464869722db94b88e4466
> Name:             
> docker-8:1-41991960-
> 404f2097bb6a53625f708907036d7c00339811939157734c759c826482f84187
> Name:             
> docker-8:1-41991960-
> 40fc396d0b2aa5dab237c2f2e76af84b7d48ab37deafd3ac3a2c43fe2f34e2f7
> ...

# ps aux | grep docker
...
root      8868  2.6  1.8 5259952 70544 ?       Sl   14:35   0:19 /usr/bin/docker-latest daemon --exec-opt native.cgroupdriver=systemd -g /var/lib/docker-latest --selinux-enabled --log-driver=journald --storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/rhel--latest-docker--latest--pool --storage-opt dm.use_deferred_removal=true --storage-opt dm.use_deferred_deletion=true

Comment 11 Qian Cai 2016-04-18 19:16:18 UTC
This time lots of those errors in the journal,

Apr 18 14:38:14 rhel7-00.localdomain forward-journal[8869]: time="2016-04-18T14:38:14.885603306-04:00" level=error msg="devmapper: Error unmounting device 421409e18955c41801c0ea5e9a9c18d08f9d2b7908804d06f86ec93d0bb3215e: invalid argument"
Apr 18 14:38:14 rhel7-00.localdomain forward-journal[8869]: time="2016-04-18T14:38:14.885640183-04:00" level=error msg="Error unmounting container 842e029fbc03ac6ee7cf1b3cdfe79c39ccbab41ded849e1c9c35abb7f3b8bc1a: invalid argument"
Apr 18 14:38:14 rhel7-00.localdomain forward-journal[8869]: time="2016-04-18T14:38:14.887038753-04:00" level=error msg="Failed to get pwuid struct: strconv.ParseInt: parsing \"Ϩ\": invalid syntax"
Apr 18 14:38:14 rhel7-00.localdomain forward-journal[8869]: time="2016-04-18T14:38:14.887059613-04:00" level=info msg="{Action=stop, ID=2d50e31a73c9f60b78f05fc9b87a483c0a2e3ed75577c236e60cf33f21f4207f, LoginUID=1000, PID=14470}"
Apr 18 14:38:15 rhel7-00.localdomain forward-journal[8869]: time="2016-04-18T14:38:15.148741288-04:00" level=error msg="devmapper: Error unmounting device 346562c043c92d1b64982c9a1c20f0491c68192bd42492fd7ccd289be9df424b: invalid argument"
Apr 18 14:38:15 rhel7-00.localdomain forward-journal[8869]: time="2016-04-18T14:38:15.148773849-04:00" level=error msg="Error unmounting container 2d50e31a73c9f60b78f05fc9b87a483c0a2e3ed75577c236e60cf33f21f4207f: invalid argument"
Apr 18 14:38:48 rhel7-00.localdomain forward-journal[8869]: time="2016-04-18T14:38:48.603985368-04:00" level=error msg="Failed to get pwuid struct: strconv.ParseInt: parsing \"Ϩ\": invalid syntax"
Apr 18 14:45:01 rhel7-00.localdomain forward-journal[8869]: time="2016-04-18T14:45:01.191656527-04:00" level=error msg="Failed to get pwuid struct: strconv.ParseInt: parsing \"Ϩ\": invalid syntax"

Comment 12 Qian Cai 2016-04-18 19:22:29 UTC
Again, # docker-latest rm $(docker-latest ps -a -q) can clean all the mess and destroyed all the dm-devices.

Comment 13 Qian Cai 2016-04-18 19:49:06 UTC
Occasionally, we saw those errors in the docker-latest journal too,

Apr 18 15:27:59 rhel7-00.localdomain forward-journal[8869]: time="2016-04-18T15:27:59.407826974-04:00" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker-latest/containers/eb90b7099eaa2a18fe3968551a4cee19515c02160ced417a040748238702e47d/shm: invalid argument"
Apr 18 15:27:59 rhel7-00.localdomain forward-journal[8869]: time="2016-04-18T15:27:59.407852802-04:00" level=error msg="Error unmounting container eb90b7099eaa2a18fe3968551a4cee19515c02160ced417a040748238702e47d: not mounted"
Apr 18 15:27:59 rhel7-00.localdomain forward-journal[8869]: time="2016-04-18T15:27:59.408034711-04:00" level=error msg="Handler for POST /v1.22/containers/eb90b7099eaa2a18fe3968551a4cee19515c02160ced417a040748238702e47d/start returned error: Container command not found or does not exist.

Comment 14 Qian Cai 2016-04-18 21:45:53 UTC
Created attachment 1148284 [details]
docker-latest daemon journal with debug enabled

Comment 18 Daniel Walsh 2016-06-03 13:05:21 UTC
Stephen can you take a look at this one.

Comment 19 Qian Cai 2016-06-20 13:35:39 UTC

*** This bug has been marked as a duplicate of bug 1347821 ***