Bug 1477486 - [DOCKER] /var/lib/docker gets full overtime when docker uses maxfile > 1 and while Openshif web console keeps watching logs
Summary: [DOCKER] /var/lib/docker gets full overtime when docker uses maxfile > 1 and ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.2.1
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Nalin Dahyabhai
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-02 08:40 UTC by Nicolas Nosenzo
Modified: 2021-12-10 15:11 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-09 14:31:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Nicolas Nosenzo 2017-08-02 08:40:15 UTC
Description of problem:
Docker does not properly rotate the logs when maxfile is > 1 and logs file descriptor are open (i.e., being watching through the console) at the rotation moment.

Then if the process continues for several days, /var/lib/docker will cause an "OutOfSpace" error (reported by kubelet).

If stopping watching the log from the GUI, it will not free up the indicated "deleted" files. 

As a workaround to recover disk space in /var/lib/docker, we can restart docker, which is not sustainable on the long term.


Version-Release number of selected component (if applicable):
Docker 1.10.3-59
Openshift 3.2.1-15

How reproducible:
100%

Steps to Reproduce:

Current configuration:

/usr/bin/docker-current daemon --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --selinux-enabled --log-driver=json-file --log-opt max-size=50m --log-opt max-file=3 --storage-driver devicemapper --storage-opt dm.fs=xfs --storage-opt dm.thinpooldev=/dev/mapper/docker--vg-docker--pool --storage-opt dm.use_deferred_removal=true --storage-opt dm.use_deferred_deletion=true -b=lbr0 --mtu=1450 --add-registry registry.access.redhat.com


Procedure to replicate the issue:

I have the following logs:

# ls -il /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log*
787851 -r--------. 1 root root 40233476 Jul 24 11:54 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log
787243 -r--------. 1 root root 50000052 Jul 24 11:49 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.1
788262 -r--------. 1 root root 50000039 Jul 24 11:43 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.2

Now start watching the related pod log from  Openshift web console GUI:

# ls -il /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log*
787851 -r--------. 1 root root 49874783 Jul 24 11:55 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log
787243 -r--------. 1 root root 50000052 Jul 24 11:49 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.1
788262 -r--------. 1 root root 50000039 Jul 24 11:43 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.2


Since, the rotation occured, now seing one 1st "deleted" file, but that one still hold disk space even if no more accessible except from Docker:
 
# lsof -a +L1 | grep /var/lib/docker/
docker-cu 2624    root   55r   REG  253,1 50000039     0  788262 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.2 (deleted)

# ls -il /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log*
787878 -r--------. 1 root root 15362832 Jul 24 11:57 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log
787851 -r--------. 1 root root 50000187 Jul 24 11:55 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.1
787243 -r--------. 1 root root 50000052 Jul 24 11:49 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.2


This is about to rotate again:

# ls -il /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log*
787878 -r--------. 1 root root 48125666 Jul 24 12:01 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log
787851 -r--------. 1 root root 50000187 Jul 24 11:55 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.1
787243 -r--------. 1 root root 50000052 Jul 24 11:49 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.2

A rotation occured:

# ls -il /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log*
788260 -r--------. 1 root root  1820555 Jul 24 12:01 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log
787878 -r--------. 1 root root 50000501 Jul 24 12:01 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.1
787851 -r--------. 1 root root 50000187 Jul 24 11:55 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.2

Now there are 2 handles on log files left opened:

# lsof -a +L1 | grep /var/lib/docker/
docker-cu 2624    root   55r   REG  253,1 50000039     0  788262 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.2 (deleted)
docker-cu 2624    root   70r   REG  253,1 50000052     0  787243 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.2 (deleted)

Etc.:
# lsof -a +L1 | grep /var/lib/docker/
docker-cu 2624    root   55r   REG  253,1 50000039     0  788262 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.2 (deleted)
docker-cu 2624    root   70r   REG  253,1 50000052     0  787243 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.2 (deleted)
docker-cu 2624    root   71r   REG  253,1 50000187     0  787851 /var/lib/docker/containers/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c/5891494323dcef09e051473e2f4e1172683f4a655e82c32d189ca9f6f973707c-json.log.2 (deleted)

Actual results:

Open files are not being removed on next Docker log rotation. They keep occupying space on the FS and potentially fill the /var/lib/docker

Expected results:

The open files are removed during a docker log rotation when they appear as (deleted) on lsof output.

Additional info:

Comment 1 Ian McLeod 2017-08-02 15:10:08 UTC
This looks like expected VFS behavior.  If you hold open a FD and then delete the file, the block remain used until the FS is closed.

You have asserted: "If stopping watching the log from the GUI, it will not free up the indicated "deleted" files."

This is odd.  It suggests that stopping the GUI watch doesn't actually close the FD.  We should get to the bottom of that if it is, indeed happening.

Can you reproduce with a series of lsof outputs both before and _after_ the GUI log watch process is closed?

Comment 2 Nicolas Nosenzo 2017-08-04 08:05:50 UTC
@Ian, I'm on the way of reproducing it, I will let you know my findings shortly.

Comment 3 Nicolas Nosenzo 2017-08-04 13:17:08 UTC
Ok, I could reproduce it:

--- About to rotate:

# tail rot.logs -n3
-rw-------. 1 root root 1000159 Aug  3 21:35 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log.2
-rw-r--r--. 1 root root 1000165 Aug  4 05:37 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log.1
-rw-r--r--. 1 root root 994780 Aug  4 13:36 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log

If I check the open files that have been unlinked:

# lsof -a +L1 |  grep bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69

Shows nothing.

However, if I run (just to show that, at least, there are open files with 1 link):

# lsof -a +L2 | grep bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69

docker-cu 125850     root   10w      REG  253,0    992706     1   17725690 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log
docker-cu 125850     root   28r      REG   0,25         0     1   21312546 /sys/fs/cgroup/memory/system.slice/docker-bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69.scope/memory.oom_control
docker-cu 125850     root   37r      REG  253,0   1000159     1   17725702 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log.2
docker-cu 125850     root   47r      REG  253,0   1000165     1   17924919 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log.1
docker-cu 125850     root   50r      REG  253,0    992706     1   17725690 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log
fluentd   129272     root   16r      REG  253,0    992706     1   17725690 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log


--- Rotated (with console' watch process open):

# tail rot.logs -n3
-rw-r--r--. 1 root root 1000165 Aug  4 05:37 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log.2
-rw-r--r--. 1 root root 1000141 Aug  4 13:39 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log.1
-rw-r--r--. 1 root root 11242 Aug  4 13:44 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log

# lsof -a +L1 | grep containers
docker-cu 125850 root   37r   REG  253,0  1000159     0 17725702 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log.2 (deleted)


--- After log out of the console, the file descriptor still there:


# lsof -a +L1 |  grep bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69
docker-cu 125850 root   37r   REG  253,0  1000159     0 17725702 /var/lib/docker/containers/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69/bfc9b78e9d2bbd7a3bc39d5900e28850c837646dfd87201cfa5a100190707a69-json.log.2 (deleted)

Comment 4 Nicolas Nosenzo 2017-08-23 08:33:14 UTC
As a workaround, changing docker config on nodes to start with "max-file=1" (instead of "max-file=3") in order to always have the same log file descriptor seems to avoid this bug.

Comment 6 Nicolas Nosenzo 2018-09-09 14:31:11 UTC
Tested it in OCP 3.6 and the issue is gone. Closing this BZ as current release.


Note You need to log in before you can comment on or make changes to this bug.