1722562 – ulimit nofile option impacts process execution time

Bug 1722562 - ulimit nofile option impacts process execution time

Summary: ulimit nofile option impacts process execution time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	3.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	4.0
Assignee:	Dimitri Savineau
QA Contact:	Vasishta
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-20 16:02 UTC by Dimitri Savineau
Modified:	2020-01-31 12:46 UTC (History)
CC List:	16 users (show)
Fixed In Version:	ceph-ansible-4.0.0-0.1.rc14.el8cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-31 12:46:20 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 4317	'None'	closed	ceph-osd: Add ulimit nofile on container start	2021-01-04 14:24:02 UTC
Github	ceph ceph-ansible pull 4372	'None'	closed	ceph-osd: Add ulimit nofile on container start (bp #4317)	2021-01-04 14:24:02 UTC
Red Hat Product Errata	RHBA-2020:0312	None	None	None	2020-01-31 12:46:56 UTC

Description Dimitri Savineau 2019-06-20 16:02:22 UTC

Description of problem:
When deploying Ceph on containerized setup with docker and using ceph-volume to create the OSD on LV devices, the execution time is very slow compared to baremetal.

Version-Release number of selected component (if applicable):
# docker --version
Docker version 1.13.1, build b2f74b2/1.13.1
# rpm -qa docker
docker-1.13.1-96.gitb2f74b2.el7.x86_64
# docker info
(...)
Server Version: 1.13.1
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: journald
Cgroup Driver: systemd
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
 Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Init Binary: /usr/libexec/docker/docker-init-current
containerd version:  (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1)
runc version: 9c3c5f853ebf0ffac0d087e94daef462133b69c7 (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f)
init version: fec3683b971d9c3ef73f284f176672c44b448662 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
(...)

RHCS 3.2 (ceph)
# docker images
REPOSITORY                                         TAG                 IMAGE ID            CREATED             SIZE
registry.access.redhat.com/rhceph/rhceph-3-rhel7   3-27                aa5f82a46d71        2 weeks ago         622 MB
# docker run --rm --entrypoint=ceph registry.access.redhat.com/rhceph/rhceph-3-rhel7:3-27 --version
ceph version 12.2.8-128.el7cp (030358773c5213a14c1444a5147258672b2dc15f) luminous (stable)

How reproducible:
100%

Steps to Reproduce:
1. Run ceph-ansible containerized deployment with multiple OSD devices on a node and configured with lvm.

Actual results:
Creating 25 OSDs with ceph-volume from the container takes 2993.24s

Expected results:
On baremetal it only takes 624.10s

Additional info:

- lvm2 is installed on the host and vg/lv are created like this:
  * PV: /dev/sdb
    VG: vgsdb
      LV: data
      LV: db
      LV: wal
  * PV: /dev/sdc
    VG: vgsdc
      LV: data
      LV: db
      LV: wal
    ...
  * PV: /dev/sdz
    VG: vgsdz
      LV: data
      LV: db
      LV: wal

- the container has access to the lvm devices from the host (via bind mounts) in order to run ceph-volume command from the container.

When trying to list the lvm OSD devices from ceph-volume in the container:

time docker run --rm --net=host --privileged=true --pid=host --ipc=host -v /dev:/dev -v /etc/localtime:/etc/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/run/ceph:z -v /var/run/udev/:/var/run/udev/:z -v /run/lvm/lvmetad.socket:/run/lvm/lvmetad.socket --entrypoint=ceph-volume registry.access.redhat.com/rhceph/rhceph-3-rhel7:3-27 lvm list --format json
(...)
real	3m44.120s

Now when using the ulimit docker option to cap the max open files then the execution time is completely different.

time docker run --rm --ulimit nofile=1024:1024 --net=host --privileged=true --pid=host --ipc=host -v /dev:/dev -v /etc/localtime:/etc/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/run/ceph:z -v /var/run/udev/:/var/run/udev/:z -v /run/lvm/lvmetad.socket:/run/lvm/lvmetad.socket --entrypoint=ceph-volume registry.access.redhat.com/rhceph/rhceph-3-rhel7:3-27 lvm list --format json
(...)
real	0m32.796s

I don't know why the ulimit nofile option has a such impact on ceph-volume.

I noticed that the max open files values (soft & hard) aren't the same on the host and in the container when using the same user.

Host:
# ulimit -Sn
1024
# ulimit -Hn
4096

Container:
# ulimit -Sn
1048576
# ulimit -Hn
1048576

I tried with different docker ulimit nofile values for the ceph-volume lvm list command:

1048576 3m44.120s (default)
 524288 2m3.393s
 262144 1m17.870s
 131072 0m56.228s
  65536 0m44.566s
  32768 0m38.738s
  16384 0m36.141s
   8192 0m34.320s
   4096 0m33.148s
   2048 0m32.977s
   1024 0m32.796s

With the ulimit nofile option then OSDs creation decrease from 2993.24s to 1065.67s

Comment 2 Daniel Walsh 2019-06-20 17:05:19 UTC

This seems to be more of a cephs issue then a Docker issue.

Comment 3 Guillaume Abrioux 2019-06-21 07:09:16 UTC

Hi Daniel,

could you elaborate a bit more? Any pointers would be appreciated.
Thanks!

Comment 4 Daniel Walsh 2019-06-21 11:33:32 UTC

My reading of your description above is that manipulating the ULimits changes the performance of the tool.  
Unless you are looking to change the default settings for Docker, I don't see how this is a Docker issue.

BTW Have you tried to experiment with Podman to see how it performs?

Comment 5 Dimitri Savineau 2019-06-21 13:17:32 UTC

> BTW Have you tried to experiment with Podman to see how it performs?

Podman has exactly the same issue.

Comment 6 Daniel Walsh 2019-06-21 18:50:41 UTC

Ok, again this is not an issue with the container engines.  It might be an issue with rhceph container running with different Ulimits.

Comment 7 Dimitri Savineau 2019-06-21 19:12:22 UTC

AFAIK rhceph container doesn't run with ulimits.

BTW the default ulimit nofile value in the container (1048576 as mentionned in #1) is the same when using the default RHEL 7 container image.

Comment 9 Daniel Walsh 2019-06-23 09:27:29 UTC

Containers are normal processes on a system.  Container engines setup default contraints on processes, like NOFILE, to control what those processes on the system do.
They also take away the power of root for these proceses.  For example Root has the CAPABILITY SYS_RESOURCE, which allows it to ignore constraints like NOFILE.

BUT when a root process runs in a container, the container runtime by default removes SYS_RESOUCE Capability, so the root processes inside of the container
can not ignore it.

Bottom line on these processes is that the container engine (podman, docker, buildah, cri-o) just sets the defaults Ulimits Capabilties and other namepsaces, cgroups and security options on the kernel 
and then launches the processes.  From that point on the container engine has little to do with the performance of the processes. That is more about the Kernel and the code of the processes.

If the cephs code does not run well within these constraints, its code needs to be fixed or the person running the container with cephs needs to modify the container engines commands to make it run better.

The container engines code does NOT need to be modified, and their is no bug in the Container Engine, in my opinion.

Comment 10 Giuseppe Scrivano 2019-06-23 18:59:20 UTC

I agree, this doesn't seem to have anything to do with the container runtime.  It only sets the limit without any further intervention.  Do you hit the same issue if you just increase the ulimit without running in a container?

The issue could also be in the kernel, although in the past I've ran different tests setting high ulimits values and I've never hit a similar issue.

Comment 11 Dimitri Savineau 2019-06-26 19:41:23 UTC

I tried to run the same on baremetal (still 25 OSDs) and increasing the ulimit nofile values

With default values:

# ulimit -Sn && ulimit -Hn
1024
4096
# time ceph-volume lvm list --format json
(...)
real	0m30.832s
user	0m8.690s
sys	0m20.301

We almost have the same result than with container + ulimit nofile (1024) parameter

Now let's increase the max open files value

# ulimit -n 1048576
# ulimit -Sn && ulimit -Hn
1048576
1048576
# time ceph-volume lvm list --format json
(...)
real	3m36.879s
user	1m2.011s
sys	2m31.822s

Which is close to the default execution time on container (ie: without the ulimit parameter)

So it's definitly not something related to container engine.

Comment 13 Alfredo Deza 2019-08-21 11:47:05 UTC

Going through the details in this ticket, I don't see how ceph-volume can help here, as I consider this a system configuration rather than specific to the execution of ceph-volume. Dmitri, not sure to what/where do you want to re-assign this (it is unclear to me), let me know if you want to discuss this further.

Comment 18 Sage Weil 2019-11-13 15:51:43 UTC

An update, for posterity:

This is bad behavior on the part of python2's subprocess module when close_fds=True (which is needed for other reasons I'm told):

gnit:~ (master) 09:43 AM $ cat t.py
import subprocess
subprocess.call(['true'], close_fds=True)

gnit:~ (master) 09:43 AM $ ulimit -n 100000 ; rm -f xx ; time strace -f -oxx python t.py ; grep -c close xx

real    0m2.672s
user    0m0.901s
sys     0m2.327s
100106
gnit:~ (master) 09:43 AM $ ulimit -n 100 ; rm -f xx ; time strace -f -oxx python t.py ; grep -c close xx

real    0m0.079s
user    0m0.029s
sys     0m0.060s
207

The good news is that python3 does not have this super-lame behavior:

gnit:~ (master) 09:44 AM $ ulimit -n 100000 ; rm -f xx ; time strace -f -oxx python3 t.py ; grep -c close xx

real    0m0.093s
user    0m0.060s
sys     0m0.040s
87
gnit:~ (master) 09:44 AM $ ulimit -n 1000 ; rm -f xx ; time strace -f -oxx python3 t.py ; grep -c close xx

real    0m0.086s
user    0m0.047s
sys     0m0.045s
86

Comment 23 errata-xmlrpc 2020-01-31 12:46:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0312

Note You need to log in before you can comment on or make changes to this bug.