Description of problem: When deploying Ceph on containerized setup with docker and using ceph-volume to create the OSD on LV devices, the execution time is very slow compared to baremetal. Version-Release number of selected component (if applicable): # docker --version Docker version 1.13.1, build b2f74b2/1.13.1 # rpm -qa docker docker-1.13.1-96.gitb2f74b2.el7.x86_64 # docker info (...) Server Version: 1.13.1 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true Logging Driver: journald Cgroup Driver: systemd Plugins: Volume: local Network: bridge host macvlan null overlay Authorization: rhel-push-plugin Swarm: inactive Runtimes: docker-runc runc Default Runtime: docker-runc Init Binary: /usr/libexec/docker/docker-init-current containerd version: (expected: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1) runc version: 9c3c5f853ebf0ffac0d087e94daef462133b69c7 (expected: 9df8b306d01f59d3a8029be411de015b7304dd8f) init version: fec3683b971d9c3ef73f284f176672c44b448662 (expected: 949e6facb77383876aeff8a6944dde66b3089574) (...) RHCS 3.2 (ceph) # docker images REPOSITORY TAG IMAGE ID CREATED SIZE registry.access.redhat.com/rhceph/rhceph-3-rhel7 3-27 aa5f82a46d71 2 weeks ago 622 MB # docker run --rm --entrypoint=ceph registry.access.redhat.com/rhceph/rhceph-3-rhel7:3-27 --version ceph version 12.2.8-128.el7cp (030358773c5213a14c1444a5147258672b2dc15f) luminous (stable) How reproducible: 100% Steps to Reproduce: 1. Run ceph-ansible containerized deployment with multiple OSD devices on a node and configured with lvm. Actual results: Creating 25 OSDs with ceph-volume from the container takes 2993.24s Expected results: On baremetal it only takes 624.10s Additional info: - lvm2 is installed on the host and vg/lv are created like this: * PV: /dev/sdb VG: vgsdb LV: data LV: db LV: wal * PV: /dev/sdc VG: vgsdc LV: data LV: db LV: wal ... * PV: /dev/sdz VG: vgsdz LV: data LV: db LV: wal - the container has access to the lvm devices from the host (via bind mounts) in order to run ceph-volume command from the container. When trying to list the lvm OSD devices from ceph-volume in the container: time docker run --rm --net=host --privileged=true --pid=host --ipc=host -v /dev:/dev -v /etc/localtime:/etc/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/run/ceph:z -v /var/run/udev/:/var/run/udev/:z -v /run/lvm/lvmetad.socket:/run/lvm/lvmetad.socket --entrypoint=ceph-volume registry.access.redhat.com/rhceph/rhceph-3-rhel7:3-27 lvm list --format json (...) real 3m44.120s Now when using the ulimit docker option to cap the max open files then the execution time is completely different. time docker run --rm --ulimit nofile=1024:1024 --net=host --privileged=true --pid=host --ipc=host -v /dev:/dev -v /etc/localtime:/etc/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/run/ceph:z -v /var/run/udev/:/var/run/udev/:z -v /run/lvm/lvmetad.socket:/run/lvm/lvmetad.socket --entrypoint=ceph-volume registry.access.redhat.com/rhceph/rhceph-3-rhel7:3-27 lvm list --format json (...) real 0m32.796s I don't know why the ulimit nofile option has a such impact on ceph-volume. I noticed that the max open files values (soft & hard) aren't the same on the host and in the container when using the same user. Host: # ulimit -Sn 1024 # ulimit -Hn 4096 Container: # ulimit -Sn 1048576 # ulimit -Hn 1048576 I tried with different docker ulimit nofile values for the ceph-volume lvm list command: 1048576 3m44.120s (default) 524288 2m3.393s 262144 1m17.870s 131072 0m56.228s 65536 0m44.566s 32768 0m38.738s 16384 0m36.141s 8192 0m34.320s 4096 0m33.148s 2048 0m32.977s 1024 0m32.796s With the ulimit nofile option then OSDs creation decrease from 2993.24s to 1065.67s
This seems to be more of a cephs issue then a Docker issue.
Hi Daniel, could you elaborate a bit more? Any pointers would be appreciated. Thanks!
My reading of your description above is that manipulating the ULimits changes the performance of the tool. Unless you are looking to change the default settings for Docker, I don't see how this is a Docker issue. BTW Have you tried to experiment with Podman to see how it performs?
> BTW Have you tried to experiment with Podman to see how it performs? Podman has exactly the same issue.
Ok, again this is not an issue with the container engines. It might be an issue with rhceph container running with different Ulimits.
AFAIK rhceph container doesn't run with ulimits. BTW the default ulimit nofile value in the container (1048576 as mentionned in #1) is the same when using the default RHEL 7 container image.
Containers are normal processes on a system. Container engines setup default contraints on processes, like NOFILE, to control what those processes on the system do. They also take away the power of root for these proceses. For example Root has the CAPABILITY SYS_RESOURCE, which allows it to ignore constraints like NOFILE. BUT when a root process runs in a container, the container runtime by default removes SYS_RESOUCE Capability, so the root processes inside of the container can not ignore it. Bottom line on these processes is that the container engine (podman, docker, buildah, cri-o) just sets the defaults Ulimits Capabilties and other namepsaces, cgroups and security options on the kernel and then launches the processes. From that point on the container engine has little to do with the performance of the processes. That is more about the Kernel and the code of the processes. If the cephs code does not run well within these constraints, its code needs to be fixed or the person running the container with cephs needs to modify the container engines commands to make it run better. The container engines code does NOT need to be modified, and their is no bug in the Container Engine, in my opinion.
I agree, this doesn't seem to have anything to do with the container runtime. It only sets the limit without any further intervention. Do you hit the same issue if you just increase the ulimit without running in a container? The issue could also be in the kernel, although in the past I've ran different tests setting high ulimits values and I've never hit a similar issue.
I tried to run the same on baremetal (still 25 OSDs) and increasing the ulimit nofile values With default values: # ulimit -Sn && ulimit -Hn 1024 4096 # time ceph-volume lvm list --format json (...) real 0m30.832s user 0m8.690s sys 0m20.301 We almost have the same result than with container + ulimit nofile (1024) parameter Now let's increase the max open files value # ulimit -n 1048576 # ulimit -Sn && ulimit -Hn 1048576 1048576 # time ceph-volume lvm list --format json (...) real 3m36.879s user 1m2.011s sys 2m31.822s Which is close to the default execution time on container (ie: without the ulimit parameter) So it's definitly not something related to container engine.
Going through the details in this ticket, I don't see how ceph-volume can help here, as I consider this a system configuration rather than specific to the execution of ceph-volume. Dmitri, not sure to what/where do you want to re-assign this (it is unclear to me), let me know if you want to discuss this further.
An update, for posterity: This is bad behavior on the part of python2's subprocess module when close_fds=True (which is needed for other reasons I'm told): gnit:~ (master) 09:43 AM $ cat t.py import subprocess subprocess.call(['true'], close_fds=True) gnit:~ (master) 09:43 AM $ ulimit -n 100000 ; rm -f xx ; time strace -f -oxx python t.py ; grep -c close xx real 0m2.672s user 0m0.901s sys 0m2.327s 100106 gnit:~ (master) 09:43 AM $ ulimit -n 100 ; rm -f xx ; time strace -f -oxx python t.py ; grep -c close xx real 0m0.079s user 0m0.029s sys 0m0.060s 207 The good news is that python3 does not have this super-lame behavior: gnit:~ (master) 09:44 AM $ ulimit -n 100000 ; rm -f xx ; time strace -f -oxx python3 t.py ; grep -c close xx real 0m0.093s user 0m0.060s sys 0m0.040s 87 gnit:~ (master) 09:44 AM $ ulimit -n 1000 ; rm -f xx ; time strace -f -oxx python3 t.py ; grep -c close xx real 0m0.086s user 0m0.047s sys 0m0.045s 86
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0312