Description of problem: ceph-disk repeatedly invokes a find command that brings a RHOSP computeosd node to its knees. system disk utilization hits 100% and docker commands take a very long time to run. Version-Release number of selected component (if applicable): RHOSP 13 GA dated 06-21 RHCS 3.0 container tag 3-9 Yes I know this isn't latest, we are working on that. How reproducible: every time in our cluster, which is very large (476 OSDs) Steps to Reproduce: 1. deploy large RHOSP 13 HCI cluster 2. populate the cluster with 147 TB of Cinder-RBD data (16% space used) 3. Not sure if there is a step 3, but I've never seen a time when this wasn't happening on the cluster after it was populated. Actual results: system disk looks like this all the time: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdak 0.00 70.00 0.00 210.00 0.00 4.49 43.81 158.30 727.47 0.00 727.47 4.76 99.95 Expected results: should only happen once, should not have each container running same find command on same large set of directories and files! Note that /var/lib/ceph is shared by containers. If you look at /usr/share/run-ceph-osd.sh you see: /usr/bin/docker run \ ... -v /var/lib/ceph:/var/lib/ceph:z ... which means that the OSD containers are all accessing the same /var/lib/ceph tree on the computeosd host. Additional info: The command appears to originate from: /usr/lib/python2.7/site-packages/ceph_disk/main.py in the main_fix() subroutine. symptom is that a large number of find processes appear and start to consume significant system resources, they look like this: [root@overcloud-compute-11 ~]# ps awux | grep find | grep -v grep root 186656 8.6 0.0 23740 8188 ? S 15:44 0:05 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ; root 237569 8.9 0.0 23740 8188 ? S 15:45 0:02 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ; root 237570 9.0 0.0 23740 8184 ? S 15:45 0:02 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ; root 248529 8.6 0.0 23872 8192 ? S 15:45 0:02 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ; root 251811 8.9 0.0 23872 8188 ? S 15:45 0:01 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ; I have seen as many as 12 of these running. These find processes take so long to run because of this (happens on all computeosd nodes): [root@overcloud-compute-11 ~]# find /var/lib/ceph/tmp -name 'tmp.*' | wc -l 23871 Why are there so many of these? I have to keep deleting them with # find /var/lib/ceph/tmp -name 'tmp.*' -exec rmdir {} \; Or else the system becomes unusable. This in itself is a separate bz I guess. I could not find where these directories were being created in ceph-disk code.
I've noticed over 100,000 of these messages in /var/log/messages on the cluster, perhaps this is related to why there are so many of these /var/lib/ceph/tmp/tmp.* directories. Sep 19 15:10:28 overcloud-compute-11 ceph-osd-run.sh: 3d76bfca-9f0f-4f6d-b040-acf362bb40df on /var/lib/ceph/tmp/tmp.t47SoP9RAm failed: File name too long Sep 19 15:10:28 overcloud-compute-11 journal: 3d76bfca-9f0f-4f6d-b040-acf362bb40df on /var/lib/ceph/tmp/tmp.t47SoP9RAm failed: File name too long Why is this coming from /usr/share/ceph-osd-run.sh? It only executes 2 docker run commands. Why is the file name too long?
I was able to stop creation of these directories on 1 host, overcloud-compute-6. I now have a theory about what's wrong. Adding Alfredo because he might be able to fill in the missing pieces of the puzzle. To stop the directory creation, we had to "ceph-disk zap /dev/sda" because of problem described below. I then found that Ceph kept trying to activate /dev/sdaj, which had no partition table! So I stopped and disabled ceph-osd , and now there are no directories being created. When containerized Ceph tries to start an OSD on /dev/sda and there are > 26 OSDs, device naming wraps around and it starts using device names like /dev/sdaa, /dev/sdab, ... Apparently there is something in the container image that gets totally confused by that and it matches multiple devices. I know this because typically in this cluster we have 34 OSDs. This is a major bug. The assumption is that sda is the system disk or that there aren't enough other disk devices to cause this wildcarding to not work. I'm hypothesizing that the /var/lib/ceph/tmp/tmp.* directories (which started this bz) are left over from this process. Am verifying that this is the root cause. The same problem could be recreated in a virtual machine environment if there were enough OSD devices available (i.e. > 26) BTW we updated software version and it still happens. Background: Although I haven't looked at the source code, I think this is how Ceph OSD startup works. If it sees a partition with a PART_ENTRY_NAME attribute containing "ceph disk", or perhaps a PART_ENTRY_TYPE containing 4fbd7e29-9d25-41b8-afd0-062c0ceff05d, then it assumes that it is an OSD and tries to start it. This command shows which devices contain OSDs: # for n in /dev/sd*[a-z]1 ; do blkid -p -s PART_ENTRY_TYPE -s PART_ENTRY_NAME $n ; done | grep sda /dev/sda1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdaa1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdab1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdac1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdad1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdae1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdaf1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdag1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdah1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdai1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdak1: PART_ENTRY_TYPE="0x83" ... /dev/sdak is the system disk here, it's just the way the scale lab works. For each such OSD partition, first it mounts the filesystem , then it discovers the OSD number in "whoami" file. It then discovers the journal device in the "journal" softlink. It should then be able to do what you and I did and start the OSD running as is done in /usr/share/ceph-osd-run.sh. But this doesn't work. Why? It's simple - it is using a wildcard and so it matches more than one device. See this output: # bash -x /usr/share/ceph-osd-run.sh sda + DOCKER_ENV= + expose_partitions sda ++ docker run --rm --net=host --name expose_partitions_sda --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z -e CLUSTER=ceph -e OSD_DEVICE=/dev/sda 192.168.24.1:8787/rhceph:3-12 disk_list mount: mount /dev/disk/by-partuuid/867ef9c4-2ae1-4c61-b99f-786646807fba d7c2aff0-9fe6-4dba-8e9d-03c0b2e95a61 5ad89aa8-0dcf-42b6-8e12-3aa916dee91c 129df299-0142-4091-9a54-01411450df3a fbe6babc-1164-40bc-86b4-0422c18d9b90 2735bdda-c14d-4245-b1c2-3b8098ddd7bf 7b9bba9a-8c0a-4152-be57-476005beb126 1e9647ff-f09a-4ee2-8105-af17a220da12 046aae50-35a4-4335-a461-fced600503e5 bed87a5c-d799-4cc1-8d6a-1edcbfca0d93 on /var/lib/ceph/tmp/tmp.gy28EWV48j failed: File name too long + DOCKER_ENV= + docker rm -f expose_partitions_sda Error response from daemon: No such container: expose_partitions_sda + /usr/bin/docker run --rm --net=host --privileged=true --pid=host --memory=5g --cpu-quota=100000 -v /dev:/dev -v /etc/localtime:/et c/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/run/ceph:z -e OSD_FILESTORE=1 -e OSD_ DMCRYPT=0 -e CLUSTER=ceph -e OSD_DEVICE=/dev/sda -e CEPH_DAEMON=OSD_CEPH_DISK_ACTIVATE --name=ceph-osd-overcloud-compute-12-sda 192. 168.24.1:8787/rhceph:3-12 2018-10-01 22:19:37 /entrypoint.sh: static: does not generate config mount: mount /dev/disk/by-partuuid/867ef9c4-2ae1-4c61-b99f-786646807fba d7c2aff0-9fe6-4dba-8e9d-03c0b2e95a61 5ad89aa8-0dcf-42b6-8e12-3aa916dee91c 129df299-0142-4091-9a54-01411450df3a fbe6babc-1164-40bc-86b4-0422c18d9b90 2735bdda-c14d-4245-b1c2-3b8098ddd7bf 7b9bba9a-8c0a-4152-be57-476005beb126 1e9647ff-f09a-4ee2-8105-af17a220da12 046aae50-35a4-4335-a461-fced600503e5 bed87a5c-d799-4cc1-8d6a-1edcbfca0d93 on /var/lib/ceph/tmp/tmp.cq8VM6tZNo failed: File name too long Note that this is what would happen if you matched /dev/sda*, you'd get /dev/sdaa, /dev/sdab, ... This list below matches precisely what is up above. # for n in /dev/sd*[a-z]1 ; do blkid -p -s PART_ENTRY_UUID $n ; done | grep /dev/sda /dev/sda1: PART_ENTRY_UUID="867ef9c4-2ae1-4c61-b99f-786646807fba" /dev/sdaa1: PART_ENTRY_UUID="d7c2aff0-9fe6-4dba-8e9d-03c0b2e95a61" /dev/sdab1: PART_ENTRY_UUID="5ad89aa8-0dcf-42b6-8e12-3aa916dee91c" /dev/sdac1: PART_ENTRY_UUID="129df299-0142-4091-9a54-01411450df3a" /dev/sdad1: PART_ENTRY_UUID="fbe6babc-1164-40bc-86b4-0422c18d9b90" /dev/sdae1: PART_ENTRY_UUID="2735bdda-c14d-4245-b1c2-3b8098ddd7bf" /dev/sdaf1: PART_ENTRY_UUID="7b9bba9a-8c0a-4152-be57-476005beb126" /dev/sdag1: PART_ENTRY_UUID="1e9647ff-f09a-4ee2-8105-af17a220da12" /dev/sdah1: PART_ENTRY_UUID="046aae50-35a4-4335-a461-fced600503e5" /dev/sdai1: PART_ENTRY_UUID="bed87a5c-d799-4cc1-8d6a-1edcbfca0d93"
The OSD names wouldn't have to be consistent across the nodes if bz 1438590 was fixed (discover OSDs by rule, don't enumerate them). I reported this a year and a half ago. For example, "make all the 2-TB HDDs into OSDs, use NVM devices as journals". This is the only solution that works for a large, heterogeneous cluster.
We cannot verify this bug due to lack of hardware necessary to check it. @bengland, will you be able to retest this scenario in the scale lab?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:4354
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days