Bug 1628652
Summary: | ceph container with more than 26 osds on a host fails to run | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Ben England <bengland> | |
Component: | Container | Assignee: | Dimitri Savineau <dsavinea> | |
Status: | CLOSED ERRATA | QA Contact: | Vasishta <vashastr> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 3.0 | CC: | adeza, ceph-eng-bugs, ceph-qe-bugs, dhill, dsavinea, dwilson, gabrioux, gfidente, johfulto, mmurthy, nojha, pasik, tchandra, tserlin, twilkins, ykaul, yrabl | |
Target Milestone: | z2 | Keywords: | Performance | |
Target Release: | 3.3 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | rhceph:ceph-3.3-rhel-7-containers-candidate-48122-20191024075852 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1784047 (view as bug list) | Environment: | ||
Last Closed: | 2019-12-19 17:44:56 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1578730, 1784047 |
Description
Ben England
2018-09-13 16:01:26 UTC
I've noticed over 100,000 of these messages in /var/log/messages on the cluster, perhaps this is related to why there are so many of these /var/lib/ceph/tmp/tmp.* directories. Sep 19 15:10:28 overcloud-compute-11 ceph-osd-run.sh: 3d76bfca-9f0f-4f6d-b040-acf362bb40df on /var/lib/ceph/tmp/tmp.t47SoP9RAm failed: File name too long Sep 19 15:10:28 overcloud-compute-11 journal: 3d76bfca-9f0f-4f6d-b040-acf362bb40df on /var/lib/ceph/tmp/tmp.t47SoP9RAm failed: File name too long Why is this coming from /usr/share/ceph-osd-run.sh? It only executes 2 docker run commands. Why is the file name too long? I was able to stop creation of these directories on 1 host, overcloud-compute-6. I now have a theory about what's wrong. Adding Alfredo because he might be able to fill in the missing pieces of the puzzle. To stop the directory creation, we had to "ceph-disk zap /dev/sda" because of problem described below. I then found that Ceph kept trying to activate /dev/sdaj, which had no partition table! So I stopped and disabled ceph-osd , and now there are no directories being created. When containerized Ceph tries to start an OSD on /dev/sda and there are > 26 OSDs, device naming wraps around and it starts using device names like /dev/sdaa, /dev/sdab, ... Apparently there is something in the container image that gets totally confused by that and it matches multiple devices. I know this because typically in this cluster we have 34 OSDs. This is a major bug. The assumption is that sda is the system disk or that there aren't enough other disk devices to cause this wildcarding to not work. I'm hypothesizing that the /var/lib/ceph/tmp/tmp.* directories (which started this bz) are left over from this process. Am verifying that this is the root cause. The same problem could be recreated in a virtual machine environment if there were enough OSD devices available (i.e. > 26) BTW we updated software version and it still happens. Background: Although I haven't looked at the source code, I think this is how Ceph OSD startup works. If it sees a partition with a PART_ENTRY_NAME attribute containing "ceph disk", or perhaps a PART_ENTRY_TYPE containing 4fbd7e29-9d25-41b8-afd0-062c0ceff05d, then it assumes that it is an OSD and tries to start it. This command shows which devices contain OSDs: # for n in /dev/sd*[a-z]1 ; do blkid -p -s PART_ENTRY_TYPE -s PART_ENTRY_NAME $n ; done | grep sda /dev/sda1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdaa1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdab1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdac1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdad1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdae1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdaf1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdag1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdah1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdai1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" /dev/sdak1: PART_ENTRY_TYPE="0x83" ... /dev/sdak is the system disk here, it's just the way the scale lab works. For each such OSD partition, first it mounts the filesystem , then it discovers the OSD number in "whoami" file. It then discovers the journal device in the "journal" softlink. It should then be able to do what you and I did and start the OSD running as is done in /usr/share/ceph-osd-run.sh. But this doesn't work. Why? It's simple - it is using a wildcard and so it matches more than one device. See this output: # bash -x /usr/share/ceph-osd-run.sh sda + DOCKER_ENV= + expose_partitions sda ++ docker run --rm --net=host --name expose_partitions_sda --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z -e CLUSTER=ceph -e OSD_DEVICE=/dev/sda 192.168.24.1:8787/rhceph:3-12 disk_list mount: mount /dev/disk/by-partuuid/867ef9c4-2ae1-4c61-b99f-786646807fba d7c2aff0-9fe6-4dba-8e9d-03c0b2e95a61 5ad89aa8-0dcf-42b6-8e12-3aa916dee91c 129df299-0142-4091-9a54-01411450df3a fbe6babc-1164-40bc-86b4-0422c18d9b90 2735bdda-c14d-4245-b1c2-3b8098ddd7bf 7b9bba9a-8c0a-4152-be57-476005beb126 1e9647ff-f09a-4ee2-8105-af17a220da12 046aae50-35a4-4335-a461-fced600503e5 bed87a5c-d799-4cc1-8d6a-1edcbfca0d93 on /var/lib/ceph/tmp/tmp.gy28EWV48j failed: File name too long + DOCKER_ENV= + docker rm -f expose_partitions_sda Error response from daemon: No such container: expose_partitions_sda + /usr/bin/docker run --rm --net=host --privileged=true --pid=host --memory=5g --cpu-quota=100000 -v /dev:/dev -v /etc/localtime:/et c/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/run/ceph:z -e OSD_FILESTORE=1 -e OSD_ DMCRYPT=0 -e CLUSTER=ceph -e OSD_DEVICE=/dev/sda -e CEPH_DAEMON=OSD_CEPH_DISK_ACTIVATE --name=ceph-osd-overcloud-compute-12-sda 192. 168.24.1:8787/rhceph:3-12 2018-10-01 22:19:37 /entrypoint.sh: static: does not generate config mount: mount /dev/disk/by-partuuid/867ef9c4-2ae1-4c61-b99f-786646807fba d7c2aff0-9fe6-4dba-8e9d-03c0b2e95a61 5ad89aa8-0dcf-42b6-8e12-3aa916dee91c 129df299-0142-4091-9a54-01411450df3a fbe6babc-1164-40bc-86b4-0422c18d9b90 2735bdda-c14d-4245-b1c2-3b8098ddd7bf 7b9bba9a-8c0a-4152-be57-476005beb126 1e9647ff-f09a-4ee2-8105-af17a220da12 046aae50-35a4-4335-a461-fced600503e5 bed87a5c-d799-4cc1-8d6a-1edcbfca0d93 on /var/lib/ceph/tmp/tmp.cq8VM6tZNo failed: File name too long Note that this is what would happen if you matched /dev/sda*, you'd get /dev/sdaa, /dev/sdab, ... This list below matches precisely what is up above. # for n in /dev/sd*[a-z]1 ; do blkid -p -s PART_ENTRY_UUID $n ; done | grep /dev/sda /dev/sda1: PART_ENTRY_UUID="867ef9c4-2ae1-4c61-b99f-786646807fba" /dev/sdaa1: PART_ENTRY_UUID="d7c2aff0-9fe6-4dba-8e9d-03c0b2e95a61" /dev/sdab1: PART_ENTRY_UUID="5ad89aa8-0dcf-42b6-8e12-3aa916dee91c" /dev/sdac1: PART_ENTRY_UUID="129df299-0142-4091-9a54-01411450df3a" /dev/sdad1: PART_ENTRY_UUID="fbe6babc-1164-40bc-86b4-0422c18d9b90" /dev/sdae1: PART_ENTRY_UUID="2735bdda-c14d-4245-b1c2-3b8098ddd7bf" /dev/sdaf1: PART_ENTRY_UUID="7b9bba9a-8c0a-4152-be57-476005beb126" /dev/sdag1: PART_ENTRY_UUID="1e9647ff-f09a-4ee2-8105-af17a220da12" /dev/sdah1: PART_ENTRY_UUID="046aae50-35a4-4335-a461-fced600503e5" /dev/sdai1: PART_ENTRY_UUID="bed87a5c-d799-4cc1-8d6a-1edcbfca0d93" The OSD names wouldn't have to be consistent across the nodes if bz 1438590 was fixed (discover OSDs by rule, don't enumerate them). I reported this a year and a half ago. For example, "make all the 2-TB HDDs into OSDs, use NVM devices as journals". This is the only solution that works for a large, heterogeneous cluster. We cannot verify this bug due to lack of hardware necessary to check it. @bengland, will you be able to retest this scenario in the scale lab? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:4354 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |