Bug 1628652 - ceph container with more than 26 osds on a host fails to run
Summary: ceph container with more than 26 osds on a host fails to run
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Container
Version: 3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z2
: 3.3
Assignee: Dimitri Savineau
QA Contact: Vasishta
URL:
Whiteboard:
Depends On:
Blocks: 1578730 1784047
TreeView+ depends on / blocked
 
Reported: 2018-09-13 16:01 UTC by Ben England
Modified: 2023-09-18 00:14 UTC (History)
17 users (show)

Fixed In Version: rhceph:ceph-3.3-rhel-7-containers-candidate-48122-20191024075852
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1784047 (view as bug list)
Environment:
Last Closed: 2019-12-19 17:44:56 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-container pull 1461 0 'None' closed disk_list.sh: Don't use wildcard on OSD device 2020-10-25 17:05:59 UTC
Github ceph ceph-container pull 1464 0 'None' closed disk_list.sh: Don't use wildcard on OSD device 2020-10-25 17:06:13 UTC
Red Hat Issue Tracker RHCEPH-6309 0 None None None 2023-03-24 14:27:08 UTC
Red Hat Product Errata RHBA-2019:4354 0 None None None 2019-12-19 17:45:01 UTC

Description Ben England 2018-09-13 16:01:26 UTC
Description of problem:

ceph-disk repeatedly invokes a find command that brings a RHOSP computeosd node to its knees.  system disk utilization hits 100% and docker commands take a very long time to run.


Version-Release number of selected component (if applicable):

RHOSP 13 GA dated 06-21
RHCS 3.0 container tag 3-9

Yes I know this isn't latest, we are working on that.

How reproducible:

every time in our cluster, which is very large (476 OSDs)

Steps to Reproduce:
1. deploy large RHOSP 13 HCI cluster
2. populate the cluster with 147 TB of Cinder-RBD data (16% space used)
3. 

Not sure if there is a step 3, but I've never seen a time when this wasn't happening on the cluster after it was populated.

Actual results:

system disk looks like this all the time:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdak              0.00    70.00    0.00  210.00     0.00     4.49    43.81   158.30  727.47    0.00  727.47   4.76  99.95



Expected results:

should only happen once, should not have each container running same find command on same large set of directories and files!  Note that /var/lib/ceph is shared by containers.  If you look at /usr/share/run-ceph-osd.sh you see:

/usr/bin/docker run \
...
  -v /var/lib/ceph:/var/lib/ceph:z
...

which means that the OSD containers are all accessing the same /var/lib/ceph tree on the computeosd host.



Additional info:

The command appears to originate from:

/usr/lib/python2.7/site-packages/ceph_disk/main.py 

in the main_fix() subroutine.

symptom is that a large number of find processes appear and start to consume significant system resources, they look like this:

[root@overcloud-compute-11 ~]# ps awux | grep find | grep -v grep                                                                                                            
root      186656  8.6  0.0  23740  8188 ?        S    15:44   0:05 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ;                                     
root      237569  8.9  0.0  23740  8188 ?        S    15:45   0:02 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ;                                     
root      237570  9.0  0.0  23740  8184 ?        S    15:45   0:02 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ;
root      248529  8.6  0.0  23872  8192 ?        S    15:45   0:02 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ;
root      251811  8.9  0.0  23872  8188 ?        S    15:45   0:01 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ;

I have seen as many as 12 of these running.  These find processes take so long to run because of this (happens on all computeosd nodes):

[root@overcloud-compute-11 ~]# find /var/lib/ceph/tmp -name 'tmp.*' | wc -l
23871

Why are there so many of these?  I have to keep deleting them with

# find /var/lib/ceph/tmp -name 'tmp.*' -exec rmdir {} \;

Or else the system becomes unusable.  This in itself is a separate bz I guess.  I could not find where these directories were being created in ceph-disk code.

Comment 3 Ben England 2018-09-19 15:18:55 UTC
I've noticed over 100,000 of these messages in /var/log/messages on the cluster, perhaps this is related to why there are so many of these /var/lib/ceph/tmp/tmp.* directories.

Sep 19 15:10:28 overcloud-compute-11 ceph-osd-run.sh: 3d76bfca-9f0f-4f6d-b040-acf362bb40df on /var/lib/ceph/tmp/tmp.t47SoP9RAm failed: File name too long
Sep 19 15:10:28 overcloud-compute-11 journal: 3d76bfca-9f0f-4f6d-b040-acf362bb40df on /var/lib/ceph/tmp/tmp.t47SoP9RAm failed: File name too long

Why is this coming from /usr/share/ceph-osd-run.sh?   It only executes 2 docker run commands.

Why is the file name too long?

Comment 4 Ben England 2018-10-02 16:44:01 UTC
I was able to stop creation of these directories on 1 host, overcloud-compute-6.  I now have a theory about what's wrong.  Adding Alfredo because he might be able to fill in the missing pieces of the puzzle.  

To stop the directory creation, we had to "ceph-disk zap /dev/sda" because of problem described below.   I then found that Ceph kept trying to activate /dev/sdaj, which had no partition table!  So I stopped and disabled ceph-osd , and now there are no directories being created.

When containerized Ceph tries to start an OSD on /dev/sda and there are > 26 OSDs, device naming wraps around and it starts using device names like /dev/sdaa, /dev/sdab, ...  Apparently there is something in the container image that gets totally confused by that and it matches multiple devices.  I know this because typically in this cluster we have 34 OSDs.  This is a major bug.   The assumption is that sda is the system disk or that there aren't enough other disk devices to cause this wildcarding to not work. 

I'm hypothesizing that the /var/lib/ceph/tmp/tmp.* directories (which started this bz) are left over from this process.  Am verifying that this is the root cause. 

The same problem could be recreated in a virtual machine environment if there were enough OSD devices available (i.e. > 26)

BTW we updated software version and it still happens.

Background: Although I haven't looked at the source code, I think this is how Ceph OSD startup works.  If it sees a partition with a PART_ENTRY_NAME attribute containing "ceph disk", or perhaps a PART_ENTRY_TYPE containing 
4fbd7e29-9d25-41b8-afd0-062c0ceff05d, then it assumes that it is an OSD and tries to start it.  This command shows which devices contain OSDs:


# for n in /dev/sd*[a-z]1 ; do blkid -p -s PART_ENTRY_TYPE -s PART_ENTRY_NAME $n ; done | grep sda
/dev/sda1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdaa1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdab1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdac1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdad1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdae1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdaf1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdag1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdah1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdai1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdak1: PART_ENTRY_TYPE="0x83" 
...

/dev/sdak is the system disk here, it's just the way the scale lab works.

For each such OSD partition, first it mounts the filesystem , then it discovers the OSD number in "whoami" file.  It then discovers the journal device in the "journal" softlink.  It should then be able to do what you and I did and start the OSD running as is done in /usr/share/ceph-osd-run.sh.  

But this doesn't work.  Why?  It's simple - it is using a wildcard and so it matches more than one device.  See this output:

# bash -x /usr/share/ceph-osd-run.sh sda 
+ DOCKER_ENV= 
+ expose_partitions sda 
++ docker run --rm --net=host --name expose_partitions_sda --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z -e CLUSTER=ceph
-e OSD_DEVICE=/dev/sda 192.168.24.1:8787/rhceph:3-12 disk_list 
mount: mount /dev/disk/by-partuuid/867ef9c4-2ae1-4c61-b99f-786646807fba 
d7c2aff0-9fe6-4dba-8e9d-03c0b2e95a61 
5ad89aa8-0dcf-42b6-8e12-3aa916dee91c 
129df299-0142-4091-9a54-01411450df3a 
fbe6babc-1164-40bc-86b4-0422c18d9b90 
2735bdda-c14d-4245-b1c2-3b8098ddd7bf 
7b9bba9a-8c0a-4152-be57-476005beb126 
1e9647ff-f09a-4ee2-8105-af17a220da12 
046aae50-35a4-4335-a461-fced600503e5 
bed87a5c-d799-4cc1-8d6a-1edcbfca0d93 on /var/lib/ceph/tmp/tmp.gy28EWV48j failed: File name too long 
+ DOCKER_ENV= 
+ docker rm -f expose_partitions_sda 
Error response from daemon: No such container: expose_partitions_sda 
+ /usr/bin/docker run --rm --net=host --privileged=true --pid=host --memory=5g --cpu-quota=100000 -v /dev:/dev -v /etc/localtime:/et
c/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/run/ceph:z -e OSD_FILESTORE=1 -e OSD_
DMCRYPT=0 -e CLUSTER=ceph -e OSD_DEVICE=/dev/sda -e CEPH_DAEMON=OSD_CEPH_DISK_ACTIVATE --name=ceph-osd-overcloud-compute-12-sda 192.
168.24.1:8787/rhceph:3-12 
2018-10-01 22:19:37  /entrypoint.sh: static: does not generate config 
mount: mount /dev/disk/by-partuuid/867ef9c4-2ae1-4c61-b99f-786646807fba 
d7c2aff0-9fe6-4dba-8e9d-03c0b2e95a61 
5ad89aa8-0dcf-42b6-8e12-3aa916dee91c 
129df299-0142-4091-9a54-01411450df3a 
fbe6babc-1164-40bc-86b4-0422c18d9b90 
2735bdda-c14d-4245-b1c2-3b8098ddd7bf 
7b9bba9a-8c0a-4152-be57-476005beb126 
1e9647ff-f09a-4ee2-8105-af17a220da12 
046aae50-35a4-4335-a461-fced600503e5 
bed87a5c-d799-4cc1-8d6a-1edcbfca0d93 on /var/lib/ceph/tmp/tmp.cq8VM6tZNo failed: File name too long

Note that this is what would happen if you matched /dev/sda*, you'd get /dev/sdaa, /dev/sdab, ...  This list below matches precisely what is up above.

# for n in /dev/sd*[a-z]1 ; do blkid -p -s PART_ENTRY_UUID $n ; done | grep /dev/sda
/dev/sda1: PART_ENTRY_UUID="867ef9c4-2ae1-4c61-b99f-786646807fba" 
/dev/sdaa1: PART_ENTRY_UUID="d7c2aff0-9fe6-4dba-8e9d-03c0b2e95a61" 
/dev/sdab1: PART_ENTRY_UUID="5ad89aa8-0dcf-42b6-8e12-3aa916dee91c" 
/dev/sdac1: PART_ENTRY_UUID="129df299-0142-4091-9a54-01411450df3a" 
/dev/sdad1: PART_ENTRY_UUID="fbe6babc-1164-40bc-86b4-0422c18d9b90" 
/dev/sdae1: PART_ENTRY_UUID="2735bdda-c14d-4245-b1c2-3b8098ddd7bf" 
/dev/sdaf1: PART_ENTRY_UUID="7b9bba9a-8c0a-4152-be57-476005beb126" 
/dev/sdag1: PART_ENTRY_UUID="1e9647ff-f09a-4ee2-8105-af17a220da12" 
/dev/sdah1: PART_ENTRY_UUID="046aae50-35a4-4335-a461-fced600503e5" 
/dev/sdai1: PART_ENTRY_UUID="bed87a5c-d799-4cc1-8d6a-1edcbfca0d93"

Comment 9 Ben England 2018-12-14 13:30:13 UTC
The OSD names wouldn't have to be consistent across the nodes if bz 1438590 was fixed (discover OSDs by rule, don't enumerate them).  I reported this a year and a half ago.  For example, "make all the 2-TB HDDs into OSDs, use NVM devices as journals". This is the only solution that works for a large, heterogeneous cluster.

Comment 13 Yogev Rabl 2019-11-26 16:48:16 UTC
We cannot verify this bug due to lack of hardware necessary to check it.
@bengland, will you be able to retest this scenario in the scale lab?

Comment 21 errata-xmlrpc 2019-12-19 17:44:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4354

Comment 23 Red Hat Bugzilla 2023-09-18 00:14:27 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.