Bug 1628652

Summary: ceph container with more than 26 osds on a host fails to run
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Ben England <bengland>
Component: ContainerAssignee: Dimitri Savineau <dsavinea>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: high Docs Contact:
Priority: high    
Version: 3.0CC: adeza, ceph-eng-bugs, ceph-qe-bugs, dhill, dsavinea, dwilson, gabrioux, gfidente, johfulto, mmurthy, nojha, pasik, tchandra, tserlin, twilkins, ykaul, yrabl
Target Milestone: z2Keywords: Performance
Target Release: 3.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhceph:ceph-3.3-rhel-7-containers-candidate-48122-20191024075852 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1784047 (view as bug list) Environment:
Last Closed: 2019-12-19 17:44:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1578730, 1784047    

Description Ben England 2018-09-13 16:01:26 UTC
Description of problem:

ceph-disk repeatedly invokes a find command that brings a RHOSP computeosd node to its knees.  system disk utilization hits 100% and docker commands take a very long time to run.


Version-Release number of selected component (if applicable):

RHOSP 13 GA dated 06-21
RHCS 3.0 container tag 3-9

Yes I know this isn't latest, we are working on that.

How reproducible:

every time in our cluster, which is very large (476 OSDs)

Steps to Reproduce:
1. deploy large RHOSP 13 HCI cluster
2. populate the cluster with 147 TB of Cinder-RBD data (16% space used)
3. 

Not sure if there is a step 3, but I've never seen a time when this wasn't happening on the cluster after it was populated.

Actual results:

system disk looks like this all the time:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdak              0.00    70.00    0.00  210.00     0.00     4.49    43.81   158.30  727.47    0.00  727.47   4.76  99.95



Expected results:

should only happen once, should not have each container running same find command on same large set of directories and files!  Note that /var/lib/ceph is shared by containers.  If you look at /usr/share/run-ceph-osd.sh you see:

/usr/bin/docker run \
...
  -v /var/lib/ceph:/var/lib/ceph:z
...

which means that the OSD containers are all accessing the same /var/lib/ceph tree on the computeosd host.



Additional info:

The command appears to originate from:

/usr/lib/python2.7/site-packages/ceph_disk/main.py 

in the main_fix() subroutine.

symptom is that a large number of find processes appear and start to consume significant system resources, they look like this:

[root@overcloud-compute-11 ~]# ps awux | grep find | grep -v grep                                                                                                            
root      186656  8.6  0.0  23740  8188 ?        S    15:44   0:05 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ;                                     
root      237569  8.9  0.0  23740  8188 ?        S    15:45   0:02 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ;                                     
root      237570  9.0  0.0  23740  8184 ?        S    15:45   0:02 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ;
root      248529  8.6  0.0  23872  8192 ?        S    15:45   0:02 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ;
root      251811  8.9  0.0  23872  8188 ?        S    15:45   0:01 find -L /var/lib/ceph/ -mindepth 1 -maxdepth 3 -exec chown ceph. {} ;

I have seen as many as 12 of these running.  These find processes take so long to run because of this (happens on all computeosd nodes):

[root@overcloud-compute-11 ~]# find /var/lib/ceph/tmp -name 'tmp.*' | wc -l
23871

Why are there so many of these?  I have to keep deleting them with

# find /var/lib/ceph/tmp -name 'tmp.*' -exec rmdir {} \;

Or else the system becomes unusable.  This in itself is a separate bz I guess.  I could not find where these directories were being created in ceph-disk code.

Comment 3 Ben England 2018-09-19 15:18:55 UTC
I've noticed over 100,000 of these messages in /var/log/messages on the cluster, perhaps this is related to why there are so many of these /var/lib/ceph/tmp/tmp.* directories.

Sep 19 15:10:28 overcloud-compute-11 ceph-osd-run.sh: 3d76bfca-9f0f-4f6d-b040-acf362bb40df on /var/lib/ceph/tmp/tmp.t47SoP9RAm failed: File name too long
Sep 19 15:10:28 overcloud-compute-11 journal: 3d76bfca-9f0f-4f6d-b040-acf362bb40df on /var/lib/ceph/tmp/tmp.t47SoP9RAm failed: File name too long

Why is this coming from /usr/share/ceph-osd-run.sh?   It only executes 2 docker run commands.

Why is the file name too long?

Comment 4 Ben England 2018-10-02 16:44:01 UTC
I was able to stop creation of these directories on 1 host, overcloud-compute-6.  I now have a theory about what's wrong.  Adding Alfredo because he might be able to fill in the missing pieces of the puzzle.  

To stop the directory creation, we had to "ceph-disk zap /dev/sda" because of problem described below.   I then found that Ceph kept trying to activate /dev/sdaj, which had no partition table!  So I stopped and disabled ceph-osd , and now there are no directories being created.

When containerized Ceph tries to start an OSD on /dev/sda and there are > 26 OSDs, device naming wraps around and it starts using device names like /dev/sdaa, /dev/sdab, ...  Apparently there is something in the container image that gets totally confused by that and it matches multiple devices.  I know this because typically in this cluster we have 34 OSDs.  This is a major bug.   The assumption is that sda is the system disk or that there aren't enough other disk devices to cause this wildcarding to not work. 

I'm hypothesizing that the /var/lib/ceph/tmp/tmp.* directories (which started this bz) are left over from this process.  Am verifying that this is the root cause. 

The same problem could be recreated in a virtual machine environment if there were enough OSD devices available (i.e. > 26)

BTW we updated software version and it still happens.

Background: Although I haven't looked at the source code, I think this is how Ceph OSD startup works.  If it sees a partition with a PART_ENTRY_NAME attribute containing "ceph disk", or perhaps a PART_ENTRY_TYPE containing 
4fbd7e29-9d25-41b8-afd0-062c0ceff05d, then it assumes that it is an OSD and tries to start it.  This command shows which devices contain OSDs:


# for n in /dev/sd*[a-z]1 ; do blkid -p -s PART_ENTRY_TYPE -s PART_ENTRY_NAME $n ; done | grep sda
/dev/sda1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdaa1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdab1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdac1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdad1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdae1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdaf1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdag1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdah1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdai1: PART_ENTRY_NAME="ceph data" PART_ENTRY_TYPE="4fbd7e29-9d25-41b8-afd0-062c0ceff05d" 
/dev/sdak1: PART_ENTRY_TYPE="0x83" 
...

/dev/sdak is the system disk here, it's just the way the scale lab works.

For each such OSD partition, first it mounts the filesystem , then it discovers the OSD number in "whoami" file.  It then discovers the journal device in the "journal" softlink.  It should then be able to do what you and I did and start the OSD running as is done in /usr/share/ceph-osd-run.sh.  

But this doesn't work.  Why?  It's simple - it is using a wildcard and so it matches more than one device.  See this output:

# bash -x /usr/share/ceph-osd-run.sh sda 
+ DOCKER_ENV= 
+ expose_partitions sda 
++ docker run --rm --net=host --name expose_partitions_sda --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z -e CLUSTER=ceph
-e OSD_DEVICE=/dev/sda 192.168.24.1:8787/rhceph:3-12 disk_list 
mount: mount /dev/disk/by-partuuid/867ef9c4-2ae1-4c61-b99f-786646807fba 
d7c2aff0-9fe6-4dba-8e9d-03c0b2e95a61 
5ad89aa8-0dcf-42b6-8e12-3aa916dee91c 
129df299-0142-4091-9a54-01411450df3a 
fbe6babc-1164-40bc-86b4-0422c18d9b90 
2735bdda-c14d-4245-b1c2-3b8098ddd7bf 
7b9bba9a-8c0a-4152-be57-476005beb126 
1e9647ff-f09a-4ee2-8105-af17a220da12 
046aae50-35a4-4335-a461-fced600503e5 
bed87a5c-d799-4cc1-8d6a-1edcbfca0d93 on /var/lib/ceph/tmp/tmp.gy28EWV48j failed: File name too long 
+ DOCKER_ENV= 
+ docker rm -f expose_partitions_sda 
Error response from daemon: No such container: expose_partitions_sda 
+ /usr/bin/docker run --rm --net=host --privileged=true --pid=host --memory=5g --cpu-quota=100000 -v /dev:/dev -v /etc/localtime:/et
c/localtime:ro -v /var/lib/ceph:/var/lib/ceph:z -v /etc/ceph:/etc/ceph:z -v /var/run/ceph:/var/run/ceph:z -e OSD_FILESTORE=1 -e OSD_
DMCRYPT=0 -e CLUSTER=ceph -e OSD_DEVICE=/dev/sda -e CEPH_DAEMON=OSD_CEPH_DISK_ACTIVATE --name=ceph-osd-overcloud-compute-12-sda 192.
168.24.1:8787/rhceph:3-12 
2018-10-01 22:19:37  /entrypoint.sh: static: does not generate config 
mount: mount /dev/disk/by-partuuid/867ef9c4-2ae1-4c61-b99f-786646807fba 
d7c2aff0-9fe6-4dba-8e9d-03c0b2e95a61 
5ad89aa8-0dcf-42b6-8e12-3aa916dee91c 
129df299-0142-4091-9a54-01411450df3a 
fbe6babc-1164-40bc-86b4-0422c18d9b90 
2735bdda-c14d-4245-b1c2-3b8098ddd7bf 
7b9bba9a-8c0a-4152-be57-476005beb126 
1e9647ff-f09a-4ee2-8105-af17a220da12 
046aae50-35a4-4335-a461-fced600503e5 
bed87a5c-d799-4cc1-8d6a-1edcbfca0d93 on /var/lib/ceph/tmp/tmp.cq8VM6tZNo failed: File name too long

Note that this is what would happen if you matched /dev/sda*, you'd get /dev/sdaa, /dev/sdab, ...  This list below matches precisely what is up above.

# for n in /dev/sd*[a-z]1 ; do blkid -p -s PART_ENTRY_UUID $n ; done | grep /dev/sda
/dev/sda1: PART_ENTRY_UUID="867ef9c4-2ae1-4c61-b99f-786646807fba" 
/dev/sdaa1: PART_ENTRY_UUID="d7c2aff0-9fe6-4dba-8e9d-03c0b2e95a61" 
/dev/sdab1: PART_ENTRY_UUID="5ad89aa8-0dcf-42b6-8e12-3aa916dee91c" 
/dev/sdac1: PART_ENTRY_UUID="129df299-0142-4091-9a54-01411450df3a" 
/dev/sdad1: PART_ENTRY_UUID="fbe6babc-1164-40bc-86b4-0422c18d9b90" 
/dev/sdae1: PART_ENTRY_UUID="2735bdda-c14d-4245-b1c2-3b8098ddd7bf" 
/dev/sdaf1: PART_ENTRY_UUID="7b9bba9a-8c0a-4152-be57-476005beb126" 
/dev/sdag1: PART_ENTRY_UUID="1e9647ff-f09a-4ee2-8105-af17a220da12" 
/dev/sdah1: PART_ENTRY_UUID="046aae50-35a4-4335-a461-fced600503e5" 
/dev/sdai1: PART_ENTRY_UUID="bed87a5c-d799-4cc1-8d6a-1edcbfca0d93"

Comment 9 Ben England 2018-12-14 13:30:13 UTC
The OSD names wouldn't have to be consistent across the nodes if bz 1438590 was fixed (discover OSDs by rule, don't enumerate them).  I reported this a year and a half ago.  For example, "make all the 2-TB HDDs into OSDs, use NVM devices as journals". This is the only solution that works for a large, heterogeneous cluster.

Comment 13 Yogev Rabl 2019-11-26 16:48:16 UTC
We cannot verify this bug due to lack of hardware necessary to check it.
@bengland, will you be able to retest this scenario in the scale lab?

Comment 21 errata-xmlrpc 2019-12-19 17:44:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4354

Comment 23 Red Hat Bugzilla 2023-09-18 00:14:27 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days