Bug 1386905 - Research the issues/obstacles to allowing OSPd managed Ceph Node Scaling to at least 25 nodes [NEEDINFO]
Summary: Research the issues/obstacles to allowing OSPd managed Ceph Node Scaling to a...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-heat
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: Upstream M3
: 11.0 (Ocata)
Assignee: leseb
QA Contact: Yogev Rabl
Derek
URL:
Whiteboard:
Depends On: 1372589 1430002
Blocks: 1387431 1414466 1422721
TreeView+ depends on / blocked
 
Reported: 2016-10-19 19:15 UTC by Jeff Brown
Modified: 2017-11-21 16:44 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-17 19:36:00 UTC
slinaber: needinfo? (shan)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1245 normal SHIPPED_LIVE Red Hat OpenStack Platform 11.0 Bug Fix and Enhancement Advisory 2017-05-17 23:01:50 UTC

Description Jeff Brown 2016-10-19 19:15:47 UTC
Description of problem:

OSP10 has best practice limits of scaling up to 10 Ceph storage nodes.  For OSP11 we need to investigate what limits the level of scale and improve it to support up to 25 Ceph nodes.  This starts as an investigation and it will generate a series of BZ's that will need to be resolved to allow for the increased numb er of Ceph nodes supported. 

We currently do not know exactly what components will require modification, so puppet-heat that is currently set is a place holder.



How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

A usable OSP solution that scales to at least 25 Ceph nodes will be supported.  The usabilility of the solution will be validated by the Storage consulting field suport team.  

Additional info:

Comment 2 Steve Linabery 2017-01-16 15:52:00 UTC
please add 'fixed in version' info

Comment 3 Ben England 2017-01-24 21:32:46 UTC
To support Scaling to 25 Ceph nodes with as many as 36 OSDs per node, you need to address the bz 1372589 to increase file descriptor limit for compute nodes from its default of 1024.   If you do this, you may be able to get much farther than 25 nodes - Tim Wilkinson and I got to 29 nodes and 1043 OSDs this way with an external Ceph cluster connected to OpenStack.

Note that http://tracker.ceph.com/issues/17573 means that you don't immediately find out that something is wrong if you don't fix this, because your VM doesn't create all the ceph OSD sockets immediately - instead it creates them on demand, until it runs out of file descriptors for Ceph sockets, at which point you may get different behaviors including hangs.  You can see this happened in /var/log/libvirt/qemu/*.log but it's better to just not allow this to happen and a simple config change is all it takes.

Originally duplicate bz 1389503 was filed as part of Red Hat OpenStack scale lab project to get to > 1000 OSDs across 29 servers.  1389502 (kernel.pid-max0 appears to have been fixed, but was also necessary for this goal.

Comment 4 Ben England 2017-02-06 12:51:47 UTC
cc'ing Rick and Andy - this impacts OpenStack Performance & Scale Release Criteria.

https://docs.google.com/document/d/1I4l1UzDoykh4o9jUQJdzYkSji3iamERk8OtPDxjKj9E/edit#heading=h.gu9zkkcebb85

Comment 7 errata-xmlrpc 2017-05-17 19:36:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245


Note You need to log in before you can comment on or make changes to this bug.