Currently there is no official documentation for providing the hints tweaks for reaching deployments with 3 controllers + 97 or more computes. Some of the bottleneck detected were sorted by increasing the rpc_response_timeout to at least 3600 in heat, ironic and nova configuration, enabling the memcached cached, increasing the heat engine rpc workers (48) as the default one where too low (2) and in this way it was able to enlarge cloud to 80-90 compute node then reaching some haproxy timeouts at the controllers above the default ones. With all those information we could create a basic solution articles explaining some of the tunables that can be used, but still, As more details can appear this should be taken as a whole, and be included in the standard documentation as it's not uncommon to have customers coming and asking about scaling the platform (specially cloud providers/ partners having their own products at scale and thus, we expect this to become even more common in the future.
Can you share what is failing here? Any sort of "tweaks" needed should be pushed back into the product vs having some sort of one-off documentation somewhere.
There is a general OSP 10 scale guide now at https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/recommendations_for_large_deployments/index Closing this BZ, please reopen if needed.