Bug 1856981
Summary: | 9GB of RAM are wasted for TASK [ceph-facts : set_fact ceph_current_status (convert to json)] | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | John Fulton <johfulto> | ||||
Component: | Ceph-Ansible | Assignee: | Guillaume Abrioux <gabrioux> | ||||
Status: | CLOSED ERRATA | QA Contact: | Vasishta <vashastr> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.1 | CC: | aschoen, bdobreli, bengland, ceph-eng-bugs, dsavinea, gabrioux, gmeno, nthomas, smalleni, tserlin, vereddy, ykaul | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.2 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | ceph-ansible-4.0.32-1.el8cp, ceph-ansible-4.0.32-1.el7cp | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-01-12 14:56:02 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1760354 | ||||||
Attachments: |
|
Description
John Fulton
2020-07-14 20:29:15 UTC
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity. This is when scaling the overcloud compute count from 472 to 474 on an already deployed ceph cluster. Also, I don't believe we were using -e ceph_ansible_limit so the client role was probably running across all the compute nodes. I can give you that data, but note that, we've scaled that cluster further and are now at 660+ compute nodes, so this might not exactly be the output when the profiling was done at the 472 node scale. Current output from the cluster: https://gist.githubusercontent.com/smalleni/e8828ade26ce679df799d02847ef5035/raw/372ff53d0b6e49c8c83c932d691d8a5fd5d63f05/gistfile1.txt what is the problem with using 9 GB of RAM for deploying a *huge* cluster? It's only needed during the deployment, right? A typical midrange server has > 100 GB of RAM. If that's the only problem, why don't we just recommend additional memory in such cases? Also, I thought fact gathering was only done on the ansible-playbook host and we weren't asking each host for facts about all the other hosts (which would be O(N^2)). Is that so? What is the --fork parameter? I saw this in e-mail chain: "FTR In my testing, the memory consumption is likely related to the number of forks and the async waits that we perform. Example in my basic reproducer (with a whole 3 tasks) was forks=80 against 20 hosts used ~1.5G whereas as default forks vs 20 hosts used ~900M according to cgroup_memory_recap." so why not use --forks=40 instead? What are the requirements on overall deploy or scale-up time and how is this impacted by --forks? Is elapsed time proportional to 1/forks, as we might guess, or are there other factors? RHCS 5 will be based on cephadm, not ceph-ansible, so I don't think we want to invest a ton of time in fixing ceph-ansible - we need data on cephadm scalability, right? There is a related issue about the default ansble.cfg forks we use, which is calculated as 10*CPU_COUNT which is problematic on a 64 core undercloud, as 640 ansible forks can end up actually consuming ALL of the memory in some cases. I opened: https://bugzilla.redhat.com/show_bug.cgi?id=1857451 Actually ignore my previous comment, the forks used by ceph-ansible are different from the forks used by tripleo-ansible. Looks like my ceph_ansible_command.sh has ANSIBLE_FORKS=25 Now that https://github.com/ceph/ceph-ansible/pull/5663 has merged into the 4 stable branch, can it into the coming v4.0.30 release and the bug moved to POST? Based on comment 30 moving to Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0081 |