Bug 1651415

Summary: ceph-ansible task "generate ceph configuration file: {{ cluster }}.conf" slow with large clusters
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Patrick Donnelly <pdonnell>
Component: Ceph-AnsibleAssignee: Dimitri Savineau <dsavinea>
Status: CLOSED NOTABUG QA Contact: Vasishta <vashastr>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.0CC: anharris, aschoen, ceph-eng-bugs, gmeno, nthomas, pdonnell
Target Milestone: rc   
Target Release: 4.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-26 17:58:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Patrick Donnelly 2018-11-20 02:32:49 UTC
Description of problem:

In a Linode cluster [1] configured with 256 ceph-osds, 0 clients, 3 ceph-mon, 2 ceph-mgr, I have observed this task in $subject [2] take a very long time to complete. It appears there is some linear or quadratic behavior with the size of the cluster.

[1] https://github.com/batrick/ceph-linode
[2] https://github.com/ceph/ceph-ansible/blob/098f42f2334c442bf418f09d3f4b3b99750c7ba0/roles/ceph-config/tasks/main.yml#L77-L93

Version-Release number of selected component (if applicable):

4.0 (master)

How reproducible:

100%

Steps to Reproduce:
1. Create a large cluster using steps as outlined on ceph-linode's README. 256 OSDs reliably reproduces the issue.

Other notes:

--forks=50 [3] is not the cause. I've observed the same issue with --forks=5.

[3] https://github.com/batrick/ceph-linode/blob/master/ansible-env.bash#L8

Comment 3 Dimitri Savineau 2019-02-26 20:16:47 UTC
Could you give us more information about the setup ?
  - ansible version
  - containerized deployment
  - any other useful configuration variables (ceph overrides, osd scenarios, etc..)

When you said 256 OSDs, I suppose that it's not the number of osd nodes but osd devices right ? If that's true how many dedicated nodes are you using ?

I took a quick look and I don't see any reason why the ceph.conf template generation could take more time with the number of OSDs.
I have more concern about the OSDs count on the ceph_volume task [1] than the template creation.

Also it could be interesting to run ceph-ansible with the configuration from ansible.cfg [2] (because it's overrided by linode's launch.sh script) and see the task timing via the profile_tasks callback.

[1] https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-config/tasks/main.yml#L22-L39
[2] https://github.com/ceph/ceph-ansible/blob/master/ansible.cfg

Comment 5 Giridhar Ramaraju 2019-08-05 13:09:37 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 6 Giridhar Ramaraju 2019-08-05 13:10:52 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri