Description of problem: In a Linode cluster [1] configured with 256 ceph-osds, 0 clients, 3 ceph-mon, 2 ceph-mgr, I have observed this task in $subject [2] take a very long time to complete. It appears there is some linear or quadratic behavior with the size of the cluster. [1] https://github.com/batrick/ceph-linode [2] https://github.com/ceph/ceph-ansible/blob/098f42f2334c442bf418f09d3f4b3b99750c7ba0/roles/ceph-config/tasks/main.yml#L77-L93 Version-Release number of selected component (if applicable): 4.0 (master) How reproducible: 100% Steps to Reproduce: 1. Create a large cluster using steps as outlined on ceph-linode's README. 256 OSDs reliably reproduces the issue. Other notes: --forks=50 [3] is not the cause. I've observed the same issue with --forks=5. [3] https://github.com/batrick/ceph-linode/blob/master/ansible-env.bash#L8
Could you give us more information about the setup ? - ansible version - containerized deployment - any other useful configuration variables (ceph overrides, osd scenarios, etc..) When you said 256 OSDs, I suppose that it's not the number of osd nodes but osd devices right ? If that's true how many dedicated nodes are you using ? I took a quick look and I don't see any reason why the ceph.conf template generation could take more time with the number of OSDs. I have more concern about the OSDs count on the ceph_volume task [1] than the template creation. Also it could be interesting to run ceph-ansible with the configuration from ansible.cfg [2] (because it's overrided by linode's launch.sh script) and see the task timing via the profile_tasks callback. [1] https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-config/tasks/main.yml#L22-L39 [2] https://github.com/ceph/ceph-ansible/blob/master/ansible.cfg
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri