Hide Forgot
Description of problem: LUN allocation tasks should run simultaneously up to the ansible fork limit across the gateways. In the BAGL environment is isn't happening - instead the tasks is scheduled on one node, and the other node does not receive the job until the first node completes. Each LUN allocation task has an owning gateway - to perform the rbd create/resize and update the config object. When the tasks are run in a more serial pattern, the non-owning host times out waiting for the config update to have happened. Version-Release number of selected component (if applicable): 1.3+ How reproducible: Only seen so far in BAGL cluster Steps to Reproduce: 1. Run a normal playbook to allocatee LUNs 2. 3. Actual results: - timeouts during the allocate task are observed. - looking at the times the jobs start to run on the gateways (journalctl -t ansible-igw_lun shows that there is a lag starting jobs Expected results: all jobs within a task should schedule simultaneously up to the fork limit set in /etc/ansible/ansible.cfg - so this issue should not happen Additional info:
Created attachment 1215663 [details] playbook run with -f 10 and -vvv I ran this playbook with -f 10 -vvv and it worked Purged the config and reran again ... same issue Attached a copy of the -vvv output
A couple more observations; 1. when I purge the configuration but leave the rbd's in place and rerun I see the same timeouts. This seems to suggest that the rbd create process has nothing to do with any delay that is being seen. 2. issue is seen whether the playbook is run under root or ansible accounts 3. using -vvv under root or ansible and the playbook completes properly - without -vvv ... timeouts!
Talking the issue through with Alfredo
Ran a test playbook to remove the ceph-iscsi logic from the picture --- - name: test timing hosts: pctest tasks: - name: sleep across systems for a changing amount of time command: sleep {{ item }} with_items: - 5 - 10 - 15 - 20 The scheduling issue is still there. This is confirmation that the issue is *not* related to the ceph-iscsi-* rpms/playbooks but is instead build/environmental. To further test this theory, I installed Ansible 1.9-4 on another box within the BAGL environment (gprfc088), and ran the above playbook. On these servers ansible is scheduling as expected
Created attachment 1216263 [details] test playbook based on the bash sleep command
Created attachment 1216264 [details] timings showing the delays in scheduling during the test playbook
Created attachment 1216265 [details] test playbook running on different servers in BAGL environment - successful
This is caused by the fact that "UserKnownHostsFile=/dev/null" in "/etc/ssh/ssh_config". That is the delta between the two ansible hosts. This is not an issue with ceph-iscsi-ansible and is instead a known and expected behavior when ansible believes you will need to type "yes" to confirm that you know the host due to not persisting the the known hosts file. I believe this should be re-tested and closed by QE once the configuration issue is addressed.
Paul, I dont believe we are seeing this issue anymore. If you feel that any specific QE testing is needed for this, please let us know. Else this is fixed from a QE side. Thanks, Tejas
Moving this to CLOSED/NOTABUG due to environment configuration issue.
Agree - this is not an issue. Jason was spot on identifying it as a local ssh configuration problem