Bug 1390025 - [ceph-iscsi-ansible] lun creation is not running in parallel, causing task failures
Summary: [ceph-iscsi-ansible] lun creation is not running in parallel, causing task fa...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Storage Console
Classification: Red Hat
Component: ceph-ansible
Version: 2
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 2
Assignee: Paul Cuzner
QA Contact: Tejas
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-30 22:58 UTC by Paul Cuzner
Modified: 2016-11-15 04:33 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-03 17:14:47 UTC
Target Upstream Version:
kurs: needinfo+


Attachments (Terms of Use)
playbook run with -f 10 and -vvv (28.19 KB, text/plain)
2016-10-31 04:01 UTC, Paul Cuzner
no flags Details
test playbook based on the bash sleep command (233 bytes, text/plain)
2016-11-01 23:23 UTC, Paul Cuzner
no flags Details
timings showing the delays in scheduling during the test playbook (2.31 KB, text/plain)
2016-11-01 23:25 UTC, Paul Cuzner
no flags Details
test playbook running on different servers in BAGL environment - successful (2.85 KB, text/plain)
2016-11-01 23:27 UTC, Paul Cuzner
no flags Details

Description Paul Cuzner 2016-10-30 22:58:08 UTC
Description of problem:
LUN allocation tasks should run simultaneously up to the ansible fork limit across the gateways. In the BAGL environment is isn't happening - instead the tasks is scheduled on one node, and the other node does not receive the job until the first node completes.

Each LUN allocation task has an owning gateway - to perform the rbd create/resize and update the config object. When the tasks are run in a more serial pattern, the non-owning host times out waiting for the config update to have happened.



Version-Release number of selected component (if applicable):
1.3+

How reproducible:
Only seen so far in BAGL cluster

Steps to Reproduce:
1. Run a normal playbook to allocatee LUNs
2.
3.

Actual results:
- timeouts during the allocate task are observed.
- looking at the times the jobs start to run on the gateways (journalctl -t ansible-igw_lun shows that there is a lag starting jobs

Expected results:
all jobs within a task should schedule simultaneously up to the fork limit set in /etc/ansible/ansible.cfg - so this issue should not happen



Additional info:

Comment 2 Paul Cuzner 2016-10-31 04:01:36 UTC
Created attachment 1215663 [details]
playbook run with -f 10 and -vvv

I ran this playbook with -f 10 -vvv and it worked
Purged the config and reran again ... same issue

Attached a copy of the -vvv output

Comment 3 Paul Cuzner 2016-10-31 04:14:28 UTC
A couple more observations;

1. when I purge the configuration but leave the rbd's in place and rerun I see the same timeouts. This seems to suggest that the rbd create process has nothing to do with any delay that is being seen.
2. issue is seen whether the playbook is run under root or ansible accounts
3. using -vvv under root or ansible and the playbook completes properly - without -vvv ... timeouts!

Comment 5 Paul Cuzner 2016-11-01 20:32:13 UTC
Talking the issue through with Alfredo

Comment 6 Paul Cuzner 2016-11-01 23:22:17 UTC
Ran a test playbook to remove the ceph-iscsi logic from the picture

---
  - name: test timing
    hosts: pctest
    tasks:
      - name: sleep across systems for a changing amount of time
        command: sleep {{ item }}
        with_items:
          - 5
          - 10
          - 15
          - 20

The scheduling issue is still there. This is confirmation that the issue is *not* related to the ceph-iscsi-* rpms/playbooks but is instead build/environmental.

To further test this theory, I installed Ansible 1.9-4 on another box within the BAGL environment (gprfc088), and ran the above playbook. On these servers ansible is scheduling as expected

Comment 7 Paul Cuzner 2016-11-01 23:23:31 UTC
Created attachment 1216263 [details]
test playbook based on the bash sleep command

Comment 8 Paul Cuzner 2016-11-01 23:25:06 UTC
Created attachment 1216264 [details]
timings showing the delays in scheduling during the test playbook

Comment 9 Paul Cuzner 2016-11-01 23:27:48 UTC
Created attachment 1216265 [details]
test playbook running on different servers in BAGL environment - successful

Comment 11 Jason Dillaman 2016-11-02 16:25:41 UTC
This is caused by the fact that "UserKnownHostsFile=/dev/null" in "/etc/ssh/ssh_config". That is the delta between the two ansible hosts. This is not an issue with ceph-iscsi-ansible and is instead a known and expected behavior when ansible believes you will need to type "yes" to confirm that you know the host due to not persisting the the known hosts file. I believe this should be re-tested and closed by QE once the configuration issue is addressed.

Comment 12 Tejas 2016-11-03 16:41:35 UTC
Paul,

    I dont believe we are seeing this issue anymore. If you feel that any specific QE testing is needed for this, please let us know.
Else this is fixed from a QE side.

Thanks,
Tejas

Comment 13 Jason Dillaman 2016-11-03 17:14:47 UTC
Moving this to CLOSED/NOTABUG due to environment configuration issue.

Comment 14 Paul Cuzner 2016-11-15 04:33:17 UTC
Agree - this is not an issue. Jason was spot on identifying it as a local ssh configuration problem


Note You need to log in before you can comment on or make changes to this bug.