Description of problem ====================== When gdeploy fails on some unexpected issue and fails, it's not possible to rerun the operation. One would expect that gdeploy operates in idempotent[1] way since it uses ansible to execute all operations. Consider this scenario: failure happens on large trusted storage pool: * admin runs gdeploy * failure happens and gdeploy stops in the middle * admin fixes the issue * admin has no obvious/easy way to rerun gdeploy operation to finish setup [1] "idempotent" means that one can rerun ansible playbooks even when the system is halfway (or entirely) in desired state, see Ansible docs: > The resource models are ‘idempotent’ meaning change commands are not run > unless needed, and Ansible will bring the system back to a desired state > regardless of the actual state – rather than you having to tell it how to get > to the state. Version-Release number of selected component (if applicable) ============================================================ gdeploy-1.0-12.el6rhs.noarch How reproducible ================ 100 % Steps to Reproduce ================== 1. Create gluster.conf file for gdeploy so that it fail somewhere in the middle 2. Run gdeploy for the first time: `gdeploy -c gluster.conf` 3. After expected failure, try to rerun the operation again: `gdeploy -c gluster.conf` Actual results ============== First gdeploy fails (as expected for the sake of reproducing this issue): ~~~ ... initial successfully changed tasks skipped ... TASK: [Start glusterd in all the hosts (if not started already)] ************** changed: [node-128.storage.example.com] changed: [node-131.storage.example.com] changed: [node-130.storage.example.com] changed: [node-129.storage.example.com] PLAY [master] ***************************************************************** TASK: [Creates a Trusted Storage Pool] **************************************** failed: [node-131.storage.example.com] => {"failed": true} msg: peer probe: failed: Probe returned with unknown errno 107 FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/ansible_playbooks.retry node-128.storage.example.com : ok=12 changed=12 unreachable=0 failed=0 node-129.storage.example.com : ok=12 changed=12 unreachable=0 failed=0 node-130.storage.example.com : ok=12 changed=12 unreachable=0 failed=0 node-131.storage.example.com : ok=12 changed=12 unreachable=0 failed=1 ~~~ Here I either fix the issue (optionally) and would like to rerun gdeploy: ~~~ [root@node-129 ~]# gdeploy -c gluster.conf /usr/lib/python2.6/site-packages/argparse.py:1575: DeprecationWarning: The "version" argument to ArgumentParser is deprecated. Please use "add_argument(..., action='version', version="N", ...)" instead """instead""", DeprecationWarning) INFO: Back-end setup triggered Warning: Using mountpoint itself as the brick in one or more hosts since force is specified, although not recommended. INFO: Peer management(action: probe) triggered INFO: Volume management(action: create) triggered INFO: FUSE mount of volume triggered. PLAY [gluster_servers] ******************************************************** TASK: [Create Physical Volume on all the nodes] ******************************* failed: [node-129.storage.example.com] => {"failed": true} msg: ['/dev/vdb Physical Volume Exists!'] failed: [node-128.storage.example.com] => {"failed": true} msg: ['/dev/vdb Physical Volume Exists!'] failed: [node-130.storage.example.com] => {"failed": true} msg: ['/dev/vdb Physical Volume Exists!'] failed: [node-131.storage.example.com] => {"failed": true} msg: ['/dev/vdb Physical Volume Exists!'] FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/ansible_playbooks.retry node-128.storage.example.com : ok=0 changed=0 unreachable=0 failed=1 node-129.storage.example.com : ok=0 changed=0 unreachable=0 failed=1 node-130.storage.example.com : ok=0 changed=0 unreachable=0 failed=1 node-131.storage.example.com : ok=0 changed=0 unreachable=0 failed=1 [root@node-129 ~]# gdeploy -c gluster.conf ~~~ Expected results ================ The 2nd run will go through steps which were successfully completed with ansible ok state, making it possible to recover from the failure and complete setup.
This will be fixed in the next release. We will ignore the errors that's coming while rerunning the same config again.
Fixed in the release 1.1
Reruns are possible with gdeploy-2.0-5.el7rhgs.noarch. Marking this as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1250