Bug 1277250
Summary: | gdeploy doesn't provide way to recover from failure during setup (is not idempotent) | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Martin Bukatovic <mbukatov> |
Component: | gdeploy | Assignee: | Sachidananda Urs <surs> |
Status: | CLOSED ERRATA | QA Contact: | Anush Shetty <ashetty> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.1 | CC: | bmohanra, nvarma, rcyriac, rhinduja, smohan |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | RHGS 3.1.3 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | gdeploy-2.0-2 | Doc Type: | Bug Fix |
Doc Text: |
In case of failures gdeploy would stop at that point and the user had to manually undo the partially completed operations before restarting the program. This was a manual and cumbersome exercise, and user had to login to the machines to fix. With this release gdeploy now continues from where it left off and all the earlier completed steps are ignored.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-06-23 05:29:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1299184 |
This will be fixed in the next release. We will ignore the errors that's coming while rerunning the same config again. Fixed in the release 1.1 Reruns are possible with gdeploy-2.0-5.el7rhgs.noarch. Marking this as verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1250 |
Description of problem ====================== When gdeploy fails on some unexpected issue and fails, it's not possible to rerun the operation. One would expect that gdeploy operates in idempotent[1] way since it uses ansible to execute all operations. Consider this scenario: failure happens on large trusted storage pool: * admin runs gdeploy * failure happens and gdeploy stops in the middle * admin fixes the issue * admin has no obvious/easy way to rerun gdeploy operation to finish setup [1] "idempotent" means that one can rerun ansible playbooks even when the system is halfway (or entirely) in desired state, see Ansible docs: > The resource models are ‘idempotent’ meaning change commands are not run > unless needed, and Ansible will bring the system back to a desired state > regardless of the actual state – rather than you having to tell it how to get > to the state. Version-Release number of selected component (if applicable) ============================================================ gdeploy-1.0-12.el6rhs.noarch How reproducible ================ 100 % Steps to Reproduce ================== 1. Create gluster.conf file for gdeploy so that it fail somewhere in the middle 2. Run gdeploy for the first time: `gdeploy -c gluster.conf` 3. After expected failure, try to rerun the operation again: `gdeploy -c gluster.conf` Actual results ============== First gdeploy fails (as expected for the sake of reproducing this issue): ~~~ ... initial successfully changed tasks skipped ... TASK: [Start glusterd in all the hosts (if not started already)] ************** changed: [node-128.storage.example.com] changed: [node-131.storage.example.com] changed: [node-130.storage.example.com] changed: [node-129.storage.example.com] PLAY [master] ***************************************************************** TASK: [Creates a Trusted Storage Pool] **************************************** failed: [node-131.storage.example.com] => {"failed": true} msg: peer probe: failed: Probe returned with unknown errno 107 FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/ansible_playbooks.retry node-128.storage.example.com : ok=12 changed=12 unreachable=0 failed=0 node-129.storage.example.com : ok=12 changed=12 unreachable=0 failed=0 node-130.storage.example.com : ok=12 changed=12 unreachable=0 failed=0 node-131.storage.example.com : ok=12 changed=12 unreachable=0 failed=1 ~~~ Here I either fix the issue (optionally) and would like to rerun gdeploy: ~~~ [root@node-129 ~]# gdeploy -c gluster.conf /usr/lib/python2.6/site-packages/argparse.py:1575: DeprecationWarning: The "version" argument to ArgumentParser is deprecated. Please use "add_argument(..., action='version', version="N", ...)" instead """instead""", DeprecationWarning) INFO: Back-end setup triggered Warning: Using mountpoint itself as the brick in one or more hosts since force is specified, although not recommended. INFO: Peer management(action: probe) triggered INFO: Volume management(action: create) triggered INFO: FUSE mount of volume triggered. PLAY [gluster_servers] ******************************************************** TASK: [Create Physical Volume on all the nodes] ******************************* failed: [node-129.storage.example.com] => {"failed": true} msg: ['/dev/vdb Physical Volume Exists!'] failed: [node-128.storage.example.com] => {"failed": true} msg: ['/dev/vdb Physical Volume Exists!'] failed: [node-130.storage.example.com] => {"failed": true} msg: ['/dev/vdb Physical Volume Exists!'] failed: [node-131.storage.example.com] => {"failed": true} msg: ['/dev/vdb Physical Volume Exists!'] FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/ansible_playbooks.retry node-128.storage.example.com : ok=0 changed=0 unreachable=0 failed=1 node-129.storage.example.com : ok=0 changed=0 unreachable=0 failed=1 node-130.storage.example.com : ok=0 changed=0 unreachable=0 failed=1 node-131.storage.example.com : ok=0 changed=0 unreachable=0 failed=1 [root@node-129 ~]# gdeploy -c gluster.conf ~~~ Expected results ================ The 2nd run will go through steps which were successfully completed with ansible ok state, making it possible to recover from the failure and complete setup.