Bug 1277250 - gdeploy doesn't provide way to recover from failure during setup (is not idempotent)
gdeploy doesn't provide way to recover from failure during setup (is not idem...
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: gdeploy (Show other bugs)
3.1
Unspecified Unspecified
unspecified Severity high
: ---
: RHGS 3.1.3
Assigned To: Sachidananda Urs
Anush Shetty
: ZStream
Depends On:
Blocks: 1299184
  Show dependency treegraph
 
Reported: 2015-11-02 14:47 EST by Martin Bukatovic
Modified: 2016-06-23 01:29 EDT (History)
5 users (show)

See Also:
Fixed In Version: gdeploy-2.0-2
Doc Type: Bug Fix
Doc Text:
In case of failures gdeploy would stop at that point and the user had to manually undo the partially completed operations before restarting the program. This was a manual and cumbersome exercise, and user had to login to the machines to fix. With this release gdeploy now continues from where it left off and all the earlier completed steps are ignored.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-06-23 01:29:00 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Martin Bukatovic 2015-11-02 14:47:00 EST
Description of problem
======================

When gdeploy fails on some unexpected issue and fails, it's not possible to
rerun the operation. One would expect that gdeploy operates in idempotent[1]
way since it uses ansible to execute all operations.

Consider this scenario: failure happens on large trusted storage pool:

 * admin runs gdeploy
 * failure happens and gdeploy stops in the middle
 * admin fixes the issue
 * admin has no obvious/easy way to rerun gdeploy operation to finish setup

[1] "idempotent" means that one can rerun ansible playbooks even when the
system is halfway (or entirely) in desired state, see Ansible docs:

> The resource models are ‘idempotent’ meaning change commands are not run
> unless needed, and Ansible will bring the system back to a desired state
> regardless of the actual state – rather than you having to tell it how to get
> to the state.

Version-Release number of selected component (if applicable)
============================================================

gdeploy-1.0-12.el6rhs.noarch

How reproducible
================

100 %

Steps to Reproduce
==================

1. Create gluster.conf file for gdeploy so that it
   fail somewhere in the middle
2. Run gdeploy for the first time: `gdeploy -c gluster.conf`
3. After expected failure, try to rerun the operation again:
   `gdeploy -c gluster.conf`

Actual results
==============

First gdeploy fails (as expected for the sake of reproducing this issue):

~~~
... initial successfully changed tasks skipped ...

TASK: [Start glusterd in all the hosts (if not started already)] ************** 
changed: [node-128.storage.example.com]
changed: [node-131.storage.example.com]
changed: [node-130.storage.example.com]
changed: [node-129.storage.example.com]

PLAY [master] ***************************************************************** 

TASK: [Creates a Trusted Storage Pool] **************************************** 
failed: [node-131.storage.example.com] => {"failed": true}
msg: peer probe: failed: Probe returned with unknown errno 107


FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/ansible_playbooks.retry

node-128.storage.example.com : ok=12   changed=12   unreachable=0    failed=0   
node-129.storage.example.com : ok=12   changed=12   unreachable=0    failed=0   
node-130.storage.example.com : ok=12   changed=12   unreachable=0    failed=0   
node-131.storage.example.com : ok=12   changed=12   unreachable=0    failed=1   
~~~

Here I either fix the issue (optionally) and would like to rerun gdeploy:

~~~
[root@node-129 ~]# gdeploy -c gluster.conf
/usr/lib/python2.6/site-packages/argparse.py:1575: DeprecationWarning: The "version" argument to ArgumentParser is deprecated. Please use "add_argument(..., action='version', version="N", ...)" instead
  """instead""", DeprecationWarning)

INFO: Back-end setup triggered

Warning: Using mountpoint itself as the brick in one or more hosts since force is specified, although not recommended.

INFO: Peer management(action: probe) triggered
INFO: Volume management(action: create) triggered
INFO: FUSE mount of volume triggered.

PLAY [gluster_servers] ******************************************************** 

TASK: [Create Physical Volume on all the nodes] ******************************* 
failed: [node-129.storage.example.com] => {"failed": true}
msg: ['/dev/vdb Physical Volume Exists!']
failed: [node-128.storage.example.com] => {"failed": true}
msg: ['/dev/vdb Physical Volume Exists!']
failed: [node-130.storage.example.com] => {"failed": true}
msg: ['/dev/vdb Physical Volume Exists!']
failed: [node-131.storage.example.com] => {"failed": true}
msg: ['/dev/vdb Physical Volume Exists!']

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/ansible_playbooks.retry

node-128.storage.example.com : ok=0    changed=0    unreachable=0    failed=1   
node-129.storage.example.com : ok=0    changed=0    unreachable=0    failed=1   
node-130.storage.example.com : ok=0    changed=0    unreachable=0    failed=1   
node-131.storage.example.com : ok=0    changed=0    unreachable=0    failed=1   

[root@node-129 ~]# gdeploy -c gluster.conf
~~~

Expected results
================

The 2nd run will go through steps which were successfully completed with
ansible ok state, making it possible to recover from the failure and complete
setup.
Comment 1 Nandaja Varma 2015-11-09 00:26:37 EST
This will be fixed in the next release. We will ignore the errors that's coming while rerunning the same config again.
Comment 2 Nandaja Varma 2015-11-24 02:09:26 EST
Fixed in  the release 1.1
Comment 4 Anush Shetty 2016-04-12 03:57:51 EDT
Reruns are possible with gdeploy-2.0-5.el7rhgs.noarch. Marking this as verified.
Comment 6 errata-xmlrpc 2016-06-23 01:29:00 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1250

Note You need to log in before you can comment on or make changes to this bug.