Bug 1277250

Summary:	gdeploy doesn't provide way to recover from failure during setup (is not idempotent)
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Martin Bukatovic <mbukatov>
Component:	gdeploy	Assignee:	Sachidananda Urs <surs>
Status:	CLOSED ERRATA	QA Contact:	Anush Shetty <ashetty>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	bmohanra, nvarma, rcyriac, rhinduja, smohan
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.1.3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	gdeploy-2.0-2	Doc Type:	Bug Fix
Doc Text:	In case of failures gdeploy would stop at that point and the user had to manually undo the partially completed operations before restarting the program. This was a manual and cumbersome exercise, and user had to login to the machines to fix. With this release gdeploy now continues from where it left off and all the earlier completed steps are ignored.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-06-23 05:29:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1299184

Description Martin Bukatovic 2015-11-02 19:47:00 UTC

Description of problem
======================

When gdeploy fails on some unexpected issue and fails, it's not possible to
rerun the operation. One would expect that gdeploy operates in idempotent[1]
way since it uses ansible to execute all operations.

Consider this scenario: failure happens on large trusted storage pool:

 * admin runs gdeploy
 * failure happens and gdeploy stops in the middle
 * admin fixes the issue
 * admin has no obvious/easy way to rerun gdeploy operation to finish setup

[1] "idempotent" means that one can rerun ansible playbooks even when the
system is halfway (or entirely) in desired state, see Ansible docs:

> The resource models are ‘idempotent’ meaning change commands are not run
> unless needed, and Ansible will bring the system back to a desired state
> regardless of the actual state – rather than you having to tell it how to get
> to the state.

Version-Release number of selected component (if applicable)
============================================================

gdeploy-1.0-12.el6rhs.noarch

How reproducible
================

100 %

Steps to Reproduce
==================

1. Create gluster.conf file for gdeploy so that it
   fail somewhere in the middle
2. Run gdeploy for the first time: `gdeploy -c gluster.conf`
3. After expected failure, try to rerun the operation again:
   `gdeploy -c gluster.conf`

Actual results
==============

First gdeploy fails (as expected for the sake of reproducing this issue):

~~~
... initial successfully changed tasks skipped ...

TASK: [Start glusterd in all the hosts (if not started already)] ************** 
changed: [node-128.storage.example.com]
changed: [node-131.storage.example.com]
changed: [node-130.storage.example.com]
changed: [node-129.storage.example.com]

PLAY [master] ***************************************************************** 

TASK: [Creates a Trusted Storage Pool] **************************************** 
failed: [node-131.storage.example.com] => {"failed": true}
msg: peer probe: failed: Probe returned with unknown errno 107


FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/ansible_playbooks.retry

node-128.storage.example.com : ok=12   changed=12   unreachable=0    failed=0   
node-129.storage.example.com : ok=12   changed=12   unreachable=0    failed=0   
node-130.storage.example.com : ok=12   changed=12   unreachable=0    failed=0   
node-131.storage.example.com : ok=12   changed=12   unreachable=0    failed=1   
~~~

Here I either fix the issue (optionally) and would like to rerun gdeploy:

~~~
[root@node-129 ~]# gdeploy -c gluster.conf
/usr/lib/python2.6/site-packages/argparse.py:1575: DeprecationWarning: The "version" argument to ArgumentParser is deprecated. Please use "add_argument(..., action='version', version="N", ...)" instead
  """instead""", DeprecationWarning)

INFO: Back-end setup triggered

Warning: Using mountpoint itself as the brick in one or more hosts since force is specified, although not recommended.

INFO: Peer management(action: probe) triggered
INFO: Volume management(action: create) triggered
INFO: FUSE mount of volume triggered.

PLAY [gluster_servers] ******************************************************** 

TASK: [Create Physical Volume on all the nodes] ******************************* 
failed: [node-129.storage.example.com] => {"failed": true}
msg: ['/dev/vdb Physical Volume Exists!']
failed: [node-128.storage.example.com] => {"failed": true}
msg: ['/dev/vdb Physical Volume Exists!']
failed: [node-130.storage.example.com] => {"failed": true}
msg: ['/dev/vdb Physical Volume Exists!']
failed: [node-131.storage.example.com] => {"failed": true}
msg: ['/dev/vdb Physical Volume Exists!']

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/ansible_playbooks.retry

node-128.storage.example.com : ok=0    changed=0    unreachable=0    failed=1   
node-129.storage.example.com : ok=0    changed=0    unreachable=0    failed=1   
node-130.storage.example.com : ok=0    changed=0    unreachable=0    failed=1   
node-131.storage.example.com : ok=0    changed=0    unreachable=0    failed=1   

[root@node-129 ~]# gdeploy -c gluster.conf
~~~

Expected results
================

The 2nd run will go through steps which were successfully completed with
ansible ok state, making it possible to recover from the failure and complete
setup.

Comment 1 Nandaja Varma 2015-11-09 05:26:37 UTC

This will be fixed in the next release. We will ignore the errors that's coming while rerunning the same config again.

Comment 2 Nandaja Varma 2015-11-24 07:09:26 UTC

Fixed in  the release 1.1

Comment 4 Anush Shetty 2016-04-12 07:57:51 UTC

Reruns are possible with gdeploy-2.0-5.el7rhgs.noarch. Marking this as verified.

Comment 6 errata-xmlrpc 2016-06-23 05:29:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1250