1277250 – gdeploy doesn't provide way to recover from failure during setup (is not idempotent)

Bug 1277250 - gdeploy doesn't provide way to recover from failure during setup (is not idempotent)

Summary: gdeploy doesn't provide way to recover from failure during setup (is not idem...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gdeploy
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Sachidananda Urs
QA Contact:	Anush Shetty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1299184
TreeView+	depends on / blocked

Reported:	2015-11-02 19:47 UTC by Martin Bukatovic
Modified:	2016-06-23 05:29 UTC (History)
CC List:	5 users (show)
Fixed In Version:	gdeploy-2.0-2
Doc Type:	Bug Fix
Doc Text:	In case of failures gdeploy would stop at that point and the user had to manually undo the partially completed operations before restarting the program. This was a manual and cumbersome exercise, and user had to login to the machines to fix. With this release gdeploy now continues from where it left off and all the earlier completed steps are ignored.
Clone Of:
Environment:
Last Closed:	2016-06-23 05:29:00 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:1250	0	normal	SHIPPED_LIVE	gdeploy update for Red Hat Gluster Storage 3.1 update 3	2016-06-23 09:11:59 UTC

Description Martin Bukatovic 2015-11-02 19:47:00 UTC

Description of problem
======================

When gdeploy fails on some unexpected issue and fails, it's not possible to
rerun the operation. One would expect that gdeploy operates in idempotent[1]
way since it uses ansible to execute all operations.

Consider this scenario: failure happens on large trusted storage pool:

 * admin runs gdeploy
 * failure happens and gdeploy stops in the middle
 * admin fixes the issue
 * admin has no obvious/easy way to rerun gdeploy operation to finish setup

[1] "idempotent" means that one can rerun ansible playbooks even when the
system is halfway (or entirely) in desired state, see Ansible docs:

> The resource models are ‘idempotent’ meaning change commands are not run
> unless needed, and Ansible will bring the system back to a desired state
> regardless of the actual state – rather than you having to tell it how to get
> to the state.

Version-Release number of selected component (if applicable)
============================================================

gdeploy-1.0-12.el6rhs.noarch

How reproducible
================

100 %

Steps to Reproduce
==================

1. Create gluster.conf file for gdeploy so that it
   fail somewhere in the middle
2. Run gdeploy for the first time: `gdeploy -c gluster.conf`
3. After expected failure, try to rerun the operation again:
   `gdeploy -c gluster.conf`

Actual results
==============

First gdeploy fails (as expected for the sake of reproducing this issue):

~~~
... initial successfully changed tasks skipped ...

TASK: [Start glusterd in all the hosts (if not started already)] ************** 
changed: [node-128.storage.example.com]
changed: [node-131.storage.example.com]
changed: [node-130.storage.example.com]
changed: [node-129.storage.example.com]

PLAY [master] ***************************************************************** 

TASK: [Creates a Trusted Storage Pool] **************************************** 
failed: [node-131.storage.example.com] => {"failed": true}
msg: peer probe: failed: Probe returned with unknown errno 107


FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/ansible_playbooks.retry

node-128.storage.example.com : ok=12   changed=12   unreachable=0    failed=0   
node-129.storage.example.com : ok=12   changed=12   unreachable=0    failed=0   
node-130.storage.example.com : ok=12   changed=12   unreachable=0    failed=0   
node-131.storage.example.com : ok=12   changed=12   unreachable=0    failed=1   
~~~

Here I either fix the issue (optionally) and would like to rerun gdeploy:

~~~
[root@node-129 ~]# gdeploy -c gluster.conf
/usr/lib/python2.6/site-packages/argparse.py:1575: DeprecationWarning: The "version" argument to ArgumentParser is deprecated. Please use "add_argument(..., action='version', version="N", ...)" instead
  """instead""", DeprecationWarning)

INFO: Back-end setup triggered

Warning: Using mountpoint itself as the brick in one or more hosts since force is specified, although not recommended.

INFO: Peer management(action: probe) triggered
INFO: Volume management(action: create) triggered
INFO: FUSE mount of volume triggered.

PLAY [gluster_servers] ******************************************************** 

TASK: [Create Physical Volume on all the nodes] ******************************* 
failed: [node-129.storage.example.com] => {"failed": true}
msg: ['/dev/vdb Physical Volume Exists!']
failed: [node-128.storage.example.com] => {"failed": true}
msg: ['/dev/vdb Physical Volume Exists!']
failed: [node-130.storage.example.com] => {"failed": true}
msg: ['/dev/vdb Physical Volume Exists!']
failed: [node-131.storage.example.com] => {"failed": true}
msg: ['/dev/vdb Physical Volume Exists!']

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/ansible_playbooks.retry

node-128.storage.example.com : ok=0    changed=0    unreachable=0    failed=1   
node-129.storage.example.com : ok=0    changed=0    unreachable=0    failed=1   
node-130.storage.example.com : ok=0    changed=0    unreachable=0    failed=1   
node-131.storage.example.com : ok=0    changed=0    unreachable=0    failed=1   

[root@node-129 ~]# gdeploy -c gluster.conf
~~~

Expected results
================

The 2nd run will go through steps which were successfully completed with
ansible ok state, making it possible to recover from the failure and complete
setup.

Comment 1 Nandaja Varma 2015-11-09 05:26:37 UTC

This will be fixed in the next release. We will ignore the errors that's coming while rerunning the same config again.

Comment 2 Nandaja Varma 2015-11-24 07:09:26 UTC

Fixed in  the release 1.1

Comment 4 Anush Shetty 2016-04-12 07:57:51 UTC

Reruns are possible with gdeploy-2.0-5.el7rhgs.noarch. Marking this as verified.

Comment 6 errata-xmlrpc 2016-06-23 05:29:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1250

Note You need to log in before you can comment on or make changes to this bug.