Bug 1588357 - Sometimes import flow and unmanage flow is failing
Summary: Sometimes import flow and unmanage flow is failing
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-commons
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.4.0
Assignee: gowtham
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks: 1503137 1570048
TreeView+ depends on / blocked
 
Reported: 2018-06-07 06:47 UTC by gowtham
Modified: 2018-09-04 07:08 UTC (History)
6 users (show)

Fixed In Version: tendrl-commons-1.6.3-7.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 07:07:28 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github https://github.com/Tendrl commons issues 983 0 None None None 2018-06-07 06:49:28 UTC
Red Hat Product Errata RHSA-2018:2616 0 None None None 2018-09-04 07:08:29 UTC

Description gowtham 2018-06-07 06:47:31 UTC
Description of problem:
If a setup keeps running for some time then unmanage and import flows are working fine. If we do node-agent stop or down the node or server for sometimes then import and unmanage flows are failing. 

Version-Release number of selected component (if applicable):


How reproducible:
It is reprodusable using following steps:

Steps to Reproduce:
1. Stop node-agent in a server and few storage nodes for 200 sec, after 200 sec status watcher field in a node_context object will be deleted. It won't mark down because a server is down. the only server is monitoring the nodes and mark as down but in this case server also down. So status watcher filed for a node_context object in etcd won't be created. 

2. Start node-agent in a server and other storage nodes, then node-agent sync failed with exception:
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: Traceback (most recent call last):
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: self.run()
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: File "/usr/lib/python2.7/site-packages/tendrl/node_agent/node_sync/__init__.py", line 52, in run
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: NS.node_context.save(ttl=_sync_ttl)
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: File "/usr/lib/python2.7/site-packages/tendrl/commons/objects/node_context/__init__.py", line 98, in save
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: etcd_utils.refresh(status, ttl)
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: File "/usr/lib/python2.7/site-packages/tendrl/commons/utils/etcd_utils.py", line 82, in refresh
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: raise ex
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: EtcdKeyNotFound: Key not found : /nodes/3d389d10-2e02-43f1-9e40-9104825798a4/NodeContext/status

3. Because there is no status watcher field but node-agent is trying to refresh watcher field so it raised a KeyNotFound exception.

4. If an import is already done then fire unmanage, it will fail. Because job won't pick up by the nodes so it will timeout.

5. If import flow is not done then fire import, that is also failed for the same reason. 

6. node-agent is down so all flows will fail.


Actual results:
when I stop the server and some storage nodes for sometimes and then start an import and unmanage flows are failing. 


Expected results:
import and unmanage should work when server and node are up.

Additional info:

Comment 2 gowtham 2018-06-07 06:52:43 UTC
upstream PR is merged for this : https://github.com/Tendrl/commons/pull/984

Comment 3 Martin Bukatovic 2018-06-07 07:01:11 UTC
(In reply to gowtham from comment #0)
> Version-Release number of selected component (if applicable):

Could you provide version of affected builds by running on the following on
on machine from the cluster:

```
# rpm -qa | grep tendrl | sort
```

Comment 6 Nishanth Thomas 2018-06-07 08:25:14 UTC
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch

Comment 10 Filip Balák 2018-06-25 09:03:45 UTC
I tested reproducer from this BZ and BZ 1570048 few times and everything seem ok. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

Comment 12 errata-xmlrpc 2018-09-04 07:07:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616


Note You need to log in before you can comment on or make changes to this bug.