Bug 1588357

Summary: Sometimes import flow and unmanage flow is failing
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: gowtham <gshanmug>
Component: web-admin-tendrl-commonsAssignee: gowtham <gshanmug>
Status: CLOSED ERRATA QA Contact: Filip Balák <fbalak>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: fbalak, gshanmug, mbukatov, nthomas, rhs-bugs, sankarshan
Target Milestone: ---Keywords: TestBlocker
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-commons-1.6.3-7.el7rhgs Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 07:07:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503137, 1570048    

Description gowtham 2018-06-07 06:47:31 UTC
Description of problem:
If a setup keeps running for some time then unmanage and import flows are working fine. If we do node-agent stop or down the node or server for sometimes then import and unmanage flows are failing. 

Version-Release number of selected component (if applicable):


How reproducible:
It is reprodusable using following steps:

Steps to Reproduce:
1. Stop node-agent in a server and few storage nodes for 200 sec, after 200 sec status watcher field in a node_context object will be deleted. It won't mark down because a server is down. the only server is monitoring the nodes and mark as down but in this case server also down. So status watcher filed for a node_context object in etcd won't be created. 

2. Start node-agent in a server and other storage nodes, then node-agent sync failed with exception:
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: Traceback (most recent call last):
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: self.run()
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: File "/usr/lib/python2.7/site-packages/tendrl/node_agent/node_sync/__init__.py", line 52, in run
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: NS.node_context.save(ttl=_sync_ttl)
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: File "/usr/lib/python2.7/site-packages/tendrl/commons/objects/node_context/__init__.py", line 98, in save
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: etcd_utils.refresh(status, ttl)
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: File "/usr/lib/python2.7/site-packages/tendrl/commons/utils/etcd_utils.py", line 82, in refresh
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: raise ex
Jun 07 06:22:50 tendrl-server tendrl-node-agent[6578]: EtcdKeyNotFound: Key not found : /nodes/3d389d10-2e02-43f1-9e40-9104825798a4/NodeContext/status

3. Because there is no status watcher field but node-agent is trying to refresh watcher field so it raised a KeyNotFound exception.

4. If an import is already done then fire unmanage, it will fail. Because job won't pick up by the nodes so it will timeout.

5. If import flow is not done then fire import, that is also failed for the same reason. 

6. node-agent is down so all flows will fail.


Actual results:
when I stop the server and some storage nodes for sometimes and then start an import and unmanage flows are failing. 


Expected results:
import and unmanage should work when server and node are up.

Additional info:

Comment 2 gowtham 2018-06-07 06:52:43 UTC
upstream PR is merged for this : https://github.com/Tendrl/commons/pull/984

Comment 3 Martin Bukatovic 2018-06-07 07:01:11 UTC
(In reply to gowtham from comment #0)
> Version-Release number of selected component (if applicable):

Could you provide version of affected builds by running on the following on
on machine from the cluster:

```
# rpm -qa | grep tendrl | sort
```

Comment 6 Nishanth Thomas 2018-06-07 08:25:14 UTC
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch

Comment 10 Filip Balák 2018-06-25 09:03:45 UTC
I tested reproducer from this BZ and BZ 1570048 few times and everything seem ok. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

Comment 12 errata-xmlrpc 2018-09-04 07:07:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616