Bug 1570048

Summary: unmanaged task always fails after import failure
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Lubos Trilety <ltrilety>
Component: web-admin-tendrl-commonsAssignee: gowtham <gshanmug>
Status: CLOSED ERRATA QA Contact: Filip Balák <fbalak>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: fbalak, gshanmug, mbukatov, rhs-bugs, sankarshan
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-ui-1.6.3-3.el7rhgs tendrl-gluster-integration-1.6.3-4.el7rhgs tendrl-monitoring-integration-1.6.3-4.el7rhgs tendrl-commons-1.6.3-6.el7rhgs tendrl-node-agent-1.6.3-6.el7rhgs Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 07:04:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1588357    
Bug Blocks: 1503137, 1526338    
Attachments:
Description Flags
unmanage task error
none
task page none

Description Lubos Trilety 2018-04-20 13:57:54 UTC
Created attachment 1424509 [details]
unmanage task error

Description of problem:
Un-manage task always fails when import was not successful. 

Version-Release number of selected component (if applicable):
tendrl-ui-1.6.3-1.el7rhgs.noarch
tendrl-commons-1.6.3-2.el7rhgs.noarch
tendrl-api-1.6.3-1.el7rhgs.noarch
tendrl-api-httpd-1.6.3-1.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-ansible-1.6.3-1.el7rhgs.noarch
tendrl-node-agent-1.6.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

How reproducible:
100%

Steps to Reproduce:
1. Change tendrl-node-devel repository baseurl to bad one
2. Start cluster import
3. Import fails
4. Start cluster unmanage task

Actual results:
The task fails with following events or similar ones
 - info:
   Stop service jobs on cluster(<id0>) not yet complete on all nodes([u'<id1>', u'<id2>', u'<id3>', u'<id4>', u'<id5>', u'<id6>']). Timing out
 - error:
   Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 
 - error:
   Failure in Job 8b497580-00b3-44a8-9ee2-7649f2d79625 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 213, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 

Expected results:
Unmanage task finishes successfully

Additional info:

Comment 4 Filip Balák 2018-04-25 11:50:33 UTC
With provided reproducer the unmanage task failed as described. --> ASSIGNED

Tested with:
glusterfs-3.12.2-8.el7rhgs.x86_64
tendrl-ansible-1.6.3-2.el7rhgs.noarch
tendrl-api-1.6.3-1.el7rhgs.noarch
tendrl-api-httpd-1.6.3-1.el7rhgs.noarch
tendrl-commons-1.6.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch
tendrl-node-agent-1.6.3-2.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch

Comment 5 Nishanth Thomas 2018-04-25 13:51:42 UTC
Require Screen-shot of the task-details page and the logs

Comment 6 Filip Balák 2018-04-25 14:58:31 UTC
Created attachment 1426730 [details]
task page

Comment 7 Filip Balák 2018-04-25 14:59:50 UTC
Created attachment 1426731 [details]
logs and configuration files

Comment 8 Filip Balák 2018-04-25 15:03:15 UTC
@nthomas I tried to reproduce this issue again and unmanage passed so the reproducer is not 100%. But after another try I was able to reproduce it again so the problem remains. I sent you a PM with access to my configuration.

Comment 9 gowtham 2018-05-02 07:19:29 UTC
This issue is fixed

Comment 11 Filip Balák 2018-05-10 08:35:39 UTC
With provided reproducer the unmanage job fails with following errors when cluster had no short name specified:
```
Failure in Job e42f19cd-d8e2-4190-b813-2dff1c8ef98c Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 233, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster

Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster
```

When cluster had specified name there also appeared this error (with previous errors):
```
Clearing monitoring data for cluster ([cluster_name]) not yet complete. Timing out.
```

--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-3.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-4.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch
tendrl-node-agent-1.6.3-4.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch
ansible-2.5.2-1.el7ae.noarch

Comment 12 gowtham 2018-05-16 12:30:47 UTC
This issue is fixed

Comment 13 Filip Balák 2018-05-21 08:26:48 UTC
I changed tendrl-node-devel repository baseurl to bad one for some nodes and for some nodes I left it unchanged. The import fails as expected but unmanage fails as well with errors:

```
Failure in Job 1467b057-5a69-4828-9631-ea7d303ae405 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 242, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 

Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 

Stop service jobs on cluster(cluster1) not yet complete on all nodes([u'dd17e478-1530-4e76-8ef5-a2543e2b1ada', u'4286fdbd-e814-461e-aa5d-c389d5092388', u'8872254b-071c-4855-a5e8-c3190c92b291', u'93dd5041-338c-48f9-ba81-289be1cf45bf', u'd59c1492-288d-4b66-bed6-4ddc74cf4c9c', u'da56311e-51d6-4d64-ae11-854b9bffa060']). Timing out.
```

--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-5.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-3.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-3.el7rhgs.noarch
tendrl-node-agent-1.6.3-5.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-2.el7rhgs.noarch

Comment 14 Filip Balák 2018-06-04 08:08:17 UTC
I changed baseurl on two of six nodes in /etc/yum.repos.d/tendrl_node.repo. Import failed as expected but unmanage failed again with the same error:
```
Failure in Job a879da9c-e3f2-40cd-8bfb-d2346af6fb96 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 242, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster

Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster

Stop service jobs on cluster(cl1) not yet complete on all nodes([u'a2d6c641-b314-4d1a-8bdf-7947811576e7', u'b250cc18-20f5-4fd2-aae6-7f0e3dcd219b', u'e1a370b7-6ae6-423e-af80-cfed0c58ddd6', u'e9acdf48-57d1-4761-92e7-107a7727a14f', u'1a93660d-402c-4994-8068-f6e920548200', u'60ef19eb-92ee-482b-a345-c113aa3c6ded']). Timing out
```
--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch

Comment 15 gowtham 2018-06-05 17:43:56 UTC
This time unmanage fail actually not by wrong repo change, Please check without wrong repo also import failing and unamange also failing. 

This does not fail always but it can be reprodusable by the following step:
   1. Start tendrl-node-agent in server and storage nodes
   2. After few minutes stop node-agent in a server and few storage-node.
   3. Wait for 150 to 200 sec to TTL delete status watcher field from node_context.
   4. Start node-agent on a server and storage nodes (which are all the nodes node-agent was stopped).
   5. Start import flow, after few minutes import will fail
   6. Start unmanage flow, after few minutes unmanage also  will fail (timeout)

Because watcher status field is deleted by TTL, and server also down so it won't mark that node as down. So when you start node-agent first node_context.save() in manger try to update but there is no change in an object so nothing will happen, watcher status field also not updated. After node-agent sync will try to save node_context with TTL. here also no change in an object so save won't happen, But save method in a node_context object will try to refresh then watcher status field. So it will raise EtcdKeyNotFound Exception. 

Because of this problem some nodes are not picking up the job.

The actual problem which is reported in this bug like when a wrong repo is given unamange is failing is fixed.

Comment 16 Filip Balák 2018-06-07 07:11:58 UTC
Based on Comment 15 I move this BZ back to ON_QA and I will test it again after BZ 1588357 is resolved.

Comment 17 Filip Balák 2018-06-25 09:07:16 UTC
I tested reproducer from this BZ few times with different configurations (different number of nodes, volume present/not present, Cluster name set/not set) and everything seems ok. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

Comment 19 errata-xmlrpc 2018-09-04 07:04:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616