Bug 1570048 - unmanaged task always fails after import failure
Summary: unmanaged task always fails after import failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-commons
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.4.0
Assignee: gowtham
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On: 1588357
Blocks: 1503137 1526338
TreeView+ depends on / blocked
 
Reported: 2018-04-20 13:57 UTC by Lubos Trilety
Modified: 2018-09-04 07:05 UTC (History)
5 users (show)

Fixed In Version: tendrl-ui-1.6.3-3.el7rhgs tendrl-gluster-integration-1.6.3-4.el7rhgs tendrl-monitoring-integration-1.6.3-4.el7rhgs tendrl-commons-1.6.3-6.el7rhgs tendrl-node-agent-1.6.3-6.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 07:04:50 UTC
Embargoed:


Attachments (Terms of Use)
unmanage task error (352.81 KB, image/png)
2018-04-20 13:57 UTC, Lubos Trilety
no flags Details
task page (105.62 KB, image/png)
2018-04-25 14:58 UTC, Filip Balák
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github Tendrl commons issues 932 0 None None None 2018-05-02 07:19:29 UTC
Github Tendrl commons issues 965 0 None None None 2018-05-16 12:30:47 UTC
Github Tendrl commons issues 977 0 None None None 2018-05-25 04:39:51 UTC
Github Tendrl commons issues 983 0 None None None 2018-06-05 18:05:52 UTC
Github Tendrl commons pull 933 0 None None None 2018-05-02 07:18:12 UTC
Github Tendrl gluster-integration issues 650 0 None None None 2018-05-28 17:38:51 UTC
Github Tendrl monitoring-integration issues 468 0 None None None 2018-05-25 04:39:15 UTC
Github Tendrl node-agent issues 824 0 None None None 2018-06-05 18:06:29 UTC
Red Hat Bugzilla 1514442 0 unspecified CLOSED Successive attempts to import the same cluster on the same webadmin server fail 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2018:2616 0 None None None 2018-09-04 07:05:55 UTC

Internal Links: 1514442

Description Lubos Trilety 2018-04-20 13:57:54 UTC
Created attachment 1424509 [details]
unmanage task error

Description of problem:
Un-manage task always fails when import was not successful. 

Version-Release number of selected component (if applicable):
tendrl-ui-1.6.3-1.el7rhgs.noarch
tendrl-commons-1.6.3-2.el7rhgs.noarch
tendrl-api-1.6.3-1.el7rhgs.noarch
tendrl-api-httpd-1.6.3-1.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-ansible-1.6.3-1.el7rhgs.noarch
tendrl-node-agent-1.6.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

How reproducible:
100%

Steps to Reproduce:
1. Change tendrl-node-devel repository baseurl to bad one
2. Start cluster import
3. Import fails
4. Start cluster unmanage task

Actual results:
The task fails with following events or similar ones
 - info:
   Stop service jobs on cluster(<id0>) not yet complete on all nodes([u'<id1>', u'<id2>', u'<id3>', u'<id4>', u'<id5>', u'<id6>']). Timing out
 - error:
   Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 
 - error:
   Failure in Job 8b497580-00b3-44a8-9ee2-7649f2d79625 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 213, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 

Expected results:
Unmanage task finishes successfully

Additional info:

Comment 4 Filip Balák 2018-04-25 11:50:33 UTC
With provided reproducer the unmanage task failed as described. --> ASSIGNED

Tested with:
glusterfs-3.12.2-8.el7rhgs.x86_64
tendrl-ansible-1.6.3-2.el7rhgs.noarch
tendrl-api-1.6.3-1.el7rhgs.noarch
tendrl-api-httpd-1.6.3-1.el7rhgs.noarch
tendrl-commons-1.6.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch
tendrl-node-agent-1.6.3-2.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch

Comment 5 Nishanth Thomas 2018-04-25 13:51:42 UTC
Require Screen-shot of the task-details page and the logs

Comment 6 Filip Balák 2018-04-25 14:58:31 UTC
Created attachment 1426730 [details]
task page

Comment 7 Filip Balák 2018-04-25 14:59:50 UTC
Created attachment 1426731 [details]
logs and configuration files

Comment 8 Filip Balák 2018-04-25 15:03:15 UTC
@nthomas I tried to reproduce this issue again and unmanage passed so the reproducer is not 100%. But after another try I was able to reproduce it again so the problem remains. I sent you a PM with access to my configuration.

Comment 9 gowtham 2018-05-02 07:19:29 UTC
This issue is fixed

Comment 11 Filip Balák 2018-05-10 08:35:39 UTC
With provided reproducer the unmanage job fails with following errors when cluster had no short name specified:
```
Failure in Job e42f19cd-d8e2-4190-b813-2dff1c8ef98c Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 233, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster

Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster
```

When cluster had specified name there also appeared this error (with previous errors):
```
Clearing monitoring data for cluster ([cluster_name]) not yet complete. Timing out.
```

--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-3.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-4.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch
tendrl-node-agent-1.6.3-4.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch
ansible-2.5.2-1.el7ae.noarch

Comment 12 gowtham 2018-05-16 12:30:47 UTC
This issue is fixed

Comment 13 Filip Balák 2018-05-21 08:26:48 UTC
I changed tendrl-node-devel repository baseurl to bad one for some nodes and for some nodes I left it unchanged. The import fails as expected but unmanage fails as well with errors:

```
Failure in Job 1467b057-5a69-4828-9631-ea7d303ae405 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 242, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 

Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 

Stop service jobs on cluster(cluster1) not yet complete on all nodes([u'dd17e478-1530-4e76-8ef5-a2543e2b1ada', u'4286fdbd-e814-461e-aa5d-c389d5092388', u'8872254b-071c-4855-a5e8-c3190c92b291', u'93dd5041-338c-48f9-ba81-289be1cf45bf', u'd59c1492-288d-4b66-bed6-4ddc74cf4c9c', u'da56311e-51d6-4d64-ae11-854b9bffa060']). Timing out.
```

--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-5.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-3.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-3.el7rhgs.noarch
tendrl-node-agent-1.6.3-5.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-2.el7rhgs.noarch

Comment 14 Filip Balák 2018-06-04 08:08:17 UTC
I changed baseurl on two of six nodes in /etc/yum.repos.d/tendrl_node.repo. Import failed as expected but unmanage failed again with the same error:
```
Failure in Job a879da9c-e3f2-40cd-8bfb-d2346af6fb96 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 242, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster

Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster

Stop service jobs on cluster(cl1) not yet complete on all nodes([u'a2d6c641-b314-4d1a-8bdf-7947811576e7', u'b250cc18-20f5-4fd2-aae6-7f0e3dcd219b', u'e1a370b7-6ae6-423e-af80-cfed0c58ddd6', u'e9acdf48-57d1-4761-92e7-107a7727a14f', u'1a93660d-402c-4994-8068-f6e920548200', u'60ef19eb-92ee-482b-a345-c113aa3c6ded']). Timing out
```
--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch

Comment 15 gowtham 2018-06-05 17:43:56 UTC
This time unmanage fail actually not by wrong repo change, Please check without wrong repo also import failing and unamange also failing. 

This does not fail always but it can be reprodusable by the following step:
   1. Start tendrl-node-agent in server and storage nodes
   2. After few minutes stop node-agent in a server and few storage-node.
   3. Wait for 150 to 200 sec to TTL delete status watcher field from node_context.
   4. Start node-agent on a server and storage nodes (which are all the nodes node-agent was stopped).
   5. Start import flow, after few minutes import will fail
   6. Start unmanage flow, after few minutes unmanage also  will fail (timeout)

Because watcher status field is deleted by TTL, and server also down so it won't mark that node as down. So when you start node-agent first node_context.save() in manger try to update but there is no change in an object so nothing will happen, watcher status field also not updated. After node-agent sync will try to save node_context with TTL. here also no change in an object so save won't happen, But save method in a node_context object will try to refresh then watcher status field. So it will raise EtcdKeyNotFound Exception. 

Because of this problem some nodes are not picking up the job.

The actual problem which is reported in this bug like when a wrong repo is given unamange is failing is fixed.

Comment 16 Filip Balák 2018-06-07 07:11:58 UTC
Based on Comment 15 I move this BZ back to ON_QA and I will test it again after BZ 1588357 is resolved.

Comment 17 Filip Balák 2018-06-25 09:07:16 UTC
I tested reproducer from this BZ few times with different configurations (different number of nodes, volume present/not present, Cluster name set/not set) and everything seems ok. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

Comment 19 errata-xmlrpc 2018-09-04 07:04:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616


Note You need to log in before you can comment on or make changes to this bug.