1570048 – unmanaged task always fails after import failure

Bug 1570048 - unmanaged task always fails after import failure

Summary: unmanaged task always fails after import failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-commons
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	gowtham
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Depends On:	1588357
Blocks:	1503137 1526338
TreeView+	depends on / blocked

Reported:	2018-04-20 13:57 UTC by Lubos Trilety
Modified:	2018-09-04 07:05 UTC (History)
CC List:	5 users (show)
Fixed In Version:	tendrl-ui-1.6.3-3.el7rhgs tendrl-gluster-integration-1.6.3-4.el7rhgs tendrl-monitoring-integration-1.6.3-4.el7rhgs tendrl-commons-1.6.3-6.el7rhgs tendrl-node-agent-1.6.3-6.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 07:04:50 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
unmanage task error (352.81 KB, image/png) 2018-04-20 13:57 UTC, Lubos Trilety	no flags	Details
task page (105.62 KB, image/png) 2018-04-25 14:58 UTC, Filip Balák	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	Tendrl commons issues 932	None	None	None	2018-05-02 07:19:29 UTC
Github	Tendrl commons issues 965	None	None	None	2018-05-16 12:30:47 UTC
Github	Tendrl commons issues 977	None	None	None	2018-05-25 04:39:51 UTC
Github	Tendrl commons issues 983	None	None	None	2018-06-05 18:05:52 UTC
Github	Tendrl commons pull 933	None	None	None	2018-05-02 07:18:12 UTC
Github	Tendrl gluster-integration issues 650	None	None	None	2018-05-28 17:38:51 UTC
Github	Tendrl monitoring-integration issues 468	None	None	None	2018-05-25 04:39:15 UTC
Github	Tendrl node-agent issues 824	None	None	None	2018-06-05 18:06:29 UTC
Red Hat Bugzilla	1514442	unspecified	CLOSED	Successive attempts to import the same cluster on the same webadmin server fail	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2018:2616	None	None	None	2018-09-04 07:05:55 UTC

Internal Links: 1514442

Description Lubos Trilety 2018-04-20 13:57:54 UTC

Created attachment 1424509 [details]
unmanage task error

Description of problem:
Un-manage task always fails when import was not successful. 

Version-Release number of selected component (if applicable):
tendrl-ui-1.6.3-1.el7rhgs.noarch
tendrl-commons-1.6.3-2.el7rhgs.noarch
tendrl-api-1.6.3-1.el7rhgs.noarch
tendrl-api-httpd-1.6.3-1.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-ansible-1.6.3-1.el7rhgs.noarch
tendrl-node-agent-1.6.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

How reproducible:
100%

Steps to Reproduce:
1. Change tendrl-node-devel repository baseurl to bad one
2. Start cluster import
3. Import fails
4. Start cluster unmanage task

Actual results:
The task fails with following events or similar ones
 - info:
   Stop service jobs on cluster(<id0>) not yet complete on all nodes([u'<id1>', u'<id2>', u'<id3>', u'<id4>', u'<id5>', u'<id6>']). Timing out
 - error:
   Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 
 - error:
   Failure in Job 8b497580-00b3-44a8-9ee2-7649f2d79625 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 213, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 

Expected results:
Unmanage task finishes successfully

Additional info:

Comment 4 Filip Balák 2018-04-25 11:50:33 UTC

With provided reproducer the unmanage task failed as described. --> ASSIGNED

Tested with:
glusterfs-3.12.2-8.el7rhgs.x86_64
tendrl-ansible-1.6.3-2.el7rhgs.noarch
tendrl-api-1.6.3-1.el7rhgs.noarch
tendrl-api-httpd-1.6.3-1.el7rhgs.noarch
tendrl-commons-1.6.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch
tendrl-node-agent-1.6.3-2.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch

Comment 5 Nishanth Thomas 2018-04-25 13:51:42 UTC

Require Screen-shot of the task-details page and the logs

Comment 6 Filip Balák 2018-04-25 14:58:31 UTC

Created attachment 1426730 [details]
task page

Comment 7 Filip Balák 2018-04-25 14:59:50 UTC

Created attachment 1426731 [details]
logs and configuration files

Comment 8 Filip Balák 2018-04-25 15:03:15 UTC

@nthomas I tried to reproduce this issue again and unmanage passed so the reproducer is not 100%. But after another try I was able to reproduce it again so the problem remains. I sent you a PM with access to my configuration.

Comment 9 gowtham 2018-05-02 07:19:29 UTC

This issue is fixed

Comment 10 gowtham 2018-05-02 07:20:54 UTC

This is also part of this fix https://github.com/Tendrl/monitoring-integration/pull/437/commits/5d9a5279095055552f96d7034d03c5ba02b0b537

Comment 11 Filip Balák 2018-05-10 08:35:39 UTC

With provided reproducer the unmanage job fails with following errors when cluster had no short name specified:
```
Failure in Job e42f19cd-d8e2-4190-b813-2dff1c8ef98c Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 233, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster

Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster
```

When cluster had specified name there also appeared this error (with previous errors):
```
Clearing monitoring data for cluster ([cluster_name]) not yet complete. Timing out.
```

--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-3.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-4.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch
tendrl-node-agent-1.6.3-4.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch
ansible-2.5.2-1.el7ae.noarch

Comment 12 gowtham 2018-05-16 12:30:47 UTC

This issue is fixed

Comment 13 Filip Balák 2018-05-21 08:26:48 UTC

I changed tendrl-node-devel repository baseurl to bad one for some nodes and for some nodes I left it unchanged. The import fails as expected but unmanage fails as well with errors:

```
Failure in Job 1467b057-5a69-4828-9631-ea7d303ae405 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 242, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 

Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 

Stop service jobs on cluster(cluster1) not yet complete on all nodes([u'dd17e478-1530-4e76-8ef5-a2543e2b1ada', u'4286fdbd-e814-461e-aa5d-c389d5092388', u'8872254b-071c-4855-a5e8-c3190c92b291', u'93dd5041-338c-48f9-ba81-289be1cf45bf', u'd59c1492-288d-4b66-bed6-4ddc74cf4c9c', u'da56311e-51d6-4d64-ae11-854b9bffa060']). Timing out.
```

--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-5.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-3.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-3.el7rhgs.noarch
tendrl-node-agent-1.6.3-5.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-2.el7rhgs.noarch

Comment 14 Filip Balák 2018-06-04 08:08:17 UTC

I changed baseurl on two of six nodes in /etc/yum.repos.d/tendrl_node.repo. Import failed as expected but unmanage failed again with the same error:
```
Failure in Job a879da9c-e3f2-40cd-8bfb-d2346af6fb96 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 242, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster

Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster

Stop service jobs on cluster(cl1) not yet complete on all nodes([u'a2d6c641-b314-4d1a-8bdf-7947811576e7', u'b250cc18-20f5-4fd2-aae6-7f0e3dcd219b', u'e1a370b7-6ae6-423e-af80-cfed0c58ddd6', u'e9acdf48-57d1-4761-92e7-107a7727a14f', u'1a93660d-402c-4994-8068-f6e920548200', u'60ef19eb-92ee-482b-a345-c113aa3c6ded']). Timing out
```
--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch

Comment 15 gowtham 2018-06-05 17:43:56 UTC

This time unmanage fail actually not by wrong repo change, Please check without wrong repo also import failing and unamange also failing. 

This does not fail always but it can be reprodusable by the following step:
   1. Start tendrl-node-agent in server and storage nodes
   2. After few minutes stop node-agent in a server and few storage-node.
   3. Wait for 150 to 200 sec to TTL delete status watcher field from node_context.
   4. Start node-agent on a server and storage nodes (which are all the nodes node-agent was stopped).
   5. Start import flow, after few minutes import will fail
   6. Start unmanage flow, after few minutes unmanage also  will fail (timeout)

Because watcher status field is deleted by TTL, and server also down so it won't mark that node as down. So when you start node-agent first node_context.save() in manger try to update but there is no change in an object so nothing will happen, watcher status field also not updated. After node-agent sync will try to save node_context with TTL. here also no change in an object so save won't happen, But save method in a node_context object will try to refresh then watcher status field. So it will raise EtcdKeyNotFound Exception. 

Because of this problem some nodes are not picking up the job.

The actual problem which is reported in this bug like when a wrong repo is given unamange is failing is fixed.

Comment 16 Filip Balák 2018-06-07 07:11:58 UTC

Based on Comment 15 I move this BZ back to ON_QA and I will test it again after BZ 1588357 is resolved.

Comment 17 Filip Balák 2018-06-25 09:07:16 UTC

I tested reproducer from this BZ few times with different configurations (different number of nodes, volume present/not present, Cluster name set/not set) and everything seems ok. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

Comment 19 errata-xmlrpc 2018-09-04 07:04:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616

Note You need to log in before you can comment on or make changes to this bug.