Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1570048 - unmanaged task always fails after import failure
unmanaged task always fails after import failure
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: web-admin-tendrl-commons (Show other bugs)
3.4
Unspecified Unspecified
unspecified Severity unspecified
: ---
: RHGS 3.4.0
Assigned To: gowtham
Filip Balák
:
Depends On: 1588357
Blocks: 1503137 1526338
  Show dependency treegraph
 
Reported: 2018-04-20 09:57 EDT by Lubos Trilety
Modified: 2018-09-04 03:05 EDT (History)
5 users (show)

See Also:
Fixed In Version: tendrl-ui-1.6.3-3.el7rhgs tendrl-gluster-integration-1.6.3-4.el7rhgs tendrl-monitoring-integration-1.6.3-4.el7rhgs tendrl-commons-1.6.3-6.el7rhgs tendrl-node-agent-1.6.3-6.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-09-04 03:04:50 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
unmanage task error (352.81 KB, image/png)
2018-04-20 09:57 EDT, Lubos Trilety
no flags Details
task page (105.62 KB, image/png)
2018-04-25 10:58 EDT, Filip Balák
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Github Tendrl/commons/issues/932 None None None 2018-05-02 03:19 EDT
Github Tendrl/commons/issues/965 None None None 2018-05-16 08:30 EDT
Github Tendrl/commons/issues/977 None None None 2018-05-25 00:39 EDT
Github Tendrl/commons/issues/983 None None None 2018-06-05 14:05 EDT
Github Tendrl/commons/pull/933 None None None 2018-05-02 03:18 EDT
Github Tendrl/gluster-integration/issues/650 None None None 2018-05-28 13:38 EDT
Github Tendrl/monitoring-integration/issues/468 None None None 2018-05-25 00:39 EDT
Github Tendrl/node-agent/issues/824 None None None 2018-06-05 14:06 EDT
Red Hat Product Errata RHSA-2018:2616 None None None 2018-09-04 03:05 EDT

  None (edit)
Description Lubos Trilety 2018-04-20 09:57:54 EDT
Created attachment 1424509 [details]
unmanage task error

Description of problem:
Un-manage task always fails when import was not successful. 

Version-Release number of selected component (if applicable):
tendrl-ui-1.6.3-1.el7rhgs.noarch
tendrl-commons-1.6.3-2.el7rhgs.noarch
tendrl-api-1.6.3-1.el7rhgs.noarch
tendrl-api-httpd-1.6.3-1.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-ansible-1.6.3-1.el7rhgs.noarch
tendrl-node-agent-1.6.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch

How reproducible:
100%

Steps to Reproduce:
1. Change tendrl-node-devel repository baseurl to bad one
2. Start cluster import
3. Import fails
4. Start cluster unmanage task

Actual results:
The task fails with following events or similar ones
 - info:
   Stop service jobs on cluster(<id0>) not yet complete on all nodes([u'<id1>', u'<id2>', u'<id3>', u'<id4>', u'<id5>', u'<id6>']). Timing out
 - error:
   Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 
 - error:
   Failure in Job 8b497580-00b3-44a8-9ee2-7649f2d79625 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 213, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 

Expected results:
Unmanage task finishes successfully

Additional info:
Comment 4 Filip Balák 2018-04-25 07:50:33 EDT
With provided reproducer the unmanage task failed as described. --> ASSIGNED

Tested with:
glusterfs-3.12.2-8.el7rhgs.x86_64
tendrl-ansible-1.6.3-2.el7rhgs.noarch
tendrl-api-1.6.3-1.el7rhgs.noarch
tendrl-api-httpd-1.6.3-1.el7rhgs.noarch
tendrl-commons-1.6.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch
tendrl-node-agent-1.6.3-2.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch
Comment 5 Nishanth Thomas 2018-04-25 09:51:42 EDT
Require Screen-shot of the task-details page and the logs
Comment 6 Filip Balák 2018-04-25 10:58 EDT
Created attachment 1426730 [details]
task page
Comment 7 Filip Balák 2018-04-25 10:59 EDT
Created attachment 1426731 [details]
logs and configuration files
Comment 8 Filip Balák 2018-04-25 11:03:15 EDT
@nthomas@redhat.com I tried to reproduce this issue again and unmanage passed so the reproducer is not 100%. But after another try I was able to reproduce it again so the problem remains. I sent you a PM with access to my configuration.
Comment 9 gowtham 2018-05-02 03:19:29 EDT
This issue is fixed
Comment 11 Filip Balák 2018-05-10 04:35:39 EDT
With provided reproducer the unmanage job fails with following errors when cluster had no short name specified:
```
Failure in Job e42f19cd-d8e2-4190-b813-2dff1c8ef98c Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 233, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster

Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster
```

When cluster had specified name there also appeared this error (with previous errors):
```
Clearing monitoring data for cluster ([cluster_name]) not yet complete. Timing out.
```

--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-3.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-4.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch
tendrl-node-agent-1.6.3-4.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch
ansible-2.5.2-1.el7ae.noarch
Comment 12 gowtham 2018-05-16 08:30:47 EDT
This issue is fixed
Comment 13 Filip Balák 2018-05-21 04:26:48 EDT
I changed tendrl-node-devel repository baseurl to bad one for some nodes and for some nodes I left it unchanged. The import fails as expected but unmanage fails as well with errors:

```
Failure in Job 1467b057-5a69-4828-9631-ea7d303ae405 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 242, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 

Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster 

Stop service jobs on cluster(cluster1) not yet complete on all nodes([u'dd17e478-1530-4e76-8ef5-a2543e2b1ada', u'4286fdbd-e814-461e-aa5d-c389d5092388', u'8872254b-071c-4855-a5e8-c3190c92b291', u'93dd5041-338c-48f9-ba81-289be1cf45bf', u'd59c1492-288d-4b66-bed6-4ddc74cf4c9c', u'da56311e-51d6-4d64-ae11-854b9bffa060']). Timing out.
```

--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-5.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-3.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-3.el7rhgs.noarch
tendrl-node-agent-1.6.3-5.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-2.el7rhgs.noarch
Comment 14 Filip Balák 2018-06-04 04:08:17 EDT
I changed baseurl on two of six nodes in /etc/yum.repos.d/tendrl_node.repo. Import failed as expected but unmanage failed again with the same error:
```
Failure in Job a879da9c-e3f2-40cd-8bfb-d2346af6fb96 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 242, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster

Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster

Stop service jobs on cluster(cl1) not yet complete on all nodes([u'a2d6c641-b314-4d1a-8bdf-7947811576e7', u'b250cc18-20f5-4fd2-aae6-7f0e3dcd219b', u'e1a370b7-6ae6-423e-af80-cfed0c58ddd6', u'e9acdf48-57d1-4761-92e7-107a7727a14f', u'1a93660d-402c-4994-8068-f6e920548200', u'60ef19eb-92ee-482b-a345-c113aa3c6ded']). Timing out
```
--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch
Comment 15 gowtham 2018-06-05 13:43:56 EDT
This time unmanage fail actually not by wrong repo change, Please check without wrong repo also import failing and unamange also failing. 

This does not fail always but it can be reprodusable by the following step:
   1. Start tendrl-node-agent in server and storage nodes
   2. After few minutes stop node-agent in a server and few storage-node.
   3. Wait for 150 to 200 sec to TTL delete status watcher field from node_context.
   4. Start node-agent on a server and storage nodes (which are all the nodes node-agent was stopped).
   5. Start import flow, after few minutes import will fail
   6. Start unmanage flow, after few minutes unmanage also  will fail (timeout)

Because watcher status field is deleted by TTL, and server also down so it won't mark that node as down. So when you start node-agent first node_context.save() in manger try to update but there is no change in an object so nothing will happen, watcher status field also not updated. After node-agent sync will try to save node_context with TTL. here also no change in an object so save won't happen, But save method in a node_context object will try to refresh then watcher status field. So it will raise EtcdKeyNotFound Exception. 

Because of this problem some nodes are not picking up the job.

The actual problem which is reported in this bug like when a wrong repo is given unamange is failing is fixed.
Comment 16 Filip Balák 2018-06-07 03:11:58 EDT
Based on Comment 15 I move this BZ back to ON_QA and I will test it again after BZ 1588357 is resolved.
Comment 17 Filip Balák 2018-06-25 05:07:16 EDT
I tested reproducer from this BZ few times with different configurations (different number of nodes, volume present/not present, Cluster name set/not set) and everything seems ok. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch
Comment 19 errata-xmlrpc 2018-09-04 03:04:50 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616

Note You need to log in before you can comment on or make changes to this bug.