Bug 1570048
Summary: | unmanaged task always fails after import failure | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Lubos Trilety <ltrilety> | ||||||
Component: | web-admin-tendrl-commons | Assignee: | gowtham <gshanmug> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Filip Balák <fbalak> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | rhgs-3.4 | CC: | fbalak, gshanmug, mbukatov, rhs-bugs, sankarshan | ||||||
Target Milestone: | --- | ||||||||
Target Release: | RHGS 3.4.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | tendrl-ui-1.6.3-3.el7rhgs tendrl-gluster-integration-1.6.3-4.el7rhgs tendrl-monitoring-integration-1.6.3-4.el7rhgs tendrl-commons-1.6.3-6.el7rhgs tendrl-node-agent-1.6.3-6.el7rhgs | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-09-04 07:04:50 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1588357 | ||||||||
Bug Blocks: | 1503137, 1526338 | ||||||||
Attachments: |
|
With provided reproducer the unmanage task failed as described. --> ASSIGNED Tested with: glusterfs-3.12.2-8.el7rhgs.x86_64 tendrl-ansible-1.6.3-2.el7rhgs.noarch tendrl-api-1.6.3-1.el7rhgs.noarch tendrl-api-httpd-1.6.3-1.el7rhgs.noarch tendrl-commons-1.6.3-2.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch tendrl-node-agent-1.6.3-2.el7rhgs.noarch tendrl-notifier-1.6.3-2.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-1.el7rhgs.noarch Require Screen-shot of the task-details page and the logs Created attachment 1426730 [details]
task page
Created attachment 1426731 [details]
logs and configuration files
@nthomas I tried to reproduce this issue again and unmanage passed so the reproducer is not 100%. But after another try I was able to reproduce it again so the problem remains. I sent you a PM with access to my configuration. This issue is fixed This is also part of this fix https://github.com/Tendrl/monitoring-integration/pull/437/commits/5d9a5279095055552f96d7034d03c5ba02b0b537 With provided reproducer the unmanage job fails with following errors when cluster had no short name specified: ``` Failure in Job e42f19cd-d8e2-4190-b813-2dff1c8ef98c Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 233, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster ``` When cluster had specified name there also appeared this error (with previous errors): ``` Clearing monitoring data for cluster ([cluster_name]) not yet complete. Timing out. ``` --> ASSIGNED Tested with: tendrl-ansible-1.6.3-3.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-4.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch tendrl-node-agent-1.6.3-4.el7rhgs.noarch tendrl-notifier-1.6.3-2.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-1.el7rhgs.noarch ansible-2.5.2-1.el7ae.noarch This issue is fixed I changed tendrl-node-devel repository baseurl to bad one for some nodes and for some nodes I left it unchanged. The import fails as expected but unmanage fails as well with errors: ``` Failure in Job 1467b057-5a69-4828-9631-ea7d303ae405 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 242, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster Stop service jobs on cluster(cluster1) not yet complete on all nodes([u'dd17e478-1530-4e76-8ef5-a2543e2b1ada', u'4286fdbd-e814-461e-aa5d-c389d5092388', u'8872254b-071c-4855-a5e8-c3190c92b291', u'93dd5041-338c-48f9-ba81-289be1cf45bf', u'd59c1492-288d-4b66-bed6-4ddc74cf4c9c', u'da56311e-51d6-4d64-ae11-854b9bffa060']). Timing out. ``` --> ASSIGNED Tested with: tendrl-ansible-1.6.3-4.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-5.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-3.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-3.el7rhgs.noarch tendrl-node-agent-1.6.3-5.el7rhgs.noarch tendrl-notifier-1.6.3-3.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-2.el7rhgs.noarch I changed baseurl on two of six nodes in /etc/yum.repos.d/tendrl_node.repo. Import failed as expected but unmanage failed again with the same error: ``` Failure in Job a879da9c-e3f2-40cd-8bfb-d2346af6fb96 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 242, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster Stop service jobs on cluster(cl1) not yet complete on all nodes([u'a2d6c641-b314-4d1a-8bdf-7947811576e7', u'b250cc18-20f5-4fd2-aae6-7f0e3dcd219b', u'e1a370b7-6ae6-423e-af80-cfed0c58ddd6', u'e9acdf48-57d1-4761-92e7-107a7727a14f', u'1a93660d-402c-4994-8068-f6e920548200', u'60ef19eb-92ee-482b-a345-c113aa3c6ded']). Timing out ``` --> ASSIGNED Tested with: tendrl-ansible-1.6.3-4.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-6.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch tendrl-node-agent-1.6.3-6.el7rhgs.noarch tendrl-notifier-1.6.3-3.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-3.el7rhgs.noarch This time unmanage fail actually not by wrong repo change, Please check without wrong repo also import failing and unamange also failing. This does not fail always but it can be reprodusable by the following step: 1. Start tendrl-node-agent in server and storage nodes 2. After few minutes stop node-agent in a server and few storage-node. 3. Wait for 150 to 200 sec to TTL delete status watcher field from node_context. 4. Start node-agent on a server and storage nodes (which are all the nodes node-agent was stopped). 5. Start import flow, after few minutes import will fail 6. Start unmanage flow, after few minutes unmanage also will fail (timeout) Because watcher status field is deleted by TTL, and server also down so it won't mark that node as down. So when you start node-agent first node_context.save() in manger try to update but there is no change in an object so nothing will happen, watcher status field also not updated. After node-agent sync will try to save node_context with TTL. here also no change in an object so save won't happen, But save method in a node_context object will try to refresh then watcher status field. So it will raise EtcdKeyNotFound Exception. Because of this problem some nodes are not picking up the job. The actual problem which is reported in this bug like when a wrong repo is given unamange is failing is fixed. Based on Comment 15 I move this BZ back to ON_QA and I will test it again after BZ 1588357 is resolved. I tested reproducer from this BZ few times with different configurations (different number of nodes, volume present/not present, Cluster name set/not set) and everything seems ok. --> VERIFIED Tested with: tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-7.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch tendrl-node-agent-1.6.3-7.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-4.el7rhgs.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616 |
Created attachment 1424509 [details] unmanage task error Description of problem: Un-manage task always fails when import was not successful. Version-Release number of selected component (if applicable): tendrl-ui-1.6.3-1.el7rhgs.noarch tendrl-commons-1.6.3-2.el7rhgs.noarch tendrl-api-1.6.3-1.el7rhgs.noarch tendrl-api-httpd-1.6.3-1.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-notifier-1.6.3-2.el7rhgs.noarch tendrl-ansible-1.6.3-1.el7rhgs.noarch tendrl-node-agent-1.6.3-2.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch How reproducible: 100% Steps to Reproduce: 1. Change tendrl-node-devel repository baseurl to bad one 2. Start cluster import 3. Import fails 4. Start cluster unmanage task Actual results: The task fails with following events or similar ones - info: Stop service jobs on cluster(<id0>) not yet complete on all nodes([u'<id1>', u'<id2>', u'<id3>', u'<id4>', u'<id5>', u'<id6>']). Timing out - error: Failed atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster - error: Failure in Job 8b497580-00b3-44a8-9ee2-7649f2d79625 Flow tendrl.flows.UnmanageCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 213, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/unmanage_cluster/__init__.py", line 91, in run raise ex AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.StopMonitoringServices on flow: Unmanage a Gluster Cluster Expected results: Unmanage task finishes successfully Additional info: