1571244 – Import cluster job fails for a while but then finishes successfully

Bug 1571244 - Import cluster job fails for a while but then finishes successfully

Summary: Import cluster job fails for a while but then finishes successfully

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-notifier
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	gowtham
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1576717 (view as bug list)
Depends On:
Blocks:	1503137 1602858
TreeView+	depends on / blocked

Reported:	2018-04-24 11:49 UTC by Filip Balák
Modified:	2018-09-04 07:05 UTC (History)
CC List:	6 users (show)
Fixed In Version:	tendrl-commons-1.6.3-8.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 07:04:50 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
task page during import (115.30 KB, image/png) 2018-04-26 11:38 UTC, Filip Balák	no flags	Details
task page after import (114.86 KB, image/png) 2018-04-26 11:39 UTC, Filip Balák	no flags	Details
01 Import cluster fail (79.06 KB, image/png) 2018-06-28 11:36 UTC, Filip Balák	no flags	Details
02 ImportCluster later passed (66.79 KB, image/png) 2018-06-28 11:37 UTC, Filip Balák	no flags	Details
01 Import cluster fail - new (108.01 KB, image/png) 2018-07-11 11:52 UTC, Filip Balák	no flags	Details
02 ImportCluster later passed - new (110.64 KB, image/png) 2018-07-11 11:53 UTC, Filip Balák	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	Tendrl commons issues 951	None	None	None	2018-04-30 13:25:34 UTC
Github	Tendrl commons issues 953	None	None	None	2018-05-02 18:07:08 UTC
Red Hat Product Errata	RHSA-2018:2616	None	None	None	2018-09-04 07:05:55 UTC

Description Filip Balák 2018-04-24 11:49:18 UTC

Description of problem:
There appears an error that looks like:
```
Failure in Job fafdf3ce-8a0f-46d7-b3ed-aa78eecf9ba4 Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 213, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 53, in run _cluster.integration_id FlowExecutionFailedError: Another job in progress for cluster, please wait till the job finishes (job_id: fafdf3ce-8a0f-46d7-b3ed-aa78eecf9ba4) (integration_id: 5d8640f5-8d33-42f5-a11e-bd35e2758fa3)
```
during cluster import. After this error appears, the job is marked as `failed` but the job continues and after a while finishes successfully.

Version-Release number of selected component (if applicable):
glusterfs-3.12.2-8.el7rhgs.x86_64
tendrl-ansible-1.6.3-2.el7rhgs.noarch
tendrl-api-1.6.3-1.el7rhgs.noarch
tendrl-api-httpd-1.6.3-1.el7rhgs.noarch
tendrl-commons-1.6.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch
tendrl-node-agent-1.6.3-2.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch

How reproducible:
It seems to appear at random and is not affected by the time the machines run. After I spotted this a few times I did 15 automated installations of tendrl with import. The issue appeared 2 times.

Steps to Reproduce:
1. Install tendrl
2. Prepare gluster cluster with distributed replicated volume.
3. Import cluster.

Actual results:
There might appear error `Another job in progress for cluster, please wait till the job finishes` marked with red cross and the job continues but is marked as failed.

Expected results:
This should not be marked as an error and it should not mark entire job as failed. If job fails it shouldn't continue and after a time finish successfully.

Additional info:

Comment 1 Martin Bukatovic 2018-04-25 10:28:51 UTC

Add a screenshot.

Comment 2 Filip Balák 2018-04-26 11:38:30 UTC

Created attachment 1427153 [details]
task page during import

Comment 3 Filip Balák 2018-04-26 11:39:08 UTC

Created attachment 1427154 [details]
task page after import

Comment 4 Filip Balák 2018-04-26 11:40:11 UTC

Created attachment 1427155 [details]
logs and configuration files

Comment 7 gowtham 2018-05-01 07:03:00 UTC

I reproduced this problem, This is happening because of the same job is executed by two different nodes.

Comment 8 gowtham 2018-05-03 08:59:44 UTC

this issue is fixed https://github.com/Tendrl/commons/pull/954

Comment 10 Nishanth Thomas 2018-05-10 13:14:48 UTC

*** Bug 1576717 has been marked as a duplicate of this bug. ***

Comment 12 Filip Balák 2018-06-06 09:44:38 UTC

I did run script that discovered this issue 25x times (23x times the cluster was imported successfully) and this issue didn't appear.
--> VERIFIED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch

Comment 13 Filip Balák 2018-06-28 11:36:39 UTC

Created attachment 1455248 [details]
01 Import cluster fail

Comment 14 Filip Balák 2018-06-28 11:37:30 UTC

Created attachment 1455249 [details]
02 ImportCluster later passed

Comment 15 Filip Balák 2018-06-28 11:48:57 UTC

This issue is seen with following reproducer:
1. Change tendrl-node-devel repository baseurl to bad one
2. Start cluster import

Cluster import fails for a while and looks as Failed (so user can run unmanage cluster and import it again as described in BZ 1570048) but after few minutes the job continues and finishes with succeed. Cluster looks as imported in UI but tendrl-gluster-integration is not installed on the node with error. This behaviour is visible in attachments 1455248 and 1455249.

--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

Comment 16 gowtham 2018-07-01 04:07:01 UTC

Pr is under review https://github.com/Tendrl/commons/pull/1016 Last time i had this new change in PR but last minute confusion I removed this change from my PR :)

Comment 17 Filip Balák 2018-07-11 11:52:35 UTC

Created attachment 1458054 [details]
01 Import cluster fail - new

Comment 18 Filip Balák 2018-07-11 11:53:04 UTC

Created attachment 1458055 [details]
02 ImportCluster later passed - new

Comment 20 Filip Balák 2018-07-11 11:57:46 UTC

The issue appeared again when I was trying to import cluster with 6 nodes and 2 volumes. --> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-4.el7rhgs.noarch
tendrl-api-httpd-1.6.3-4.el7rhgs.noarch
tendrl-commons-1.6.3-8.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch
tendrl-node-agent-1.6.3-8.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-6.el7rhgs.noarch

Comment 21 gowtham 2018-07-11 12:25:42 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1571244
@rohan, need help in this. I tried a lot but I have no I dead why this bug comes again.

This bug which comes in the last screenshot is really hard to reproduce and debug.


I fixed this bug in a lot of scenarios but only this issue which is present in the last screenshot is really difficult to fix. 


@rohan and @shubhendu Need help

Comment 22 gowtham 2018-07-11 18:00:24 UTC

filip without clear reproduces step I struggling to proceed further in this issue. I tried a lot of times but I can't find a clear reproduce step for this issue.

Comment 23 Filip Balák 2018-07-13 14:03:47 UTC

It happens very rarely. I accidentally found it when importing cluster with 6 nodes and 2 volumes and with cluster name set

Comment 25 Nishanth Thomas 2018-07-18 10:22:45 UTC

There are patches gone in already to fix the issue. Comment - https://bugzilla.redhat.com/show_bug.cgi?id=1571244#c12 says that issue is not seen and moved the bug to 'Verified'. That means there is an improvement in the situation. The issue reported as part of https://bugzilla.redhat.com/show_bug.cgi?id=1571244#c20 is not reproducible ans submitter itself commented that its a rare case. Had a discussion with Martin and decide to split the Bz into two so that the original BZ can be verified. The new BZ will be fixed when we have a clear procedure to reproduce the issue.

Comment 26 Martin Bukatovic 2018-07-18 15:15:43 UTC

Classifying this as medium severity because errors like this without clear
root cause could raise the cost of debugging and problem solving in the
production.

Comment 27 Martin Bukatovic 2018-07-18 16:28:20 UTC

Based on discussion with Nishanth (as noted in comment 25), I'm limiting scope
of this BZ to fixes created by Gowtham. QE is expected to verify that the
the problem described in this BZ is less likely to happen compared to the
original report (verifying that the fixes improved the situation significantly).

New BZ 1602858 is created to track effort on:

 * figuring out root cause of remaining part of this issue
 * figuring out a reproducer or clarifying/improving the likelihood of this
   problem to appear
 * fixing the problem entirely

I keep this BZ in ON QA state and I will wait for Filip to discuss if we can
verify this BZ based on previous reports or if we need to run new verification.

My opinion is that this should be retested unless Filip has a high confidence
that this is not necessary.

Comment 28 Martin Bukatovic 2018-07-18 16:30:09 UTC

Filip, either retest or mark this BZ as verified based on previous testing, as
noted in comment 27.

Comment 29 Filip Balák 2018-08-07 10:42:01 UTC

I tested this again but I was unable to reproduce it. Based on Comment 27 I VERIFY this BZ. If I will be able to reproduce it again in future, I will add info to BZ 1602858.

Tested with:
tendrl-ansible-1.6.3-6.el7rhgs.noarch
tendrl-api-1.6.3-5.el7rhgs.noarch
tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
tendrl-commons-1.6.3-11.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-8.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-8.el7rhgs.noarch
tendrl-node-agent-1.6.3-9.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-9.el7rhgs.noarch

Comment 31 errata-xmlrpc 2018-09-04 07:04:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616

Note You need to log in before you can comment on or make changes to this bug.