Bug 1571244

Summary:

Import cluster job fails for a while but then finishes successfully

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Filip Balák <fbalak>

Component:

web-admin-tendrl-notifier

Assignee:

gowtham <gshanmug>

Status:

CLOSED ERRATA

QA Contact:

Filip Balák <fbalak>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

rhgs-3.4

CC:

dahorak, fbalak, mbukatov, nthomas, rhs-bugs, sankarshan

Target Milestone:

---

Target Release:

RHGS 3.4.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

tendrl-commons-1.6.3-8.el7rhgs

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-09-04 07:04:50 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1503137, 1602858

Attachments:

Description	Flags
task page during import	none
task page after import	none
01 Import cluster fail	none
02 ImportCluster later passed	none
01 Import cluster fail - new	none
02 ImportCluster later passed - new	none

Description Filip Balák 2018-04-24 11:49:18 UTC

Description of problem:
There appears an error that looks like:
```
Failure in Job fafdf3ce-8a0f-46d7-b3ed-aa78eecf9ba4 Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 213, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 53, in run _cluster.integration_id FlowExecutionFailedError: Another job in progress for cluster, please wait till the job finishes (job_id: fafdf3ce-8a0f-46d7-b3ed-aa78eecf9ba4) (integration_id: 5d8640f5-8d33-42f5-a11e-bd35e2758fa3)
```
during cluster import. After this error appears, the job is marked as `failed` but the job continues and after a while finishes successfully.

Version-Release number of selected component (if applicable):
glusterfs-3.12.2-8.el7rhgs.x86_64
tendrl-ansible-1.6.3-2.el7rhgs.noarch
tendrl-api-1.6.3-1.el7rhgs.noarch
tendrl-api-httpd-1.6.3-1.el7rhgs.noarch
tendrl-commons-1.6.3-2.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch
tendrl-node-agent-1.6.3-2.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch

How reproducible:
It seems to appear at random and is not affected by the time the machines run. After I spotted this a few times I did 15 automated installations of tendrl with import. The issue appeared 2 times.

Steps to Reproduce:
1. Install tendrl
2. Prepare gluster cluster with distributed replicated volume.
3. Import cluster.

Actual results:
There might appear error `Another job in progress for cluster, please wait till the job finishes` marked with red cross and the job continues but is marked as failed.

Expected results:
This should not be marked as an error and it should not mark entire job as failed. If job fails it shouldn't continue and after a time finish successfully.

Additional info:

Comment 1 Martin Bukatovic 2018-04-25 10:28:51 UTC

Add a screenshot.

Comment 2 Filip Balák 2018-04-26 11:38:30 UTC

Created attachment 1427153 [details]
task page during import

Comment 3 Filip Balák 2018-04-26 11:39:08 UTC

Created attachment 1427154 [details]
task page after import

Comment 4 Filip Balák 2018-04-26 11:40:11 UTC

Created attachment 1427155 [details]
logs and configuration files

Comment 7 gowtham 2018-05-01 07:03:00 UTC

I reproduced this problem, This is happening because of the same job is executed by two different nodes.

Comment 8 gowtham 2018-05-03 08:59:44 UTC

this issue is fixed https://github.com/Tendrl/commons/pull/954

Comment 10 Nishanth Thomas 2018-05-10 13:14:48 UTC

*** Bug 1576717 has been marked as a duplicate of this bug. ***

Comment 12 Filip Balák 2018-06-06 09:44:38 UTC

I did run script that discovered this issue 25x times (23x times the cluster was imported successfully) and this issue didn't appear.
--> VERIFIED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch

Comment 13 Filip Balák 2018-06-28 11:36:39 UTC

Created attachment 1455248 [details]
01 Import cluster fail

Comment 14 Filip Balák 2018-06-28 11:37:30 UTC

Created attachment 1455249 [details]
02 ImportCluster later passed

Comment 15 Filip Balák 2018-06-28 11:48:57 UTC

This issue is seen with following reproducer:
1. Change tendrl-node-devel repository baseurl to bad one
2. Start cluster import

Cluster import fails for a while and looks as Failed (so user can run unmanage cluster and import it again as described in BZ 1570048) but after few minutes the job continues and finishes with succeed. Cluster looks as imported in UI but tendrl-gluster-integration is not installed on the node with error. This behaviour is visible in attachments 1455248 and 1455249.

--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

Comment 16 gowtham 2018-07-01 04:07:01 UTC

Pr is under review https://github.com/Tendrl/commons/pull/1016 Last time i had this new change in PR but last minute confusion I removed this change from my PR :)

Comment 17 Filip Balák 2018-07-11 11:52:35 UTC

Created attachment 1458054 [details]
01 Import cluster fail - new

Comment 18 Filip Balák 2018-07-11 11:53:04 UTC

Created attachment 1458055 [details]
02 ImportCluster later passed - new

Comment 20 Filip Balák 2018-07-11 11:57:46 UTC

The issue appeared again when I was trying to import cluster with 6 nodes and 2 volumes. --> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-4.el7rhgs.noarch
tendrl-api-httpd-1.6.3-4.el7rhgs.noarch
tendrl-commons-1.6.3-8.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch
tendrl-node-agent-1.6.3-8.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-6.el7rhgs.noarch

Comment 21 gowtham 2018-07-11 12:25:42 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1571244
@rohan, need help in this. I tried a lot but I have no I dead why this bug comes again.

This bug which comes in the last screenshot is really hard to reproduce and debug.


I fixed this bug in a lot of scenarios but only this issue which is present in the last screenshot is really difficult to fix. 


@rohan and @shubhendu Need help

Comment 22 gowtham 2018-07-11 18:00:24 UTC

filip without clear reproduces step I struggling to proceed further in this issue. I tried a lot of times but I can't find a clear reproduce step for this issue.

Comment 23 Filip Balák 2018-07-13 14:03:47 UTC

It happens very rarely. I accidentally found it when importing cluster with 6 nodes and 2 volumes and with cluster name set

Comment 25 Nishanth Thomas 2018-07-18 10:22:45 UTC

There are patches gone in already to fix the issue. Comment - https://bugzilla.redhat.com/show_bug.cgi?id=1571244#c12 says that issue is not seen and moved the bug to 'Verified'. That means there is an improvement in the situation. The issue reported as part of https://bugzilla.redhat.com/show_bug.cgi?id=1571244#c20 is not reproducible ans submitter itself commented that its a rare case. Had a discussion with Martin and decide to split the Bz into two so that the original BZ can be verified. The new BZ will be fixed when we have a clear procedure to reproduce the issue.

Comment 26 Martin Bukatovic 2018-07-18 15:15:43 UTC

Classifying this as medium severity because errors like this without clear
root cause could raise the cost of debugging and problem solving in the
production.

Comment 27 Martin Bukatovic 2018-07-18 16:28:20 UTC

Based on discussion with Nishanth (as noted in comment 25), I'm limiting scope
of this BZ to fixes created by Gowtham. QE is expected to verify that the
the problem described in this BZ is less likely to happen compared to the
original report (verifying that the fixes improved the situation significantly).

New BZ 1602858 is created to track effort on:

 * figuring out root cause of remaining part of this issue
 * figuring out a reproducer or clarifying/improving the likelihood of this
   problem to appear
 * fixing the problem entirely

I keep this BZ in ON QA state and I will wait for Filip to discuss if we can
verify this BZ based on previous reports or if we need to run new verification.

My opinion is that this should be retested unless Filip has a high confidence
that this is not necessary.

Comment 28 Martin Bukatovic 2018-07-18 16:30:09 UTC

Filip, either retest or mark this BZ as verified based on previous testing, as
noted in comment 27.

Comment 29 Filip Balák 2018-08-07 10:42:01 UTC

I tested this again but I was unable to reproduce it. Based on Comment 27 I VERIFY this BZ. If I will be able to reproduce it again in future, I will add info to BZ 1602858.

Tested with:
tendrl-ansible-1.6.3-6.el7rhgs.noarch
tendrl-api-1.6.3-5.el7rhgs.noarch
tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
tendrl-commons-1.6.3-11.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-8.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-8.el7rhgs.noarch
tendrl-node-agent-1.6.3-9.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-9.el7rhgs.noarch

Comment 31 errata-xmlrpc 2018-09-04 07:04:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616