Description of problem: There appears an error that looks like: ``` Failure in Job fafdf3ce-8a0f-46d7-b3ed-aa78eecf9ba4 Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 213, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 53, in run _cluster.integration_id FlowExecutionFailedError: Another job in progress for cluster, please wait till the job finishes (job_id: fafdf3ce-8a0f-46d7-b3ed-aa78eecf9ba4) (integration_id: 5d8640f5-8d33-42f5-a11e-bd35e2758fa3) ``` during cluster import. After this error appears, the job is marked as `failed` but the job continues and after a while finishes successfully. Version-Release number of selected component (if applicable): glusterfs-3.12.2-8.el7rhgs.x86_64 tendrl-ansible-1.6.3-2.el7rhgs.noarch tendrl-api-1.6.3-1.el7rhgs.noarch tendrl-api-httpd-1.6.3-1.el7rhgs.noarch tendrl-commons-1.6.3-2.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-1.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-1.el7rhgs.noarch tendrl-node-agent-1.6.3-2.el7rhgs.noarch tendrl-notifier-1.6.3-2.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-1.el7rhgs.noarch How reproducible: It seems to appear at random and is not affected by the time the machines run. After I spotted this a few times I did 15 automated installations of tendrl with import. The issue appeared 2 times. Steps to Reproduce: 1. Install tendrl 2. Prepare gluster cluster with distributed replicated volume. 3. Import cluster. Actual results: There might appear error `Another job in progress for cluster, please wait till the job finishes` marked with red cross and the job continues but is marked as failed. Expected results: This should not be marked as an error and it should not mark entire job as failed. If job fails it shouldn't continue and after a time finish successfully. Additional info:
Add a screenshot.
Created attachment 1427153 [details] task page during import
Created attachment 1427154 [details] task page after import
Created attachment 1427155 [details] logs and configuration files
I reproduced this problem, This is happening because of the same job is executed by two different nodes.
this issue is fixed https://github.com/Tendrl/commons/pull/954
*** Bug 1576717 has been marked as a duplicate of this bug. ***
I did run script that discovered this issue 25x times (23x times the cluster was imported successfully) and this issue didn't appear. --> VERIFIED Tested with: tendrl-ansible-1.6.3-4.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-6.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch tendrl-node-agent-1.6.3-6.el7rhgs.noarch tendrl-notifier-1.6.3-3.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-3.el7rhgs.noarch
Created attachment 1455248 [details] 01 Import cluster fail
Created attachment 1455249 [details] 02 ImportCluster later passed
This issue is seen with following reproducer: 1. Change tendrl-node-devel repository baseurl to bad one 2. Start cluster import Cluster import fails for a while and looks as Failed (so user can run unmanage cluster and import it again as described in BZ 1570048) but after few minutes the job continues and finishes with succeed. Cluster looks as imported in UI but tendrl-gluster-integration is not installed on the node with error. This behaviour is visible in attachments 1455248 and 1455249. --> ASSIGNED Tested with: tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-7.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch tendrl-node-agent-1.6.3-7.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-4.el7rhgs.noarch
Pr is under review https://github.com/Tendrl/commons/pull/1016 Last time i had this new change in PR but last minute confusion I removed this change from my PR :)
Created attachment 1458054 [details] 01 Import cluster fail - new
Created attachment 1458055 [details] 02 ImportCluster later passed - new
The issue appeared again when I was trying to import cluster with 6 nodes and 2 volumes. --> ASSIGNED Tested with: tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-4.el7rhgs.noarch tendrl-api-httpd-1.6.3-4.el7rhgs.noarch tendrl-commons-1.6.3-8.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch tendrl-node-agent-1.6.3-8.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-6.el7rhgs.noarch
https://bugzilla.redhat.com/show_bug.cgi?id=1571244 @rohan, need help in this. I tried a lot but I have no I dead why this bug comes again. This bug which comes in the last screenshot is really hard to reproduce and debug. I fixed this bug in a lot of scenarios but only this issue which is present in the last screenshot is really difficult to fix. @rohan and @shubhendu Need help
filip without clear reproduces step I struggling to proceed further in this issue. I tried a lot of times but I can't find a clear reproduce step for this issue.
It happens very rarely. I accidentally found it when importing cluster with 6 nodes and 2 volumes and with cluster name set
There are patches gone in already to fix the issue. Comment - https://bugzilla.redhat.com/show_bug.cgi?id=1571244#c12 says that issue is not seen and moved the bug to 'Verified'. That means there is an improvement in the situation. The issue reported as part of https://bugzilla.redhat.com/show_bug.cgi?id=1571244#c20 is not reproducible ans submitter itself commented that its a rare case. Had a discussion with Martin and decide to split the Bz into two so that the original BZ can be verified. The new BZ will be fixed when we have a clear procedure to reproduce the issue.
Classifying this as medium severity because errors like this without clear root cause could raise the cost of debugging and problem solving in the production.
Based on discussion with Nishanth (as noted in comment 25), I'm limiting scope of this BZ to fixes created by Gowtham. QE is expected to verify that the the problem described in this BZ is less likely to happen compared to the original report (verifying that the fixes improved the situation significantly). New BZ 1602858 is created to track effort on: * figuring out root cause of remaining part of this issue * figuring out a reproducer or clarifying/improving the likelihood of this problem to appear * fixing the problem entirely I keep this BZ in ON QA state and I will wait for Filip to discuss if we can verify this BZ based on previous reports or if we need to run new verification. My opinion is that this should be retested unless Filip has a high confidence that this is not necessary.
Filip, either retest or mark this BZ as verified based on previous testing, as noted in comment 27.
I tested this again but I was unable to reproduce it. Based on Comment 27 I VERIFY this BZ. If I will be able to reproduce it again in future, I will add info to BZ 1602858. Tested with: tendrl-ansible-1.6.3-6.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-11.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-8.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-8.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-9.el7rhgs.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616