Description of problem ====================== This BZ is created out of BZ 1571244, to track finding and then fixing a root cause of BZ 1571244 (along with all error reporting, locking and other bugs) so that the problem is fixed for all supported use case. Please see the original BZ and all comments if you need to understand the context. The gist of the problem is Version-Release number of selected component ============================================ tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-4.el7rhgs.noarch tendrl-api-httpd-1.6.3-4.el7rhgs.noarch tendrl-commons-1.6.3-8.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch tendrl-node-agent-1.6.3-8.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-6.el7rhgs.noarch based on https://bugzilla.redhat.com/show_bug.cgi?id=1571244#c20 How reproducible ================ very rarely see: https://bugzilla.redhat.com/show_bug.cgi?id=1571244#c23 Steps to Reproduce ================== 1. Install tendrl 2. Prepare gluster cluster with distributed replicated volume. 3. Import cluster. Actual results ============== There appears an error that looks like: ``` Failure in Job fafdf3ce-8a0f-46d7-b3ed-aa78eecf9ba4 Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 213, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 53, in run _cluster.integration_id FlowExecutionFailedError: Another job in progress for cluster, please wait till the job finishes (job_id: fafdf3ce-8a0f-46d7-b3ed-aa78eecf9ba4) (integration_id: 5d8640f5-8d33-42f5-a11e-bd35e2758fa3) ``` during cluster import. After this error appears, the job is marked as `failed` but the job continues and after a while finishes successfully. Expected results ================ This should not be marked as an error and it should not mark entire job as failed. If job fails it shouldn't continue and after a time finish successfully. We understand the root cause of the problem. When the problem is reproduced via script which will run eg. 100 times, the problem doesn't happen. This error should not happen during regular regression testing.
Both the dependent BZs are verified and closed now. Suggest to mark this closed. Martin?
(In reply to Shubhendu Tripathi from comment #3) > Both the dependent BZs are verified and closed now. Suggest to mark this > closed. No. This bug was specifically created because we weren't able to fix the original problem 100%. But if you have root cause analysis and a fix, you can attach upstream pull request in this bug and we can consider it for a future batch update.
job.load() is not updating object with new values from etcd, it always keeps the same values which are given at the time of object initialization time. Problem is job locked by anyone node but job.load() is not updating job object properly so every node thinks they locked the job so same job is executed by different nodes. I have modified the code to read etcd with the empty object so that will initialized properly. PR: https://github.com/Tendrl/commons/pull/1083
Having read through the entire comment update of BZ https://bugzilla.redhat.com/show_bug.cgi?id=1571244, I understand that the original bug was not fixed 100%, but still moved to /conditionally/ verified as the frequency of occurrence had reduced, plus lack of clear reproducer steps in the rare scenario that it was actually hit. @Gowtham, It would help to have reproducer steps to hit this issue. It will not just help me in understanding better the patch that has gone in, but the same can be executed (multiple times) on a build without the fix and with fix to confidently move this bug to its final state.
Can we CLOSE-WONTFIX this?
(In reply to Yaniv Kaul from comment #29) > Can we CLOSE-WONTFIX this? Ok to close given the low probability
Closing per the discussion above