1602858 – Root cause of problem with import cluster job failure which then finishes successfully needs to be identified and the problem fixed for 100% cases

Bug 1602858 - Root cause of problem with import cluster job failure which then finishes successfully needs to be identified and the problem fixed for 100% cases

Summary: Root cause of problem with import cluster job failure which then finishes suc...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-notifier
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Timothy Asir
QA Contact:	sds-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	1515276 1571244
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-18 16:26 UTC by Martin Bukatovic
Modified:	2019-11-07 09:45 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:	Job object is not updated with the latest values from etcd. There is a small logic problem in implementation. So all the nodes are tried to execute the same job. With this fix, the Job object is always read the latest value from etcd and once the job is marked as processing by one node then another node won't pick up the job.
Clone Of:
Environment:
Last Closed:	2019-11-07 09:45:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Martin Bukatovic 2018-07-18 16:26:11 UTC

Description of problem
======================

This BZ is created out of BZ 1571244, to track finding and then fixing a root
cause of BZ 1571244 (along with all error reporting, locking and other bugs)
so that the problem is fixed for all supported use case.

Please see the original BZ and all comments if you need to understand the
context.

The gist of the problem is 

Version-Release number of selected component
============================================

tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-4.el7rhgs.noarch
tendrl-api-httpd-1.6.3-4.el7rhgs.noarch
tendrl-commons-1.6.3-8.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch
tendrl-node-agent-1.6.3-8.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-6.el7rhgs.noarch

based on https://bugzilla.redhat.com/show_bug.cgi?id=1571244#c20

How reproducible
================

very rarely

see: https://bugzilla.redhat.com/show_bug.cgi?id=1571244#c23

Steps to Reproduce
==================

1. Install tendrl
2. Prepare gluster cluster with distributed replicated volume.
3. Import cluster.

Actual results
==============

There appears an error that looks like:

```
Failure in Job fafdf3ce-8a0f-46d7-b3ed-aa78eecf9ba4 Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 213, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 53, in run _cluster.integration_id FlowExecutionFailedError: Another job in progress for cluster, please wait till the job finishes (job_id: fafdf3ce-8a0f-46d7-b3ed-aa78eecf9ba4) (integration_id: 5d8640f5-8d33-42f5-a11e-bd35e2758fa3)
```

during cluster import. After this error appears, the job is marked as `failed` but the job continues and after a while finishes successfully.

Expected results
================

This should not be marked as an error and it should not mark entire job as
failed. If job fails it shouldn't continue and after a time finish
successfully.

We understand the root cause of the problem.

When the problem is reproduced via script which will run eg. 100 times,
the problem doesn't happen. This error should not happen during regular
regression testing.

Comment 3 Shubhendu Tripathi 2018-11-19 12:48:40 UTC

Both the dependent BZs are verified and closed now. Suggest to mark this closed.
Martin?

Comment 4 Martin Bukatovic 2018-12-13 17:36:18 UTC

(In reply to Shubhendu Tripathi from comment #3)
> Both the dependent BZs are verified and closed now. Suggest to mark this
> closed.

No. This bug was specifically created because we weren't able to fix the
original problem 100%.

But if you have root cause analysis and a fix, you can attach upstream
pull request in this bug and we can consider it for a future batch update.

Comment 5 gowtham 2019-04-01 16:54:31 UTC

job.load() is not updating object with new values from etcd, it always keeps the same values which are given at the time of object initialization time. Problem is job locked by anyone node but job.load() is not updating job object properly so every node thinks they locked the job so same job is executed by different nodes. I have modified the code to read etcd with the empty object so that will initialized properly. 

   PR: https://github.com/Tendrl/commons/pull/1083

Comment 11 Sweta Anandpara 2019-07-08 06:18:59 UTC

Having read through the entire comment update of BZ https://bugzilla.redhat.com/show_bug.cgi?id=1571244, I understand that the original bug was not fixed 100%, but still moved to /conditionally/ verified as the frequency of occurrence had reduced, plus lack of clear reproducer steps in the rare scenario that it was actually hit. 

@Gowtham, It would help to have reproducer steps to hit this issue. It will not just help me in understanding better the patch that has gone in, but the same can be executed (multiple times) on a build without the fix and with fix to confidently move this bug to its final state.

Comment 29 Yaniv Kaul 2019-10-28 09:24:50 UTC

Can we CLOSE-WONTFIX this?

Comment 30 Patric Uebele 2019-11-07 08:20:16 UTC

(In reply to Yaniv Kaul from comment #29)
> Can we CLOSE-WONTFIX this?

Ok to close given the low probability

Comment 31 Nishanth Thomas 2019-11-07 09:45:45 UTC

Closing per the discussion above

Note You need to log in before you can comment on or make changes to this bug.