Bug 1559365 - If import cluster fails due to time out, the current job is not marked properly
Summary: If import cluster fails due to time out, the current job is not marked properly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-commons
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.4.0
Assignee: Shubhendu Tripathi
QA Contact: Filip Balák
URL:
Whiteboard:
Depends On:
Blocks: 1503137
TreeView+ depends on / blocked
 
Reported: 2018-03-22 12:05 UTC by Shubhendu Tripathi
Modified: 2018-09-04 07:02 UTC (History)
5 users (show)

Fixed In Version: tendrl-commons-1.6.1-3.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 07:01:53 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github https://github.com/Tendrl commons issues 875 0 None None None 2018-03-28 12:05:47 UTC
Red Hat Product Errata RHSA-2018:2616 0 None None None 2018-09-04 07:02:56 UTC

Description Shubhendu Tripathi 2018-03-22 12:05:10 UTC
Description of problem:
If an import cluster flow fails due to time out the current job under cluster should be marked properly so that UI can depict the same in UI well

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Currently UI is not able to display the status for the cluster properly in case import failed due to time out

Expected results:
UI should display the cluster status properly if import failed due to time out

Additional info:

Comment 2 Martin Bukatovic 2018-03-28 09:38:07 UTC
Could you provide more details about:

* Where to get value of the timeout in question here?
* How to invoke job timeout for reproducing this issue?
* Link to upstream merge request or issue.
* Version where the bug is present.
* Is it possible to reproduce this on previously released RHGS WA?

Comment 3 Shubhendu Tripathi 2018-03-28 12:05:48 UTC
Martin,

>> Where to get value of the timeout in question here?
So the import cluster flow works like one node from the cluster picks the task and creates import cluster job for rest of the nodes in the cluster. Now the first node which picked the job initially because the parent task holder and waits for other nodes to finish their import cluster flow within a defined time limit of 6 mins (this is currently a hard-coded value inside the flow logic). If any node is not able to finish its import job (installation of tendrl-gluster-integration + configuration of collectd plugins and refresh + first round of sync done for tendlr-gluster-integration) the parent comes to know about the same as its waiting for the job/task. In that situation the overall import cluster task would be marked as failed due to time out.

>> How to invoke job timeout for reproducing this issue
Now to simulate the issue I personally used to tweak sync sleep in tendrl-gluster-integration to be honest. Also if your repo is quite slow from where tendrl-gluster-integration gets installed, there are chances of time out during install itself.

>> Link to upstream merge request or issue
Done. Linked to upstream issue added which can provide all the PR details required for this.

>> Version where the bug is present
The release 1.6.1-1 had this issue of time out not being handled properly.

>> Is it possible to reproduce this on previously released RHGS WA
Yes, this can be but few small fixes around race conditions have gone in.

Comment 7 Filip Balák 2018-05-31 15:34:37 UTC
Do I understand it correctly that:
 * status of import job when the job timed out was `finished` prior to the fix,
 * desired status of the timed out job is `failed` (nothing else needs to be tested)?

Comment 8 Shubhendu Tripathi 2018-06-19 02:11:34 UTC
@Filip, so earlier the job used to remain in in_progress sometimes in case of time out. As you correctly mentioned now with the fixes in place, the job should be marked correctly as failed if there is a time out issue in some nodes.

Comment 9 Filip Balák 2018-07-13 12:22:46 UTC
Timing out during import was tested by shutting down a node which made him unavailable. This generated error:
```
Import jobs on cluster(d911cea0-6401-48c1-9d32-df120191ef6a) not yet complete on all nodes([u'44d42e32-ae15-421e-9a20-b55854b08253', u'f0df51a7-d09c-47f2-a03c-59c2e6db3076', u'0690f209-d5b0-4fb6-b383-c7a6e6fd3a6f', u'0d1db8d0-a7a0-4629-b7ba-1aef95401ee8', u'319d58ec-b76d-4cfa-acfe-e721f4cd0eff', u'00972075-6140-4498-9ee9-5cb670c5b21e']). Timing out.
```

Making repository unavailable resulted in error:
```
Error importing the cluster (integration_id: d911cea0-6401-48c1-9d32-df120191ef6a). Error: Could not install tendrl-gluster-integration on fbalak-usm1-gl1.usmqe.lab.eng.brq.redhat.com. Error: http://badurl.com/rcm-guest/puddles/RHGS/3.4.0-RHEL-7/2018-07-08.1/Server-RH-Gluster-3.4-NodeAgent/x86_64/os/repodata/repomd.xml: [Errno 14] curl#6 - "Could not resolve host: badurl.com; Unknown error" Trying other mirror. http://badurl.com/rcm-guest/puddles/RHGS/3.4.0-RHEL-7/2018-07-08.1/Server-RH-Gluster-3.4-NodeAgent/x86_64/os/Packages/tendrl-gluster-integration-1.6.3-6.el7rhgs.noarch.rpm: [Errno 14] curl#6 - "Could not resolve host: badurl.com; Unknown error" Trying other mirror. Error downloading packages: tendrl-gluster-integration-1.6.3-6.el7rhgs.noarch: [Errno 256] No more mirrors to try. 
```

These cases were tested several times and the status of job was always marked correctly as `failed`. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-4.el7rhgs.noarch
tendrl-api-httpd-1.6.3-4.el7rhgs.noarch
tendrl-commons-1.6.3-8.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch
tendrl-node-agent-1.6.3-8.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-6.el7rhgs.noarch

Comment 11 errata-xmlrpc 2018-09-04 07:01:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616


Note You need to log in before you can comment on or make changes to this bug.