Bug 1584593

Summary:

Import of 3 Node (100 Vols & 300 bricks) fails with Time-Out Error

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Shekhar Berry <shberry>

Component:

web-admin-tendrl-gluster-integration

Assignee:

gowtham <gshanmug>

Status:

CLOSED WONTFIX

QA Contact:

sds-qe-bugs

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

rhgs-3.4

CC:

mbukatov, mpillai, nthomas, psuriset, rhs-bugs, rsussman, sankarshan, shberry, shtripat

Target Milestone:

---

Keywords:

Performance, ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-05-08 15:45:41 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Screenshot of import failing due to time out	none

Description Shekhar Berry 2018-05-31 09:31:12 UTC

Created attachment 1446199 [details]
Screenshot of import failing due to time out

Description of problem:
-----------------------
In CNS environment, the scale we recommend is 1000 volumes for 3 node gluster cluster. The test being done here is to find out resources consumed on RHGSWA server when we try to scale volume on 3 node gluster cluster. 
In the first run 100 volumes were created on a 3-node gluster cluster and these 3 nodes were imported.

The import task failed with Time-Out error. See below and the screenshot attached:

'Import jobs on cluster(c58ca47b-910e-493f-b699-643dc90737fe) not yet complete on all nodes([u'ef059316-ec09-408a-8f5d-d368274804fc', u'e162cc03-e603-4972-bf73-6a8bc6d1c471', u'986070c6-28ba-4c41-a688-da10247b4250']). Timing out.'

On further investigating the problem, We see each node gets about 6 minutes to complete import, withinh which if import is not completed it results in time-out.

Here is the piece of code that I feel is causing the issue. Its in /usr/lib/python2.7/site-packages/tendrl/commons/objects/cluster/atoms/import_cluster/__init__.py

 loop_count = 0
                # Wait for (no of nodes) * 6 minutes for import to complete
                wait_count = (len(node_list) - 1) * 36
                while True:
                    parent_job = NS.tendrl.objects.Job(
                        job_id=self.parameters['job_id']
                    ).load()
                    if loop_count >= wait_count:
                        logger.log(
                            "info",
                            NS.publisher_id,
                            {"message": "Import jobs on cluster(%s) not yet "
                             "complete on all nodes(%s). Timing out." %
                             (_cluster.short_name, str(node_list))},
                            job_id=self.parameters['job_id'],
                            flow_id=self.parameters['flow_id']
                        )
                        return False
                    time.sleep(10)



Version-Release number of selected component (if applicable):
-------------------------------------------------------------
On Tendrl Server
----------------
rpm -qa | grep tendrl
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch
tendrl-node-agent-1.6.3-4.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-ansible-1.6.3-2.el7rhgs.noarch
tendrl-commons-1.6.3-4.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch


On Tendrl Nodes
---------------
rpm -qa | grep tendrl
tendrl-gluster-integration-1.6.3-2.el7rhgs.noarch
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-node-agent-1.6.3-4.el7rhgs.noarch


How reproducible:
-----------------
Always


Steps to Reproduce:
-------------------
1. Create 100 replica-3 volumes on 3 node cluster
2. Try to Import nodes to RHGSWA
3. Observe

Actual results:
---------------
Import fails when we tried to scale 100 vols (300 bricks) in 3 node cluster.

Expected results:
-----------------

Import should have been successful.

Additional info:
----------------
The import stared at about 7:08 and it failed at about 7:21 so approximately after 13 minutes which matches with 6min per node. So basically it tried in two nodes for 6 minutes each and failed without going to the first node from where the gluster cluster was detected.

Comment 3 gowtham 2018-11-19 05:53:56 UTC

This issue is solved while fixing https://bugzilla.redhat.com/show_bug.cgi?id=1600092, now import waiting time for the flow is based on a number of volumes.

Comment 6 gowtham 2018-12-04 06:32:20 UTC

Part of import flow performance testing we tested up to 400 volumes and 1200 bricks also