Bug 1584593

Summary: Import of 3 Node (100 Vols & 300 bricks) fails with Time-Out Error
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Shekhar Berry <shberry>
Component: web-admin-tendrl-gluster-integrationAssignee: gowtham <gshanmug>
Status: CLOSED WONTFIX QA Contact: sds-qe-bugs
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: mbukatov, mpillai, nthomas, psuriset, rhs-bugs, rsussman, sankarshan, shberry, shtripat
Target Milestone: ---Keywords: Performance, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-08 15:45:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Screenshot of import failing due to time out none

Description Shekhar Berry 2018-05-31 09:31:12 UTC
Created attachment 1446199 [details]
Screenshot of import failing due to time out

Description of problem:
-----------------------
In CNS environment, the scale we recommend is 1000 volumes for 3 node gluster cluster. The test being done here is to find out resources consumed on RHGSWA server when we try to scale volume on 3 node gluster cluster. 
In the first run 100 volumes were created on a 3-node gluster cluster and these 3 nodes were imported.

The import task failed with Time-Out error. See below and the screenshot attached:

'Import jobs on cluster(c58ca47b-910e-493f-b699-643dc90737fe) not yet complete on all nodes([u'ef059316-ec09-408a-8f5d-d368274804fc', u'e162cc03-e603-4972-bf73-6a8bc6d1c471', u'986070c6-28ba-4c41-a688-da10247b4250']). Timing out.'

On further investigating the problem, We see each node gets about 6 minutes to complete import, withinh which if import is not completed it results in time-out.

Here is the piece of code that I feel is causing the issue. Its in /usr/lib/python2.7/site-packages/tendrl/commons/objects/cluster/atoms/import_cluster/__init__.py

 loop_count = 0
                # Wait for (no of nodes) * 6 minutes for import to complete
                wait_count = (len(node_list) - 1) * 36
                while True:
                    parent_job = NS.tendrl.objects.Job(
                        job_id=self.parameters['job_id']
                    ).load()
                    if loop_count >= wait_count:
                        logger.log(
                            "info",
                            NS.publisher_id,
                            {"message": "Import jobs on cluster(%s) not yet "
                             "complete on all nodes(%s). Timing out." %
                             (_cluster.short_name, str(node_list))},
                            job_id=self.parameters['job_id'],
                            flow_id=self.parameters['flow_id']
                        )
                        return False
                    time.sleep(10)



Version-Release number of selected component (if applicable):
-------------------------------------------------------------
On Tendrl Server
----------------
rpm -qa | grep tendrl
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch
tendrl-node-agent-1.6.3-4.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-ansible-1.6.3-2.el7rhgs.noarch
tendrl-commons-1.6.3-4.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch


On Tendrl Nodes
---------------
rpm -qa | grep tendrl
tendrl-gluster-integration-1.6.3-2.el7rhgs.noarch
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-node-agent-1.6.3-4.el7rhgs.noarch


How reproducible:
-----------------
Always


Steps to Reproduce:
-------------------
1. Create 100 replica-3 volumes on 3 node cluster
2. Try to Import nodes to RHGSWA
3. Observe

Actual results:
---------------
Import fails when we tried to scale 100 vols (300 bricks) in 3 node cluster.

Expected results:
-----------------

Import should have been successful.

Additional info:
----------------
The import stared at about 7:08 and it failed at about 7:21 so approximately after 13 minutes which matches with 6min per node. So basically it tried in two nodes for 6 minutes each and failed without going to the first node from where the gluster cluster was detected.

Comment 3 gowtham 2018-11-19 05:53:56 UTC
This issue is solved while fixing https://bugzilla.redhat.com/show_bug.cgi?id=1600092, now import waiting time for the flow is based on a number of volumes.

Comment 6 gowtham 2018-12-04 06:32:20 UTC
Part of import flow performance testing we tested up to 400 volumes and 1200 bricks also