Created attachment 1446199 [details] Screenshot of import failing due to time out Description of problem: ----------------------- In CNS environment, the scale we recommend is 1000 volumes for 3 node gluster cluster. The test being done here is to find out resources consumed on RHGSWA server when we try to scale volume on 3 node gluster cluster. In the first run 100 volumes were created on a 3-node gluster cluster and these 3 nodes were imported. The import task failed with Time-Out error. See below and the screenshot attached: 'Import jobs on cluster(c58ca47b-910e-493f-b699-643dc90737fe) not yet complete on all nodes([u'ef059316-ec09-408a-8f5d-d368274804fc', u'e162cc03-e603-4972-bf73-6a8bc6d1c471', u'986070c6-28ba-4c41-a688-da10247b4250']). Timing out.' On further investigating the problem, We see each node gets about 6 minutes to complete import, withinh which if import is not completed it results in time-out. Here is the piece of code that I feel is causing the issue. Its in /usr/lib/python2.7/site-packages/tendrl/commons/objects/cluster/atoms/import_cluster/__init__.py loop_count = 0 # Wait for (no of nodes) * 6 minutes for import to complete wait_count = (len(node_list) - 1) * 36 while True: parent_job = NS.tendrl.objects.Job( job_id=self.parameters['job_id'] ).load() if loop_count >= wait_count: logger.log( "info", NS.publisher_id, {"message": "Import jobs on cluster(%s) not yet " "complete on all nodes(%s). Timing out." % (_cluster.short_name, str(node_list))}, job_id=self.parameters['job_id'], flow_id=self.parameters['flow_id'] ) return False time.sleep(10) Version-Release number of selected component (if applicable): ------------------------------------------------------------- On Tendrl Server ---------------- rpm -qa | grep tendrl tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch tendrl-node-agent-1.6.3-4.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch tendrl-notifier-1.6.3-2.el7rhgs.noarch tendrl-ansible-1.6.3-2.el7rhgs.noarch tendrl-commons-1.6.3-4.el7rhgs.noarch tendrl-ui-1.6.3-1.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch On Tendrl Nodes --------------- rpm -qa | grep tendrl tendrl-gluster-integration-1.6.3-2.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-node-agent-1.6.3-4.el7rhgs.noarch How reproducible: ----------------- Always Steps to Reproduce: ------------------- 1. Create 100 replica-3 volumes on 3 node cluster 2. Try to Import nodes to RHGSWA 3. Observe Actual results: --------------- Import fails when we tried to scale 100 vols (300 bricks) in 3 node cluster. Expected results: ----------------- Import should have been successful. Additional info: ---------------- The import stared at about 7:08 and it failed at about 7:21 so approximately after 13 minutes which matches with 6min per node. So basically it tried in two nodes for 6 minutes each and failed without going to the first node from where the gluster cluster was detected.
This issue is solved while fixing https://bugzilla.redhat.com/show_bug.cgi?id=1600092, now import waiting time for the flow is based on a number of volumes.
Part of import flow performance testing we tested up to 400 volumes and 1200 bricks also