Description of problem: I'm not able to import 24 nodes cluster. The import job always fails on some timeouts. First error looks like this: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Timing out import job, Cluster data still not fully updated (node: 3b61ef35-d513-4c55-ab68-d972711e6e01) (integration_id: 260d83b2-d4a7-4c90-9c86-8ccbbb01e024) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Version-Release number of selected component (if applicable): RHGS WA Server: Red Hat Enterprise Linux Server release 7.5 (Maipo) collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-4.el7rhgs.noarch tendrl-api-httpd-1.6.3-4.el7rhgs.noarch tendrl-commons-1.6.3-8.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch tendrl-node-agent-1.6.3-8.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-6.el7rhgs.noarch Gluster Storage Server: Red Hat Enterprise Linux Server release 7.5 (Maipo) Red Hat Gluster Storage Server 3.4.0 collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 glusterfs-3.12.2-13.el7rhgs.x86_64 glusterfs-api-3.12.2-13.el7rhgs.x86_64 glusterfs-cli-3.12.2-13.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-13.el7rhgs.x86_64 glusterfs-events-3.12.2-13.el7rhgs.x86_64 glusterfs-fuse-3.12.2-13.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-13.el7rhgs.x86_64 glusterfs-libs-3.12.2-13.el7rhgs.x86_64 glusterfs-rdma-3.12.2-13.el7rhgs.x86_64 glusterfs-server-3.12.2-13.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 python2-gluster-3.12.2-13.el7rhgs.x86_64 tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-8.el7rhgs.noarch tendrl-gluster-integration-1.6.3-6.el7rhgs.noarch tendrl-node-agent-1.6.3-8.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch How reproducible: 100% Steps to Reproduce: 1. Prepare cluster with higher number of Gluster Storage Nodes (24 Storage Nodes, 4 vCPUs, 6GB RAM in my case) 2. Prepare RHGS WA Server (12 vCPUs, 24GB RAM in my case) 3. Run import cluster task. Actual results: Import cluster fails (timeout) in less than 10 minutes. Expected results: It should be possible to import such cluster. Additional info: All errors from ImportCluster tasks (parsed for better readability): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * error Failure in Job 7aecbb0a-99ff-4168-a9ad-a1392cd7efb7 Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 234, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 131, in run exc_traceback) FlowExecutionFailedError: [ 'Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 98, in run\n super(ImportCluster, self).run()\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/__init__.py", line 186, in run\n (atom_fqn, self._defs[\'help\'])\n', 'AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster\n'] 11 Jul 2018 01:24:18 * error Failed atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster 11 Jul 2018 01:24:18 * error Child jobs failed are [u'0bb75dfe-89d2-48da-a2df-dddd8534d1c5'] 11 Jul 2018 01:24:16 * error Failure in Job 0bb75dfe-89d2-48da-a2df-dddd8534d1c5 Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 234, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 131, in run exc_traceback) FlowExecutionFailedError: [ 'Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 98, in run\n super(ImportCluster, self).run()\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/__init__.py", line 227, in run\n "Error executing post run function: %s" % atom_fqn\n', 'AtomExecutionFailedError: Atom Execution failed. Error: Error executing post run function: tendrl.objects.Cluster.atoms.CheckSyncDone\n'] 11 Jul 2018 01:24:15 * error Failed post-run: tendrl.objects.Cluster.atoms.CheckSyncDone for flow: Import existing Gluster Cluster 11 Jul 2018 01:24:15 * error Timing out import job, Cluster data still not fully updated (node: 3b61ef35-d513-4c55-ab68-d972711e6e01) (integration_id: 260d83b2-d4a7-4c90-9c86-8ccbbb01e024) 11 Jul 2018 01:24:15 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Created attachment 1458062 [details] Errors_in_ImportCluster_task_02
Created attachment 1458063 [details] Errors_in_ImportCluster_task_01
The failure seen in this bug is caused by BZ 1600790, which has been pointed out on today's bug triage by Nishanth. That said, there seems to be an opportunity to improve error reporting on WA side even if BZ 1600790 is not fixed.
I've hit the same error when trying to import cluster with configured geo-replication (see bug 1578716 and bug 1603175). The cluster was really small in this case (2 and 6 storage nodes). It means that we will have to validate also the geo-replication scenario as part of validation of this BZ. Version-Release number of selected component: RHGS WA Server: Red Hat Enterprise Linux Server release 7.5 (Maipo) tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-4.el7rhgs.noarch tendrl-api-httpd-1.6.3-4.el7rhgs.noarch tendrl-commons-1.6.3-9.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-7.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-7.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-8.el7rhgs.noarch Gluster Storage Server: Red Hat Enterprise Linux Server release 7.5 (Maipo) Red Hat Gluster Storage Server 3.4.0 glusterfs-3.12.2-14.el7rhgs.x86_64 glusterfs-api-3.12.2-14.el7rhgs.x86_64 glusterfs-cli-3.12.2-14.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-14.el7rhgs.x86_64 glusterfs-events-3.12.2-14.el7rhgs.x86_64 glusterfs-fuse-3.12.2-14.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-14.el7rhgs.x86_64 glusterfs-libs-3.12.2-14.el7rhgs.x86_64 glusterfs-rdma-3.12.2-14.el7rhgs.x86_64 glusterfs-server-3.12.2-14.el7rhgs.x86_64 tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-9.el7rhgs.noarch tendrl-gluster-integration-1.6.3-7.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch
Scenario from comment 9 tested during verification of Bug 1603175 comment 3.
Tested and verified on: RHGS WA Server: Red Hat Enterprise Linux Server release 7.5 (Maipo) collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 etcd-3.2.7-1.el7.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 python-etcd-0.4.5-2.el7rhgs.noarch rubygem-etcd-0.3.0-2.el7rhgs.noarch tendrl-ansible-1.6.3-6.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-11.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-8.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-8.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-9.el7rhgs.noarch Gluster Storage Server: Red Hat Enterprise Linux Server release 7.5 (Maipo) Red Hat Gluster Storage Server 3.4.0 collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 glusterfs-3.12.2-15.el7rhgs.x86_64 glusterfs-api-3.12.2-15.el7rhgs.x86_64 glusterfs-cli-3.12.2-15.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-15.el7rhgs.x86_64 glusterfs-events-3.12.2-15.el7rhgs.x86_64 glusterfs-fuse-3.12.2-15.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-15.el7rhgs.x86_64 glusterfs-libs-3.12.2-15.el7rhgs.x86_64 glusterfs-rdma-3.12.2-15.el7rhgs.x86_64 glusterfs-server-3.12.2-15.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 python2-gluster-3.12.2-15.el7rhgs.x86_64 python-etcd-0.4.5-2.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-11.el7rhgs.noarch tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch Importing 24 nodes cluster (the same as described in Comment 0) works correctly. >> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616