Bug 1600092 - Importing bigger cluster failing: Timing out import job, Cluster data still not fully updated
Summary: Importing bigger cluster failing: Timing out import job, Cluster data still n...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-commons
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.4.0
Assignee: Shubhendu Tripathi
QA Contact: Daniel Horák
URL:
Whiteboard:
Depends On: 1600790 1607783
Blocks: 1503137 1560875
TreeView+ depends on / blocked
 
Reported: 2018-07-11 12:03 UTC by Daniel Horák
Modified: 2018-09-04 07:09 UTC (History)
5 users (show)

Fixed In Version: tendrl-gluster-integration-1.6.3-8.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 07:08:56 UTC
Embargoed:


Attachments (Terms of Use)
Errors_in_ImportCluster_task_01 (deleted)
2018-07-11 12:03 UTC, Daniel Horák
no flags Details
Errors_in_ImportCluster_task_02 (99.08 KB, image/png)
2018-07-11 12:03 UTC, Daniel Horák
no flags Details
Errors_in_ImportCluster_task_01 (103.36 KB, image/png)
2018-07-11 12:04 UTC, Daniel Horák
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1600790 0 unspecified CLOSED Segmentation fault while using gfapi while getting volume utilization 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2018:2616 0 None None None 2018-09-04 07:09:57 UTC

Internal Links: 1600790

Description Daniel Horák 2018-07-11 12:03:09 UTC
Description of problem:
  I'm not able to import 24 nodes cluster.

  The import job always fails on some timeouts.

  First error looks like this:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Timing out import job, Cluster data still not fully updated
  (node: 3b61ef35-d513-4c55-ab68-d972711e6e01)
  (integration_id: 260d83b2-d4a7-4c90-9c86-8ccbbb01e024)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Version-Release number of selected component (if applicable):
  RHGS WA Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  tendrl-ansible-1.6.3-5.el7rhgs.noarch
  tendrl-api-1.6.3-4.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-4.el7rhgs.noarch
  tendrl-commons-1.6.3-8.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch
  tendrl-node-agent-1.6.3-8.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-6.el7rhgs.noarch

  Gluster Storage Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  Red Hat Gluster Storage Server 3.4.0
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-13.el7rhgs.x86_64
  glusterfs-api-3.12.2-13.el7rhgs.x86_64
  glusterfs-cli-3.12.2-13.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-13.el7rhgs.x86_64
  glusterfs-events-3.12.2-13.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-13.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-13.el7rhgs.x86_64
  glusterfs-libs-3.12.2-13.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-13.el7rhgs.x86_64
  glusterfs-server-3.12.2-13.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
  python2-gluster-3.12.2-13.el7rhgs.x86_64
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-8.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-6.el7rhgs.noarch
  tendrl-node-agent-1.6.3-8.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

How reproducible:
  100%

Steps to Reproduce:
1. Prepare cluster with higher number of Gluster Storage Nodes
  (24 Storage Nodes, 4 vCPUs, 6GB RAM in my case)
2. Prepare RHGS WA Server (12 vCPUs, 24GB RAM in my case)
3. Run import cluster task.

Actual results:
  Import cluster fails (timeout) in less than 10 minutes.

Expected results:
  It should be possible to import such cluster.

Additional info:
All errors from ImportCluster tasks (parsed for better readability):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* error
  Failure in Job 7aecbb0a-99ff-4168-a9ad-a1392cd7efb7
  Flow tendrl.flows.ImportCluster with error:
  Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py",
    line 234, in process_job the_flow.run()
  File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py",
    line 131, in run exc_traceback)
  FlowExecutionFailedError: [
  'Traceback (most recent call last):\n',
  ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py",
    line 98, in run\n super(ImportCluster, self).run()\n',
  ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/__init__.py",
    line 186, in run\n (atom_fqn, self._defs[\'help\'])\n',
  'AtomExecutionFailedError: Atom Execution failed. Error:
    Error executing atom: tendrl.objects.Cluster.atoms.ImportCluster on flow:
    Import existing Gluster Cluster\n']
  11 Jul 2018 01:24:18
* error
  Failed atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster
  11 Jul 2018 01:24:18

* error
  Child jobs failed are [u'0bb75dfe-89d2-48da-a2df-dddd8534d1c5']
  11 Jul 2018 01:24:16
* error
  Failure in Job 0bb75dfe-89d2-48da-a2df-dddd8534d1c5
  Flow tendrl.flows.ImportCluster with error:
  Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py",
    line 234, in process_job the_flow.run()
  File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py",
    line 131, in run exc_traceback)
  FlowExecutionFailedError: [
  'Traceback (most recent call last):\n',
  ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py",
    line 98, in run\n super(ImportCluster, self).run()\n',
  ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/__init__.py",
    line 227, in run\n "Error executing post run function: %s" % atom_fqn\n',
  'AtomExecutionFailedError: Atom Execution failed. Error:
    Error executing post run function: tendrl.objects.Cluster.atoms.CheckSyncDone\n']
  11 Jul 2018 01:24:15
* error
  Failed post-run: tendrl.objects.Cluster.atoms.CheckSyncDone for flow: Import existing Gluster Cluster
  11 Jul 2018 01:24:15
* error
  Timing out import job, Cluster data still not fully updated
  (node: 3b61ef35-d513-4c55-ab68-d972711e6e01)
  (integration_id: 260d83b2-d4a7-4c90-9c86-8ccbbb01e024)
  11 Jul 2018 01:24:15
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Comment 1 Daniel Horák 2018-07-11 12:03:35 UTC
Created attachment 1458062 [details]
Errors_in_ImportCluster_task_02

Comment 2 Daniel Horák 2018-07-11 12:04:17 UTC
Created attachment 1458063 [details]
Errors_in_ImportCluster_task_01

Comment 3 Martin Bukatovic 2018-07-13 09:30:08 UTC
The failure seen in this bug is caused by BZ 1600790, which has been pointed out
on today's bug triage by Nishanth.

That said, there seems to be an opportunity to improve error reporting on WA side
even if BZ 1600790 is not fixed.

Comment 9 Daniel Horák 2018-07-19 12:15:11 UTC
I've hit the same error when trying to import cluster with configured
geo-replication (see bug 1578716 and bug 1603175). The cluster was really small
in this case (2 and 6 storage nodes).

It means that we will have to validate also the geo-replication scenario
as part of validation of this BZ.

Version-Release number of selected component:
  RHGS WA Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  tendrl-ansible-1.6.3-5.el7rhgs.noarch
  tendrl-api-1.6.3-4.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-4.el7rhgs.noarch
  tendrl-commons-1.6.3-9.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-7.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-7.el7rhgs.noarch
  tendrl-node-agent-1.6.3-9.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-8.el7rhgs.noarch

  Gluster Storage Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  Red Hat Gluster Storage Server 3.4.0
  glusterfs-3.12.2-14.el7rhgs.x86_64
  glusterfs-api-3.12.2-14.el7rhgs.x86_64
  glusterfs-cli-3.12.2-14.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-14.el7rhgs.x86_64
  glusterfs-events-3.12.2-14.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-14.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-14.el7rhgs.x86_64
  glusterfs-libs-3.12.2-14.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-14.el7rhgs.x86_64
  glusterfs-server-3.12.2-14.el7rhgs.x86_64
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-9.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-7.el7rhgs.noarch
  tendrl-node-agent-1.6.3-9.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch

Comment 11 Daniel Horák 2018-08-03 08:42:12 UTC
Scenario from comment 9 tested during verification of Bug 1603175 comment 3.

Comment 12 Daniel Horák 2018-08-03 09:10:39 UTC
Tested and verified on:
RHGS WA Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  etcd-3.2.7-1.el7.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  python-etcd-0.4.5-2.el7rhgs.noarch
  rubygem-etcd-0.3.0-2.el7rhgs.noarch
  tendrl-ansible-1.6.3-6.el7rhgs.noarch
  tendrl-api-1.6.3-5.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
  tendrl-commons-1.6.3-11.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-8.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-8.el7rhgs.noarch
  tendrl-node-agent-1.6.3-9.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-9.el7rhgs.noarch

Gluster Storage Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  Red Hat Gluster Storage Server 3.4.0
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-15.el7rhgs.x86_64
  glusterfs-api-3.12.2-15.el7rhgs.x86_64
  glusterfs-cli-3.12.2-15.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-15.el7rhgs.x86_64
  glusterfs-events-3.12.2-15.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-15.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-15.el7rhgs.x86_64
  glusterfs-libs-3.12.2-15.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-15.el7rhgs.x86_64
  glusterfs-server-3.12.2-15.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
  python2-gluster-3.12.2-15.el7rhgs.x86_64
  python-etcd-0.4.5-2.el7rhgs.noarch
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-11.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
  tendrl-node-agent-1.6.3-9.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

Importing 24 nodes cluster (the same as described in Comment 0) works correctly.

>> VERIFIED

Comment 14 errata-xmlrpc 2018-09-04 07:08:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616


Note You need to log in before you can comment on or make changes to this bug.