1600092 – Importing bigger cluster failing: Timing out import job, Cluster data still not fully updated

Bug 1600092 - Importing bigger cluster failing: Timing out import job, Cluster data still not fully updated

Summary: Importing bigger cluster failing: Timing out import job, Cluster data still n...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-commons
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Shubhendu Tripathi
QA Contact:	Daniel Horák
Docs Contact:
URL:
Whiteboard:
Depends On:	1600790 1607783
Blocks:	1503137 1560875
TreeView+	depends on / blocked

Reported:	2018-07-11 12:03 UTC by Daniel Horák
Modified:	2018-09-04 07:09 UTC (History)
CC List:	5 users (show)
Fixed In Version:	tendrl-gluster-integration-1.6.3-8.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 07:08:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Errors_in_ImportCluster_task_01 (deleted) 2018-07-11 12:03 UTC, Daniel Horák	no flags	Details
Errors_in_ImportCluster_task_02 (99.08 KB, image/png) 2018-07-11 12:03 UTC, Daniel Horák	no flags	Details
Errors_in_ImportCluster_task_01 (103.36 KB, image/png) 2018-07-11 12:04 UTC, Daniel Horák	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1600790	0	unspecified	CLOSED	Segmentation fault while using gfapi while getting volume utilization	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2018:2616	0	None	None	None	2018-09-04 07:09:57 UTC

Internal Links: 1600790

Description Daniel Horák 2018-07-11 12:03:09 UTC

Description of problem:
  I'm not able to import 24 nodes cluster.

  The import job always fails on some timeouts.

  First error looks like this:
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Timing out import job, Cluster data still not fully updated
  (node: 3b61ef35-d513-4c55-ab68-d972711e6e01)
  (integration_id: 260d83b2-d4a7-4c90-9c86-8ccbbb01e024)
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Version-Release number of selected component (if applicable):
  RHGS WA Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  tendrl-ansible-1.6.3-5.el7rhgs.noarch
  tendrl-api-1.6.3-4.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-4.el7rhgs.noarch
  tendrl-commons-1.6.3-8.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch
  tendrl-node-agent-1.6.3-8.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-6.el7rhgs.noarch

  Gluster Storage Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  Red Hat Gluster Storage Server 3.4.0
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-13.el7rhgs.x86_64
  glusterfs-api-3.12.2-13.el7rhgs.x86_64
  glusterfs-cli-3.12.2-13.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-13.el7rhgs.x86_64
  glusterfs-events-3.12.2-13.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-13.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-13.el7rhgs.x86_64
  glusterfs-libs-3.12.2-13.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-13.el7rhgs.x86_64
  glusterfs-server-3.12.2-13.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
  python2-gluster-3.12.2-13.el7rhgs.x86_64
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-8.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-6.el7rhgs.noarch
  tendrl-node-agent-1.6.3-8.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

How reproducible:
  100%

Steps to Reproduce:
1. Prepare cluster with higher number of Gluster Storage Nodes
  (24 Storage Nodes, 4 vCPUs, 6GB RAM in my case)
2. Prepare RHGS WA Server (12 vCPUs, 24GB RAM in my case)
3. Run import cluster task.

Actual results:
  Import cluster fails (timeout) in less than 10 minutes.

Expected results:
  It should be possible to import such cluster.

Additional info:
All errors from ImportCluster tasks (parsed for better readability):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* error
  Failure in Job 7aecbb0a-99ff-4168-a9ad-a1392cd7efb7
  Flow tendrl.flows.ImportCluster with error:
  Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py",
    line 234, in process_job the_flow.run()
  File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py",
    line 131, in run exc_traceback)
  FlowExecutionFailedError: [
  'Traceback (most recent call last):\n',
  ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py",
    line 98, in run\n super(ImportCluster, self).run()\n',
  ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/__init__.py",
    line 186, in run\n (atom_fqn, self._defs[\'help\'])\n',
  'AtomExecutionFailedError: Atom Execution failed. Error:
    Error executing atom: tendrl.objects.Cluster.atoms.ImportCluster on flow:
    Import existing Gluster Cluster\n']
  11 Jul 2018 01:24:18
* error
  Failed atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster
  11 Jul 2018 01:24:18

* error
  Child jobs failed are [u'0bb75dfe-89d2-48da-a2df-dddd8534d1c5']
  11 Jul 2018 01:24:16
* error
  Failure in Job 0bb75dfe-89d2-48da-a2df-dddd8534d1c5
  Flow tendrl.flows.ImportCluster with error:
  Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py",
    line 234, in process_job the_flow.run()
  File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py",
    line 131, in run exc_traceback)
  FlowExecutionFailedError: [
  'Traceback (most recent call last):\n',
  ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py",
    line 98, in run\n super(ImportCluster, self).run()\n',
  ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/__init__.py",
    line 227, in run\n "Error executing post run function: %s" % atom_fqn\n',
  'AtomExecutionFailedError: Atom Execution failed. Error:
    Error executing post run function: tendrl.objects.Cluster.atoms.CheckSyncDone\n']
  11 Jul 2018 01:24:15
* error
  Failed post-run: tendrl.objects.Cluster.atoms.CheckSyncDone for flow: Import existing Gluster Cluster
  11 Jul 2018 01:24:15
* error
  Timing out import job, Cluster data still not fully updated
  (node: 3b61ef35-d513-4c55-ab68-d972711e6e01)
  (integration_id: 260d83b2-d4a7-4c90-9c86-8ccbbb01e024)
  11 Jul 2018 01:24:15
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Comment 1 Daniel Horák 2018-07-11 12:03:35 UTC

Created attachment 1458062 [details]
Errors_in_ImportCluster_task_02

Comment 2 Daniel Horák 2018-07-11 12:04:17 UTC

Created attachment 1458063 [details]
Errors_in_ImportCluster_task_01

Comment 3 Martin Bukatovic 2018-07-13 09:30:08 UTC

The failure seen in this bug is caused by BZ 1600790, which has been pointed out
on today's bug triage by Nishanth.

That said, there seems to be an opportunity to improve error reporting on WA side
even if BZ 1600790 is not fixed.

Comment 9 Daniel Horák 2018-07-19 12:15:11 UTC

I've hit the same error when trying to import cluster with configured
geo-replication (see bug 1578716 and bug 1603175). The cluster was really small
in this case (2 and 6 storage nodes).

It means that we will have to validate also the geo-replication scenario
as part of validation of this BZ.

Version-Release number of selected component:
  RHGS WA Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  tendrl-ansible-1.6.3-5.el7rhgs.noarch
  tendrl-api-1.6.3-4.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-4.el7rhgs.noarch
  tendrl-commons-1.6.3-9.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-7.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-7.el7rhgs.noarch
  tendrl-node-agent-1.6.3-9.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-8.el7rhgs.noarch

  Gluster Storage Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  Red Hat Gluster Storage Server 3.4.0
  glusterfs-3.12.2-14.el7rhgs.x86_64
  glusterfs-api-3.12.2-14.el7rhgs.x86_64
  glusterfs-cli-3.12.2-14.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-14.el7rhgs.x86_64
  glusterfs-events-3.12.2-14.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-14.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-14.el7rhgs.x86_64
  glusterfs-libs-3.12.2-14.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-14.el7rhgs.x86_64
  glusterfs-server-3.12.2-14.el7rhgs.x86_64
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-9.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-7.el7rhgs.noarch
  tendrl-node-agent-1.6.3-9.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch

Comment 11 Daniel Horák 2018-08-03 08:42:12 UTC

Scenario from comment 9 tested during verification of Bug 1603175 comment 3.

Comment 12 Daniel Horák 2018-08-03 09:10:39 UTC

Tested and verified on:
RHGS WA Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  etcd-3.2.7-1.el7.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  python-etcd-0.4.5-2.el7rhgs.noarch
  rubygem-etcd-0.3.0-2.el7rhgs.noarch
  tendrl-ansible-1.6.3-6.el7rhgs.noarch
  tendrl-api-1.6.3-5.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
  tendrl-commons-1.6.3-11.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-8.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-8.el7rhgs.noarch
  tendrl-node-agent-1.6.3-9.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-9.el7rhgs.noarch

Gluster Storage Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  Red Hat Gluster Storage Server 3.4.0
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-15.el7rhgs.x86_64
  glusterfs-api-3.12.2-15.el7rhgs.x86_64
  glusterfs-cli-3.12.2-15.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-15.el7rhgs.x86_64
  glusterfs-events-3.12.2-15.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-15.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-15.el7rhgs.x86_64
  glusterfs-libs-3.12.2-15.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-15.el7rhgs.x86_64
  glusterfs-server-3.12.2-15.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
  python2-gluster-3.12.2-15.el7rhgs.x86_64
  python-etcd-0.4.5-2.el7rhgs.noarch
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-11.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
  tendrl-node-agent-1.6.3-9.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

Importing 24 nodes cluster (the same as described in Comment 0) works correctly.

>> VERIFIED

Comment 14 errata-xmlrpc 2018-09-04 07:08:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616

Note You need to log in before you can comment on or make changes to this bug.