Bug 1622461

Summary:	tendrl-node-agent (and gluster-integration) crash if WA Server (etcd database) not available
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Daniel Horák <dahorak>
Component:	web-admin-tendrl-node-agent	Assignee:	Shubhendu Tripathi <shtripat>
Status:	CLOSED WONTFIX	QA Contact:	sds-qe-bugs
Severity:	low	Docs Contact:
Priority:	low
Version:	rhgs-3.4	CC:	dahorak, fbalak, gshanmug, nthomas, rhs-bugs, sankarshan, shtripat
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	When central store (etcd) is stopped which could happen either due to stopping of etcd or shutting down the Web Administration server node itself, all the Web Administration services start reporting exceptions regarding reachability to the etcd. As a consequence, Web Administration services crash as etcd is not reachable. Workaround: Once etcd is back, restart Web Administration services.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-05-08 19:50:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1503143

Description Daniel Horák 2018-08-27 09:22:53 UTC

Description of problem:
  If WA Server (Tendrl server) is temporarily unavailable (because of network
  issue, maintenance reboot...), tendrl-* services on Storage nodes crash and
  didn't restore once WA server is again up and available.


Version-Release number of selected component (if applicable):
  RHGS WA Server:
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  grafana-4.3.2-3.el7rhgs.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  tendrl-ansible-1.6.3-7.el7rhgs.noarch
  tendrl-api-1.6.3-5.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
  tendrl-commons-1.6.3-12.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-11.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-11.el7rhgs.noarch
  tendrl-node-agent-1.6.3-10.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-11.el7rhgs.noarch

  Gluster Storage Server:
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-17.el7rhgs.x86_64
  glusterfs-api-3.12.2-17.el7rhgs.x86_64
  glusterfs-cli-3.12.2-17.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-17.el7rhgs.x86_64
  glusterfs-events-3.12.2-17.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-17.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-17.el7rhgs.x86_64
  glusterfs-libs-3.12.2-17.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-17.el7rhgs.x86_64
  glusterfs-server-3.12.2-17.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64
  python2-gluster-3.12.2-17.el7rhgs.x86_64
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-12.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-10.el7rhgs.noarch
  tendrl-node-agent-1.6.3-10.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch


How reproducible:
  100%


Steps to Reproduce:
1. Prepare and install Gluster Storage Cluster and RHGS Web Administration.
2. Import Gluster Storage cluster into WA.
3. Power off Web Administration Server.
4. Check logs and tendrl-* services on Storage Servers.


Actual results:
  After few minutes, both tendrl-node-agent and tendrl-gluster-integration on
  all Gluster Storage Servers are down (crashed):

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # systemctl status -l tendrl-node-agent tendrl-gluster-integration
  ● tendrl-node-agent.service - A python agent local to every managed storage node in the sds cluster
     Loaded: loaded (/usr/lib/systemd/system/tendrl-node-agent.service; enabled; vendor preset: disabled)
     Active: inactive (dead) since Mon 2018-08-27 10:44:55 CEST; 28min ago
       Docs: https://github.com/Tendrl/node-agent/tree/master/doc/source
    Process: 21776 ExecStart=/usr/bin/tendrl-node-agent (code=exited, status=0/SUCCESS)
   Main PID: 21776 (code=exited, status=0/SUCCESS)

  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: Traceback (most recent call last):
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: self.run()
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib64/python2.7/threading.py", line 765, in run
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: self.__target(*self.__args, **self.__kwargs)
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib/python2.7/site-packages/tendrl/commons/utils/central_store/utils.py", line 95, in watch
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: for change in NS._int.client.eternal_watch(key):
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 795, in eternal_watch
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: local_index = response.modifiedIndex + 1
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: AttributeError: 'NoneType' object has no attribute 'modifiedIndex'

  ● tendrl-gluster-integration.service - Tendrl Gluster Daemon to Manage gluster tasks
     Loaded: loaded (/usr/lib/systemd/system/tendrl-gluster-integration.service; enabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Mon 2018-08-27 10:44:55 CEST; 28min ago
    Process: 7693 ExecStart=/usr/bin/tendrl-gluster-integration (code=exited, status=1/FAILURE)
   Main PID: 7693 (code=exited, status=1/FAILURE)

  Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Setup NodeContext for namespace.tendrl
  Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.NodeContext
  Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.NodeContext
  Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Setup TendrlContext for namespace.tendrl
  Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext
  Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext
  Aug 27 10:44:55 gl1.example.com systemd[1]: tendrl-gluster-integration.service: main process exited, code=exited, status=1/FAILURE
  Aug 27 10:44:55 gl1.example.com systemd[1]: Stopped Tendrl Gluster Daemon to Manage gluster tasks.
  Aug 27 10:44:55 gl1.example.com systemd[1]: Unit tendrl-gluster-integration.service entered failed state.
  Aug 27 10:44:55 gl1.example.com systemd[1]: tendrl-gluster-integration.service failed.
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Expected results:
  Tendrl-* services on Gluster Storage nodes shouldn't crash, if WA Server is
  temporarily unavailable and should be able, to recover once WA server is back
  up and running again.


Additional info:
  Once the WA server is back up and running, in the Web UI, the imported Gluster
  Storage Cluster seems to be Healthy state and all hosts are "UP", but there
  are no Volumes visible.

Comment 4 Daniel Horák 2018-08-27 13:39:21 UTC

From my (QE) point of view, it is not a blocker, because (re)start tendrl-node-agent and tendrl-gluster-integration services works as expected.

But it might be worth to document it as known issue/troubleshooting scenario - because for the first look, it might not be clear, where is the problem - because all the hosts are in "Up" state and only information about volumes disappears.

Comment 5 Shubhendu Tripathi 2018-08-28 02:51:34 UTC

@sankarshan I agree with Daniel as after restart of WA server node, if we start tendrl-node-agent and tendrl-gluster-integration services on storage nodes, it would work as expected. So yes its not a blocker I feel.

Comment 8 Shubhendu Tripathi 2018-11-19 06:03:01 UTC

We need to handle the exception while connecting to etcd from tendrl components and if connection exception it should not crash, rather it should report the error and continue trying.

Changing the severity to low as restart of the services fixes the things.