Bug 1622461 - tendrl-node-agent (and gluster-integration) crash if WA Server (etcd database) not available
Summary: tendrl-node-agent (and gluster-integration) crash if WA Server (etcd database...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-node-agent
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Shubhendu Tripathi
QA Contact: sds-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks: 1503143
TreeView+ depends on / blocked
 
Reported: 2018-08-27 09:22 UTC by Daniel Horák
Modified: 2019-10-16 02:56 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
When central store (etcd) is stopped which could happen either due to stopping of etcd or shutting down the Web Administration server node itself, all the Web Administration services start reporting exceptions regarding reachability to the etcd. As a consequence, Web Administration services crash as etcd is not reachable. Workaround: Once etcd is back, restart Web Administration services.
Clone Of:
Environment:
Last Closed: 2019-05-08 19:50:46 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1647386 0 medium CLOSED AttributeError: 'NoneType' object has no attribute 'modifiedIndex' Traceback 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1647393 0 medium CLOSED AttributeError: 'NoneType' object has no attribute 'modifiedIndex' Traceback 2021-02-22 00:41:40 UTC

Internal Links: 1647386 1647393

Description Daniel Horák 2018-08-27 09:22:53 UTC
Description of problem:
  If WA Server (Tendrl server) is temporarily unavailable (because of network
  issue, maintenance reboot...), tendrl-* services on Storage nodes crash and
  didn't restore once WA server is again up and available.


Version-Release number of selected component (if applicable):
  RHGS WA Server:
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  grafana-4.3.2-3.el7rhgs.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  tendrl-ansible-1.6.3-7.el7rhgs.noarch
  tendrl-api-1.6.3-5.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
  tendrl-commons-1.6.3-12.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-11.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-11.el7rhgs.noarch
  tendrl-node-agent-1.6.3-10.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-11.el7rhgs.noarch

  Gluster Storage Server:
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-17.el7rhgs.x86_64
  glusterfs-api-3.12.2-17.el7rhgs.x86_64
  glusterfs-cli-3.12.2-17.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-17.el7rhgs.x86_64
  glusterfs-events-3.12.2-17.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-17.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-17.el7rhgs.x86_64
  glusterfs-libs-3.12.2-17.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-17.el7rhgs.x86_64
  glusterfs-server-3.12.2-17.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64
  python2-gluster-3.12.2-17.el7rhgs.x86_64
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-12.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-10.el7rhgs.noarch
  tendrl-node-agent-1.6.3-10.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch


How reproducible:
  100%


Steps to Reproduce:
1. Prepare and install Gluster Storage Cluster and RHGS Web Administration.
2. Import Gluster Storage cluster into WA.
3. Power off Web Administration Server.
4. Check logs and tendrl-* services on Storage Servers.


Actual results:
  After few minutes, both tendrl-node-agent and tendrl-gluster-integration on
  all Gluster Storage Servers are down (crashed):

  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # systemctl status -l tendrl-node-agent tendrl-gluster-integration
  ● tendrl-node-agent.service - A python agent local to every managed storage node in the sds cluster
     Loaded: loaded (/usr/lib/systemd/system/tendrl-node-agent.service; enabled; vendor preset: disabled)
     Active: inactive (dead) since Mon 2018-08-27 10:44:55 CEST; 28min ago
       Docs: https://github.com/Tendrl/node-agent/tree/master/doc/source
    Process: 21776 ExecStart=/usr/bin/tendrl-node-agent (code=exited, status=0/SUCCESS)
   Main PID: 21776 (code=exited, status=0/SUCCESS)

  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: Traceback (most recent call last):
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: self.run()
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib64/python2.7/threading.py", line 765, in run
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: self.__target(*self.__args, **self.__kwargs)
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib/python2.7/site-packages/tendrl/commons/utils/central_store/utils.py", line 95, in watch
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: for change in NS._int.client.eternal_watch(key):
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 795, in eternal_watch
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: local_index = response.modifiedIndex + 1
  Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: AttributeError: 'NoneType' object has no attribute 'modifiedIndex'

  ● tendrl-gluster-integration.service - Tendrl Gluster Daemon to Manage gluster tasks
     Loaded: loaded (/usr/lib/systemd/system/tendrl-gluster-integration.service; enabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Mon 2018-08-27 10:44:55 CEST; 28min ago
    Process: 7693 ExecStart=/usr/bin/tendrl-gluster-integration (code=exited, status=1/FAILURE)
   Main PID: 7693 (code=exited, status=1/FAILURE)

  Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Setup NodeContext for namespace.tendrl
  Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.NodeContext
  Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.NodeContext
  Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Setup TendrlContext for namespace.tendrl
  Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext
  Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext
  Aug 27 10:44:55 gl1.example.com systemd[1]: tendrl-gluster-integration.service: main process exited, code=exited, status=1/FAILURE
  Aug 27 10:44:55 gl1.example.com systemd[1]: Stopped Tendrl Gluster Daemon to Manage gluster tasks.
  Aug 27 10:44:55 gl1.example.com systemd[1]: Unit tendrl-gluster-integration.service entered failed state.
  Aug 27 10:44:55 gl1.example.com systemd[1]: tendrl-gluster-integration.service failed.
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Expected results:
  Tendrl-* services on Gluster Storage nodes shouldn't crash, if WA Server is
  temporarily unavailable and should be able, to recover once WA server is back
  up and running again.


Additional info:
  Once the WA server is back up and running, in the Web UI, the imported Gluster
  Storage Cluster seems to be Healthy state and all hosts are "UP", but there
  are no Volumes visible.

Comment 4 Daniel Horák 2018-08-27 13:39:21 UTC
From my (QE) point of view, it is not a blocker, because (re)start tendrl-node-agent and tendrl-gluster-integration services works as expected.

But it might be worth to document it as known issue/troubleshooting scenario - because for the first look, it might not be clear, where is the problem - because all the hosts are in "Up" state and only information about volumes disappears.

Comment 5 Shubhendu Tripathi 2018-08-28 02:51:34 UTC
@sankarshan I agree with Daniel as after restart of WA server node, if we start tendrl-node-agent and tendrl-gluster-integration services on storage nodes, it would work as expected. So yes its not a blocker I feel.

Comment 8 Shubhendu Tripathi 2018-11-19 06:03:01 UTC
We need to handle the exception while connecting to etcd from tendrl components and if connection exception it should not crash, rather it should report the error and continue trying.

Changing the severity to low as restart of the services fixes the things.


Note You need to log in before you can comment on or make changes to this bug.