Description of problem: If WA Server (Tendrl server) is temporarily unavailable (because of network issue, maintenance reboot...), tendrl-* services on Storage nodes crash and didn't restore once WA server is again up and available. Version-Release number of selected component (if applicable): RHGS WA Server: collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 grafana-4.3.2-3.el7rhgs.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 tendrl-ansible-1.6.3-7.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-11.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-11.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-11.el7rhgs.noarch Gluster Storage Server: collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 glusterfs-3.12.2-17.el7rhgs.x86_64 glusterfs-api-3.12.2-17.el7rhgs.x86_64 glusterfs-cli-3.12.2-17.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-17.el7rhgs.x86_64 glusterfs-events-3.12.2-17.el7rhgs.x86_64 glusterfs-fuse-3.12.2-17.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-17.el7rhgs.x86_64 glusterfs-libs-3.12.2-17.el7rhgs.x86_64 glusterfs-rdma-3.12.2-17.el7rhgs.x86_64 glusterfs-server-3.12.2-17.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64 python2-gluster-3.12.2-17.el7rhgs.x86_64 tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-gluster-integration-1.6.3-10.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch How reproducible: 100% Steps to Reproduce: 1. Prepare and install Gluster Storage Cluster and RHGS Web Administration. 2. Import Gluster Storage cluster into WA. 3. Power off Web Administration Server. 4. Check logs and tendrl-* services on Storage Servers. Actual results: After few minutes, both tendrl-node-agent and tendrl-gluster-integration on all Gluster Storage Servers are down (crashed): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # systemctl status -l tendrl-node-agent tendrl-gluster-integration ● tendrl-node-agent.service - A python agent local to every managed storage node in the sds cluster Loaded: loaded (/usr/lib/systemd/system/tendrl-node-agent.service; enabled; vendor preset: disabled) Active: inactive (dead) since Mon 2018-08-27 10:44:55 CEST; 28min ago Docs: https://github.com/Tendrl/node-agent/tree/master/doc/source Process: 21776 ExecStart=/usr/bin/tendrl-node-agent (code=exited, status=0/SUCCESS) Main PID: 21776 (code=exited, status=0/SUCCESS) Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: Traceback (most recent call last): Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: self.run() Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib64/python2.7/threading.py", line 765, in run Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: self.__target(*self.__args, **self.__kwargs) Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib/python2.7/site-packages/tendrl/commons/utils/central_store/utils.py", line 95, in watch Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: for change in NS._int.client.eternal_watch(key): Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 795, in eternal_watch Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: local_index = response.modifiedIndex + 1 Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: AttributeError: 'NoneType' object has no attribute 'modifiedIndex' ● tendrl-gluster-integration.service - Tendrl Gluster Daemon to Manage gluster tasks Loaded: loaded (/usr/lib/systemd/system/tendrl-gluster-integration.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2018-08-27 10:44:55 CEST; 28min ago Process: 7693 ExecStart=/usr/bin/tendrl-gluster-integration (code=exited, status=1/FAILURE) Main PID: 7693 (code=exited, status=1/FAILURE) Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Setup NodeContext for namespace.tendrl Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.NodeContext Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.NodeContext Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Setup TendrlContext for namespace.tendrl Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext Aug 27 10:44:55 gl1.example.com systemd[1]: tendrl-gluster-integration.service: main process exited, code=exited, status=1/FAILURE Aug 27 10:44:55 gl1.example.com systemd[1]: Stopped Tendrl Gluster Daemon to Manage gluster tasks. Aug 27 10:44:55 gl1.example.com systemd[1]: Unit tendrl-gluster-integration.service entered failed state. Aug 27 10:44:55 gl1.example.com systemd[1]: tendrl-gluster-integration.service failed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Expected results: Tendrl-* services on Gluster Storage nodes shouldn't crash, if WA Server is temporarily unavailable and should be able, to recover once WA server is back up and running again. Additional info: Once the WA server is back up and running, in the Web UI, the imported Gluster Storage Cluster seems to be Healthy state and all hosts are "UP", but there are no Volumes visible.
From my (QE) point of view, it is not a blocker, because (re)start tendrl-node-agent and tendrl-gluster-integration services works as expected. But it might be worth to document it as known issue/troubleshooting scenario - because for the first look, it might not be clear, where is the problem - because all the hosts are in "Up" state and only information about volumes disappears.
@sankarshan I agree with Daniel as after restart of WA server node, if we start tendrl-node-agent and tendrl-gluster-integration services on storage nodes, it would work as expected. So yes its not a blocker I feel.
We need to handle the exception while connecting to etcd from tendrl components and if connection exception it should not crash, rather it should report the error and continue trying. Changing the severity to low as restart of the services fixes the things.