When central store (etcd) is stopped which could happen either due to stopping of etcd or shutting down the Web Administration server node itself, all the Web Administration services start reporting exceptions regarding reachability to the etcd. As a consequence, Web Administration services crash as etcd is not reachable.
Workaround: Once etcd is back, restart Web Administration services.
Description of problem:
If WA Server (Tendrl server) is temporarily unavailable (because of network
issue, maintenance reboot...), tendrl-* services on Storage nodes crash and
didn't restore once WA server is again up and available.
Version-Release number of selected component (if applicable):
RHGS WA Server:
collectd-5.7.2-3.1.el7rhgs.x86_64
collectd-ping-5.7.2-3.1.el7rhgs.x86_64
grafana-4.3.2-3.el7rhgs.x86_64
libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
tendrl-ansible-1.6.3-7.el7rhgs.noarch
tendrl-api-1.6.3-5.el7rhgs.noarch
tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-11.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-11.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-11.el7rhgs.noarch
Gluster Storage Server:
collectd-5.7.2-3.1.el7rhgs.x86_64
collectd-ping-5.7.2-3.1.el7rhgs.x86_64
glusterfs-3.12.2-17.el7rhgs.x86_64
glusterfs-api-3.12.2-17.el7rhgs.x86_64
glusterfs-cli-3.12.2-17.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-17.el7rhgs.x86_64
glusterfs-events-3.12.2-17.el7rhgs.x86_64
glusterfs-fuse-3.12.2-17.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-17.el7rhgs.x86_64
glusterfs-libs-3.12.2-17.el7rhgs.x86_64
glusterfs-rdma-3.12.2-17.el7rhgs.x86_64
glusterfs-server-3.12.2-17.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64
python2-gluster-3.12.2-17.el7rhgs.x86_64
tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-10.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
How reproducible:
100%
Steps to Reproduce:
1. Prepare and install Gluster Storage Cluster and RHGS Web Administration.
2. Import Gluster Storage cluster into WA.
3. Power off Web Administration Server.
4. Check logs and tendrl-* services on Storage Servers.
Actual results:
After few minutes, both tendrl-node-agent and tendrl-gluster-integration on
all Gluster Storage Servers are down (crashed):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# systemctl status -l tendrl-node-agent tendrl-gluster-integration
● tendrl-node-agent.service - A python agent local to every managed storage node in the sds cluster
Loaded: loaded (/usr/lib/systemd/system/tendrl-node-agent.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Mon 2018-08-27 10:44:55 CEST; 28min ago
Docs: https://github.com/Tendrl/node-agent/tree/master/doc/source
Process: 21776 ExecStart=/usr/bin/tendrl-node-agent (code=exited, status=0/SUCCESS)
Main PID: 21776 (code=exited, status=0/SUCCESS)
Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: Traceback (most recent call last):
Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: self.run()
Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib64/python2.7/threading.py", line 765, in run
Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: self.__target(*self.__args, **self.__kwargs)
Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib/python2.7/site-packages/tendrl/commons/utils/central_store/utils.py", line 95, in watch
Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: for change in NS._int.client.eternal_watch(key):
Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 795, in eternal_watch
Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: local_index = response.modifiedIndex + 1
Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: AttributeError: 'NoneType' object has no attribute 'modifiedIndex'
● tendrl-gluster-integration.service - Tendrl Gluster Daemon to Manage gluster tasks
Loaded: loaded (/usr/lib/systemd/system/tendrl-gluster-integration.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2018-08-27 10:44:55 CEST; 28min ago
Process: 7693 ExecStart=/usr/bin/tendrl-gluster-integration (code=exited, status=1/FAILURE)
Main PID: 7693 (code=exited, status=1/FAILURE)
Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Setup NodeContext for namespace.tendrl
Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.NodeContext
Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.NodeContext
Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Setup TendrlContext for namespace.tendrl
Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext
Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext
Aug 27 10:44:55 gl1.example.com systemd[1]: tendrl-gluster-integration.service: main process exited, code=exited, status=1/FAILURE
Aug 27 10:44:55 gl1.example.com systemd[1]: Stopped Tendrl Gluster Daemon to Manage gluster tasks.
Aug 27 10:44:55 gl1.example.com systemd[1]: Unit tendrl-gluster-integration.service entered failed state.
Aug 27 10:44:55 gl1.example.com systemd[1]: tendrl-gluster-integration.service failed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expected results:
Tendrl-* services on Gluster Storage nodes shouldn't crash, if WA Server is
temporarily unavailable and should be able, to recover once WA server is back
up and running again.
Additional info:
Once the WA server is back up and running, in the Web UI, the imported Gluster
Storage Cluster seems to be Healthy state and all hosts are "UP", but there
are no Volumes visible.
From my (QE) point of view, it is not a blocker, because (re)start tendrl-node-agent and tendrl-gluster-integration services works as expected.
But it might be worth to document it as known issue/troubleshooting scenario - because for the first look, it might not be clear, where is the problem - because all the hosts are in "Up" state and only information about volumes disappears.
Comment 5Shubhendu Tripathi
2018-08-28 02:51:34 UTC
@sankarshan I agree with Daniel as after restart of WA server node, if we start tendrl-node-agent and tendrl-gluster-integration services on storage nodes, it would work as expected. So yes its not a blocker I feel.
Comment 8Shubhendu Tripathi
2018-11-19 06:03:01 UTC
We need to handle the exception while connecting to etcd from tendrl components and if connection exception it should not crash, rather it should report the error and continue trying.
Changing the severity to low as restart of the services fixes the things.
Description of problem: If WA Server (Tendrl server) is temporarily unavailable (because of network issue, maintenance reboot...), tendrl-* services on Storage nodes crash and didn't restore once WA server is again up and available. Version-Release number of selected component (if applicable): RHGS WA Server: collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 grafana-4.3.2-3.el7rhgs.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 tendrl-ansible-1.6.3-7.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-11.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-11.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-11.el7rhgs.noarch Gluster Storage Server: collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 glusterfs-3.12.2-17.el7rhgs.x86_64 glusterfs-api-3.12.2-17.el7rhgs.x86_64 glusterfs-cli-3.12.2-17.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-17.el7rhgs.x86_64 glusterfs-events-3.12.2-17.el7rhgs.x86_64 glusterfs-fuse-3.12.2-17.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-17.el7rhgs.x86_64 glusterfs-libs-3.12.2-17.el7rhgs.x86_64 glusterfs-rdma-3.12.2-17.el7rhgs.x86_64 glusterfs-server-3.12.2-17.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64 python2-gluster-3.12.2-17.el7rhgs.x86_64 tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-gluster-integration-1.6.3-10.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch How reproducible: 100% Steps to Reproduce: 1. Prepare and install Gluster Storage Cluster and RHGS Web Administration. 2. Import Gluster Storage cluster into WA. 3. Power off Web Administration Server. 4. Check logs and tendrl-* services on Storage Servers. Actual results: After few minutes, both tendrl-node-agent and tendrl-gluster-integration on all Gluster Storage Servers are down (crashed): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # systemctl status -l tendrl-node-agent tendrl-gluster-integration ● tendrl-node-agent.service - A python agent local to every managed storage node in the sds cluster Loaded: loaded (/usr/lib/systemd/system/tendrl-node-agent.service; enabled; vendor preset: disabled) Active: inactive (dead) since Mon 2018-08-27 10:44:55 CEST; 28min ago Docs: https://github.com/Tendrl/node-agent/tree/master/doc/source Process: 21776 ExecStart=/usr/bin/tendrl-node-agent (code=exited, status=0/SUCCESS) Main PID: 21776 (code=exited, status=0/SUCCESS) Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: Traceback (most recent call last): Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: self.run() Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib64/python2.7/threading.py", line 765, in run Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: self.__target(*self.__args, **self.__kwargs) Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib/python2.7/site-packages/tendrl/commons/utils/central_store/utils.py", line 95, in watch Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: for change in NS._int.client.eternal_watch(key): Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: File "/usr/lib/python2.7/site-packages/etcd/client.py", line 795, in eternal_watch Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: local_index = response.modifiedIndex + 1 Aug 27 10:44:54 gl1.example.com tendrl-node-agent[21776]: AttributeError: 'NoneType' object has no attribute 'modifiedIndex' ● tendrl-gluster-integration.service - Tendrl Gluster Daemon to Manage gluster tasks Loaded: loaded (/usr/lib/systemd/system/tendrl-gluster-integration.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2018-08-27 10:44:55 CEST; 28min ago Process: 7693 ExecStart=/usr/bin/tendrl-gluster-integration (code=exited, status=1/FAILURE) Main PID: 7693 (code=exited, status=1/FAILURE) Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Setup NodeContext for namespace.tendrl Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.NodeContext Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.NodeContext Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Setup TendrlContext for namespace.tendrl Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext Aug 27 10:44:55 gl1.example.com tendrl-gluster-integration[7693]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext Aug 27 10:44:55 gl1.example.com systemd[1]: tendrl-gluster-integration.service: main process exited, code=exited, status=1/FAILURE Aug 27 10:44:55 gl1.example.com systemd[1]: Stopped Tendrl Gluster Daemon to Manage gluster tasks. Aug 27 10:44:55 gl1.example.com systemd[1]: Unit tendrl-gluster-integration.service entered failed state. Aug 27 10:44:55 gl1.example.com systemd[1]: tendrl-gluster-integration.service failed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Expected results: Tendrl-* services on Gluster Storage nodes shouldn't crash, if WA Server is temporarily unavailable and should be able, to recover once WA server is back up and running again. Additional info: Once the WA server is back up and running, in the Web UI, the imported Gluster Storage Cluster seems to be Healthy state and all hosts are "UP", but there are no Volumes visible.