Description of problem: When node-agent service is started socket file "message.sock" was created under /var/run/tendrl. Using socket file only node-agent message socket server listen and accept the connections. Sometimes when we restart node-agent socket file is not created and node-agent message handler thread is failing. I have seen this scenario a few times after installation done from tendrl-ansible in a fresh machine. All other components in a node are kept raising the following exception: Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: Traceback (most recent call last): Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: File "/usr/lib/python2.7/site-packages/tendrl/commons/event.py", line 38, in _write Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: self.sock.connect(self.socket_path) Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: File "/usr/lib64/python2.7/socket.py", line 224, in meth Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: return getattr(self._sock,name)(*args) Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: error: [Errno 2] No such file or directory Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: Unable to pass the message into socket.{"integration_id": null, "publisher": "monitoring_integration", "job_id": null, "timestamp": "2018-09-16T03:42:02.051258+00:00", "caller": {"function": "load_definition", "line_no": 50, "filename": "/usr/lib/python2.7/site-packages/tendrl/commons/objects/__init__.py"}, "payload": {"message": "Load definitions (.yml) for namespace.tendrl.objects.Job"}, "priority": "debug", "parent_id": null, "node_id": "a6ed6301-260d-4a4d-8a35-62f980122ee1", "flow_id": null, "message_id": "a178660d-7333-43d0-a7ef-aace791fbf30"} Version-Release number of selected component (if applicable): How reproducible: I don't have clear reproducer for this problem Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
As per current implementation Message socket file is present "/var/run/tendrl/message.sock", This file with folder "tendrl" is created when node-agent starts. When we stop node-agent then the folder "tendrl" and file "message.sock" will delete. Sometimes message.sock file is not created but folder "tendrl" is created. This issue is happening when a temporary network issue happening while try to connect with etcd. I have reprduced this scenario in another way, I have stopped etcd service, so node-agent services are going down after few retries. Then I started node-agent service (not restart just start) service tendrl-node-agent start it will start node-agent again, then i started tendrl other services also then I saw node-agent continuously raised above traceback message. I have checked message socket directory "/var/run/tendrl/", their message.sock file is missed. service tendrl-node-agent start is starting only node-agent service only after the crash, it is not starting service tendrl-node-agent.socket start. So socket file is not created. I have seen the same problem in customer machine log file, after node-gent restart log message I saw all message socket issue, and in the log file, I saw etcd temporary connection issue log message also.
PR is under review: https://github.com/Tendrl/node-agent/pull/851
Steps to reproduce: 1. check file under /var/run/tendrl/message.sock 2. stop etcd service 3. After a few minutes, node-agent will go down 4. start etcd service 5. then use the command: service node-agent start (don't use restart) 6. check the directory /var/run/tendrl/ (message.sock file won't created) 7. start other tendrl services, and check the log file The reason tendrl-node-agent service will start node-agent.sock also, but in this case, node-agent.sock is not called. The same case happened in customer machine also. But little different, node-agent itself started again in temporary etcd connection issue. I can't reproduce exact scenario but mine is similar to that.
QE team will retest this based on reproducer from comment 4.
I was able to reproduce this issue with reproducer from comment 4 on older version and see that /var/run/tendrl/message.sock was not created and that tendrl-monitoring-integration was reporting errors related to that. With current version the message.sock file is created correctly and tendrl-monitoring-integration is working without errors related to this file (but there are tracebacks described in BZ 1647393 and BZ 1647386 now). --> VERIFIED Older version: tendrl-monitoring-integration-1.6.3-14.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-api-1.6.3-7.el7rhgs.noarch tendrl-api-httpd-1.6.3-7.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-14.el7rhgs.noarch tendrl-ansible-1.6.3-8.el7rhgs.noarch tendrl-commons-1.6.3-13.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-ui-1.6.3-11.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch Current version: tendrl-monitoring-integration-1.6.3-15.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-api-1.6.3-8.el7rhgs.noarch tendrl-api-httpd-1.6.3-8.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-15.el7rhgs.noarch tendrl-ansible-1.6.3-9.el7rhgs.noarch tendrl-commons-1.6.3-13.el7rhgs.noarch tendrl-node-agent-1.6.3-11.el7rhgs.noarch tendrl-ui-1.6.3-12.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:3829