Bug 1630344

Summary: Somtimes node-agent message socket file "message.sock" is missing
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: gowtham <gshanmug>
Component: web-admin-tendrl-node-agentAssignee: gowtham <gshanmug>
Status: CLOSED ERRATA QA Contact: Filip Balák <fbalak>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: fbalak, mbukatov, nthomas, rhs-bugs, sankarshan
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.z Batch Update 2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-node-agent-1.6.3-11.el7rhgs Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-17 17:06:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description gowtham 2018-09-18 12:37:09 UTC
Description of problem:

When node-agent service is started socket file "message.sock" was created under /var/run/tendrl. Using socket file only node-agent message socket server listen and accept the connections. Sometimes when we restart node-agent socket file is not created and node-agent message handler thread is failing.

I have seen this scenario a few times after installation done from tendrl-ansible in a fresh machine.

All other components in a node are kept raising the following exception:

Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: Traceback (most recent call last):
Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: File "/usr/lib/python2.7/site-packages/tendrl/commons/event.py", line 38, in _write
Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: self.sock.connect(self.socket_path)
Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: File "/usr/lib64/python2.7/socket.py", line 224, in meth
Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: return getattr(self._sock,name)(*args)
Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: error: [Errno 2] No such file or directory
Sep 16 03:42:02 GlusterWebAdmin tendrl-monitoring-integration: Unable to pass the message into socket.{"integration_id": null, "publisher": "monitoring_integration", "job_id": null, "timestamp": "2018-09-16T03:42:02.051258+00:00", "caller": {"function": "load_definition", "line_no": 50, "filename": "/usr/lib/python2.7/site-packages/tendrl/commons/objects/__init__.py"}, "payload": {"message": "Load definitions (.yml) for namespace.tendrl.objects.Job"}, "priority": "debug", "parent_id": null, "node_id": "a6ed6301-260d-4a4d-8a35-62f980122ee1", "flow_id": null, "message_id": "a178660d-7333-43d0-a7ef-aace791fbf30"}

Version-Release number of selected component (if applicable):


How reproducible:
I don't have clear reproducer for this problem

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 gowtham 2018-09-20 08:54:51 UTC
As per current implementation Message socket file is present "/var/run/tendrl/message.sock", This file with folder "tendrl" is created when node-agent starts. When we stop node-agent then the folder "tendrl" and file "message.sock" will delete. Sometimes message.sock file is not created but folder "tendrl" is created. 

This issue is happening when a temporary network issue happening while try to connect with etcd.

I have reprduced this scenario in another way,
    I have stopped etcd service, so node-agent services are going down after few retries. Then I started node-agent service (not restart just start)
   service tendrl-node-agent start

it will start node-agent again, then i started tendrl other services also then I saw node-agent continuously raised above traceback message. I have checked message socket directory "/var/run/tendrl/", their message.sock file is missed. 


service tendrl-node-agent start is starting only node-agent service only after the crash, it is not starting service tendrl-node-agent.socket start. So socket file is not created.



I have seen the same problem in customer machine log file, after node-gent restart log message I saw all message socket issue, and in the log file, I saw etcd temporary connection issue log message also.

Comment 3 gowtham 2018-09-20 09:20:12 UTC
PR is under review: https://github.com/Tendrl/node-agent/pull/851

Comment 4 gowtham 2018-09-20 09:33:07 UTC
Steps to reproduce:
  1. check file under /var/run/tendrl/message.sock
  2. stop etcd service
  3. After a few minutes, node-agent will go down
  4. start etcd service
  5. then use the command: service node-agent start (don't use restart)
  6. check the directory /var/run/tendrl/  (message.sock file won't created)
  7. start other tendrl services, and check the log file
  

The reason tendrl-node-agent service will start node-agent.sock also, but in this case, node-agent.sock is not called. 


The same case happened in customer machine also. But little different, node-agent itself started again in temporary etcd connection issue. I can't reproduce exact scenario but mine is similar to that.

Comment 5 Martin Bukatovic 2018-10-30 10:34:16 UTC
QE team will retest this based on reproducer from comment 4.

Comment 8 Filip Balák 2018-11-07 11:54:20 UTC
I was able to reproduce this issue with reproducer from comment 4 on older version and see that /var/run/tendrl/message.sock was not created and that tendrl-monitoring-integration was reporting errors related to that. With current version the message.sock file is created correctly and tendrl-monitoring-integration is working without errors related to this file (but there are tracebacks described in BZ 1647393 and BZ 1647386 now). --> VERIFIED

Older version:
tendrl-monitoring-integration-1.6.3-14.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-api-1.6.3-7.el7rhgs.noarch
tendrl-api-httpd-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-14.el7rhgs.noarch
tendrl-ansible-1.6.3-8.el7rhgs.noarch
tendrl-commons-1.6.3-13.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-ui-1.6.3-11.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch

Current version:
tendrl-monitoring-integration-1.6.3-15.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-api-1.6.3-8.el7rhgs.noarch
tendrl-api-httpd-1.6.3-8.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-15.el7rhgs.noarch
tendrl-ansible-1.6.3-9.el7rhgs.noarch
tendrl-commons-1.6.3-13.el7rhgs.noarch
tendrl-node-agent-1.6.3-11.el7rhgs.noarch
tendrl-ui-1.6.3-12.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch

Comment 9 errata-xmlrpc 2018-12-17 17:06:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3829