Bug 1513931

Summary: collectd: SyntaxError: unexpected EOF while parsing
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Lubos Trilety <ltrilety>
Component: web-admin-tendrl-node-agentAssignee: Nishanth Thomas <nthomas>
Status: CLOSED ERRATA QA Contact: Lubos Trilety <ltrilety>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: dahorak, ltrilety, mbukatov, mkudlej, rhs-bugs, rkanade, sanandpa, sankarshan, ssaha
Target Milestone: ---Keywords: Reopened, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-node-agent-1.5.4-9.el7rhgs.noarch Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-18 04:39:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lubos Trilety 2017-11-16 10:05:44 UTC
Description of problem:
Probably after bigger load of data (2 bricks full) there's following error in collectd log:
Nov 15 22:02:27 <hostname> collectd[10477]: Exception in thread Thread-10989:
Nov 15 22:02:27 <hostname> collectd[10477]: Traceback (most recent call last):
Nov 15 22:02:27 <hostname> collectd[10477]: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
Nov 15 22:02:27 <hostname> collectd[10477]: self.run()
Nov 15 22:02:27 <hostname> collectd[10477]: File "/usr/lib64/python2.7/threading.py", line 765, in run
Nov 15 22:02:27 <hostname> collectd[10477]: self.__target(*self.__args, **self.__kwargs)
Nov 15 22:02:27 <hostname> collectd[10477]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 181, in populate_disk_det
Nov 15 22:02:27 <hostname> collectd[10477]: self.get_brick_devices(brick_path)
Nov 15 22:02:27 <hostname> collectd[10477]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 94, in get_brick_devices
Nov 15 22:02:27 <hostname> collectd[10477]: return self.fetch_brick_devices(brick_path)
Nov 15 22:02:27 <hostname> collectd[10477]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 70, in fetch_brick_device
Nov 15 22:02:27 <hostname> collectd[10477]: brick_path.replace('/', '_').replace("_", "", 1)
Nov 15 22:02:27 <hostname> collectd[10477]: File "/usr/lib64/python2.7/ast.py", line 49, in literal_eval
Nov 15 22:02:27 <hostname> collectd[10477]: node_or_string = parse(node_or_string, mode='eval')
Nov 15 22:02:27 <hostname> collectd[10477]: File "/usr/lib64/python2.7/ast.py", line 37, in parse
Nov 15 22:02:27 <hostname> collectd[10477]: return compile(source, filename, mode, PyCF_ONLY_AST)
Nov 15 22:02:27 <hostname> collectd[10477]: File "<unknown>", line 0
Nov 15 22:02:27 <hostname> collectd[10477]: ^
Nov 15 22:02:27 <hostname> collectd[10477]: SyntaxError: unexpected EOF while parsing

BTW in my case the error repeats regularly several times (on all gluster machines) and then ends to appear and collectd seems to be working.


Version-Release number of selected component (if applicable):
tendrl-collectd-selinux-1.5.3-2.el7rhgs.noarch
tendrl-node-agent-1.5.4-2.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-commons-1.5.4-2.el7rhgs.noarch
tendrl-gluster-integration-1.5.4-2.el7rhgs.noarch
collectd-ping-5.7.0-4.el7rhgs.x86_64
collectd-5.7.0-4.el7rhgs.x86_64
libcollectdclient-5.7.0-4.el7rhgs.x86_64


How reproducible:

Steps to Reproduce:
1. 
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Lubos Trilety 2017-11-16 10:30:37 UTC
I was able to reproduce the issue. I use dd on directory, where the gluster volume is mounted. I create such big file that the dd output says:
dd: error writing ‘file’: No space left on device
dd: closing output file ‘file’: Read-only file system

After some time I found the collectd error on gluster nodes.

Comment 2 Rohan Kanade 2017-11-16 17:44:10 UTC
Lubos,

Did you run the dd command on the partition or the device directly? It seems that the issue is caused because no partition details are gathered after your run the dd command.

Comment 3 Rohan Kanade 2017-11-17 08:01:56 UTC
Adding to my comment (https://bugzilla.redhat.com/show_bug.cgi?id=1513931#c2)

Incase of bad partition tables on Brick devices (can be caused by overwriting the partition tables via "dd" command as suspected in this issue).  Tendrl does not take responsibility of discovering disks with bad partition tables and alerting users about this problem as of today.

Comment 5 Lubos Trilety 2017-11-20 06:51:15 UTC
(In reply to Rohan Kanade from comment #2)
> Lubos,
> 
> Did you run the dd command on the partition or the device directly? It seems
> that the issue is caused because no partition details are gathered after
> your run the dd command.

I run dd command on the mounted volume, not directly on the device. It should not touch partition table. And it did not as the volume is still running after the dd command.

Comment 6 Lubos Trilety 2017-11-20 06:56:07 UTC
What's true here is the fact that the file is not closed correctly so unexpected EOF makes sense. The file looks correct using ls, but it's not. If nothing else its size is not correct, as the whole file was not written on the volume. That said closing as NOTABUG.

Comment 8 Martin Bukatovic 2017-11-21 08:35:49 UTC
(In reply to Lubos Trilety from comment #6)
> What's true here is the fact that the file is not closed correctly so
> unexpected EOF makes sense. The file looks correct using ls, but it's not.
> If nothing else its size is not correct, as the whole file was not written
> on the volume. That said closing as NOTABUG.

I disagree with conclusion that this is not a bug. How is admin who reviews
the logs supposed to figure out what's wrong?

Comment 9 Daniel Horák 2017-11-29 14:59:01 UTC
I saw the same Traceback on multiple nodes of two freshly installed and imported clusters today, so reopening this Bug.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nov 29 06:56:21 localhost collectd: Exception in thread Thread-15:
Nov 29 06:56:21 localhost collectd: Traceback (most recent call last):
Nov 29 06:56:21 localhost collectd: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
Nov 29 06:56:21 localhost collectd: self.run()
Nov 29 06:56:21 localhost collectd: File "/usr/lib64/python2.7/threading.py", line 765, in run
Nov 29 06:56:21 localhost collectd: self.__target(*self.__args, **self.__kwargs)
Nov 29 06:56:21 localhost collectd: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 181, in populate_disk_details
Nov 29 06:56:21 localhost collectd: self.get_brick_devices(brick_path)
Nov 29 06:56:21 localhost collectd: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 94, in get_brick_devices
Nov 29 06:56:21 localhost collectd: return self.fetch_brick_devices(brick_path)
Nov 29 06:56:21 localhost collectd: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 70, in fetch_brick_devices
Nov 29 06:56:21 localhost collectd: brick_path.replace('/', '_').replace("_", "", 1)
Nov 29 06:56:21 localhost collectd: File "/usr/lib64/python2.7/ast.py", line 49, in literal_eval
Nov 29 06:56:21 localhost collectd: node_or_string = parse(node_or_string, mode='eval')
Nov 29 06:56:21 localhost collectd: File "/usr/lib64/python2.7/ast.py", line 37, in parse
Nov 29 06:56:21 localhost collectd: return compile(source, filename, mode, PyCF_ONLY_AST)
Nov 29 06:56:21 localhost collectd: File "<unknown>", line 0
Nov 29 06:56:21 localhost collectd: ^
Nov 29 06:56:21 localhost collectd: SyntaxError: unexpected EOF while parsing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I'm still not sure about particular reproduction scenario, but there is at least following semantic error in file:
  /usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py

If the except clause around line 75 should handle both etcd.EtcdKeyNotFound and SyntaxError exceptions, they should be written as a tuple (in parenthesis)[1].
The current implementation handle only the etcd.EtcdKeyNotFound and saves it into variable named SyntaxError.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
51     def fetch_brick_devices(self, brick_path):
52         mount_point = self.get_brick_source_and_mount(
53             brick_path
54         )
55         try:
           ....
75         except etcd.EtcdKeyNotFound, SyntaxError:
76             return [], [], mount_point
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Second concern what I have about the code between the try and except clause in the tendrl_gluster_brick_disk_stats.py file, is to fetch the data from etcd and put them directly to the literal_eval() function.
There might happened lots of possible problems, related for example to the network connection to etcd and it's hard to debug it, because any such problem is masked by the SyntaxError from the literal_eval function (and moreover will be completely ignored, once the above described semantics error will be fixed).

[1] https://docs.python.org/2.7/tutorial/errors.html#handling-exceptions

Comment 11 Rohan Kanade 2017-11-29 15:33:12 UTC
We should be handling any network issues before syntax errors here. I'll send a fix

Comment 12 Rohan Kanade 2017-12-01 06:01:29 UTC
This issue is fixed upstream. We are now handling all the known exceptions

Comment 14 Lubos Trilety 2017-12-06 14:33:28 UTC
Tested with:
tendrl-node-agent-1.5.4-9.el7rhgs.noarch

The exception handling is corrected

cat /usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py
...
        except urllib3.exceptions.TimeoutError:
            _msg = "Timeout error while fetching brick devices"
            collectd.error(_msg)
            collectd.error(traceback.format_exc())
            return [], [], mount_point

        except SyntaxError:
            _msg = "Unable to parse brick device data for integration_id (%s), peer_name (%s), brick (%s)" % (self.CONFIG['integration_id'],
                                                                                                              self.CONFIG['peer_name'],
                                                                                                              brick_path.replace('/', '_').replace("_", "", 1))
            collectd.error(_msg)
            collectd.error(traceback.format_exc())
            return [], [], mount_point

        except etcd.EtcdKeyNotFound:
            _msg = "Unable to parse brick device data for integration_id (%s), peer_name (%s), brick (%s)" % (self.CONFIG['integration_id'],
                                                                                                              self.CONFIG['peer_name'],
                                                                                                              brick_path.replace('/', '_').replace("_", "", 1))
            collectd.error(_msg)
            collectd.error(traceback.format_exc())
            return [], [], mount_point
...

Comment 15 Lubos Trilety 2017-12-07 09:24:27 UTC
I have to re-open this one again.
The exception handling is corrected, but when I check collectd journal I found following errors there:

Dec 07 02:37:46 <hostname> collectd[24468]: Unable to parse brick device data for integration_id (db34a3b7-c260-4758-bd13-2bec86f0bc1a), peer_name (<hostname>), brick (mnt_brick_gama_disperse_2_2)
Dec 07 02:37:46 <hostname> collectd[24468]: Traceback (most recent call last):
Dec 07 02:37:46 <hostname> collectd[24468]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 71, in fetch_brick_devices
Dec 07 02:37:46 <hostname> collectd[24468]: brick_path.replace('/', '_').replace("_", "", 1)
Dec 07 02:37:46 <hostname> collectd[24468]: File "/usr/lib64/python2.7/ast.py", line 49, in literal_eval
Dec 07 02:37:46 <hostname> collectd[24468]: node_or_string = parse(node_or_string, mode='eval')
Dec 07 02:37:46 <hostname> collectd[24468]: File "/usr/lib64/python2.7/ast.py", line 37, in parse
Dec 07 02:37:46 <hostname> collectd[24468]: return compile(source, filename, mode, PyCF_ONLY_AST)
Dec 07 02:37:46 <hostname> collectd[24468]: File "<unknown>", line 0
Dec 07 02:37:46 <hostname> collectd[24468]: ^
Dec 07 02:37:46 <hostname> collectd[24468]: SyntaxError: unexpected EOF while parsing
Dec 07 02:37:46 <hostname> collectd[24468]: Failed to fetch device details for brick <hostname>:/mnt/brick_gama_disperse_2/2 of volume volume_gama_disperse_4_plus_2x2

The same for all other bricks regardless on volume type. In this particular case I didn't made any action on the machines and still the errors are there.

Comment 16 Rohan Kanade 2017-12-07 09:46:30 UTC
Lubos,

The Tendrl gluster collectd plugin "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py"

has tried to fetch brick devices for a certain brick but it failed to do so and it has logged the error as 

"Dec 07 02:37:46 <hostname> collectd[24468]: Unable to parse brick device data for integration_id (db34a3b7-c260-4758-bd13-2bec86f0bc1a), peer_name (<hostname>), brick (mnt_brick_gama_disperse_2_2)
"


Now, why the brick devices were not found is completely dependent on the deployment. The Tendrl gluster collectd plugin has handled the error and logged it and continues to run which is what this BZ is about right?

If you want to investigate why brick (mnt_brick_gama_disperse_2_2) devices arent available in etcd, we can open another bug.

Comment 17 Lubos Trilety 2017-12-07 11:00:47 UTC
Rohan, that may be what was required when the issue was re-opened by Daniel, but when I look on the first comment, I was just saying that there are errors in logs. In other words the original reason for this bug was to fix all issues which leads to these errors.
It was closed because I don't have valid reproducer. Now it happened when I didn't do anything, just install RHGSWA, import cluster and let it run.

Comment 18 Lubos Trilety 2017-12-08 09:15:17 UTC
Tested with:
tendrl-node-agent-1.5.4-14.el7rhgs.noarch

The occurrence of the error is quite sporadic but from time to time it appears in log. In my case I have some in logs right away for another occurrence I have to wait about half hour. Moreover it doesn't happen on all machines.

Dec 08 04:03:18 <hostname> collectd[15635]: Unable to parse brick device data for integration_id (e345648c-3cd6-4ba5-9301-f989289a2e2e), peer_name (<hostname>), brick (mnt_brick_beta_arbiter_1_1)
Dec 08 04:03:18 <hostname> collectd[15635]: Traceback (most recent call last):
Dec 08 04:03:18 <hostname> collectd[15635]: File "/usr/lib64/collectd/gluster/tendrl_gluster_brick_disk_stats.py", line 71, in fetch_brick_devices
Dec 08 04:03:18 <hostname> collectd[15635]: brick_path.replace('/', '_').replace("_", "", 1)
Dec 08 04:03:18 <hostname> collectd[15635]: File "/usr/lib64/python2.7/ast.py", line 49, in literal_eval
Dec 08 04:03:18 <hostname> collectd[15635]: node_or_string = parse(node_or_string, mode='eval')
Dec 08 04:03:18 <hostname> collectd[15635]: File "/usr/lib64/python2.7/ast.py", line 37, in parse
Dec 08 04:03:18 <hostname> collectd[15635]: return compile(source, filename, mode, PyCF_ONLY_AST)
Dec 08 04:03:18 <hostname> collectd[15635]: File "<unknown>", line 0
Dec 08 04:03:18 <hostname> collectd[15635]: ^
Dec 08 04:03:18 <hostname> collectd[15635]: SyntaxError: unexpected EOF while parsing
Dec 08 04:03:18 <hostname> collectd[15635]: Failed to fetch device details for brick <hostname>:/mnt/brick_beta_arbiter_1/1 of volume volume_beta_arbiter_2_plus_1x2

Comment 19 Rohan Kanade 2017-12-08 09:32:06 UTC
@Lubos, few questions for you

1) Do you see data for brick_beta_arbiter_1/1 in Tendrl grafana dashboard and UI?
2) Do you see data related to volume "volume_beta_arbiter_2_plus_1x2" in Tendrl grafana dashboard and UI? 

3) Do you get alerts saying "brick_beta_arbiter_1/1" not found or status of "brick_beta_arbiter_1/1" or "volume_beta_arbiter_2_plus_1x2" is stopped or failed?



Tendrl central store is a cache and not a database, if there's cache miss/errors which are logged in collectd or tendrl agents at some point in time, it is normal.

Comment 20 Lubos Trilety 2017-12-08 12:44:23 UTC
(In reply to Rohan Kanade from comment #19)
> @Lubos, few questions for you
> 
> 1) Do you see data for brick_beta_arbiter_1/1 in Tendrl grafana dashboard
> and UI?
> 2) Do you see data related to volume "volume_beta_arbiter_2_plus_1x2" in
> Tendrl grafana dashboard and UI? 
> 
> 3) Do you get alerts saying "brick_beta_arbiter_1/1" not found or status of
> "brick_beta_arbiter_1/1" or "volume_beta_arbiter_2_plus_1x2" is stopped or
> failed?
> 
> 
> 
> Tendrl central store is a cache and not a database, if there's cache
> miss/errors which are logged in collectd or tendrl agents at some point in
> time, it is normal.

Sorry, I don't have the cluster any more. However I see the same errors on my other cluster. And for that:
1) I can see some data on Grafana dashboard for the brick for which error appears in log
2) Yes there are some data for the volume on UI and on Grafana too
3) Nope, no alerts received about bad state of any brick. When I kill the brick process I get the stop state alert correctly, the same for clearing alert when I start the brick daemon again.

That said if it's normal to have an error about missing device once per about an hour, than I am OK with moving this BZ to VERIFIED.

Comment 21 Rohan Kanade 2017-12-08 13:31:42 UTC
Please move to verified.

Comment 22 Lubos Trilety 2017-12-08 13:49:35 UTC
Tested with:
tendrl-node-agent-1.5.4-14.el7rhgs.noarch

With Comment 19 and Comment 20 in mind it seems to be working properly.

Comment 24 errata-xmlrpc 2017-12-18 04:39:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3478