+++ This bug was initially created as a clone of Bug #1359619 +++ Description of problem: Sometimes the gstatus command below traceback instead of proper output. The issue is because the glusterd is giving malformed xml outputs to gstatus scripts. # gstatus Traceback (most recent call last):ons File "/usr/bin/gstatus", line 221, in <module> main() File "/usr/bin/gstatus", line 135, in main cluster.update_state(self_heal_backlog) File "/usr/lib/python2.7/site-packages/gstatus/libgluster/cluster.py", line 638, in update_state self.calc_connections() File "/usr/lib/python2.7/site-packages/gstatus/libgluster/cluster.py", line 713, in calc_connections cmd.run() File "/usr/lib/python2.7/site-packages/gstatus/libcommand/glustercmd.py", line 100, in run xmldoc = ETree.fromstring(''.join(self.stdout)) File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1301, in XML return parser.close() File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close self._raiseerror(v) File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror raise err xml.etree.ElementTree.ParseError: no element found: line 1, column 0 Version-Release number of selected component (if applicable): RHGS 3.1.3 on RHEL 7 How reproducible: Not always reproducible. Steps to Reproduce: 1. Install RHGS 2. Install gstatus 3. Run gstatus Actual results: gstatus gives traceback at times instead of cluster health status. Expected results: glusterd should provide the proper xml output and gstatus should work properly. Additional info: --- Additional comment from Atin Mukherjee on 2016-07-25 03:19:25 EDT --- Given that customer hasn't captured the xml output for the same when the issue is hit and its not reproducible, this will be hard to debug. We'll do a code walk through on the XML generation part to see if anything is buggy here. --- Additional comment from Atin Mukherjee on 2016-07-26 00:45:49 EDT --- I see following error entries in the CLI log. [2016-07-21 05:58:45.598277] E [cli-rpc-ops.c:7744:gf_cli_status_cbk] 0-cli: Error outputting to xml [2016-07-21 05:58:45.598341] E [cli-rpc-ops.c:8036:gf_cli_status_volume_all] 0-cli: status all failed This indicates that while dumping the xml output some dict_get_* call failed. CLI doesn't log these failures, so its difficult to figure out which key was missing. GlusterD log however doesn't show any evidence of an error on the same timestamp. Having said this, IMO there is an issue in the way we generate xml output. <snip> /* <volume> */ ret = xmlTextWriterStartElement (local->writer, (xmlChar *)"volume"); XML_RET_CHECK_AND_GOTO (ret, out); ret = dict_get_str (dict, "volname", &volname); if (ret) goto out; ret = xmlTextWriterWriteFormatElement (local->writer, (xmlChar *)"volName", "%s", volname); ................ ................ out: gf_log ("cli", GF_LOG_DEBUG, "Returning %d", ret); return ret; From this we can clearly see that if in case dict_get_str fails we don't close the xml tag. And once some other application (gStatus in this case) tries to parse it, this may result into trace back as well. --- Additional comment from Atin Mukherjee on 2016-09-01 08:20:07 EDT --- Riyas, Can you also file a bug on gstatus component for handling exceptions? gStatus needs to also handle these exceptions and should not crash. Thanks, Atin --- Additional comment from Atin Mukherjee on 2016-09-01 13:40:28 EDT --- Riyas, Unfortunately this will not help either as after detail code walkthrough I see that the xml output doesn't get dumped into the stdout at the xmlEndWriterElement which doesn't get called in this case as highlighted in comment 1. Looking at the case 01674704, I see the case mentioning about command getting timed out which is very well possible if the volume has a heavy list of connected clients. But I am not able to map the exit code 90 here, in my setup cli doesn't show up anything if the command fails but echo $? shows 1 instead of 90. I've identified a fix at the CLI side to write the xml output and will be sending a patch by tomorrow. However I'll continue to analyse we did we end up with exit code 90. --- Additional comment from Atin Mukherjee on 2016-09-06 00:34:06 EDT --- http://review.gluster.org/15384 posted upstream for review. --- Additional comment from Atin Mukherjee on 2016-09-19 09:03:51 EDT --- Upstream mainline : http://review.gluster.org/15384 Upstream 3.8 : http://review.gluster.org/15428 And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4. --- Additional comment from Anil Shah on 2016-10-19 07:32:28 EDT --- gluster volume status all clients --xml commands give output in xml format. [root@rhs-client46 ~]# gluster volume status all clients --xml <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <cliOutput> <opRet>0</opRet> <opErrno>0</opErrno> <opErrstr/> <volStatus> <volumes> <volume> <volName>arvol</volName> <nodeCount>7</nodeCount> <node> <hostname>10.70.36.70</hostname> <path>/rhs/brick1/b1</path> <peerid>02cff39e-b4c9-435d-85f0-8b9fe3b33193</peerid> <status>1</status> <port>49152</port> <ports> <tcp>49152</tcp> <rdma>N/A</rdma> </ports> <pid>25364</pid> <clientsStatus> <clientCount>10</clientCount> <client> <hostname>10.70.44.7:1015</hostname> <bytesRead>9678485624</bytesRead> <bytesWrite>23700708</bytesWrite> </client> <client> <hostname>10.70.36.70:1023</hostname> <bytesRead>18708</bytesRead> <bytesWrite>16012</bytesWrite> </client> <client> <hostname>10.70.44.7:1021</hostname> <bytesRead>1032</bytesRead> <bytesWrite>644</bytesWrite> </client> <client> <hostname>10.70.44.7:984</hostname> <bytesRead>25708</bytesRead> <bytesWrite>22956</bytesWrite> </client> <client> <hostname>10.70.36.46:1021</hostname> <bytesRead>1040</bytesRead> <bytesWrite>644</bytesWrite> </client> <client> <hostname>10.70.36.70:994</hostname> <bytesRead>32308</bytesRead> <bytesWrite>28108</bytesWrite> </client> <client> <hostname>10.70.36.46:991</hostname> <bytesRead>21680</bytesRead> <bytesWrite>19412</bytesWrite> </client> <client> <hostname>10.70.36.70:974</hostname> <bytesRead>49564</bytesRead> <bytesWrite>44564</bytesWrite> </client> <client> <hostname>10.70.36.71:1014</hostname> <bytesRead>1040</bytesRead> <bytesWrite>644</bytesWrite> </client> <client> <hostname>10.70.36.71:985</hostname> <bytesRead>46260</bytesRead> <bytesWrite>41644</bytesWrite> </client> </clientsStatus> gstatus command output ====================================== [root@rhs-client46 ~]# gstatus -a Product: RHGS Server v3.1Update3 Capacity: 3.90 TiB(raw bricks) Status: HEALTHY 38.00 GiB(raw used) Glusterfs: 3.8.4 1.40 TiB(usable from volumes) OverCommit: No Snapshots: 9 Nodes : 4/ 4 Volumes: 2 Up Self Heal : 4/ 2 0 Up(Degraded) Bricks : 8/ 8 0 Up(Partial) Connections : 4/ 68 0 Down Volume Information arvol UP - 6/6 bricks up - Distributed-Replicate Capacity: (1% used) 13.00 GiB/1.20 TiB (used/total) Snapshots: 9 Self Heal: 6/ 6 Tasks Active: None Protocols: glusterfs:on NFS:off SMB:on Gluster Connectivty: 4 hosts, 60 tcp connections testvol UP - 2/2 bricks up - Replicate Capacity: (0% used) 33.00 MiB/199.00 GiB (used/total) Snapshots: 0 Self Heal: 2/ 2 Tasks Active: None Protocols: glusterfs:on NFS:off SMB:on Gluster Connectivty: 4 hosts, 8 tcp connections Status Messages - Cluster is HEALTHY, all_bricks checks successful Bug verified on build glusterfs-3.8.4-2.el7rhgs.x86_64
Verified on GlusterFS 3.7.16 (CentOS Storage SIG packages): [root@shockley ~]# gluster vol status all clients Client connections for volume ctdb ---------------------------------------------- Brick : read.gluster.private:/srv/glusterfs/disk0/ctdb_brick Clients connected : 3 Hostname BytesRead BytesWritten -------- --------- ------------ 172.25.5.21:49125 1032 624 172.25.5.22:49127 199404 177816 172.25.5.23:49116 1020 624 ---------------------------------------------- Brick : hall.gluster.private:/srv/glusterfs/disk0/ctdb_brick Clients connected : 3 Hostname BytesRead BytesWritten -------- --------- ------------ 172.25.5.21:49122 1032 624 172.25.5.23:49114 199404 177816 172.25.5.22:49124 1020 624 ---------------------------------------------- Brick : shockley.gluster.private:/srv/glusterfs/disk0/ctdb_arbiter_brick Clients connected : 3 Hostname BytesRead BytesWritten -------- --------- ------------ 172.25.5.21:49112 199428 180340 172.25.5.23:49034 1032 648 172.25.5.22:49049 1032 648 ---------------------------------------------- ... (Output cut for brevity) [root@shockley ~]# echo $? 0 [root@shockley ~]# gluster vol status all clients --xml [root@shockley ~]# echo $? 96 [root@shockley ~]# gstatus Traceback (most recent call last):e File "/bin/gstatus", line 221, in <module> main() File "/bin/gstatus", line 135, in main cluster.update_state(self_heal_backlog) File "/usr/lib/python2.7/site-packages/gstatus/libgluster/cluster.py", line 542, in update_state cmd.run() File "/usr/lib/python2.7/site-packages/gstatus/libcommand/glustercmd.py", line 100, in run xmldoc = ETree.fromstring(''.join(self.stdout)) File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1301, in XML return parser.close() File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close self._raiseerror(v) File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror raise err xml.etree.ElementTree.ParseError: no element found: line 1, column 0 I didn't find any trace of the 3.8.x fixes being backported on 3.6.17 or newer, sorry if I missed something.
I can conform that http://review.gluster.org/15384 applies cleanly to latest 3.7.17 sources and the resulting packages do not exhibit the problem anymore (gstatus executes as expected).
REVIEW: http://review.gluster.org/15820 (cli: fix volume status xml generation) posted (#1) for review on release-3.7 by Atin Mukherjee (amukherj)
COMMIT: http://review.gluster.org/15820 committed in release-3.7 by Atin Mukherjee (amukherj) ------ commit a4ecc73a5cd7029dd5ad4a8545a4ec41365ffce9 Author: Atin Mukherjee <amukherj> Date: Fri Sep 2 10:42:44 2016 +0530 cli: fix volume status xml generation While generating xml, if CLI fails in between xml output doesn't get dumped into stdout. Fix is to invoke cli_xml_output_vol_status_end () in such failures. >Reviewed-on: http://review.gluster.org/15384 >NetBSD-regression: NetBSD Build System <jenkins.org> >CentOS-regression: Gluster Build System <jenkins.org> >Smoke: Gluster Build System <jenkins.org> >Reviewed-by: Samikshan Bairagya <samikshan> >Reviewed-by: Prashanth Pai <ppai> Change-Id: I7cb3097f5ae23092e6d20f68bd75aa190c31ed88 BUG: 1392181 Signed-off-by: Atin Mukherjee <amukherj> Reviewed-on: http://review.gluster.org/15820 Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> Reviewed-by: Prashanth Pai <ppai>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.18, please open a new bug report. glusterfs-3.7.18 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://www.gluster.org/pipermail/gluster-users/2016-December/029427.html [2] https://www.gluster.org/pipermail/gluster-users/