Description of problem: Sometimes the gstatus command below traceback instead of proper output. The issue is because the glusterd is giving malformed xml outputs to gstatus scripts. # gstatus Traceback (most recent call last):ons File "/usr/bin/gstatus", line 221, in <module> main() File "/usr/bin/gstatus", line 135, in main cluster.update_state(self_heal_backlog) File "/usr/lib/python2.7/site-packages/gstatus/libgluster/cluster.py", line 638, in update_state self.calc_connections() File "/usr/lib/python2.7/site-packages/gstatus/libgluster/cluster.py", line 713, in calc_connections cmd.run() File "/usr/lib/python2.7/site-packages/gstatus/libcommand/glustercmd.py", line 100, in run xmldoc = ETree.fromstring(''.join(self.stdout)) File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1301, in XML return parser.close() File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close self._raiseerror(v) File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror raise err xml.etree.ElementTree.ParseError: no element found: line 1, column 0 Version-Release number of selected component (if applicable): RHGS 3.1.3 on RHEL 7 How reproducible: Not always reproducible. Steps to Reproduce: 1. Install RHGS 2. Install gstatus 3. Run gstatus Actual results: gstatus gives traceback at times instead of cluster health status. Expected results: glusterd should provide the proper xml output and gstatus should work properly. Additional info:
Given that customer hasn't captured the xml output for the same when the issue is hit and its not reproducible, this will be hard to debug. We'll do a code walk through on the XML generation part to see if anything is buggy here.
I see following error entries in the CLI log. [2016-07-21 05:58:45.598277] E [cli-rpc-ops.c:7744:gf_cli_status_cbk] 0-cli: Error outputting to xml [2016-07-21 05:58:45.598341] E [cli-rpc-ops.c:8036:gf_cli_status_volume_all] 0-cli: status all failed This indicates that while dumping the xml output some dict_get_* call failed. CLI doesn't log these failures, so its difficult to figure out which key was missing. GlusterD log however doesn't show any evidence of an error on the same timestamp. Having said this, IMO there is an issue in the way we generate xml output. <snip> /* <volume> */ ret = xmlTextWriterStartElement (local->writer, (xmlChar *)"volume"); XML_RET_CHECK_AND_GOTO (ret, out); ret = dict_get_str (dict, "volname", &volname); if (ret) goto out; ret = xmlTextWriterWriteFormatElement (local->writer, (xmlChar *)"volName", "%s", volname); ................ ................ out: gf_log ("cli", GF_LOG_DEBUG, "Returning %d", ret); return ret; From this we can clearly see that if in case dict_get_str fails we don't close the xml tag. And once some other application (gStatus in this case) tries to parse it, this may result into trace back as well.
Riyas, Can you also file a bug on gstatus component for handling exceptions? gStatus needs to also handle these exceptions and should not crash. Thanks, Atin
Riyas, Unfortunately this will not help either as after detail code walkthrough I see that the xml output doesn't get dumped into the stdout at the xmlEndWriterElement which doesn't get called in this case as highlighted in comment 1. Looking at the case 01674704, I see the case mentioning about command getting timed out which is very well possible if the volume has a heavy list of connected clients. But I am not able to map the exit code 90 here, in my setup cli doesn't show up anything if the command fails but echo $? shows 1 instead of 90. I've identified a fix at the CLI side to write the xml output and will be sending a patch by tomorrow. However I'll continue to analyse we did we end up with exit code 90.
http://review.gluster.org/15384 posted upstream for review.
Upstream mainline : http://review.gluster.org/15384 Upstream 3.8 : http://review.gluster.org/15428 And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.
gluster volume status all clients --xml commands give output in xml format. [root@rhs-client46 ~]# gluster volume status all clients --xml <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <cliOutput> <opRet>0</opRet> <opErrno>0</opErrno> <opErrstr/> <volStatus> <volumes> <volume> <volName>arvol</volName> <nodeCount>7</nodeCount> <node> <hostname>10.70.36.70</hostname> <path>/rhs/brick1/b1</path> <peerid>02cff39e-b4c9-435d-85f0-8b9fe3b33193</peerid> <status>1</status> <port>49152</port> <ports> <tcp>49152</tcp> <rdma>N/A</rdma> </ports> <pid>25364</pid> <clientsStatus> <clientCount>10</clientCount> <client> <hostname>10.70.44.7:1015</hostname> <bytesRead>9678485624</bytesRead> <bytesWrite>23700708</bytesWrite> </client> <client> <hostname>10.70.36.70:1023</hostname> <bytesRead>18708</bytesRead> <bytesWrite>16012</bytesWrite> </client> <client> <hostname>10.70.44.7:1021</hostname> <bytesRead>1032</bytesRead> <bytesWrite>644</bytesWrite> </client> <client> <hostname>10.70.44.7:984</hostname> <bytesRead>25708</bytesRead> <bytesWrite>22956</bytesWrite> </client> <client> <hostname>10.70.36.46:1021</hostname> <bytesRead>1040</bytesRead> <bytesWrite>644</bytesWrite> </client> <client> <hostname>10.70.36.70:994</hostname> <bytesRead>32308</bytesRead> <bytesWrite>28108</bytesWrite> </client> <client> <hostname>10.70.36.46:991</hostname> <bytesRead>21680</bytesRead> <bytesWrite>19412</bytesWrite> </client> <client> <hostname>10.70.36.70:974</hostname> <bytesRead>49564</bytesRead> <bytesWrite>44564</bytesWrite> </client> <client> <hostname>10.70.36.71:1014</hostname> <bytesRead>1040</bytesRead> <bytesWrite>644</bytesWrite> </client> <client> <hostname>10.70.36.71:985</hostname> <bytesRead>46260</bytesRead> <bytesWrite>41644</bytesWrite> </client> </clientsStatus> gstatus command output ====================================== [root@rhs-client46 ~]# gstatus -a Product: RHGS Server v3.1Update3 Capacity: 3.90 TiB(raw bricks) Status: HEALTHY 38.00 GiB(raw used) Glusterfs: 3.8.4 1.40 TiB(usable from volumes) OverCommit: No Snapshots: 9 Nodes : 4/ 4 Volumes: 2 Up Self Heal : 4/ 2 0 Up(Degraded) Bricks : 8/ 8 0 Up(Partial) Connections : 4/ 68 0 Down Volume Information arvol UP - 6/6 bricks up - Distributed-Replicate Capacity: (1% used) 13.00 GiB/1.20 TiB (used/total) Snapshots: 9 Self Heal: 6/ 6 Tasks Active: None Protocols: glusterfs:on NFS:off SMB:on Gluster Connectivty: 4 hosts, 60 tcp connections testvol UP - 2/2 bricks up - Replicate Capacity: (0% used) 33.00 MiB/199.00 GiB (used/total) Snapshots: 0 Self Heal: 2/ 2 Tasks Active: None Protocols: glusterfs:on NFS:off SMB:on Gluster Connectivty: 4 hosts, 8 tcp connections Status Messages - Cluster is HEALTHY, all_bricks checks successful Bug verified on build glusterfs-3.8.4-2.el7rhgs.x86_64
LGTM :)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html