1359619 – [GSS]"gluster vol status all clients --xml" get malformed at times, causes gstatus to fail

Bug 1359619 - [GSS]"gluster vol status all clients --xml" get malformed at times, causes gstatus to fail

Summary: [GSS]"gluster vol status all clients --xml" get malformed at times, causes gs...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	cli
Sub Component:
Version:	rhgs-3.1
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Atin Mukherjee
QA Contact:	Anil Shah
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351528 1351530 1372553 1374298
TreeView+	depends on / blocked

Reported:	2016-07-25 06:55 UTC by Riyas Abdulrasak
Modified:	2020-12-11 12:16 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.4-1
Doc Type:	Bug Fix
Doc Text:	Previously if glusterd took more than 120 seconds to process 'gluster volume status all clients --xml' then the XML output that was returned was incomplete. This could result in gstatus failures when attempting to parse the incomplete result. This update ensures parseable results are returned.
Clone Of:
Clones:	1372553 1392181 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:40:17 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Riyas Abdulrasak 2016-07-25 06:55:47 UTC

Description of problem:

Sometimes the gstatus command below traceback instead of proper output. The issue is because the glusterd is giving malformed xml outputs to gstatus scripts.

# gstatus

Traceback (most recent call last):ons
File "/usr/bin/gstatus", line 221, in <module>
main()
File "/usr/bin/gstatus", line 135, in main
cluster.update_state(self_heal_backlog)
File "/usr/lib/python2.7/site-packages/gstatus/libgluster/cluster.py", line 638, in update_state
self.calc_connections()
File "/usr/lib/python2.7/site-packages/gstatus/libgluster/cluster.py", line 713, in calc_connections
cmd.run()
File "/usr/lib/python2.7/site-packages/gstatus/libcommand/glustercmd.py", line 100, in run
xmldoc = ETree.fromstring(''.join(self.stdout))
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1301, in XML
return parser.close()
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

Version-Release number of selected component (if applicable):

RHGS 3.1.3 on RHEL 7

How reproducible:

Not always reproducible.

Steps to Reproduce:

1. Install RHGS
2. Install gstatus
3. Run gstatus

Actual results:

gstatus gives traceback at times instead of cluster health status.

Expected results:

glusterd should provide the proper xml output and gstatus should work properly.
Additional info:

Comment 1 Atin Mukherjee 2016-07-25 07:19:25 UTC

Given that customer hasn't captured the xml output for the same when the issue is hit and its not reproducible, this will be hard to debug. We'll do a code walk through on the XML generation part to see if anything is buggy here.

Comment 2 Atin Mukherjee 2016-07-26 04:45:49 UTC

I see following error entries in the CLI log.

[2016-07-21 05:58:45.598277] E [cli-rpc-ops.c:7744:gf_cli_status_cbk] 0-cli: Error outputting to xml
[2016-07-21 05:58:45.598341] E [cli-rpc-ops.c:8036:gf_cli_status_volume_all] 0-cli: status all failed

This indicates that while dumping the xml output some dict_get_* call failed. CLI doesn't log these failures, so its difficult to figure out which key was missing. GlusterD log however doesn't show any evidence of an error on the same timestamp. Having said this, IMO there is an issue in the way we generate xml output.

<snip>

/* <volume> */                                                              
        ret = xmlTextWriterStartElement (local->writer, (xmlChar *)"volume");   
        XML_RET_CHECK_AND_GOTO (ret, out);                                          
                                                                                    
        ret = dict_get_str (dict, "volname", &volname);                             
        if (ret)                                                                    
                goto out;                                                           
        ret = xmlTextWriterWriteFormatElement (local->writer,                       
                                               (xmlChar *)"volName", "%s",          
                                               volname);


................
................
out:                                                                            
        gf_log ("cli", GF_LOG_DEBUG, "Returning %d", ret);                      
        return ret; 


From this we can clearly see that if in case dict_get_str fails we don't close the xml tag. And once some other application (gStatus in this case) tries to parse it, this may result into trace back as well.

Comment 5 Atin Mukherjee 2016-09-01 12:20:07 UTC

Riyas,

Can you also file a bug on gstatus component for handling exceptions? gStatus needs to also handle these exceptions and should not crash.

Thanks,
Atin

Comment 7 Atin Mukherjee 2016-09-01 17:40:28 UTC

Riyas,

Unfortunately this will not help either as after detail code walkthrough I see that the xml output doesn't get dumped into the stdout at the xmlEndWriterElement which doesn't get called in this case as highlighted in comment 1. 

Looking at the case 01674704, I see the case mentioning about command getting timed out which is very well possible if the volume has a heavy list of connected clients. But I am not able to map the exit code 90 here, in my setup cli doesn't show up anything if the command fails but echo $? shows 1 instead of 90. 

I've identified a fix at the CLI side to write the xml output and will be sending a patch by tomorrow. However I'll continue to analyse we did we end up with exit code 90.

Comment 8 Atin Mukherjee 2016-09-06 04:34:06 UTC

http://review.gluster.org/15384 posted upstream for review.

Comment 9 Atin Mukherjee 2016-09-19 13:03:51 UTC

Upstream mainline : http://review.gluster.org/15384
Upstream 3.8 : http://review.gluster.org/15428

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 14 Anil Shah 2016-10-19 11:32:28 UTC

gluster volume status all clients --xml commands give output in xml format.

[root@rhs-client46 ~]# gluster volume status all clients --xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cliOutput>
  <opRet>0</opRet>
  <opErrno>0</opErrno>
  <opErrstr/>
  <volStatus>
    <volumes>
      <volume>
        <volName>arvol</volName>
        <nodeCount>7</nodeCount>
        <node>
          <hostname>10.70.36.70</hostname>
          <path>/rhs/brick1/b1</path>
          <peerid>02cff39e-b4c9-435d-85f0-8b9fe3b33193</peerid>
          <status>1</status>
          <port>49152</port>
          <ports>
            <tcp>49152</tcp>
            <rdma>N/A</rdma>
          </ports>
          <pid>25364</pid>
          <clientsStatus>
            <clientCount>10</clientCount>
            <client>
              <hostname>10.70.44.7:1015</hostname>
              <bytesRead>9678485624</bytesRead>
              <bytesWrite>23700708</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.70:1023</hostname>
              <bytesRead>18708</bytesRead>
              <bytesWrite>16012</bytesWrite>
            </client>
            <client>
              <hostname>10.70.44.7:1021</hostname>
              <bytesRead>1032</bytesRead>
              <bytesWrite>644</bytesWrite>
            </client>
            <client>
              <hostname>10.70.44.7:984</hostname>
              <bytesRead>25708</bytesRead>
              <bytesWrite>22956</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.46:1021</hostname>
              <bytesRead>1040</bytesRead>
              <bytesWrite>644</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.70:994</hostname>
              <bytesRead>32308</bytesRead>
              <bytesWrite>28108</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.46:991</hostname>
              <bytesRead>21680</bytesRead>
              <bytesWrite>19412</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.70:974</hostname>
              <bytesRead>49564</bytesRead>
              <bytesWrite>44564</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.71:1014</hostname>
              <bytesRead>1040</bytesRead>
              <bytesWrite>644</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.71:985</hostname>
              <bytesRead>46260</bytesRead>
              <bytesWrite>41644</bytesWrite>
            </client>
          </clientsStatus>


gstatus command output 
======================================

[root@rhs-client46 ~]# gstatus -a
 
     Product: RHGS Server v3.1Update3  Capacity:   3.90 TiB(raw bricks)
      Status: HEALTHY                       38.00 GiB(raw used)
   Glusterfs: 3.8.4                          1.40 TiB(usable from volumes)
  OverCommit: No                Snapshots:   9

   Nodes       :  4/  4		  Volumes:   2 Up
   Self Heal   :  4/  2		             0 Up(Degraded)
   Bricks      :  8/  8		             0 Up(Partial)
   Connections :  4/  68                     0 Down

Volume Information
	arvol            UP - 6/6 bricks up - Distributed-Replicate
	                 Capacity: (1% used) 13.00 GiB/1.20 TiB (used/total)
	                 Snapshots: 9
	                 Self Heal:  6/ 6
	                 Tasks Active: None
	                 Protocols: glusterfs:on  NFS:off  SMB:on
	                 Gluster Connectivty: 4 hosts, 60 tcp connections

	testvol          UP - 2/2 bricks up - Replicate
	                 Capacity: (0% used) 33.00 MiB/199.00 GiB (used/total)
	                 Snapshots: 0
	                 Self Heal:  2/ 2
	                 Tasks Active: None
	                 Protocols: glusterfs:on  NFS:off  SMB:on
	                 Gluster Connectivty: 4 hosts, 8 tcp connections


Status Messages
  - Cluster is HEALTHY, all_bricks checks successful


Bug verified on build glusterfs-3.8.4-2.el7rhgs.x86_64

Comment 16 Atin Mukherjee 2017-03-06 04:55:08 UTC

LGTM :)

Comment 18 errata-xmlrpc 2017-03-23 05:40:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.