1392181 – "gluster vol status all clients --xml" get malformed at times, causes gstatus to fail

Bug 1392181 - "gluster vol status all clients --xml" get malformed at times, causes gstatus to fail

Summary: "gluster vol status all clients --xml" get malformed at times, causes gstatus...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	cli
Sub Component:
Version:	3.7.15
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Atin Mukherjee
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1372553
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-05 18:08 UTC by Giuseppe Ragusa
Modified:	2016-12-13 07:19 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.7.18
Clone Of:	1359619
Environment:
Last Closed:	2016-12-13 07:19:35 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1372553	0	medium	CLOSED	"gluster vol status all clients --xml" doesn't generate xml if there is a failure in between	2021-02-22 00:41:40 UTC

Internal Links: 1372553

Description Giuseppe Ragusa 2016-11-05 18:08:48 UTC

+++ This bug was initially created as a clone of Bug #1359619 +++

Description of problem:

Sometimes the gstatus command below traceback instead of proper output. The issue is because the glusterd is giving malformed xml outputs to gstatus scripts. 

# gstatus

Traceback (most recent call last):ons
  File "/usr/bin/gstatus", line 221, in <module>
    main()
  File "/usr/bin/gstatus", line 135, in main
    cluster.update_state(self_heal_backlog)
  File "/usr/lib/python2.7/site-packages/gstatus/libgluster/cluster.py", line 638, in update_state
    self.calc_connections()
  File "/usr/lib/python2.7/site-packages/gstatus/libgluster/cluster.py", line 713, in calc_connections
    cmd.run()
  File "/usr/lib/python2.7/site-packages/gstatus/libcommand/glustercmd.py", line 100, in run
    xmldoc = ETree.fromstring(''.join(self.stdout))
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1301, in XML
    return parser.close()
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
    self._raiseerror(v)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: no element found: line 1, column 0


Version-Release number of selected component (if applicable):

RHGS 3.1.3 on RHEL 7 

How reproducible:

Not always reproducible. 

Steps to Reproduce:

1. Install RHGS 
2. Install gstatus
3. Run gstatus

Actual results:

gstatus gives traceback at times instead of cluster health status. 

Expected results:

glusterd should provide the proper xml output and gstatus should work properly. 
Additional info:

--- Additional comment from Atin Mukherjee on 2016-07-25 03:19:25 EDT ---

Given that customer hasn't captured the xml output for the same when the issue is hit and its not reproducible, this will be hard to debug. We'll do a code walk through on the XML generation part to see if anything is buggy here.

--- Additional comment from Atin Mukherjee on 2016-07-26 00:45:49 EDT ---

I see following error entries in the CLI log.

[2016-07-21 05:58:45.598277] E [cli-rpc-ops.c:7744:gf_cli_status_cbk] 0-cli: Error outputting to xml
[2016-07-21 05:58:45.598341] E [cli-rpc-ops.c:8036:gf_cli_status_volume_all] 0-cli: status all failed

This indicates that while dumping the xml output some dict_get_* call failed. CLI doesn't log these failures, so its difficult to figure out which key was missing. GlusterD log however doesn't show any evidence of an error on the same timestamp. Having said this, IMO there is an issue in the way we generate xml output.

<snip>

/* <volume> */                                                              
        ret = xmlTextWriterStartElement (local->writer, (xmlChar *)"volume");   
        XML_RET_CHECK_AND_GOTO (ret, out);                                          
                                                                                    
        ret = dict_get_str (dict, "volname", &volname);                             
        if (ret)                                                                    
                goto out;                                                           
        ret = xmlTextWriterWriteFormatElement (local->writer,                       
                                               (xmlChar *)"volName", "%s",          
                                               volname);


................
................
out:                                                                            
        gf_log ("cli", GF_LOG_DEBUG, "Returning %d", ret);                      
        return ret; 


From this we can clearly see that if in case dict_get_str fails we don't close the xml tag. And once some other application (gStatus in this case) tries to parse it, this may result into trace back as well.

--- Additional comment from Atin Mukherjee on 2016-09-01 08:20:07 EDT ---

Riyas,

Can you also file a bug on gstatus component for handling exceptions? gStatus needs to also handle these exceptions and should not crash.

Thanks,
Atin

--- Additional comment from Atin Mukherjee on 2016-09-01 13:40:28 EDT ---

Riyas,

Unfortunately this will not help either as after detail code walkthrough I see that the xml output doesn't get dumped into the stdout at the xmlEndWriterElement which doesn't get called in this case as highlighted in comment 1. 

Looking at the case 01674704, I see the case mentioning about command getting timed out which is very well possible if the volume has a heavy list of connected clients. But I am not able to map the exit code 90 here, in my setup cli doesn't show up anything if the command fails but echo $? shows 1 instead of 90. 

I've identified a fix at the CLI side to write the xml output and will be sending a patch by tomorrow. However I'll continue to analyse we did we end up with exit code 90.

--- Additional comment from Atin Mukherjee on 2016-09-06 00:34:06 EDT ---

http://review.gluster.org/15384 posted upstream for review.

--- Additional comment from Atin Mukherjee on 2016-09-19 09:03:51 EDT ---

Upstream mainline : http://review.gluster.org/15384
Upstream 3.8 : http://review.gluster.org/15428

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

--- Additional comment from Anil Shah on 2016-10-19 07:32:28 EDT ---

gluster volume status all clients --xml commands give output in xml format.

[root@rhs-client46 ~]# gluster volume status all clients --xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cliOutput>
  <opRet>0</opRet>
  <opErrno>0</opErrno>
  <opErrstr/>
  <volStatus>
    <volumes>
      <volume>
        <volName>arvol</volName>
        <nodeCount>7</nodeCount>
        <node>
          <hostname>10.70.36.70</hostname>
          <path>/rhs/brick1/b1</path>
          <peerid>02cff39e-b4c9-435d-85f0-8b9fe3b33193</peerid>
          <status>1</status>
          <port>49152</port>
          <ports>
            <tcp>49152</tcp>
            <rdma>N/A</rdma>
          </ports>
          <pid>25364</pid>
          <clientsStatus>
            <clientCount>10</clientCount>
            <client>
              <hostname>10.70.44.7:1015</hostname>
              <bytesRead>9678485624</bytesRead>
              <bytesWrite>23700708</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.70:1023</hostname>
              <bytesRead>18708</bytesRead>
              <bytesWrite>16012</bytesWrite>
            </client>
            <client>
              <hostname>10.70.44.7:1021</hostname>
              <bytesRead>1032</bytesRead>
              <bytesWrite>644</bytesWrite>
            </client>
            <client>
              <hostname>10.70.44.7:984</hostname>
              <bytesRead>25708</bytesRead>
              <bytesWrite>22956</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.46:1021</hostname>
              <bytesRead>1040</bytesRead>
              <bytesWrite>644</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.70:994</hostname>
              <bytesRead>32308</bytesRead>
              <bytesWrite>28108</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.46:991</hostname>
              <bytesRead>21680</bytesRead>
              <bytesWrite>19412</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.70:974</hostname>
              <bytesRead>49564</bytesRead>
              <bytesWrite>44564</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.71:1014</hostname>
              <bytesRead>1040</bytesRead>
              <bytesWrite>644</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.71:985</hostname>
              <bytesRead>46260</bytesRead>
              <bytesWrite>41644</bytesWrite>
            </client>
          </clientsStatus>


gstatus command output 
======================================

[root@rhs-client46 ~]# gstatus -a
 
     Product: RHGS Server v3.1Update3  Capacity:   3.90 TiB(raw bricks)
      Status: HEALTHY                       38.00 GiB(raw used)
   Glusterfs: 3.8.4                          1.40 TiB(usable from volumes)
  OverCommit: No                Snapshots:   9

   Nodes       :  4/  4		  Volumes:   2 Up
   Self Heal   :  4/  2		             0 Up(Degraded)
   Bricks      :  8/  8		             0 Up(Partial)
   Connections :  4/  68                     0 Down

Volume Information
	arvol            UP - 6/6 bricks up - Distributed-Replicate
	                 Capacity: (1% used) 13.00 GiB/1.20 TiB (used/total)
	                 Snapshots: 9
	                 Self Heal:  6/ 6
	                 Tasks Active: None
	                 Protocols: glusterfs:on  NFS:off  SMB:on
	                 Gluster Connectivty: 4 hosts, 60 tcp connections

	testvol          UP - 2/2 bricks up - Replicate
	                 Capacity: (0% used) 33.00 MiB/199.00 GiB (used/total)
	                 Snapshots: 0
	                 Self Heal:  2/ 2
	                 Tasks Active: None
	                 Protocols: glusterfs:on  NFS:off  SMB:on
	                 Gluster Connectivty: 4 hosts, 8 tcp connections


Status Messages
  - Cluster is HEALTHY, all_bricks checks successful


Bug verified on build glusterfs-3.8.4-2.el7rhgs.x86_64

Comment 1 Giuseppe Ragusa 2016-11-05 18:14:45 UTC

Verified on GlusterFS 3.7.16 (CentOS Storage SIG packages):

[root@shockley ~]# gluster vol status all clients
Client connections for volume ctdb
----------------------------------------------
Brick : read.gluster.private:/srv/glusterfs/disk0/ctdb_brick
Clients connected : 3
Hostname                                               BytesRead    BytesWritten
--------                                               ---------    ------------
172.25.5.21:49125                                           1032             624
172.25.5.22:49127                                         199404          177816
172.25.5.23:49116                                           1020             624
----------------------------------------------
Brick : hall.gluster.private:/srv/glusterfs/disk0/ctdb_brick
Clients connected : 3
Hostname                                               BytesRead    BytesWritten
--------                                               ---------    ------------
172.25.5.21:49122                                           1032             624
172.25.5.23:49114                                         199404          177816
172.25.5.22:49124                                           1020             624
----------------------------------------------
Brick : shockley.gluster.private:/srv/glusterfs/disk0/ctdb_arbiter_brick
Clients connected : 3
Hostname                                               BytesRead    BytesWritten
--------                                               ---------    ------------
172.25.5.21:49112                                         199428          180340
172.25.5.23:49034                                           1032             648
172.25.5.22:49049                                           1032             648
----------------------------------------------
...
(Output cut for brevity)
[root@shockley ~]# echo $?
0
[root@shockley ~]# gluster vol status all clients --xml
[root@shockley ~]# echo $?
96
[root@shockley ~]# gstatus

Traceback (most recent call last):e
  File "/bin/gstatus", line 221, in <module>
    main()
  File "/bin/gstatus", line 135, in main
    cluster.update_state(self_heal_backlog)
  File "/usr/lib/python2.7/site-packages/gstatus/libgluster/cluster.py", line 542, in update_state
    cmd.run()
  File "/usr/lib/python2.7/site-packages/gstatus/libcommand/glustercmd.py", line 100, in run
    xmldoc = ETree.fromstring(''.join(self.stdout))
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1301, in XML
    return parser.close()
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
    self._raiseerror(v)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

I didn't find any trace of the 3.8.x fixes being backported on 3.6.17 or newer, sorry if I missed something.

Comment 2 Giuseppe Ragusa 2016-11-06 06:34:04 UTC

I can conform that http://review.gluster.org/15384 applies cleanly to latest 3.7.17 sources and the resulting packages do not exhibit the problem anymore (gstatus executes as expected).

Comment 3 Worker Ant 2016-11-10 12:55:41 UTC

REVIEW: http://review.gluster.org/15820 (cli: fix volume status xml generation) posted (#1) for review on release-3.7 by Atin Mukherjee (amukherj)

Comment 4 Worker Ant 2016-11-14 11:09:40 UTC

COMMIT: http://review.gluster.org/15820 committed in release-3.7 by Atin Mukherjee (amukherj) 
------
commit a4ecc73a5cd7029dd5ad4a8545a4ec41365ffce9
Author: Atin Mukherjee <amukherj>
Date:   Fri Sep 2 10:42:44 2016 +0530

    cli: fix volume status xml generation
    
    While generating xml, if CLI fails in between xml output doesn't get dumped into
    stdout. Fix is to invoke cli_xml_output_vol_status_end () in such failures.
    
    >Reviewed-on: http://review.gluster.org/15384
    >NetBSD-regression: NetBSD Build System <jenkins.org>
    >CentOS-regression: Gluster Build System <jenkins.org>
    >Smoke: Gluster Build System <jenkins.org>
    >Reviewed-by: Samikshan Bairagya <samikshan>
    >Reviewed-by: Prashanth Pai <ppai>
    
    Change-Id: I7cb3097f5ae23092e6d20f68bd75aa190c31ed88
    BUG: 1392181
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/15820
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Prashanth Pai <ppai>

Comment 5 Kaushal 2016-12-13 07:19:35 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.18, please open a new bug report.

glusterfs-3.7.18 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-December/029427.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.