1382686 – heal info --xml when bricks are down in a systemic environment is not displaying anything even after more than 30minutes

Bug 1382686 - heal info --xml when bricks are down in a systemic environment is not displaying anything even after more than 30minutes

Summary: heal info --xml when bricks are down in a systemic environment is not display...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Pranith Kumar K
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1395993 1396779
TreeView+	depends on / blocked

Reported:	2016-10-07 12:13 UTC by Nag Pavan Chilakam
Modified:	2018-10-17 08:30 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1395993 (view as bug list)
Environment:
Last Closed:	2018-10-17 08:30:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nag Pavan Chilakam 2016-10-07 12:13:12 UTC

Description of problem:
========================
In my systemic environment https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609

I was trying to get the heal info using --xml format.
There are lot of files/entries to be healed as in the 4x2 volume, one brick from each replica pair is brought down.
I issued a heal info --xml and even after half an hour no o/p is being displayed, yet.
I checked it against normal heal info(without --xml) and it starts displaying within a minute.
This may be a real problem if some customer wants to use xml formatting .

I wanted to check the cpu usage, but didnt find much consumption (so we are good there)
9450 root      20   0  823000  88192   4648 S   1.0  0.5   0:02.78 glfsheal
 9450 root      20   0  823000  88192   4648 S   1.3  0.5   0:02.82 glfsheal
 9450 root      20   0  823000  88192   4648 S   1.7  0.5   0:02.87 glfsheal
 9450 root      20   0  823000  88192   4648 S   1.0  0.5   0:02.90 glfsheal
 9450 root      20   0  823000  88192   4648 S   1.3  0.5   0:02.94 glfsheal
 9450 root      20   0  823000  88192   4648 S   1.3  0.5   0:02.98 glfsheal
 9450 root      20   0  823000  88192   4648 S   1.3  0.5   0:03.02 glfsheal
 9450 root      20   0  823000  88192   4648 S   1.3  0.5   0:03.06 glfsheal




[root@dhcp35-191 ~]# gluster v status
Status of volume: distrepvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.191:/rhs/brick1/distrepvol   N/A       N/A        N       N/A  
Brick 10.70.37.187:/rhs/brick1/distrepvol   49154     0          Y       3867 
Brick 10.70.35.3:/rhs/brick1/distrepvol     N/A       N/A        N       N/A  
Brick 10.70.37.150:/rhs/brick1/distrepvol   49154     0          Y       3918 
Brick 10.70.35.191:/rhs/brick2/distrepvol   49155     0          Y       7568 
Brick 10.70.37.187:/rhs/brick2/distrepvol   N/A       N/A        N       N/A  
Brick 10.70.35.3:/rhs/brick2/distrepvol     49155     0          Y       5341 
Brick 10.70.37.150:/rhs/brick2/distrepvol   N/A       N/A        N       N/A  
Snapshot Daemon on localhost                49152     0          Y       8358 
Self-heal Daemon on localhost               N/A       N/A        Y       7588 
Quota Daemon on localhost                   N/A       N/A        Y       8211 
Snapshot Daemon on 10.70.35.3               49152     0          Y       5858 
Self-heal Daemon on 10.70.35.3              N/A       N/A        Y       5361 
Quota Daemon on 10.70.35.3                  N/A       N/A        Y       5762 
Snapshot Daemon on 10.70.37.150             49152     0          Y       4477 
Self-heal Daemon on 10.70.37.150            N/A       N/A        Y       3957 
Quota Daemon on 10.70.37.150                N/A       N/A        Y       4380 
Snapshot Daemon on 10.70.37.187             49152     0          Y       4428 
Self-heal Daemon on 10.70.37.187            N/A       N/A        Y       3907 
Quota Daemon on 10.70.37.187                N/A       N/A        Y       4330 
 
Task Status of Volume distrepvol
------------------------------------------------------------------------------
There are no active volume tasks


Version-Release number of selected component (if applicable):
[root@dhcp35-191 ~]# rpm -qa|grep gluster
glusterfs-libs-3.8.4-1.el7rhgs.x86_64
glusterfs-fuse-3.8.4-1.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-1.el7rhgs.x86_64
glusterfs-3.8.4-1.el7rhgs.x86_64
glusterfs-api-3.8.4-1.el7rhgs.x86_64
glusterfs-cli-3.8.4-1.el7rhgs.x86_64
glusterfs-events-3.8.4-1.el7rhgs.x86_64
glusterfs-rdma-3.8.4-1.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-1.el7rhgs.x86_64
glusterfs-server-3.8.4-1.el7rhgs.x86_64
python-gluster-3.8.4-1.el7rhgs.noarch
glusterfs-devel-3.8.4-1.el7rhgs.x86_64
[root@dhcp35-191 ~]# 


Found this as part of my qatp of bug validation of Bug 1366128 - "heal info --xml" not showing the brick name of offline bricks.

Comment 2 Nag Pavan Chilakam 2016-10-07 12:14:30 UTC

[root@dhcp37-150 glusterfs]# tailf glfsheal-distrepvol.log 
[2016-10-07 11:47:07.332315] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-7: connection to 10.70.37.150:49155 failed (Connection refused)
[2016-10-07 11:47:07.335381] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-2: changing port to 49154 (from 0)
[2016-10-07 11:47:07.340317] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-2: connection to 10.70.35.3:49154 failed (Connection refused)
[2016-10-07 11:47:09.340486] E [name.c:262:af_inet_client_get_remote_sockaddr] 0-distrepvol-snapd-client: DNS resolution failed on host /var/run/glusterd.socket
[2016-10-07 11:47:10.349650] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-7: changing port to 49155 (from 0)
[2016-10-07 11:47:10.349914] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-5: changing port to 49155 (from 0)
[2016-10-07 11:47:10.352448] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-7: connection to 10.70.37.150:49155 failed (Connection refused)
[2016-10-07 11:47:10.355497] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-5: connection to 10.70.37.187:49155 failed (Connection refused)
[2016-10-07 11:47:10.355554] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-2: changing port to 49154 (from 0)
[2016-10-07 11:47:10.358502] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-2: connection to 10.70.35.3:49154 failed (Connection refused)
[2016-10-07 11:47:12.356667] E [name.c:262:af_inet_client_get_remote_sockaddr] 0-distrepvol-snapd-client: DNS resolution failed on host /var/run/glusterd.socket

Comment 3 Pranith Kumar K 2016-10-14 16:36:03 UTC

As per our discussion last about this bz, there seem to be new entries that are added for healing right? i.e. will the command ever end in that case?

Comment 4 Nag Pavan Chilakam 2016-10-17 12:27:49 UTC

statedumps available @ [qe@rhsqe-repo nchilaka]$ chmod -R 0777 /home/repo/sosreports/nchilaka/bug.1382686
[qe@rhsqe-repo nchilaka]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com



(4 statedumps taken every 30min .....for all 4 nodes)

Comment 10 Nag Pavan Chilakam 2017-02-07 09:27:54 UTC

Tested on 3.8.4-13:
for a volume which has lot of files in heal pending
[root@dhcp35-37 ~]# time gluster v heal distrep info|grep ntries
Number of entries: 112484
Number of entries: 113455
Number of entries: 112327
Number of entries: 113872



the heal info has above pending entries.
Also heal info keeps streaming the o/p instead of buffering.

with heal info --xml , i still see the issue of not streaming the o/p and instead all is dumped at the end.

Hence failing the fix(discussed with Pranith)


[root@dhcp35-37 ~]# gluster v info distrep
g 
Volume Name: distrep
Type: Distributed-Replicate
Volume ID: df5319f0-d889-4030-bb39-b8a41936a726
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.37:/rhs/brick1/distrep
Brick2: 10.70.35.116:/rhs/brick1/distrep
Brick3: 10.70.35.37:/rhs/brick2/distrep
Brick4: 10.70.35.116:/rhs/brick2/distrep
Options Reconfigured:
cluster.self-heal-daemon: disable
performance.readdir-ahead: on
nfs.disable: on
[root@dhcp35-37 ~]# gluster v status distrep
Status of volume: distrep
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.37:/rhs/brick1/distrep       49153     0          Y       600  
Brick 10.70.35.116:/rhs/brick1/distrep      49152     0          Y       32269
Brick 10.70.35.37:/rhs/brick2/distrep       49154     0          Y       620  
Brick 10.70.35.116:/rhs/brick2/distrep      49153     0          Y       32288
 
Task Status of Volume distrep
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp35-37 ~]#

Comment 17 Ravishankar N 2018-10-17 08:30:13 UTC

I'm closing this since the BZ is old and there are no immediate plans to look at this. If the issue occurs in the latest recent RHGS version and you feel it is important to be looked at, please re-open.

Note You need to log in before you can comment on or make changes to this bug.