Description of problem: ======================== In my systemic environment https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609 I was trying to get the heal info using --xml format. There are lot of files/entries to be healed as in the 4x2 volume, one brick from each replica pair is brought down. I issued a heal info --xml and even after half an hour no o/p is being displayed, yet. I checked it against normal heal info(without --xml) and it starts displaying within a minute. This may be a real problem if some customer wants to use xml formatting . I wanted to check the cpu usage, but didnt find much consumption (so we are good there) 9450 root 20 0 823000 88192 4648 S 1.0 0.5 0:02.78 glfsheal 9450 root 20 0 823000 88192 4648 S 1.3 0.5 0:02.82 glfsheal 9450 root 20 0 823000 88192 4648 S 1.7 0.5 0:02.87 glfsheal 9450 root 20 0 823000 88192 4648 S 1.0 0.5 0:02.90 glfsheal 9450 root 20 0 823000 88192 4648 S 1.3 0.5 0:02.94 glfsheal 9450 root 20 0 823000 88192 4648 S 1.3 0.5 0:02.98 glfsheal 9450 root 20 0 823000 88192 4648 S 1.3 0.5 0:03.02 glfsheal 9450 root 20 0 823000 88192 4648 S 1.3 0.5 0:03.06 glfsheal [root@dhcp35-191 ~]# gluster v status Status of volume: distrepvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.191:/rhs/brick1/distrepvol N/A N/A N N/A Brick 10.70.37.187:/rhs/brick1/distrepvol 49154 0 Y 3867 Brick 10.70.35.3:/rhs/brick1/distrepvol N/A N/A N N/A Brick 10.70.37.150:/rhs/brick1/distrepvol 49154 0 Y 3918 Brick 10.70.35.191:/rhs/brick2/distrepvol 49155 0 Y 7568 Brick 10.70.37.187:/rhs/brick2/distrepvol N/A N/A N N/A Brick 10.70.35.3:/rhs/brick2/distrepvol 49155 0 Y 5341 Brick 10.70.37.150:/rhs/brick2/distrepvol N/A N/A N N/A Snapshot Daemon on localhost 49152 0 Y 8358 Self-heal Daemon on localhost N/A N/A Y 7588 Quota Daemon on localhost N/A N/A Y 8211 Snapshot Daemon on 10.70.35.3 49152 0 Y 5858 Self-heal Daemon on 10.70.35.3 N/A N/A Y 5361 Quota Daemon on 10.70.35.3 N/A N/A Y 5762 Snapshot Daemon on 10.70.37.150 49152 0 Y 4477 Self-heal Daemon on 10.70.37.150 N/A N/A Y 3957 Quota Daemon on 10.70.37.150 N/A N/A Y 4380 Snapshot Daemon on 10.70.37.187 49152 0 Y 4428 Self-heal Daemon on 10.70.37.187 N/A N/A Y 3907 Quota Daemon on 10.70.37.187 N/A N/A Y 4330 Task Status of Volume distrepvol ------------------------------------------------------------------------------ There are no active volume tasks Version-Release number of selected component (if applicable): [root@dhcp35-191 ~]# rpm -qa|grep gluster glusterfs-libs-3.8.4-1.el7rhgs.x86_64 glusterfs-fuse-3.8.4-1.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-1.el7rhgs.x86_64 glusterfs-3.8.4-1.el7rhgs.x86_64 glusterfs-api-3.8.4-1.el7rhgs.x86_64 glusterfs-cli-3.8.4-1.el7rhgs.x86_64 glusterfs-events-3.8.4-1.el7rhgs.x86_64 glusterfs-rdma-3.8.4-1.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-1.el7rhgs.x86_64 glusterfs-server-3.8.4-1.el7rhgs.x86_64 python-gluster-3.8.4-1.el7rhgs.noarch glusterfs-devel-3.8.4-1.el7rhgs.x86_64 [root@dhcp35-191 ~]# Found this as part of my qatp of bug validation of Bug 1366128 - "heal info --xml" not showing the brick name of offline bricks.
[root@dhcp37-150 glusterfs]# tailf glfsheal-distrepvol.log [2016-10-07 11:47:07.332315] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-7: connection to 10.70.37.150:49155 failed (Connection refused) [2016-10-07 11:47:07.335381] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-2: changing port to 49154 (from 0) [2016-10-07 11:47:07.340317] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-2: connection to 10.70.35.3:49154 failed (Connection refused) [2016-10-07 11:47:09.340486] E [name.c:262:af_inet_client_get_remote_sockaddr] 0-distrepvol-snapd-client: DNS resolution failed on host /var/run/glusterd.socket [2016-10-07 11:47:10.349650] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-7: changing port to 49155 (from 0) [2016-10-07 11:47:10.349914] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-5: changing port to 49155 (from 0) [2016-10-07 11:47:10.352448] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-7: connection to 10.70.37.150:49155 failed (Connection refused) [2016-10-07 11:47:10.355497] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-5: connection to 10.70.37.187:49155 failed (Connection refused) [2016-10-07 11:47:10.355554] I [rpc-clnt.c:1947:rpc_clnt_reconfig] 0-distrepvol-client-2: changing port to 49154 (from 0) [2016-10-07 11:47:10.358502] E [socket.c:2309:socket_connect_finish] 0-distrepvol-client-2: connection to 10.70.35.3:49154 failed (Connection refused) [2016-10-07 11:47:12.356667] E [name.c:262:af_inet_client_get_remote_sockaddr] 0-distrepvol-snapd-client: DNS resolution failed on host /var/run/glusterd.socket
As per our discussion last about this bz, there seem to be new entries that are added for healing right? i.e. will the command ever end in that case?
statedumps available @ [qe@rhsqe-repo nchilaka]$ chmod -R 0777 /home/repo/sosreports/nchilaka/bug.1382686 [qe@rhsqe-repo nchilaka]$ hostname rhsqe-repo.lab.eng.blr.redhat.com (4 statedumps taken every 30min .....for all 4 nodes)
Tested on 3.8.4-13: for a volume which has lot of files in heal pending [root@dhcp35-37 ~]# time gluster v heal distrep info|grep ntries Number of entries: 112484 Number of entries: 113455 Number of entries: 112327 Number of entries: 113872 the heal info has above pending entries. Also heal info keeps streaming the o/p instead of buffering. with heal info --xml , i still see the issue of not streaming the o/p and instead all is dumped at the end. Hence failing the fix(discussed with Pranith) [root@dhcp35-37 ~]# gluster v info distrep g Volume Name: distrep Type: Distributed-Replicate Volume ID: df5319f0-d889-4030-bb39-b8a41936a726 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.37:/rhs/brick1/distrep Brick2: 10.70.35.116:/rhs/brick1/distrep Brick3: 10.70.35.37:/rhs/brick2/distrep Brick4: 10.70.35.116:/rhs/brick2/distrep Options Reconfigured: cluster.self-heal-daemon: disable performance.readdir-ahead: on nfs.disable: on [root@dhcp35-37 ~]# gluster v status distrep Status of volume: distrep Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.37:/rhs/brick1/distrep 49153 0 Y 600 Brick 10.70.35.116:/rhs/brick1/distrep 49152 0 Y 32269 Brick 10.70.35.37:/rhs/brick2/distrep 49154 0 Y 620 Brick 10.70.35.116:/rhs/brick2/distrep 49153 0 Y 32288 Task Status of Volume distrep ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-37 ~]#
I'm closing this since the BZ is old and there are no immediate plans to look at this. If the issue occurs in the latest recent RHGS version and you feel it is important to be looked at, please re-open.