1463108 – Regression: Heal info takes longer time when a brick is down

Bug 1463108 - Regression: Heal info takes longer time when a brick is down

Summary: Regression: Heal info takes longer time when a brick is down

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Ashish Pandey
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:	3.3.0-devel-freeze-exception
Depends On:
Blocks:	1417151 1464091 1465854
TreeView+	depends on / blocked

Reported:	2017-06-20 07:11 UTC by Nag Pavan Chilakam
Modified:	2018-08-16 06:50 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.8.4-32
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1464091 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:59:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Nag Pavan Chilakam 2017-06-20 07:11:10 UTC

Description of problem:
=========================
(A possible regression)
When a brick is down the heal info of a disperse volume takes significantly longer time.
for eg: without any heal pending or on a fresh setup, with all bricks up, the heal info response time for  a 1x(4+2) volume is hardly 1 sec
where as with one brick down the same setup takes 11 Sec

and in one case where heal pending entries were there the heal info was hung(the re were hardly 10 entries on each brick)
If I hit the hang again, will raise a seperate bug

Version-Release number of selected component (if applicable):
===
3.8.4-28

How reproducible:
==
always

Steps to Reproduce:
1.create a 4+2 ecvolume and check heal info time (it will be hardly 1 sec)
2.kill one brick
3.check heal info time(takes about 11 sec)

Comment 2 Nag Pavan Chilakam 2017-06-20 09:20:53 UTC

check on 3.2 build and i don't notice the problem
Hence marking as regression

caLast login: Tue Jun 20 14:40:40 2017 from dhcp35-77.lab.eng.blr.redhat.com
ca[root@dhcp41-212 ~]# cat /etc/redhat-*
cat: /etc/redhat-access-insights: Is a directory
Red Hat Enterprise Linux Server release 7.3 (Maipo)
Red Hat Gluster Storage Server 3.2.0
[root@dhcp41-212 ~]# rpm -qa|grep gluster
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-18.4.el7rhgs.1.HOTFIX.CASE01604226.BZ1428936.x86_64
glusterfs-fuse-3.8.4-18.4.el7rhgs.1.HOTFIX.CASE01604226.BZ1428936.x86_64
glusterfs-3.8.4-18.4.el7rhgs.1.HOTFIX.CASE01604226.BZ1428936.x86_64
nfs-ganesha-gluster-2.4.1-9.el7rhgs.x86_64
glusterfs-cli-3.8.4-18.4.el7rhgs.1.HOTFIX.CASE01604226.BZ1428936.x86_64
glusterfs-rdma-3.8.4-18.4.el7rhgs.1.HOTFIX.CASE01604226.BZ1428936.x86_64
glusterfs-client-xlators-3.8.4-18.4.el7rhgs.1.HOTFIX.CASE01604226.BZ1428936.x86_64
glusterfs-server-3.8.4-18.4.el7rhgs.1.HOTFIX.CASE01604226.BZ1428936.x86_64
python-gluster-3.8.4-18.4.el7rhgs.1.HOTFIX.CASE01604226.BZ1428936.noarch
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-api-3.8.4-18.4.el7rhgs.1.HOTFIX.CASE01604226.BZ1428936.x86_64
glusterfs-debuginfo-3.8.4-18.4.el7rhgs.1.HOTFIX.CASE01604226.BZ1428936.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
samba-vfs-glusterfs-4.4.6-5.el7rhgs.x86_64
glusterfs-libs-3.8.4-18.4.el7rhgs.1.HOTFIX.CASE01604226.BZ1428936.x86_64
glusterfs-geo-replication-3.8.4-18.4.el7rhgs.1.HOTFIX.CASE01604226.BZ1428936.x86_64

Comment 9 Nag Pavan Chilakam 2017-06-22 14:35:16 UTC

With 2 bricks down and one entry to be healed, the o/p takes almost 2 min to respond and complete. This is too much of a drop
[root@dhcp35-45 ecv]# time gluster v heal ecv info
Brick 10.70.35.45:/rhs/brick1/ecv
Status: Transport endpoint is not connected
Number of entries: -

Brick 10.70.35.23:/rhs/brick1/ecv
/dir1 
Status: Connected
Number of entries: 1

Brick 10.70.35.122:/rhs/brick1/ecv
Status: Transport endpoint is not connected
Number of entries: -

Brick 10.70.35.45:/rhs/brick2/ecv
/dir1 
Status: Connected
Number of entries: 1

Brick 10.70.35.23:/rhs/brick2/ecv
/dir1 
Status: Connected
Number of entries: 1

Brick 10.70.35.122:/rhs/brick2/ecv
/dir1 
Status: Connected
Number of entries: 1


real	2m2.552s
user	0m36.088s
sys	0m33.221s
[root@dhcp35-45 ecv]# 

Following is glfs log o/p


[2017-06-22 14:28:39.901976] W [MSGID: 101174] [graph.c:360:_log_if_unknown_option] 0-ecv-readdir-ahead: option 'parallel-readdir' is not recognized
[2017-06-22 14:28:39.902111] I [MSGID: 104045] [glfs-master.c:91:notify] 0-gfapi: New graph 64686370-3335-2d34-352e-6c61622e656e (0) coming up
[2017-06-22 14:28:39.902136] I [MSGID: 114020] [client.c:2356:notify] 0-ecv-client-0: parent translators are ready, attempting connect on transport
[2017-06-22 14:28:39.905465] I [MSGID: 114020] [client.c:2356:notify] 0-ecv-client-1: parent translators are ready, attempting connect on transport
[2017-06-22 14:28:39.906050] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ecv-client-0: changing port to 49160 (from 0)
[2017-06-22 14:28:39.908135] I [MSGID: 114020] [client.c:2356:notify] 0-ecv-client-2: parent translators are ready, attempting connect on transport
[2017-06-22 14:28:39.910916] E [socket.c:2360:socket_connect_finish] 0-ecv-client-0: connection to 10.70.35.45:49160 failed (Connection refused); disconnecting socket
[2017-06-22 14:28:39.911327] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ecv-client-1: changing port to 49155 (from 0)
[2017-06-22 14:28:39.913509] I [MSGID: 114020] [client.c:2356:notify] 0-ecv-client-3: parent translators are ready, attempting connect on transport
[2017-06-22 14:28:39.916339] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-ecv-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-22 14:28:39.916479] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ecv-client-2: changing port to 49152 (from 0)
[2017-06-22 14:28:39.918835] I [MSGID: 114020] [client.c:2356:notify] 0-ecv-client-4: parent translators are ready, attempting connect on transport
[2017-06-22 14:28:39.921243] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-ecv-client-1: Connected to ecv-client-1, attached to remote volume '/rhs/brick1/ecv'.
[2017-06-22 14:28:39.921286] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-ecv-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2017-06-22 14:28:39.921520] E [socket.c:2360:socket_connect_finish] 0-ecv-client-2: connection to 10.70.35.122:49152 failed (Connection refused); disconnecting socket
[2017-06-22 14:28:39.921596] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-ecv-client-1: Server lk version = 1
[2017-06-22 14:28:39.921830] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ecv-client-3: changing port to 49158 (from 0)
[2017-06-22 14:28:39.924133] I [MSGID: 114020] [client.c:2356:notify] 0-ecv-client-5: parent translators are ready, attempting connect on transport
[2017-06-22 14:28:39.926850] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-ecv-client-3: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-22 14:28:39.927535] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ecv-client-4: changing port to 49153 (from 0)
[2017-06-22 14:28:39.927630] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-ecv-client-3: Connected to ecv-client-3, attached to remote volume '/rhs/brick2/ecv'.
[2017-06-22 14:28:39.927660] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-ecv-client-3: Server and Client lk-version numbers are not same, reopening the fds
[2017-06-22 14:28:39.929419] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-ecv-client-3: Server lk version = 1
[2017-06-22 14:28:39.929816] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ecv-client-5: changing port to 49154 (from 0)
[2017-06-22 14:28:39.932243] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-ecv-client-4: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-22 14:28:39.933224] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-ecv-client-4: Connected to ecv-client-4, attached to remote volume '/rhs/brick2/ecv'.
[2017-06-22 14:28:39.933259] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-ecv-client-4: Server and Client lk-version numbers are not same, reopening the fds
[2017-06-22 14:28:39.933557] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-ecv-client-4: Server lk version = 1
[2017-06-22 14:28:39.935075] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-ecv-client-5: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-06-22 14:28:39.935870] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-ecv-client-5: Connected to ecv-client-5, attached to remote volume '/rhs/brick2/ecv'.
[2017-06-22 14:28:39.935901] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-ecv-client-5: Server and Client lk-version numbers are not same, reopening the fds
[2017-06-22 14:28:39.936145] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-ecv-client-5: Server lk version = 1
[2017-06-22 14:28:43.893978] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ecv-client-0: changing port to 49160 (from 0)
[2017-06-22 14:28:43.896933] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ecv-client-2: changing port to 49152 (from 0)
[2017-06-22 14:28:43.898922] E [socket.c:2360:socket_connect_finish] 0-ecv-client-0: connection to 10.70.35.45:49160 failed (Connection refused); disconnecting socket
[2017-06-22 14:28:43.901701] E [socket.c:2360:socket_connect_finish] 0-ecv-client-2: connection to 10.70.35.122:49152 failed (Connection refused); disconnecting socket
[2017-06-22 14:28:47.899547] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ecv-client-0: changing port to 49160 (from 0)
[2017-06-22 14:28:47.903300] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ecv-client-2: changing port to 49152 (from 0)
[2017-06-22 14:28:47.904508] E [socket.c:2360:socket_connect_finish] 0-ecv-client-0: connection to 10.70.35.45:49160 failed (Connection refused); disconnecting socket
[2017-06-22 14:28:47.907162] E [socket.c:2360:socket_connect_finish] 0-ecv-client-2: connection to 10.70.35.122:49152 failed (Connection refused); disconnecting socket
[2017-06-22 14:28:50.902197] I [MSGID: 122061] [ec.c:323:ec_up] 0-ecv-disperse-0: Going UP
[2017-06-22 14:28:50.905125] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (5)
The message "W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (5)" repeated 2 times between [2017-06-22 14:28:50.905125] and [2017-06-22 14:28:50.907589]
[2017-06-22 14:28:50.908376] I [MSGID: 104041] [glfs-resolve.c:914:__glfs_active_subvol] 0-ecv: switched to graph 64686370-3335-2d34-352e-6c61622e656e (0)
[2017-06-22 14:28:50.911613] W [MSGID: 122035] [ec-common.c:464:ec_child_select] 0-ecv-disperse-0: Executing operation with some subvolumes unavailable (5)
[2017-06-22 14:28:51.909942] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ecv-client-0: changing port to 49160 (from 0)
[2017-06-22 14:28:51.913016] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-ecv-client-2: changing port to 49152 (from 0)
[2017-06-22 14:28:51.915213] E [socket.c:2360:socket_connect_finish] 0-ecv-client-0: connection to 10.70.35.45:49160 failed (Connection refused); disconnecting socket
[2017-06-22 14:28:51.917594] E [socket.c:2360:socket_connect_finish] 0-ecv-client-2: connection to 10.70.35.122:49152 failed (Connection refused); disconnecting socket

Comment 11 Atin Mukherjee 2017-06-27 03:48:58 UTC

upstream patch : https://review.gluster.org/17606

Comment 14 Nag Pavan Chilakam 2017-07-05 11:52:40 UTC

have retested the testcase and saw that the heal time now is reporting in about a second.
hence moving to verified
testversion:3.8.4-32

Comment 16 errata-xmlrpc 2017-09-21 04:59:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.