Bug 1643559

Summary:	Heal info summary taking very long time(more than 5 hrs) hence rendering its purpose not useful
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	replicate	Assignee:	Ravishankar N <ravishankar>
Status:	CLOSED DUPLICATE	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.4	CC:	amukherj, jstrunk, pkarampu, pprakash, ravishankar, rhs-bugs, rkavunga, sheggodu, storage-qa-internal
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-12-03 08:34:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nag Pavan Chilakam 2018-10-26 14:34:31 UTC

Description of problem:
=======================
Heal info summary is taking significantly long time to display the summary o/p
In my test bed, it took more than 5 hrs to display the o/p
(was doing upgrade test)
While I understand it depends on number of heals pending, but 5hr+ is simply not acceptable.
I had more than 5Lakh files for healing, when heal info summary was triggered, but by the time heal info summary o/p was displayed, there were only 50K files pending heals

Summary is supposed to give me a lucid and crisp o/p of heals pending(even if they are approx numbers, it should be ok)
However, summary o/p is infact rendered useless for my purpose

We need to understand how summary is being calculated, because if we are continuously scanning entries while they are getting healed, that could cause the delay, and the final o/p takes huge time

[root@dhcp35-140 ~]# for i in $(gluster v list);do echo "#####  vol $i ###############";time gluster v heal $i info|grep ntries;echo "#####################";done
#####  vol arbo ###############
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0

real	0m1.540s
user	0m0.216s
sys	0m0.220s
#####################
#####  vol basevol-10 ###############
Number of entries: 52193
Number of entries: 0
Number of entries: 0

real	301m18.394s
user	4m50.940s
sys	6m20.620s




Version-Release number of selected component (if applicable):
====================
3.4.0->3.4.1 upgrade(and command issued on 3.4.1)

How reproducible:
==================
1/1
but should be fairly reproducible

Steps to Reproduce:
1.create a 6 node brick mux setup
2.created 6 1x3 vols and 1 2x(2+1) arbiter vol
3.mounted the volumes on one client each
4. pumping IOs(mostly untars)
5. started to upgrade 1 or 2 nodes at a time(made sure the maintenance nodes didn't have same replica pairs)
6. issued heal info summary while upgrade cycle was happening(ie 4 nodes were in 3.4.1 and 2 on 3.4.0)

Actual results:
===============
heal info summary is taking very long time

Expected results:
=================
Summary is supposed to give me a lucid and crisp o/p of heals pending(even if they are approx numbers, it should be ok)
However, summary o/p is infact rendered useless for my purpose

We need to understand how summary is being calculated, because if we are continuously scanning entries while they are getting healed, that could cause the delay, and the final o/p takes huge time

Comment 2 Atin Mukherjee 2018-11-11 18:28:53 UTC

Ravi/Karthik - Can one of you have a look at this BZ and do the first pass analysis?

Comment 3 Nag Pavan Chilakam 2018-11-28 11:23:57 UTC

proposing this for 3.4.3, as it the performance leaves the customer with bad experience

Comment 7 Pranith Kumar K 2019-12-03 08:34:01 UTC

This bug is being fixed as part of https://bugzilla.redhat.com/show_bug.cgi?id=1721355. If this issue is seen even after the fix, please feel free to re-open this bug.

*** This bug has been marked as a duplicate of bug 1721355 ***