1643081 – RFE: Improve heal info performance(with serving even an approx number)

Bug 1643081 - RFE: Improve heal info performance(with serving even an approx number)

Summary: RFE: Improve heal info performance(with serving even an approx number)

Keywords:
Status:	CLOSED DUPLICATE of bug 1721355
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ravishankar N
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-25 13:26 UTC by Nag Pavan Chilakam
Modified:	2019-12-03 08:34 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-12-03 08:34:48 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nag Pavan Chilakam 2018-10-25 13:26:32 UTC

Description of problem:
A user issues heal info to check the number of entries pending required for healing .
If there are thousands of files, the o/p takes a lot of time due to glfs behavior

However, mostly a user may just want an approx number.
doing a ls -l in indices/xattrop takes less than a second, whereas heal info for about 60k files can take anything b/w 10min-30min or more.
while, ls on indices/xattrop may not be exact number of files pending heals, at least it gives an approximate number.

We must have a way to display heal info stats(summary) which need not be accurate, but must be served very fast.



Version-Release number of selected component (if applicable):
============
3.12.2-24

Comment 2 Krutika Dhananjay 2018-12-10 11:07:06 UTC

I'm curious to know if this is already solved through `gluster volume heal <VOL> statistics heal-count`.
And if it's true that it is not solved through this command, then I'd like to know why.

-Krutika

Comment 3 Ravishankar N 2018-12-10 11:21:15 UTC

statistics command gives historical data only of completed crawls and not currently pending heals. It is maintained in-memory and is lost if shd is restarted. Also it is not officially tested/supported downstream. Bhumika is working on implementing `gluster volume heal <VOL> info summary` to be a lot faster than what it is today.

Comment 4 Krutika Dhananjay 2018-12-10 12:06:54 UTC

(In reply to Ravishankar N from comment #3)
> statistics command gives historical data only of completed crawls and not
> currently pending heals. It is maintained in-memory and is lost if shd is
> restarted. Also it is not officially tested/supported downstream. Bhumika is
> working on implementing `gluster volume heal <VOL> info summary` to be a lot
> faster than what it is today.

Are you sure? As per code in upstream master, afr_xl_op() calls afr_shd_get_index_count() when the op is GF_SHD_OP_STATISTICS_HEAL_COUNT. This in turn fetches the number of hard links on the index file under indices/xattrop from each brick, effectively returning the number of files pending heal. This number may or may not be accurate depending on whether the in-flight fops put their indices under indices/dirty or indices/xattrop. But for the purpose of addressing the request in this rfe, shouldn't this be good enough?

-Krutika

Comment 5 Ravishankar N 2018-12-10 12:35:43 UTC

(In reply to Krutika Dhananjay from comment #4)
> (In reply to Ravishankar N from comment #3)
> > statistics command gives historical data only of completed crawls and not
> > currently pending heals. It is maintained in-memory and is lost if shd is
> > restarted. Also it is not officially tested/supported downstream. Bhumika is
> > working on implementing `gluster volume heal <VOL> info summary` to be a lot
> > faster than what it is today.
> 
> Are you sure? As per code in upstream master, afr_xl_op() calls
> afr_shd_get_index_count() when the op is GF_SHD_OP_STATISTICS_HEAL_COUNT.
> This in turn fetches the number of hard links on the index file under
> indices/xattrop from each brick, effectively returning the number of files
> pending heal. This number may or may not be accurate depending on whether
> the in-flight fops put their indices under indices/dirty or indices/xattrop.

Sorry, you're right. I missed the 'heal-count' part in your comment. Yes, GF_SHD_OP_STATISTICS_HEAL_COUNT should give the right number if there are no entries in dirty.

> But for the purpose of addressing the request in this rfe, shouldn't this be
> good enough?
> 

I think it depends on how the information is consumed. If we want a way to definitely know that there are no pending heals so that a rolling upgrade can be performed, then we need to process indices/dirty as well (under appropriate locks).

> -Krutika

Comment 6 John Strunk 2019-01-16 01:17:23 UTC

Ravi wanted me to add my observations of 'heal statistics heal-count' vs 'heal info summary' from OSIO...

I use the count of pending heals as a part of the Ansible automation for automated rolling upgrades. Once a server has been upgraded, it is rebooted, and one of the above heal commands is issued in a loop until the count is zero. Afterward, the next server is upgraded.

Initially, I had used "statistics heal-count", but the observed behavior was that the returned value would (sometimes... 50%?) never decrease to zero even though the files had been healed. However, if I issued "heal info" in a separate terminal (same server), the next iteration of "heal-count" would return the correct count of zero.

Since learning that "heal-count" is not supported and at Ravi's suggestion, I moved to using "info summary". This provides the correct counts, but I have found that the time required for the command to execute seems to be proportional to the number of outstanding heals for the volume.

For my use case, I'm really just looking for a binary signal of whether there are outstanding heals or not. A small lag (1 min or less) is acceptable in the case where the true count is zero, but the command could indicate heals remain. The other direction (false negative) would not be acceptable though... If there are heals, the command must reflect that.

Comment 7 Yaniv Kaul 2019-07-22 08:58:36 UTC

Do we have any estimate of the work? It seems to be a very popular feature request.

Comment 8 Ravishankar N 2019-07-23 06:07:07 UTC

(In reply to Yaniv Kaul from comment #7)
> Do we have any estimate of the work? It seems to be a very popular feature
> request.

Yaniv, you are right, there are quite a lot of bugs for improving heal info performance:
Bug 1643559 - Heal info summary taking very long time(more than 5 hrs) hence rendering its purpose not useful
Bug 1483977 - [afr]: info split-brain takes longer time about (1m 15secs) to show the output with 0 entries
Bug 1721355 - Heal Info is hung when I/O is in progress on a gluster block volume

We are targeting to fix heal info{summary|split brain} itself to be more responsive for for rhgs-3.5.1. I think once that is done, we might not need a new command entirely. But we will re-evaluate this bug once heal info fixes are in.

Comment 9 Pranith Kumar K 2019-12-03 08:34:48 UTC

This bug is being fixed as part of https://bugzilla.redhat.com/show_bug.cgi?id=1721355. If this issue is seen even after the fix, please feel free to re-open this bug.

*** This bug has been marked as a duplicate of bug 1721355 ***

Note You need to log in before you can comment on or make changes to this bug.