Bug 1347257 - spurious heal info as pending heal entries never end on an EC volume while IOs are going on
Summary: spurious heal info as pending heal entries never end on an EC volume while IO...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: disperse
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: RHGS 3.2.0
Assignee: Ashish Pandey
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard:
Depends On:
Blocks: 1351522 1366815 1383913
TreeView+ depends on / blocked
 
Reported: 2016-06-16 11:57 UTC by Nag Pavan Chilakam
Modified: 2019-04-03 09:28 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.8.4-3
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1366815 (view as bug list)
Environment:
Last Closed: 2017-03-23 05:36:57 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 0 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description Nag Pavan Chilakam 2016-06-16 11:57:09 UTC
Description of problem:
================
 heal info of a disperse(ec) volume shows new files getting written as pending in  heal info command 
When we do a want to update a 3 node EC cluster, we need to do following steps:
1)identify a node and make sure only  at max only redundant number of bricks are brought down
2)kill glusterfs,glusterfsd and glusterd
3)update the node 
4)bring back glusterd
After this, we now need to update the next node, but before that we need to make sure the healing is complete on the node which got updated.
For this we use heal info command.
But, if IO is going on while doing this, the heal info command unlike an afr volume will never show as heal info entries to be zero, because the latest files which are getting written are shown as heal pending entries. 
By this virtue, an admin can never be sure when to update the next node as he/she believes there is still heal pending. However these entries are spurious.

Steps and other details are avaialble in 1347251 - IO error seen with Rolling or non-disruptive upgrade of an distribute-disperse(EC) volume from 3.1.2 to 3.1.3 


NOte: this can be hit(not tested) even when we bring down bricks in an ec volume and bring them back up and keep waiting for heal to complete while IO is going on.

Comment 2 Ravishankar N 2016-07-04 11:06:40 UTC
I was not able to hit the issue when I did a rolling upgrade of a 2 node 4+2 disperse volume from rhgs-3.1.2 to 3.1.3 with IO (dd'ing to a file) happening from a 3.1.2 client. The heal info was showing entries as long as IO was happening (shd was waiting for locks). Once the IO stopped, healing resumed and came to zero entries.

Nag, do you have a more consistent reproducer for the issue?
Also, please provide the getfattr outputs of the files from all bricks and the logs. If the heal-info entries are spurious, then the trusted.ec* attributes of the file must be same on all bricks (indicating no heal is pending).

Comment 4 Ashish Pandey 2016-10-12 04:02:56 UTC
patch has been posted and merged on upstream -
http://review.gluster.org/#/c/15543/

Comment 6 Atin Mukherjee 2016-10-17 04:00:10 UTC
3.9 upstream patch : http://review.gluster.org/15627

Comment 9 Nag Pavan Chilakam 2017-02-06 13:57:51 UTC
QA verification:
I am not seeing this issue of spurious entries anymore on 3.8.4-13
Note, the file that is  being written currently can be seen in the heal info due to timing issue, which is acceptable
Hence moving to verified

Comment 10 Nag Pavan Chilakam 2017-02-06 14:26:10 UTC
(In reply to nchilaka from comment #9)
> QA verification:
> I am not seeing this issue of spurious entries anymore on 3.8.4-13
> Note, the file that is  being written currently can be seen in the heal info
> due to timing issue, which is acceptable
Note, I am not seeing this issue of file being written showing up in heal pending (the whole purpose of the bz), however we can see if there is a network partition case, which is expected
> Hence moving to verified

So fix is working

Comment 12 errata-xmlrpc 2017-03-23 05:36:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.