Bug 1046022

Summary: "gluster volume heal <vol-name> info", doesn't responds till self-heal is complete on the volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: SATHEESARAN <sasundar>
Component: glusterfsAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: spandura
Severity: high Docs Contact: Anjana Suparna Sriram <asriram>
Priority: high    
Version: 2.1CC: grajaiya, pkarampu, sharne, spandura, vagarwal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 2.1.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.57rhs Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-25 08:10:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description SATHEESARAN 2013-12-23 08:57:34 UTC
Description of problem:

"gluster volume heal <vol-name> info", doesn't responds till self-heal is completed. 

Consequence :
1. At the laymen level, it looks like the command has hung, though in reality its not.

2. The progress of self-heal neither can be tracked nor its exactly known. 

But once self-heal is complete, "heal info" responds backs with "number of entries:0" for all bricks, confirming that the self-heal is complete

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.51rhs.el6rhs

How reproducible:
Happened all the 5 times I tried

Steps to Reproduce:
I hit this bug wrt "virt rhev" environment, so providing the same steps

1. Created Trusted Storage Pool with 4 RHSS Nodes
(i.e) gluster peer probe <RHSS-Node>
 
2. Created a distribute-replicate volume with 8 bricks ( 2 brick per RHSS Node )
(i.e) gluster volume create <vol-name> replica 2 <brick1> .. <brick8>

3. Optimized the volume for virt store
(i.e) gluster volume set <vol-name> group virt

NOTE: Ownership for this volume is also set to 36:36, just for RHEV Env
 
4. Started the volume
(i.e) gluster volume start <vol-name>

5. Used this volume for the Storage Domain ( Data Domain ), in the datacenter
(i.e) domain used to store Data Images of VMs

6. Created 2 VMs, and installed them RHEL 6.5

7. Brought down server2(RHSS2) and server4(RHSS4), so that atleast one brick of the replica pair is UP

8. Created 2 more VMs & installed them with RHEL 6.5

9. Powered up the RHSS Nodes that were down as result of step 7

10. Once the nodes are up, trigger self-heal ( though background self-heal is on )
(i.e) gluster volume heal <vol-name>

11. Check the heal info,
(i.e) gluster volume heal <vol-name> info


Actual results:
"gluster volume heal <vol-name> info" - never responded back for >15 minutes
Initially I thought that the command has hung/dead

After sometime ( after 20 mins ), I got the ouput with "Number of Entries:0" for all bricks

Expected results:
1. "gluster volume heal <vol-name> info" should respond immediately
2. Progress of self-heal must be available to the user or there should be some
indication that self-heal is going-on in the volume

Additional info:

Comment 1 Pranith Kumar K 2013-12-23 09:56:48 UTC
As a way to fix false +ves in heal info, we started taking locks to figure out whether files need self-heal or not. If for all the files self-heal-daemon wins taking lock before self-heal-info, then this can happen. The bug description is a bit inaccurate. The locks are taken per file. lets say we have file a, b, c which need self-heal, both heal info (to find whether it needs self-heal) and self-heal-daemon (to do the actual heal) want to take locks. Now for each file if self-heal-daemon always gets the lock on the files before heal info. It seems like heal info doesn't respond until heal on the volume is complete. There are still false +ves for metadata and entry self-heal.

Comment 2 Shalaka 2014-01-06 11:56:40 UTC
Please add DocText for this Known Issue.

Comment 3 SATHEESARAN 2014-01-08 15:42:02 UTC
(In reply to Pranith Kumar K from comment #1)
> As a way to fix false +ves in heal info, we started taking locks to figure
> out whether files need self-heal or not. If for all the files
> self-heal-daemon wins taking lock before self-heal-info, then this can
> happen. The bug description is a bit inaccurate. The locks are taken per
> file. lets say we have file a, b, c which need self-heal, both heal info (to
> find whether it needs self-heal) and self-heal-daemon (to do the actual
> heal) want to take locks. Now for each file if self-heal-daemon always gets
> the lock on the files before heal info. It seems like heal info doesn't
> respond until heal on the volume is complete. There are still false +ves for
> metadata and entry self-heal.

Pranith,

I hit a scenario where, gluster volume heal <vol-name> info" takes more than 50 minutes to respond back. And I think this is too high

Check the timestamp available with command,

<< Note Timestamp here when command was triggered,
[Wed Jan  8 20:08:57 UTC 2014 root.37.187:~ ] # gluster volume heal dr-imgstore info
Brick rhss1.lab.eng.blr.redhat.com:/rhs/brick1/drdir1/
/c33c0d51-e8f5-409d-9a52-fea048db0645/images/94b388b5-2906-43e8-b372-bd6bfce099f6/ff2fffbf-a14f-4727-9bea-8afa672e9bc8
Number of entries: 1

Brick rhss2.lab.eng.blr.redhat.com:/rhs/brick1/drdir1/
/c33c0d51-e8f5-409d-9a52-fea048db0645/images/94b388b5-2906-43e8-b372-bd6bfce099f6/ff2fffbf-a14f-4727-9bea-8afa672e9bc8
Number of entries: 1

Brick rhss1.lab.eng.blr.redhat.com:/rhs/brick2/drdir2/
Number of entries: 0

Brick rhss2.lab.eng.blr.redhat.com:/rhs/brick2/drdir2/
Number of entries: 0

Brick rhss3.lab.eng.blr.redhat.com:/rhs/brick1/addbrick1/
Number of entries: 0

Brick rhss4.lab.eng.blr.redhat.com:/rhs/brick1/addbrick1/
Number of entries: 0

Brick rhss3.lab.eng.blr.redhat.com:/rhs/brick1/addbrick2/
Number of entries: 0

Brick rhss4.lab.eng.blr.redhat.com:/rhs/brick1/addbrick2/
Number of entries: 0

Brick rhss3.lab.eng.blr.redhat.com:/rhs/brick1/addbrick3/
Number of entries: 0

Brick rhss4.lab.eng.blr.redhat.com:/rhs/brick1/addbrick3/
Number of entries: 0

Brick rhss3.lab.eng.blr.redhat.com:/rhs/brick1/addbrick4/
Number of entries: 0

Brick rhss4.lab.eng.blr.redhat.com:/rhs/brick1/addbrick4/
Number of entries: 0

Brick rhss3.lab.eng.blr.redhat.com:/rhs/brick1/addbrick5/
Number of entries: 0

Brick rhss4.lab.eng.blr.redhat.com:/rhs/brick1/addbrick5/
Number of entries: 0

<<<<############### long hang 

[Wed Jan  8 20:59:50 UTC 2014 root.37.187:~ ] #  <<< Note timestamp

Comment 4 Gowrishankar Rajaiyan 2014-01-10 07:42:47 UTC
This may have impact on documentation. Please check the relevant document sections in administration guide.

Comment 6 Pranith Kumar K 2014-01-13 08:20:52 UTC
This bug is introduced after bigbend and fixed before corbett, so no need to add any doctext. Please set doc-text flag to '-'

Comment 7 SATHEESARAN 2014-01-14 15:42:06 UTC
Tested with glusterfs-3.4.0.57rhs-1.el6rhs

"gluster volume heal <vol-name>", doesn't hang for long time but return back immediately.

[Tue Jan 14 15:49:30 UTC 2014 root.37.187:~ ] # gluster volume heal drvol
Launching heal operation to perform index self heal on volume drvol has been successful 
Use heal info commands to check status

[Tue Jan 14 15:50:51 UTC 2014 root.37.187:~ ] # gluster volume heal drvol info
Brick rhss1.lab.eng.blr.redhat.com:/rhs/brick1/drdir1/
/0218725d-3846-4c6d-b9d7-c05bd55c031b/images/7e4d8003-9248-4e82-8c41-9c4093de1623/b2dc01a7-4833-41c8-9e0f-84102f97b80d - Possibly undergoing heal
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/1217280d-e8d5-4f79-826f-64514e6f5c56
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/e1503573-d342-442b-902d-f5cb55e48edc
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/fe34200e-3614-4fbf-ab46-62ba6e39b20e
Number of entries: 4

Brick rhss2.lab.eng.blr.redhat.com:/rhs/brick1/drdir1/
/0218725d-3846-4c6d-b9d7-c05bd55c031b/images/7e4d8003-9248-4e82-8c41-9c4093de1623/b2dc01a7-4833-41c8-9e0f-84102f97b80d - Possibly undergoing heal
Number of entries: 1

Brick rhss1.lab.eng.blr.redhat.com:/rhs/brick2/drdir2/
/0218725d-3846-4c6d-b9d7-c05bd55c031b/images/6d845676-0267-4b44-9856-712feda16035/27f64b50-b1c1-4ce7-a3a6-08523efa1dfc - Possibly undergoing heal
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/1217280d-e8d5-4f79-826f-64514e6f5c56
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/e1503573-d342-442b-902d-f5cb55e48edc
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/fe34200e-3614-4fbf-ab46-62ba6e39b20e
Number of entries: 4

Brick rhss2.lab.eng.blr.redhat.com:/rhs/brick2/drdir2/
/0218725d-3846-4c6d-b9d7-c05bd55c031b/images/6d845676-0267-4b44-9856-712feda16035/27f64b50-b1c1-4ce7-a3a6-08523efa1dfc - Possibly undergoing heal
Number of entries: 1

Brick rhss1.lab.eng.blr.redhat.com:/rhs/brick3/drdir3/
Number of entries: 0

Brick rhss2.lab.eng.blr.redhat.com:/rhs/brick3/drdir3/
Number of entries: 0

Brick rhss1.lab.eng.blr.redhat.com:/rhs/brick4/drdir4/
/0218725d-3846-4c6d-b9d7-c05bd55c031b/images/de44188d-1ed1-40cc-9373-cca801b23d6d/2f8fafc7-d755-4b5a-9cfe-fb0ce83b54d8 - Possibly undergoing heal
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/1217280d-e8d5-4f79-826f-64514e6f5c56
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/e1503573-d342-442b-902d-f5cb55e48edc
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/fe34200e-3614-4fbf-ab46-62ba6e39b20e
Number of entries: 4

Brick rhss2.lab.eng.blr.redhat.com:/rhs/brick4/drdir4/
/0218725d-3846-4c6d-b9d7-c05bd55c031b/images/de44188d-1ed1-40cc-9373-cca801b23d6d/2f8fafc7-d755-4b5a-9cfe-fb0ce83b54d8 - Possibly undergoing heal
Number of entries: 1

Brick rhss3.lab.eng.blr.redhat.com:/rhs/brick1/add-dir1/
Number of entries: 0

Brick rhss4.lab.eng.blr.redhat.com:/rhs/brick1/add-dir1/
Number of entries: 0

[Tue Jan 14 15:51:12 UTC 2014 root.37.187:~ ] # gluster volume heal drvol info
Brick rhss1.lab.eng.blr.redhat.com:/rhs/brick1/drdir1/
/0218725d-3846-4c6d-b9d7-c05bd55c031b/images/7e4d8003-9248-4e82-8c41-9c4093de1623/b2dc01a7-4833-41c8-9e0f-84102f97b80d - Possibly undergoing heal
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/1217280d-e8d5-4f79-826f-64514e6f5c56
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/e1503573-d342-442b-902d-f5cb55e48edc
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/fe34200e-3614-4fbf-ab46-62ba6e39b20e
Number of entries: 4

Brick rhss2.lab.eng.blr.redhat.com:/rhs/brick1/drdir1/
/0218725d-3846-4c6d-b9d7-c05bd55c031b/images/7e4d8003-9248-4e82-8c41-9c4093de1623/b2dc01a7-4833-41c8-9e0f-84102f97b80d - Possibly undergoing heal
Number of entries: 1

Brick rhss1.lab.eng.blr.redhat.com:/rhs/brick2/drdir2/
/0218725d-3846-4c6d-b9d7-c05bd55c031b/images/6d845676-0267-4b44-9856-712feda16035/27f64b50-b1c1-4ce7-a3a6-08523efa1dfc - Possibly undergoing heal
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/1217280d-e8d5-4f79-826f-64514e6f5c56
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/e1503573-d342-442b-902d-f5cb55e48edc
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/fe34200e-3614-4fbf-ab46-62ba6e39b20e
Number of entries: 4

Brick rhss2.lab.eng.blr.redhat.com:/rhs/brick2/drdir2/
/0218725d-3846-4c6d-b9d7-c05bd55c031b/images/6d845676-0267-4b44-9856-712feda16035/27f64b50-b1c1-4ce7-a3a6-08523efa1dfc - Possibly undergoing heal
Number of entries: 1

Brick rhss1.lab.eng.blr.redhat.com:/rhs/brick3/drdir3/
Number of entries: 0

Brick rhss2.lab.eng.blr.redhat.com:/rhs/brick3/drdir3/
Number of entries: 0

Brick rhss1.lab.eng.blr.redhat.com:/rhs/brick4/drdir4/
/0218725d-3846-4c6d-b9d7-c05bd55c031b/images/de44188d-1ed1-40cc-9373-cca801b23d6d/2f8fafc7-d755-4b5a-9cfe-fb0ce83b54d8 - Possibly undergoing heal
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/1217280d-e8d5-4f79-826f-64514e6f5c56
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/e1503573-d342-442b-902d-f5cb55e48edc
/0218725d-3846-4c6d-b9d7-c05bd55c031b/master/vms/fe34200e-3614-4fbf-ab46-62ba6e39b20e
Number of entries: 4

Brick rhss2.lab.eng.blr.redhat.com:/rhs/brick4/drdir4/
/0218725d-3846-4c6d-b9d7-c05bd55c031b/images/de44188d-1ed1-40cc-9373-cca801b23d6d/2f8fafc7-d755-4b5a-9cfe-fb0ce83b54d8 - Possibly undergoing heal
Number of entries: 1

Brick rhss3.lab.eng.blr.redhat.com:/rhs/brick1/add-dir1/
Number of entries: 0

Brick rhss4.lab.eng.blr.redhat.com:/rhs/brick1/add-dir1/
Number of entries: 0

As the problem related to this bug is solved, this bug could be closed.

But again,"gluster volume heal <vol-name> info" gives out few entries with message, "Possibly undergoing heal", and there are entries without this message.
What is the significance of having entries with this message
This behavior have to be documented, in that case

Comment 8 Shalaka 2014-01-20 05:16:18 UTC
Cancelling need_info as requires_doc_text flag is set to '-' based on comment 6.

Comment 10 errata-xmlrpc 2014-02-25 08:10:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html