Bug 1597648

Summary: "gluster vol heal <volname> info" is locked forever
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijay Avuthu <vavuthu>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: Vijay Avuthu <vavuthu>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: amukherj, rcyriac, rhinduja, rhs-bugs, sankarshan, sheggodu, storage-qa-internal, vavuthu
Target Milestone: ---Keywords: Regression, ZStream
Target Release: RHGS 3.3.1 Async   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-54.14 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1597654 1598189 (view as bug list) Environment:
Last Closed: 2018-07-19 06:00:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1597654, 1598189    

Description Vijay Avuthu 2018-07-03 10:56:56 UTC
Description of problem:

While running the automation cases, found out "gluster vol heal <volname> info" command is not returning the o/p even though there is not much heal pending

Version-Release number of selected component (if applicable):

glusterfs-3.8.4-54.13.el7rhgs.x86_64

How reproducible:

Always

Steps to Reproduce:

Automation test ID : test_self_heal_while_running_dbench

subcase :

EXECUTING IO ON MOUNT: dbench_case8


Actual results:

# date
Tue Jul  3 06:43:34 EDT 2018
#
# tail -f distaf_tests_afr_glusterfs_9.log

2018-07-02 12:48:40,271 INFO run Executing gluster volume heal testvol info > /tmp/heal_info on rhsauto056.lab.eng.blr.redhat.com


Expected results:


Additional info:

From state dump:

[root@rhsauto047 gluster]# less bricks-brick0-testvol_brick3.15456.dump.1530610309

inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 27761, owner=0000000000
000000, client=0x7ff9380bb370, connection-id=rhsauto028.lab.eng.blr.redhat.com-27361-2018/07/0
2-15:06:17:567933-testvol-client-3-0-8, granted at 2018-07-02 16:53:38
inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 27761, owner=40080fd8ea7f0000, client=0x7ff9380bb370, connection-id=rhsauto028.lab.eng.blr.redhat.com-27361-2018/07/02-15:06:17:567933-testvol-client-3-0-8, blocked at 2018-07-03 09:22:03

brick log:

[2018-07-02 16:53:38.542862] E [inodelk.c:304:__inode_unlock_lock] 0-testvol-locks:  Matching lock not found for unlock 0-9223372036854775807, by 40080fd8ea7f0000 on 0x7ff9380bb370
[2018-07-02 16:53:38.542899] E [MSGID: 115053] [server-rpc-fops.c:325:server_finodelk_cbk] 0-testvol-server: 4274661: FINODELK 4 (eac1b574-bbfd-4bad-a358-96727d37d822) ==> (Invalid argument) [Invalid argument]

Comment 3 Ravishankar N 2018-07-03 11:05:27 UTC
Karthik tells me that Pranith found what the issue is and is working on the fix, hence assigning it to him.

Comment 5 Pranith Kumar K 2018-07-03 16:28:00 UTC
hi Vijay,
    As per code I could only find one possibility of the issue we saw happening. As discussed I created test rpms which I believe should fix the issue. Please let me know how this build performs in that test-case.

Link to test-rpm:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16982140

Pranith

Comment 7 Vijay Avuthu 2018-07-04 16:02:41 UTC
Update:
========

I have run the same test case on below build ( took from comment 5 ) 

glusterfs-3.8.4-54.13.3.git8f9ead154.el7rhgs.x86_64

and the test case passed , there is no lagging in getting volume heal info

2018-07-04 11:13:11,495 INFO wrapper Ending test: test_self_heal_while_running_dbench
2018-07-04 11:13:11,496 INFO wrapper Test execution time for test_self_heal_while_running_dbench: 5:41:22
2018-07-04 11:13:11,497 INFO <module> Total Tests Executed : 1
2018-07-04 11:13:11,497 INFO <module> Total Tests Passed : 1
2018-07-04 11:13:11,497 INFO <module> Total Tests Failed : 0
2018-07-04 11:13:11,498 INFO <module> List of Failed tests: [""]


> logs with new test-rpm which shows "heal info" ( this is the point where heal info stuck last time ).

2018-07-04 08:29:09,501 INFO run root.eng.blr.redhat.com (cp): gluster volume heal testvol info > /tmp/heal_info
2018-07-04 08:29:22,928 INFO run RETCODE: 0
2018-07-04 08:29:22,929 INFO get_heal_info Successfully got heal info on the volume testvol
2018-07-04 08:29:22,929 INFO run root.eng.blr.redhat.com (cp): gluster volume info testvol --xml
2018-07-04 08:29:23,135 INFO run RETCODE: 0

Comment 9 Vijay Avuthu 2018-07-10 11:27:31 UTC
Update:
========

glusterfs-3.8.4-54.14.el7rhgs.x86_64


> Ran the same test in above build and test cases passed without any errors

 INFO wrapper Testcase test_self_heal_while_running_dbench passed
 INFO wrapper Ending test: test_self_heal_while_running_dbench
 INFO wrapper Test execution time for test_self_heal_while_running_dbench: 3:18:08


Hence moving to Verified

Comment 11 errata-xmlrpc 2018-07-19 06:00:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2222