Bug 1412136

Summary: [RHV-RHGS]: Rebalance process hung infinitely, which triggered after adding bricks.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Byreddy <bsrirama>
Component: distributeAssignee: Raghavendra G <rgowdapp>
Status: CLOSED WORKSFORME QA Contact: Prasad Desala <tdesala>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: bsrirama, nbalacha, rcyriac, rgowdapp, rhs-bugs, sanandpa, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-22 08:24:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Byreddy 2017-01-11 10:43:20 UTC
Description of problem:
========================
Rebalance process hung infinitely which triggered after adding bricks.

i was testing VM use case, converted the 1*3 volume to 2*3 volume and triggered the rebalance from RHEV engine.

rebalance  operation didn't completed.

Looks like this issue happned because of stale lock.
frame status remains same in the volume state dump taken in 15min interval.

Version-Release number of selected component (if applicable):
============================================================
glusterfs-3.8.4-11

How reproducible:
=================
One time


Steps to Reproduce:
===================
1.Have RHV-RHGS SETUP with 3 rhgs nodes and 2 clients (hosts)
2.create a 1 *3 volume 
3.create a Application VM using the storage created in step-2.
4.convert volume to 2*3 volume and trigger the rebalance.


Actual results:
===============
Rebalance process hung infinitely, which triggered after adding bricks.


Expected results:
=================
Rebalance process should not hung 

Additional info:

Comment 3 Byreddy 2017-01-11 10:49:49 UTC
frame status remains same in the volume state dump taken 15min interval.

[root@dhcp42-35 gluster]# grep -A2 -B2 -E "ACTIVE|BLOCK" ./rhs-brick2-br1.16325.dump.1484126221
lock-dump.domain.domain=Dis-Rep1-replicate-1:metadata
lock-dump.domain.domain=dht.layout.heal
inodelk.inodelk[0](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22549, owner=8084ba1bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:20
inodelk.inodelk[1](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22544, owner=78abb91bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:16
inodelk.inodelk[2](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22527, owner=9c58b51bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:16
inodelk.inodelk[3](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22527, owner=5cccb51bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:15
inodelk.inodelk[4](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551613, owner=a8a7410e827f0000, client=0x7f0fe410c710, connection-id=dhcp43-245.lab.eng.blr.redhat.com-23190-2017/01/11-05:57:18:82599-Dis-Rep1-client-3-0-0, blocked at 2017-01-11 05:57:23
inodelk.inodelk[5](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551613, owner=9854e5d9147f0000, client=0x7f0fdc000cb0, connection-id=dhcp42-105.lab.eng.blr.redhat.com-7831-2017/01/11-05:57:18:96386-Dis-Rep1-client-3-0-0, blocked at 2017-01-11 05:57:23
inodelk.inodelk[6](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551613, owner=20e66ccb027f0000, client=0x7f0fe40f13c0, connection-id=dhcp42-35.lab.eng.blr.redhat.com-24794-2017/01/11-05:57:13:32093-Dis-Rep1-client-3-0-0, blocked at 2017-01-11 05:57:24
lock-dump.domain.domain=Dis-Rep1-replicate-1

[root@dhcp42-35 gluster]# 
[root@dhcp42-35 gluster]# grep -A2 -B2 -E "ACTIVE|BLOCK"  ./rhs-brick2-br1.16325.dump.1484129309
lock-dump.domain.domain=Dis-Rep1-replicate-1:metadata
lock-dump.domain.domain=dht.layout.heal
inodelk.inodelk[0](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22549, owner=8084ba1bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:20
inodelk.inodelk[1](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22544, owner=78abb91bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:16
inodelk.inodelk[2](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22527, owner=9c58b51bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:16
inodelk.inodelk[3](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22527, owner=5cccb51bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:15
inodelk.inodelk[4](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551613, owner=a8a7410e827f0000, client=0x7f0fe410c710, connection-id=dhcp43-245.lab.eng.blr.redhat.com-23190-2017/01/11-05:57:18:82599-Dis-Rep1-client-3-0-0, blocked at 2017-01-11 05:57:23
inodelk.inodelk[5](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551613, owner=9854e5d9147f0000, client=0x7f0fdc000cb0, connection-id=dhcp42-105.lab.eng.blr.redhat.com-7831-2017/01/11-05:57:18:96386-Dis-Rep1-client-3-0-0, blocked at 2017-01-11 05:57:23
inodelk.inodelk[6](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551613, owner=20e66ccb027f0000, client=0x7f0fe40f13c0, connection-id=dhcp42-35.lab.eng.blr.redhat.com-24794-2017/01/11-05:57:13:32093-Dis-Rep1-client-3-0-0, blocked at 2017-01-11 05:57:24
lock-dump.domain.domain=Dis-Rep1-replicate-1

[root@dhcp42-35 gluster]#

Comment 4 Nithya Balachandran 2017-01-11 10:53:23 UTC
Assigning this to Raghavendra G as he has already looked at the setup.

Comment 6 Byreddy 2017-01-11 11:41:46 UTC
rebalance cli cmd status:
=========================
 ~]# gluster volume rebalance Dis-Rep1 status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                6       260.5MB            18             0             0          in progress        5:43:34
       dhcp43-245.lab.eng.blr.redhat.com                0        0Bytes             0             0             0          in progress        5:43:34
       dhcp42-105.lab.eng.blr.redhat.com                0        0Bytes             0             0             0          in progress        5:43:34
volume rebalance: Dis-Rep1: success

Comment 10 Byreddy 2017-01-16 09:41:27 UTC
I hit this issue only one time in one week of rhv-rhgs testing.