Description of problem: ======================== Rebalance process hung infinitely which triggered after adding bricks. i was testing VM use case, converted the 1*3 volume to 2*3 volume and triggered the rebalance from RHEV engine. rebalance operation didn't completed. Looks like this issue happned because of stale lock. frame status remains same in the volume state dump taken in 15min interval. Version-Release number of selected component (if applicable): ============================================================ glusterfs-3.8.4-11 How reproducible: ================= One time Steps to Reproduce: =================== 1.Have RHV-RHGS SETUP with 3 rhgs nodes and 2 clients (hosts) 2.create a 1 *3 volume 3.create a Application VM using the storage created in step-2. 4.convert volume to 2*3 volume and trigger the rebalance. Actual results: =============== Rebalance process hung infinitely, which triggered after adding bricks. Expected results: ================= Rebalance process should not hung Additional info:
frame status remains same in the volume state dump taken 15min interval. [root@dhcp42-35 gluster]# grep -A2 -B2 -E "ACTIVE|BLOCK" ./rhs-brick2-br1.16325.dump.1484126221 lock-dump.domain.domain=Dis-Rep1-replicate-1:metadata lock-dump.domain.domain=dht.layout.heal inodelk.inodelk[0](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22549, owner=8084ba1bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:20 inodelk.inodelk[1](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22544, owner=78abb91bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:16 inodelk.inodelk[2](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22527, owner=9c58b51bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:16 inodelk.inodelk[3](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22527, owner=5cccb51bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:15 inodelk.inodelk[4](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551613, owner=a8a7410e827f0000, client=0x7f0fe410c710, connection-id=dhcp43-245.lab.eng.blr.redhat.com-23190-2017/01/11-05:57:18:82599-Dis-Rep1-client-3-0-0, blocked at 2017-01-11 05:57:23 inodelk.inodelk[5](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551613, owner=9854e5d9147f0000, client=0x7f0fdc000cb0, connection-id=dhcp42-105.lab.eng.blr.redhat.com-7831-2017/01/11-05:57:18:96386-Dis-Rep1-client-3-0-0, blocked at 2017-01-11 05:57:23 inodelk.inodelk[6](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551613, owner=20e66ccb027f0000, client=0x7f0fe40f13c0, connection-id=dhcp42-35.lab.eng.blr.redhat.com-24794-2017/01/11-05:57:13:32093-Dis-Rep1-client-3-0-0, blocked at 2017-01-11 05:57:24 lock-dump.domain.domain=Dis-Rep1-replicate-1 [root@dhcp42-35 gluster]# [root@dhcp42-35 gluster]# grep -A2 -B2 -E "ACTIVE|BLOCK" ./rhs-brick2-br1.16325.dump.1484129309 lock-dump.domain.domain=Dis-Rep1-replicate-1:metadata lock-dump.domain.domain=dht.layout.heal inodelk.inodelk[0](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22549, owner=8084ba1bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:20 inodelk.inodelk[1](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22544, owner=78abb91bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:16 inodelk.inodelk[2](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22527, owner=9c58b51bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:16 inodelk.inodelk[3](ACTIVE)=type=READ, whence=0, start=0, len=0, pid = 22527, owner=5cccb51bd57f0000, client=0x7f0fdc0081e0, connection-id=rhs-client30.lab.eng.blr.redhat.com-11057-2017/01/10-06:35:34:507910-Dis-Rep1-client-3-15-0, granted at 2017-01-11 05:49:15 inodelk.inodelk[4](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551613, owner=a8a7410e827f0000, client=0x7f0fe410c710, connection-id=dhcp43-245.lab.eng.blr.redhat.com-23190-2017/01/11-05:57:18:82599-Dis-Rep1-client-3-0-0, blocked at 2017-01-11 05:57:23 inodelk.inodelk[5](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551613, owner=9854e5d9147f0000, client=0x7f0fdc000cb0, connection-id=dhcp42-105.lab.eng.blr.redhat.com-7831-2017/01/11-05:57:18:96386-Dis-Rep1-client-3-0-0, blocked at 2017-01-11 05:57:23 inodelk.inodelk[6](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551613, owner=20e66ccb027f0000, client=0x7f0fe40f13c0, connection-id=dhcp42-35.lab.eng.blr.redhat.com-24794-2017/01/11-05:57:13:32093-Dis-Rep1-client-3-0-0, blocked at 2017-01-11 05:57:24 lock-dump.domain.domain=Dis-Rep1-replicate-1 [root@dhcp42-35 gluster]#
Assigning this to Raghavendra G as he has already looked at the setup.
rebalance cli cmd status: ========================= ~]# gluster volume rebalance Dis-Rep1 status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 6 260.5MB 18 0 0 in progress 5:43:34 dhcp43-245.lab.eng.blr.redhat.com 0 0Bytes 0 0 0 in progress 5:43:34 dhcp42-105.lab.eng.blr.redhat.com 0 0Bytes 0 0 0 in progress 5:43:34 volume rebalance: Dis-Rep1: success
I hit this issue only one time in one week of rhv-rhgs testing.