Description of problem: On a three node cluster, with 40 volumes 10 replica, 10 aribiter, 10 distributed-replica and 10 distributed volumes. Mounted 2 volumes of each type on two clients. Performed in-service upgrade(from RHGS3.4.0 TO RHGS3.4.0-async) one by one node. Two nodes upgraded successfully and the files healed sucessfully on two nodes. On the third node, after upgrade only 20-25 files healed after that all the files which require heal are pending. Healing of files in pending state even after manual triggering heal on the volume Version-Release number of selected component (if applicable): glusterfs-server-3.12.2-18.1.el7rhgs.x86_64 How reproducible: 1/1 Setup is in same state and given to the DEV for debugging Steps to Reproduce: 1. On a three node setup, with 40 volumes as mentioned in the description 2. Upgrade nodes from RHGS3.4.0(LIVE) to 3.4.0async. data bricks 3. After upgrade wait for heal to complete 4. Heal successfully completed on two nodes 5. Upgraded the node which contains the arbitered brick Actual results: Upgraded the node(which has arbitered brick) successfully and rebooted the node, after reboot healing of files is in pending state even after manual trigger of the heal. Self-heal daemon, bricks are running. Expected results: After upgrade and reboot, heal should trigger and the files should be healed. Additional info: [root@dhcp37-94 arb_1]# gluster vol info arb_1 Volume Name: arb_1 Type: Replicate Volume ID: 0b5647dd-744d-4c8f-832c-b14f8cdd4619 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: dhcp37-75.lab.eng.blr.redhat.com:/bricks/brick0/arb_1 Brick2: dhcp37-213.lab.eng.blr.redhat.com:/bricks/brick0/arb_1 Brick3: dhcp37-94.lab.eng.blr.redhat.com:/bricks/brick0/arb_1 (arbiter) Options Reconfigured: diagnostics.client-log-level: DEBUG performance.client-io-threads: off nfs.disable: on transport.address-family: inet cluster.brick-multiplex: disable gluster volume heal arb_1 info summary Brick dhcp37-75.lab.eng.blr.redhat.com:/bricks/brick0/arb_1 Status: Connected Total Number of entries: 11620 Number of entries in heal pending: 11620 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick dhcp37-213.lab.eng.blr.redhat.com:/bricks/brick0/arb_1 Status: Connected Total Number of entries: 11620 Number of entries in heal pending: 11620 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick dhcp37-94.lab.eng.blr.redhat.com:/bricks/brick0/arb_1 Status: Connected Total Number of entries: 0 Number of entries in heal pending: 0 Number of entries in split-brain: 0 Number of entries possibly healing: 0
What's the relation of this bug with the async change we did in glusterd?
Ravi, I haven't restart or rebooted on the cluster Attaching the sosreports, bricks dumps and shd dumps http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/bmekala/bug.1635967/
The brick statedump show(In reply to Bala Konda Reddy M from comment #6) > Ravi, > I haven't restart or rebooted on the cluster > > Attaching the sosreports, bricks dumps and shd dumps > http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/bmekala/bug.1635967/ The brick statedump taken earlier shows ACTIVE locks on the bricks dated 2018-10-05: inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=402201c8af7f0000, client=0x7f8a800a6e20, connection-id=dhcp37-75.lab.eng.blr.redhat.com-2118-2018/10/03-08:40:08:786539-arb_1-client-0-0-0, granted at 2018-10-05 16:58:43 lock-dump.domain.domain=arb_1-replicate-0:metadata I haven't checked the shd logs yet but I think there might have been a network disconnect between shd and bricks in order for the locks to be released. At this point I'm almost sure that with test description in the BZ resembling https://bugzilla.redhat.com/show_bug.cgi?id=1637802#c0 (which is for fixing the the one raised by Vijay- BZ 1636902), the same fix should address this issue also.
Hi Bala, Further ti comment #7, do you have any objections to close this as a duplicate of BZ 1636902? The issue should not occur with glusterfs-3.12.2-23 which contains the fix for the stale lock issue.
(In reply to Ravishankar N from comment #8) > Hi Bala, > Further ti comment #7, do you have any objections to close this as a > duplicate of BZ 1636902? The issue should not occur with glusterfs-3.12.2-23 > which contains the fix for the stale lock issue. Yes mark it as duplicate
*** This bug has been marked as a duplicate of bug 1636902 ***