Description of problem: ======================== On my nonfunctional setup, was running tests to validate rpc fixes. I did a node reboot of one of the servers and saw that post reboot, a couple of entries are not getting healed at all IO details in "Nag Cluster IOs" worksheet https://docs.google.com/spreadsheets/d/17Yf9ZRWnWOpbRyFQ2ZYxAAlp9I_yarzKZdjN8idBJM0/edit#gid=1472913705 [root@rhs-client19 ~]# gluster v heal rpcx3 info Brick rhs-client19.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3 Status: Connected Number of entries: 0 Brick rhs-client25.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3 Status: Connected Number of entries: 0 Brick rhs-client32.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3 Status: Connected Number of entries: 0 Brick rhs-client25.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3 Status: Connected Number of entries: 0 Brick rhs-client32.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3 Status: Connected Number of entries: 0 Brick rhs-client38.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3 Status: Connected Number of entries: 0 Brick rhs-client32.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3 /IOs/samedir/level1.1/level2.1/level3.11/level4.35 /IOs/samedir/level1.1/level2.1/level3.11/level4.35/level5.53 Status: Connected Number of entries: 2 Brick rhs-client38.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3 /IOs/samedir/level1.1/level2.1/level3.11/level4.35 /IOs/samedir/level1.1/level2.1/level3.11/level4.35/level5.53 Status: Connected Number of entries: 2 Brick rhs-client19.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3 /IOs/samedir/level1.1/level2.1/level3.11/level4.35/level5.53 Status: Connected Number of entries: 1 [root@rhs-client19 ~]# gluster v info Volume Name: rpcx3 Type: Distributed-Replicate Volume ID: f7532c65-63d0-4e4a-a5b5-c95238635eff Status: Started Snapshot Count: 0 Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3 Brick2: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3 Brick3: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick1/rpcx3 Brick4: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3 Brick5: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3 Brick6: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick2/rpcx3 Brick7: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3 Brick8: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3 Brick9: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick3/rpcx3 Options Reconfigured: diagnostics.client-log-level: INFO performance.client-io-threads: off nfs.disable: on transport.address-family: inet diagnostics.latency-measurement: on diagnostics.count-fop-hits: on features.uss: enable features.quota: on features.inode-quota: on features.quota-deem-statfs: on Version-Release number of selected component (if applicable): ================= 3.12.2-43 How reproducible: ================= hit it once on systemic setup Steps to Reproduce: ================= 1.created a single 3x3 volume on 4 node setup (brickmux disabled, and all settings are default) 2.mounted volume on 8 clients and triggered below IOs IOs: 1) linux untar from all mounts ---> note that on 4 clients they were being done from non root user, after enabling access through ACLs 2) was collecting resource consumption and appending to individual files on the mount point 3) continuous lookups on all clients 4) same deep directory path creation from all 8 clients parallely --kept the IOs going for about 2 days, and then as a random health check, -- 5)after a day or so, started to create zerobyte files from a new client(~2.5million) 6)after 2 days, enabled quota and uss 6) rebooted one node 7) after 3 days also 2 entries not getting healed