Description of problem: ====================== Had a 6node cluster, with a 1*3 volume 'ozone' created on node1, node2 and node3. The setup was brickmux enabled, and the volume option was set to group 'gluster-block' There were quite a few (<10) blocks created while verifying bz 1514344 and bz 1545049. Started deleting the blocks one by one, and the 'gluster-block delete' command timed out for block 'ob10'. Unable to see anything amiss in the logs /var/log/gluster-block or in /var/log/messages, rebooted all the services, and tried it another time. This time the command succeeded for few blocks and again timed out for block 'ob9'. Trying to relate the similarity between the two blocks that timed out, they are of fairly large sizes - 1E and 1P. The entire system slows down after the command fails, I suppose because internally it keeps trying to do what was intended, and is not able to get through.. In other words, every 'gluster-block' command given after the failure takes a 2-3 mins to show the output. Restarting the gluster-block daemon does get the system back to normal. I was not able to gather much from the logs, maybe we need better logging there.. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.12.2-7 tcmu-runner-1.2.0-18 gluster-block-0.2.1-17 How reproducible: ================ 3:3 Steps to Reproduce: ================== 1. Create a block of 1K, 1M, 1G, 1T, 1P, 1E on a replica 3 volume 2. Execute 'gluster-block' delete on all the above created blocks Actual results: ============== Block delete succeeds for blocks of sizes 1K, 1M, 1G, 1T, but times out on 1P and 1E. Expected results: ================ Either block delete should succeed immediately, or it should disallow creation of such large blocks if it affects the functionality. Additional info: =============== Sosreports and gluster-block logs will be copied at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/swetas/<bugnumber>
Sweta, Could you disable sharding and redo this test? I am suspecting that this has to do with sharding xlator taking lot of time to delete the individual shards. Krutika is working on doing unlinks in background as part of https://bugzilla.redhat.com/show_bug.cgi?id=1520882 for 3.4.0.
(In reply to Pranith Kumar K from comment #3) > Sweta, > Could you disable sharding and redo this test? I am suspecting that this > has to do with sharding xlator taking lot of time to delete the individual > shards. Krutika is working on doing unlinks in background as part of > https://bugzilla.redhat.com/show_bug.cgi?id=1520882 for 3.4.0. Please note that you need to both create and delete the block volume while sharding is disabled for us to confirm that the delay was introduced because of sharding.
Note: The fixes to this issue have been merged upstream: https://review.gluster.org/#/q/status:merged+project:glusterfs+branch:master+topic:ref-1568521 Moving this bug to POST state.
The fix for this issue is already merged and the other bug BZ 1520882 is ON_QA. It would be more relevant to have this bug too on ON_QA, as the fix addresses this issue too. Why is that, this bug is not moved ON_QA ?
(In reply to SATHEESARAN from comment #18) > The fix for this issue is already merged and the other bug BZ 1520882 is > ON_QA. > It would be more relevant to have this bug too on ON_QA, as the fix > addresses this issue too. > > Why is that, this bug is not moved ON_QA ? Ok. I don't completely understand the process, but shouldn't this be done only when all 3 acks are in place? Let me know if that is not the case. -Krutika
Created attachment 1586539 [details] Verification logs on rhgs3.5.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:3249