Description of problem: ----------------------- Attempts were made to try out the customer scenario, where the disks of bigger sizes residing on gluster volumes are deleted from RHV. During one such attempts, its found that the fuse mount process has crashed. Version-Release number of selected component (if applicable): ------------------------------------------------------------- RHV 4.0.5 RHGS 3.4.2 ( glusterfs-3.12.2-32.el7rhgs ) How reproducible: ----------------- 1/1 Hit it once. Steps to Reproduce: ------------------- 1. RHV storage domain is configured to use gluster arbitrated replicate volume, with sharding enabled 2. Create disk of size 1TB from RHV Manager UI 3. Delete the disk from RHV Manager UI Actual results: --------------- Gluster fuse mount process was crashed on one of the hypervisor Expected results: ----------------- No gluster process should crash
Created attachment 1516688 [details] Fuse mount log 1
Created attachment 1516689 [details] Fuse mount log 2
So there is no core dump and I can't tell much from just the logs. From [root@dhcp37-127 ~]# cat /proc/sys/kernel/core_pattern |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t %e %i Seems like this should be set to a valid path for us to get the core dump. Would be great if you can change this value to a meaningful path and recreate the issue. -Krutika
(In reply to Krutika Dhananjay from comment #5) > So there is no core dump and I can't tell much from just the logs. > > From > [root@dhcp37-127 ~]# cat /proc/sys/kernel/core_pattern > |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t %e %i > > Seems like this should be set to a valid path for us to get the core dump. > > Would be great if you can change this value to a meaningful path and > recreate the issue. > > -Krutika I could reproduce the issue consistently outside of RHV-RHGS setup. With 3 RHGS servers and 1 client. 1. Create 5 VM image files on the fuse mounted gluster volume using qemu-img command # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm1.img 10G # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm2.img 7G # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm3.img 5G # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm4.img 4G 2. Delete the files from the mount # rm -rf /mnt/testdata/* The above step hits the crash,close to consistent I will reinstall the required debug packages and will provide the setup for debugging
(In reply to SATHEESARAN from comment #6) > (In reply to Krutika Dhananjay from comment #5) > > So there is no core dump and I can't tell much from just the logs. > > > > From > > [root@dhcp37-127 ~]# cat /proc/sys/kernel/core_pattern > > |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t %e %i > > > > Seems like this should be set to a valid path for us to get the core dump. > > > > Would be great if you can change this value to a meaningful path and > > recreate the issue. > > > > -Krutika > > I could reproduce the issue consistently outside of RHV-RHGS setup. > With 3 RHGS servers and 1 client. > > 1. Create 5 VM image files on the fuse mounted gluster volume using qemu-img > command > # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm1.img 10G > # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm2.img 7G > # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm3.img 5G > # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm4.img 4G > > 2. Delete the files from the mount > # rm -rf /mnt/testdata/* > > The above step hits the crash,close to consistent > > I will reinstall the required debug packages and will provide the setup for > debugging Is the mountpoint in step 1 different from the one used in 2? In step 1, files are created under /mnt/test/. But the rm -rf is done from /mnt/testdata/ -Krutika
(In reply to Krutika Dhananjay from comment #8) > (In reply to SATHEESARAN from comment #6) > > (In reply to Krutika Dhananjay from comment #5) > > > So there is no core dump and I can't tell much from just the logs. > > > > > > From > > > [root@dhcp37-127 ~]# cat /proc/sys/kernel/core_pattern > > > |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t %e %i > > > > > > Seems like this should be set to a valid path for us to get the core dump. > > > > > > Would be great if you can change this value to a meaningful path and > > > recreate the issue. > > > > > > -Krutika > > > > I could reproduce the issue consistently outside of RHV-RHGS setup. > > With 3 RHGS servers and 1 client. > > > > 1. Create 5 VM image files on the fuse mounted gluster volume using qemu-img > > command > > # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm1.img 10G > > # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm2.img 7G > > # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm3.img 5G > > # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm4.img 4G > > > > 2. Delete the files from the mount > > # rm -rf /mnt/testdata/* > > > > The above step hits the crash,close to consistent > > > > I will reinstall the required debug packages and will provide the setup for > > debugging > > > Is the mountpoint in step 1 different from the one used in 2? In step 1, > files are created under /mnt/test/. But the rm -rf is done from > /mnt/testdata/ > > -Krutika I did it from same mount. No different mounts
I think I found more issues in the code - 1. 'No such file or directory' logs for shards that don't exist 2. Misleading logs of the kind "Failed to clean up shards of gfid %s" (msgid: 133021) shortly after logging the fact that deletion actuall succeeded (msgid:133022) Still working on rc'ing the crash. Meanwhile, sas, since you said you were able to recreate the problem of hosts going into unresponsive state during deletion, could you capture volume profile for a run that recreates this behavior? -Krutika
I've sent https://review.gluster.org/c/glusterfs/+/21946 and https://review.gluster.org/c/glusterfs/+/21957 to fix the issues mentioned in comment #10. Oddly enough, the second patch also fixed the crash. I need to analyze the code a bit more to see the correlation between the crash and this fix. -Krutika
(In reply to Krutika Dhananjay from comment #11) > I've sent https://review.gluster.org/c/glusterfs/+/21946 and > https://review.gluster.org/c/glusterfs/+/21957 to fix the issues mentioned > in comment #10. > Oddly enough, the second patch also fixed the crash. > > I need to analyze the code a bit more to see the correlation between the > crash and this fix. > > -Krutika I found the problem. So the entrylk that is taken to perform deletion of shards atomically didn't have lk-owner initialized. This means multiple background deletion tasks from the same client will end up in the critical section at the same time and attempt parallel deletion of shards leading to refcount on base shard at some point becoming 0 when all its related shards have been deleted. At this point the other task, by virtue of attempting the same operation again will crash upon accessing the now-destroyed base inode. The fix at https://review.gluster.org/c/glusterfs/+/21957 eliminate this race by means of never launching more than one task for clean up per client. But for the sake of completeness, I will still go ahead, change the patch to initialize lk-owner. -Krutika
Oh and nice catch, Sas! :) -Krutika
(In reply to Krutika Dhananjay from comment #10) > I think I found more issues in the code - > 1. 'No such file or directory' logs for shards that don't exist > 2. Misleading logs of the kind "Failed to clean up shards of gfid %s" > (msgid: 133021) shortly after logging the fact that deletion actuall > succeeded (msgid:133022) > > Still working on rc'ing the crash. > > Meanwhile, sas, since you said you were able to recreate the problem of > hosts going into unresponsive state during deletion, could you capture > volume profile for a run that recreates this behavior? > > -Krutika Thanks Krutika, I have raised another bug - https://bugzilla.redhat.com/show_bug.cgi?id=1663367 for the latency related issue. I will get you the volume profile information details on that bug QA ack is provided for this bug based on the following reasons: 1. This bug is a blocker, as there is a crash that is evident on parallel deleted 2. This may also affect the block store use case, as gluster block storage uses sharding too. 3. This bug can be verified by RHHI-V QE with 3 days effort and can be accomodated during regression test cycle of RHHI-V
Patches posted downstream: https://code.engineering.redhat.com/gerrit/#/c/160437/ https://code.engineering.redhat.com/gerrit/#/c/160436/
Tests to be done: 1. Deleting a large file (of the order of a TB) 2. Deleting multiple large files simultaneously from the same mount point 3. Deleting multiple large files simultaneously from different mount points 4. Starting deletion of a large file that is around 1TB in size, killing the gluster fuse client which is doing the deletion midway and remounting and verifying that cleanup resumed from where it left off. This is the step that ensures any kind of outage doesn't leave ghost shards behind forever. In all these runs, also monitor the logs to be sure there is no excessive logging. (Some warning logs from protocol/{client,server} might be inevitable). But there should be no excessive logging from shard. Also make sure there is only one occurrence of MSGID 133022 for every gfid and one occurrence of MSGID 133021 as well. Multiple, sometimes conflicting messages for the same gfid would indicate something is not right. Please capture volume-profile in all of the cases. Profiling itself has little overhead, so it is safe to enable it before each run. -Krutika
Tested with RHV 4.2.8 & RHGS 3.4.3 ( glusterfs-3.12.2-38.el7rhgs ) with the following tests: Test1: 1. Create a image of size more than 1TB(preallocated) and then delete it. 2. Repeated the test 20 times. Test2: 1. Create 4 image of size more than 1TB(preallocated) and then delete it. 2. Delete the images parallely from the same client or from different client. 3. Repeated the test 20 times. Test3: 1. Create 4 images of size more than 1TB(preallocated) 2. Delete the image from client and immediately poweroff the host. 3. Power-on the host With all these scenarios, there are no issues seen. But when creating the preallocated image, fuse mount process crashed in some scenarios. Created a new bug - BZ 1668304 - for tracking the above issue
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0263