Bug 1662059

Summary:

[RHV-RHGS] Fuse mount crashed while deleting a 1 TB image file from RHV

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

SATHEESARAN <sasundar>

Component:

sharding

Assignee:

Krutika Dhananjay <kdhananj>

Status:

CLOSED ERRATA

QA Contact:

SATHEESARAN <sasundar>

Severity:

high

Docs Contact:

Priority:

high

Version:

rhgs-3.4

CC:

abhishku, rcyriac, rhs-bugs, sabose, sankarshan, sasundar, sheggodu, storage-qa-internal

Target Milestone:

---

Keywords:

ZStream

Target Release:

RHGS 3.4.z Batch Update 3

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

glusterfs-3.12.2-37

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1662368 1663208 (view as bug list)

Environment:

RHV-RHGS Integration

Last Closed:

2019-02-04 07:41:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1662368, 1665803

Bug Blocks:

1663208

Attachments:

Description	Flags
Fuse mount log 1	none
Fuse mount log 2	none

Description SATHEESARAN 2018-12-25 17:54:03 UTC

Description of problem:
-----------------------

Attempts were made to try out the customer scenario, where the disks of bigger sizes residing on gluster volumes are deleted from RHV. During one such attempts, its found that the fuse mount process has crashed.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHV 4.0.5
RHGS 3.4.2 ( glusterfs-3.12.2-32.el7rhgs )

How reproducible:
-----------------
1/1 
Hit it once.

Steps to Reproduce:
-------------------
1. RHV storage domain is configured to use gluster arbitrated replicate volume,
with sharding enabled
2. Create disk of size 1TB from RHV Manager UI
3. Delete the disk from RHV Manager UI

Actual results:
---------------
Gluster fuse mount process was crashed on one of the hypervisor

Expected results:
-----------------
No gluster process should crash

Comment 3 SATHEESARAN 2018-12-25 18:12:53 UTC

Created attachment 1516688 [details]
Fuse mount log 1

Comment 4 SATHEESARAN 2018-12-25 18:25:26 UTC

Created attachment 1516689 [details]
Fuse mount log 2

Comment 5 Krutika Dhananjay 2018-12-26 06:10:34 UTC

So there is no core dump and I can't tell much from just the logs.

From
[root@dhcp37-127 ~]# cat /proc/sys/kernel/core_pattern 
|/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t %e %i

Seems like this should be set to a valid path for us to get the core dump.

Would be great if you can change this value to a meaningful path and recreate the issue.

-Krutika

Comment 6 SATHEESARAN 2018-12-26 06:27:00 UTC

(In reply to Krutika Dhananjay from comment #5)
> So there is no core dump and I can't tell much from just the logs.
> 
> From
> [root@dhcp37-127 ~]# cat /proc/sys/kernel/core_pattern 
> |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t %e %i
> 
> Seems like this should be set to a valid path for us to get the core dump.
> 
> Would be great if you can change this value to a meaningful path and
> recreate the issue.
> 
> -Krutika

I could reproduce the issue consistently outside of RHV-RHGS setup.
With 3 RHGS servers and 1 client.

1. Create 5 VM image files on the fuse mounted gluster volume using qemu-img command
    # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm1.img 10G
    # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm2.img 7G
    # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm3.img 5G
    # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm4.img 4G

2. Delete the files from the mount
    # rm -rf /mnt/testdata/*

The above step hits the crash,close to consistent

I will reinstall the required debug packages and will provide the setup for debugging

Comment 8 Krutika Dhananjay 2018-12-26 14:23:28 UTC

(In reply to SATHEESARAN from comment #6)
> (In reply to Krutika Dhananjay from comment #5)
> > So there is no core dump and I can't tell much from just the logs.
> > 
> > From
> > [root@dhcp37-127 ~]# cat /proc/sys/kernel/core_pattern 
> > |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t %e %i
> > 
> > Seems like this should be set to a valid path for us to get the core dump.
> > 
> > Would be great if you can change this value to a meaningful path and
> > recreate the issue.
> > 
> > -Krutika
> 
> I could reproduce the issue consistently outside of RHV-RHGS setup.
> With 3 RHGS servers and 1 client.
> 
> 1. Create 5 VM image files on the fuse mounted gluster volume using qemu-img
> command
>     # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm1.img 10G
>     # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm2.img 7G
>     # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm3.img 5G
>     # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm4.img 4G
> 
> 2. Delete the files from the mount
>     # rm -rf /mnt/testdata/*
> 
> The above step hits the crash,close to consistent
> 
> I will reinstall the required debug packages and will provide the setup for
> debugging


Is the mountpoint in step 1 different from the one used in 2? In step 1, files are created under /mnt/test/. But the rm -rf is done from /mnt/testdata/

-Krutika

Comment 9 SATHEESARAN 2018-12-26 15:38:21 UTC

(In reply to Krutika Dhananjay from comment #8)
> (In reply to SATHEESARAN from comment #6)
> > (In reply to Krutika Dhananjay from comment #5)
> > > So there is no core dump and I can't tell much from just the logs.
> > > 
> > > From
> > > [root@dhcp37-127 ~]# cat /proc/sys/kernel/core_pattern 
> > > |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t %e %i
> > > 
> > > Seems like this should be set to a valid path for us to get the core dump.
> > > 
> > > Would be great if you can change this value to a meaningful path and
> > > recreate the issue.
> > > 
> > > -Krutika
> > 
> > I could reproduce the issue consistently outside of RHV-RHGS setup.
> > With 3 RHGS servers and 1 client.
> > 
> > 1. Create 5 VM image files on the fuse mounted gluster volume using qemu-img
> > command
> >     # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm1.img 10G
> >     # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm2.img 7G
> >     # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm3.img 5G
> >     # qemu-img create -f qcow2 -o preallocation=full /mnt/test/vm4.img 4G
> > 
> > 2. Delete the files from the mount
> >     # rm -rf /mnt/testdata/*
> > 
> > The above step hits the crash,close to consistent
> > 
> > I will reinstall the required debug packages and will provide the setup for
> > debugging
> 
> 
> Is the mountpoint in step 1 different from the one used in 2? In step 1,
> files are created under /mnt/test/. But the rm -rf is done from
> /mnt/testdata/
> 
> -Krutika

I did it from same mount. No different mounts

Comment 10 Krutika Dhananjay 2018-12-28 04:27:38 UTC

I think I found more issues in the code -
1. 'No such file or directory' logs for shards that don't exist
2. Misleading logs of the kind "Failed to clean up shards of gfid %s" (msgid: 133021) shortly after logging the fact that deletion actuall succeeded (msgid:133022)

Still working on rc'ing the crash.

Meanwhile, sas, since you said you were able to recreate the problem of hosts going into unresponsive state during deletion, could you capture volume profile for a run that recreates this behavior?

-Krutika

Comment 11 Krutika Dhananjay 2018-12-28 15:58:21 UTC

I've sent https://review.gluster.org/c/glusterfs/+/21946 and https://review.gluster.org/c/glusterfs/+/21957 to fix the issues mentioned in comment #10.
Oddly enough, the second patch also fixed the crash.

I need to analyze the code a bit more to see the correlation between the crash and this fix.

-Krutika

Comment 12 Krutika Dhananjay 2019-01-02 15:15:40 UTC

(In reply to Krutika Dhananjay from comment #11)
> I've sent https://review.gluster.org/c/glusterfs/+/21946 and
> https://review.gluster.org/c/glusterfs/+/21957 to fix the issues mentioned
> in comment #10.
> Oddly enough, the second patch also fixed the crash.
> 
> I need to analyze the code a bit more to see the correlation between the
> crash and this fix.
> 
> -Krutika

I found the problem. So the entrylk that is taken to perform deletion of shards atomically didn't have lk-owner initialized.
This means multiple background deletion tasks from the same client will end up in the critical section at the same time and
attempt parallel deletion of shards leading to refcount on base shard at some point becoming 0 when all its related shards
have been deleted. At this point the other task, by virtue of attempting the same operation again will crash upon accessing
the now-destroyed base inode.

The fix at https://review.gluster.org/c/glusterfs/+/21957 eliminate this race by means of never launching more than one task
for clean up per client. But for the sake of completeness, I will still go ahead, change the patch to initialize lk-owner.

-Krutika

Comment 13 Krutika Dhananjay 2019-01-02 15:17:01 UTC

Oh and nice catch, Sas! :)

-Krutika

Comment 14 SATHEESARAN 2019-01-04 09:58:48 UTC

(In reply to Krutika Dhananjay from comment #10)
> I think I found more issues in the code -
> 1. 'No such file or directory' logs for shards that don't exist
> 2. Misleading logs of the kind "Failed to clean up shards of gfid %s"
> (msgid: 133021) shortly after logging the fact that deletion actuall
> succeeded (msgid:133022)
> 
> Still working on rc'ing the crash.
> 
> Meanwhile, sas, since you said you were able to recreate the problem of
> hosts going into unresponsive state during deletion, could you capture
> volume profile for a run that recreates this behavior?
> 
> -Krutika

Thanks Krutika, I have raised another bug  - https://bugzilla.redhat.com/show_bug.cgi?id=1663367
for the latency related issue. I will get you the volume profile information details on that bug

QA ack is provided for this bug based on the following reasons:
1. This bug is a blocker, as there is a crash that is evident on parallel deleted
2. This may also affect the block store use case, as gluster block storage uses sharding too.
3. This bug can be verified by RHHI-V QE with 3 days effort and can be accomodated during regression test cycle of RHHI-V

Comment 17 Krutika Dhananjay 2019-01-11 14:05:10 UTC

Patches posted downstream:
https://code.engineering.redhat.com/gerrit/#/c/160437/
https://code.engineering.redhat.com/gerrit/#/c/160436/

Comment 18 Krutika Dhananjay 2019-01-11 14:16:20 UTC

Tests to be done:

1. Deleting a large file (of the order of a TB)
2. Deleting multiple large files simultaneously from the same mount point
3. Deleting multiple large files simultaneously from different mount points
4. Starting deletion of a large file that is around 1TB in size, killing the gluster fuse client which is doing the deletion midway and remounting and verifying that cleanup resumed from where it left off. This is the step that ensures any kind of outage doesn't leave ghost shards behind forever.

In all these runs, also monitor the logs to be sure there is no excessive logging. (Some warning logs from protocol/{client,server} might be inevitable). But there should be no excessive logging from shard.

Also make sure there is only one occurrence of MSGID 133022 for every gfid and one occurrence of MSGID 133021 as well. Multiple, sometimes conflicting messages for the same gfid would indicate something is not right.

Please capture volume-profile in all of the cases. Profiling itself has little overhead, so it is safe to enable it before each run.

-Krutika

Comment 20 SATHEESARAN 2019-01-23 09:58:47 UTC

Tested with RHV 4.2.8 & RHGS 3.4.3 ( glusterfs-3.12.2-38.el7rhgs ) with the following tests:

Test1:
1. Create a image of size more than 1TB(preallocated) and then delete it.
2. Repeated the test 20 times.

Test2:
1. Create 4 image of size more than 1TB(preallocated) and then delete it.
2. Delete the images parallely from the same client or from different client.
3. Repeated the test 20 times.

Test3:
1. Create 4 images of size more than 1TB(preallocated)
2. Delete the image from client and immediately poweroff the host.
3. Power-on the host

With all these scenarios, there are no issues seen.
But when creating the preallocated image, fuse mount process crashed in some scenarios.
Created a new bug - BZ 1668304 - for tracking the above issue

Comment 22 errata-xmlrpc 2019-02-04 07:41:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0263