Bug 1694604

Summary:	gluster fuse mount crashed, when deleting 2T image file from RHV Manager UI
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	SATHEESARAN <sasundar>
Component:	rhhi	Assignee:	Sahina Bose <sabose>
Status:	CLOSED ERRATA	QA Contact:	SATHEESARAN <sasundar>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	rhhiv-1.6	CC:	bkunal, godas, kdhananj, pasik, rhs-bugs, storage-qa-internal
Target Milestone:	---
Target Release:	RHHI-V 1.7
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-6.0-6	Doc Type:	Bug Fix
Doc Text:	When sharding is enabled on a volume, a single file allocation operation creates all shards in a single batch. If the number of shards involved in an operation was greater than the number of entries allowed in the lru-cache, the inode associated with the file operation was freed while it was still in use. This led to a crash in the mount process when deleting large files from volumes with sharding enabled, which caused all virtual machines that had mounted that storage to pause. This issue is no longer observed in these updated packages.	Story Points:	---
Clone Of:	1694595	Environment:
Last Closed:	2020-02-13 15:57:20 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1694595, 1696136
Bug Blocks:

Description SATHEESARAN 2019-04-01 08:48:24 UTC

Description of problem:
------------------------
When deleting the 2TB image file , gluster fuse mount process has crashed

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.4.4 ( glusterfs-3.12.2-47.el7rhgs )

How reproducible:
-----------------
1/1

Steps to Reproduce:
-------------------
1. Create a image file of 2T from RHV Manager UI
2. Delete the same image file after its created successfully

Actual results:
---------------
Fuse mount crashed

Expected results:
-----------------
All should work fine and no fuse mount crashes

--- Additional comment from SATHEESARAN on 2019-04-01 08:33:14 UTC ---

frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-04-01 07:57:53
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x9d)[0x7fc72c186b9d]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fc72c191114]
/lib64/libc.so.6(+0x36280)[0x7fc72a7c2280]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x9627)[0x7fc71f8ba627]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x9ef1)[0x7fc71f8baef1]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x3ae9c)[0x7fc71fb15e9c]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0x9e8c)[0x7fc71fd88e8c]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0xb79b)[0x7fc71fd8a79b]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0xc226)[0x7fc71fd8b226]
/usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so(+0x17cbc)[0x7fc72413fcbc]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7fc72bf2ca00]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x26b)[0x7fc72bf2cd6b]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fc72bf28ae3]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7586)[0x7fc727043586]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9bca)[0x7fc727045bca]
/lib64/libglusterfs.so.0(+0x8a870)[0x7fc72c1e5870]
/lib64/libpthread.so.0(+0x7dd5)[0x7fc72afc2dd5]
/lib64/libc.so.6(clone+0x6d)[0x7fc72a889ead]

--- Additional comment from SATHEESARAN on 2019-04-01 08:37:56 UTC ---

1. RHHI-V Information
----------------------
RHV 4.3.3
RHGS 3.4.4

2. Cluster Information
-----------------------
[root@rhsqa-grafton11 ~]# gluster pe s
Number of Peers: 2

Hostname: rhsqa-grafton10.lab.eng.blr.redhat.com
Uuid: 46807597-245c-4596-9be3-f7f127aa4aa2
State: Peer in Cluster (Connected)
Other names:
10.70.45.32

Hostname: rhsqa-grafton12.lab.eng.blr.redhat.com
Uuid: 8a3bc1a5-07c1-4e1c-aa37-75ab15f29877
State: Peer in Cluster (Connected)
Other names:
10.70.45.34

3. Volume information
-----------------------
Affected volume: data
[root@rhsqa-grafton11 ~]# gluster volume info data
 
Volume Name: data
Type: Replicate
Volume ID: 9d5a9d10-f192-49ed-a6f0-c912224869e8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: rhsqa-grafton10.lab.eng.blr.redhat.com:/gluster_bricks/data/data
Brick2: rhsqa-grafton11.lab.eng.blr.redhat.com:/gluster_bricks/data/data
Brick3: rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/data/data (arbiter)
Options Reconfigured:
cluster.granular-entry-heal: enable
performance.strict-o-direct: on
network.ping-timeout: 30
storage.owner-gid: 36
storage.owner-uid: 36
server.event-threads: 4
client.event-threads: 4
cluster.choose-local: off
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: off
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on
[root@rhsqa-grafton11 ~]# gluster volume status data
Status of volume: data
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhsqa-grafton10.lab.eng.blr.redhat.co
m:/gluster_bricks/data/data                 49154     0          Y       23403
Brick rhsqa-grafton11.lab.eng.blr.redhat.co
m:/gluster_bricks/data/data                 49154     0          Y       23285
Brick rhsqa-grafton12.lab.eng.blr.redhat.co
m:/gluster_bricks/data/data                 49154     0          Y       23296
Self-heal Daemon on localhost               N/A       N/A        Y       16195
Self-heal Daemon on rhsqa-grafton12.lab.eng
.blr.redhat.com                             N/A       N/A        Y       52917
Self-heal Daemon on rhsqa-grafton10.lab.eng
.blr.redhat.com                             N/A       N/A        Y       43829
 
Task Status of Volume data
------------------------------------------------------------------------------
There are no active volume tasks

Comment 2 Krutika Dhananjay 2019-04-03 15:09:30 UTC

Found the issue.

Devil is in those minor details but i'll write it here anyway more as a note to self because it's very hard to store all of the sequence and fine-grained details in memory and a very very specific case leads to this issue.

Reproducer on a smaller scale -
=============================
1. create a 1x3 volume.
2. set shard on it.
3. Set shard-block-size to 4MB
4. Set shard-lru-limit to 150
5. Turn off write-behind.
6. Start and fuse mount it.
7. qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img 1G
8. Unlink it.

Why are these options configured this way? This bug is hit when the lru-list is bigger than the deletion rate and more than lru-list-size number of shards are created specifically using operations like fallocate that fallocate all shards in parallel as opposed to something like writev where there's guarantee that i won't be ever be needing to resolve > lru-limit number of shards for the same fop.

Why is this important?
A single fallocate fop in shard resolves and creates all shards (in this case 256 > shard-lru-limit) in a single batch. Since the lru list can only old 150 shards, after blocks 1..150 are resolved, shards 150-255 (skipping base shard) will evict the first (255-150) shards even before they're fallocated. This means these (255-150) inodes get evicted from lru list and inode_unlink()d. (in other words its not just enough to fill the lru list but also force eviction of some of the just resolved participant shards which are yet to be operated on).

Now shard sends fallocate on all 256 shards. (255-150) of these will be added to fsync list. and the rest of the 150 are part of both fsync and lru lists.

Then comes an unlink on the file.
Shard deletion is initiated in background. But this is done in batches of shard-deletion-rate shards at a time. The now-evicted shards also need to be resolved first before shard can send UNLINKs on them. inode_resolve() fails on them because they are inode_unlink()d as part of eviction. So a lookup is sent and in the callback they're linked again, but inode_link() still finds the old inode and returns that (whose inode ctx is still valid). unfortunately shard_link_block_inode() ends up relying on local->loc.inode as the source of base inode which is null in background deletion. so when one of these evicted shards is added back to list, the ref on the base shard is not taken since base shard itself is null. that's a missing ref. which is still ok as long as we know we shouldn't unref it at the end. but that's not what happens. once these shards are deleted, in the cbk the base_shard gets unref'd (this is an unwanted extra unref) for each such previously evicted shard because the inode ctx is still valid of the old inode returned by inode_unlink() and contains reference to base inode pointer.

SIMPLE SUMMARY OF THE ISSUE:
============================
Here are the expectations -
1. when a shard inode is added to lru list, we ref base shard. when it is added to fsync list, we take another ref. this is to keep the base shard alive precisely for deletion operations.
2. When a shard is evicted from lru list, base shard is unref'd once. When it is evicted from fsync list, again base_shard needs to be unref'd.

Simple stuff so far, basically undo what you did in step 1 once you're in step 2.

In this bug, step 2 was executed correctly everywhere. But in one particular scenario, when the shard is added to LRU list, base inode is not ref'd (because the pointer passed to the function is NULL).
But when that shard's life cycle ends, shard translator *somehow* gets access to the base shard pointer and unrefs it thinking it needs to undo whatever was done at the beginning.
So in this way, the number of refs keeps coming down even when io (unlinks) is happening on the shards, leading to more unrefs than refs leading at some point to inode destruction and illegal memory access.


There is a way to work around the issue by setting shard-lru-limit to a value that is very high. But exposing to users can have unintended consequences like leading to high memory consumption. Or if the value is too low, it will lead to frequent evictions and hence unnecessary lookups over the network. Besides i introduced the option itself just to make testing easier. the option is NO_DOC and not exposed to users. Purely meant for testing purposes.

As for the fix, I still need to think about it. There are multiple ways to "patch" the issue. But I need to be sure it won't break anything that was already working and there are lot many cases to consider where we need to confirm that the fix won't break anything.

Comment 7 SATHEESARAN 2019-07-10 05:59:10 UTC

Tested with RHVH 4.3.5 based on RHEL 7.7 with glusterfs-6.0-7 with 2 test scenarios

1. Created the multiple preallocated raw images with their aggregate size exceeding 2TB
and deleted them all together ( concurrent )

2. Created multiple 2TB preallocated raw images and delete them concurrently

On the both of the above mentioned scenarios, the deletion of VM images was smooth,
no issues seen, all hosts were operational and DC was fully functional

Comment 9 errata-xmlrpc 2020-02-13 15:57:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0508