Description of problem: ------------------------ When deleting the 2TB image file , gluster fuse mount process has crashed Version-Release number of selected component (if applicable): ------------------------------------------------------------- RHGS 3.4.4 ( glusterfs-3.12.2-47.el7rhgs ) How reproducible: ----------------- 1/1 Steps to Reproduce: ------------------- 1. Create a image file of 2T from RHV Manager UI 2. Delete the same image file after its created successfully Actual results: --------------- Fuse mount crashed Expected results: ----------------- All should work fine and no fuse mount crashes --- Additional comment from SATHEESARAN on 2019-04-01 08:33:14 UTC --- frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2019-04-01 07:57:53 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.12.2 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x9d)[0x7fc72c186b9d] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fc72c191114] /lib64/libc.so.6(+0x36280)[0x7fc72a7c2280] /usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x9627)[0x7fc71f8ba627] /usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x9ef1)[0x7fc71f8baef1] /usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x3ae9c)[0x7fc71fb15e9c] /usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0x9e8c)[0x7fc71fd88e8c] /usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0xb79b)[0x7fc71fd8a79b] /usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0xc226)[0x7fc71fd8b226] /usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so(+0x17cbc)[0x7fc72413fcbc] /lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7fc72bf2ca00] /lib64/libgfrpc.so.0(rpc_clnt_notify+0x26b)[0x7fc72bf2cd6b] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fc72bf28ae3] /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7586)[0x7fc727043586] /usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9bca)[0x7fc727045bca] /lib64/libglusterfs.so.0(+0x8a870)[0x7fc72c1e5870] /lib64/libpthread.so.0(+0x7dd5)[0x7fc72afc2dd5] /lib64/libc.so.6(clone+0x6d)[0x7fc72a889ead] --- Additional comment from SATHEESARAN on 2019-04-01 08:37:56 UTC --- 1. RHHI-V Information ---------------------- RHV 4.3.3 RHGS 3.4.4 2. Cluster Information ----------------------- [root@rhsqa-grafton11 ~]# gluster pe s Number of Peers: 2 Hostname: rhsqa-grafton10.lab.eng.blr.redhat.com Uuid: 46807597-245c-4596-9be3-f7f127aa4aa2 State: Peer in Cluster (Connected) Other names: 10.70.45.32 Hostname: rhsqa-grafton12.lab.eng.blr.redhat.com Uuid: 8a3bc1a5-07c1-4e1c-aa37-75ab15f29877 State: Peer in Cluster (Connected) Other names: 10.70.45.34 3. Volume information ----------------------- Affected volume: data [root@rhsqa-grafton11 ~]# gluster volume info data Volume Name: data Type: Replicate Volume ID: 9d5a9d10-f192-49ed-a6f0-c912224869e8 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: rhsqa-grafton10.lab.eng.blr.redhat.com:/gluster_bricks/data/data Brick2: rhsqa-grafton11.lab.eng.blr.redhat.com:/gluster_bricks/data/data Brick3: rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/data/data (arbiter) Options Reconfigured: cluster.granular-entry-heal: enable performance.strict-o-direct: on network.ping-timeout: 30 storage.owner-gid: 36 storage.owner-uid: 36 server.event-threads: 4 client.event-threads: 4 cluster.choose-local: off user.cifs: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable network.remote-dio: off performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off transport.address-family: inet nfs.disable: on performance.client-io-threads: on [root@rhsqa-grafton11 ~]# gluster volume status data Status of volume: data Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick rhsqa-grafton10.lab.eng.blr.redhat.co m:/gluster_bricks/data/data 49154 0 Y 23403 Brick rhsqa-grafton11.lab.eng.blr.redhat.co m:/gluster_bricks/data/data 49154 0 Y 23285 Brick rhsqa-grafton12.lab.eng.blr.redhat.co m:/gluster_bricks/data/data 49154 0 Y 23296 Self-heal Daemon on localhost N/A N/A Y 16195 Self-heal Daemon on rhsqa-grafton12.lab.eng .blr.redhat.com N/A N/A Y 52917 Self-heal Daemon on rhsqa-grafton10.lab.eng .blr.redhat.com N/A N/A Y 43829 Task Status of Volume data ------------------------------------------------------------------------------ There are no active volume tasks
Found the issue. Devil is in those minor details but i'll write it here anyway more as a note to self because it's very hard to store all of the sequence and fine-grained details in memory and a very very specific case leads to this issue. Reproducer on a smaller scale - ============================= 1. create a 1x3 volume. 2. set shard on it. 3. Set shard-block-size to 4MB 4. Set shard-lru-limit to 150 5. Turn off write-behind. 6. Start and fuse mount it. 7. qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img 1G 8. Unlink it. Why are these options configured this way? This bug is hit when the lru-list is bigger than the deletion rate and more than lru-list-size number of shards are created specifically using operations like fallocate that fallocate all shards in parallel as opposed to something like writev where there's guarantee that i won't be ever be needing to resolve > lru-limit number of shards for the same fop. Why is this important? A single fallocate fop in shard resolves and creates all shards (in this case 256 > shard-lru-limit) in a single batch. Since the lru list can only old 150 shards, after blocks 1..150 are resolved, shards 150-255 (skipping base shard) will evict the first (255-150) shards even before they're fallocated. This means these (255-150) inodes get evicted from lru list and inode_unlink()d. (in other words its not just enough to fill the lru list but also force eviction of some of the just resolved participant shards which are yet to be operated on). Now shard sends fallocate on all 256 shards. (255-150) of these will be added to fsync list. and the rest of the 150 are part of both fsync and lru lists. Then comes an unlink on the file. Shard deletion is initiated in background. But this is done in batches of shard-deletion-rate shards at a time. The now-evicted shards also need to be resolved first before shard can send UNLINKs on them. inode_resolve() fails on them because they are inode_unlink()d as part of eviction. So a lookup is sent and in the callback they're linked again, but inode_link() still finds the old inode and returns that (whose inode ctx is still valid). unfortunately shard_link_block_inode() ends up relying on local->loc.inode as the source of base inode which is null in background deletion. so when one of these evicted shards is added back to list, the ref on the base shard is not taken since base shard itself is null. that's a missing ref. which is still ok as long as we know we shouldn't unref it at the end. but that's not what happens. once these shards are deleted, in the cbk the base_shard gets unref'd (this is an unwanted extra unref) for each such previously evicted shard because the inode ctx is still valid of the old inode returned by inode_unlink() and contains reference to base inode pointer. SIMPLE SUMMARY OF THE ISSUE: ============================ Here are the expectations - 1. when a shard inode is added to lru list, we ref base shard. when it is added to fsync list, we take another ref. this is to keep the base shard alive precisely for deletion operations. 2. When a shard is evicted from lru list, base shard is unref'd once. When it is evicted from fsync list, again base_shard needs to be unref'd. Simple stuff so far, basically undo what you did in step 1 once you're in step 2. In this bug, step 2 was executed correctly everywhere. But in one particular scenario, when the shard is added to LRU list, base inode is not ref'd (because the pointer passed to the function is NULL). But when that shard's life cycle ends, shard translator *somehow* gets access to the base shard pointer and unrefs it thinking it needs to undo whatever was done at the beginning. So in this way, the number of refs keeps coming down even when io (unlinks) is happening on the shards, leading to more unrefs than refs leading at some point to inode destruction and illegal memory access. There is a way to work around the issue by setting shard-lru-limit to a value that is very high. But exposing to users can have unintended consequences like leading to high memory consumption. Or if the value is too low, it will lead to frequent evictions and hence unnecessary lookups over the network. Besides i introduced the option itself just to make testing easier. the option is NO_DOC and not exposed to users. Purely meant for testing purposes. As for the fix, I still need to think about it. There are multiple ways to "patch" the issue. But I need to be sure it won't break anything that was already working and there are lot many cases to consider where we need to confirm that the fix won't break anything.
Tested with RHVH 4.3.5 based on RHEL 7.7 with glusterfs-6.0-7 with 2 test scenarios 1. Created the multiple preallocated raw images with their aggregate size exceeding 2TB and deleted them all together ( concurrent ) 2. Created multiple 2TB preallocated raw images and delete them concurrently On the both of the above mentioned scenarios, the deletion of VM images was smooth, no issues seen, all hosts were operational and DC was fully functional
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0508