Bug 1694604
Summary: | gluster fuse mount crashed, when deleting 2T image file from RHV Manager UI | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | SATHEESARAN <sasundar> |
Component: | rhhi | Assignee: | Sahina Bose <sabose> |
Status: | CLOSED ERRATA | QA Contact: | SATHEESARAN <sasundar> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | rhhiv-1.6 | CC: | bkunal, godas, kdhananj, pasik, rhs-bugs, storage-qa-internal |
Target Milestone: | --- | ||
Target Release: | RHHI-V 1.7 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-6.0-6 | Doc Type: | Bug Fix |
Doc Text: |
When sharding is enabled on a volume, a single file allocation operation creates all shards in a single batch. If the number of shards involved in an operation was greater than the number of entries allowed in the lru-cache, the inode associated with the file operation was freed while it was still in use. This led to a crash in the mount process when deleting large files from volumes with sharding enabled, which caused all virtual machines that had mounted that storage to pause. This issue is no longer observed in these updated packages.
|
Story Points: | --- |
Clone Of: | 1694595 | Environment: | |
Last Closed: | 2020-02-13 15:57:20 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1694595, 1696136 | ||
Bug Blocks: |
Description
SATHEESARAN
2019-04-01 08:48:24 UTC
Found the issue. Devil is in those minor details but i'll write it here anyway more as a note to self because it's very hard to store all of the sequence and fine-grained details in memory and a very very specific case leads to this issue. Reproducer on a smaller scale - ============================= 1. create a 1x3 volume. 2. set shard on it. 3. Set shard-block-size to 4MB 4. Set shard-lru-limit to 150 5. Turn off write-behind. 6. Start and fuse mount it. 7. qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img 1G 8. Unlink it. Why are these options configured this way? This bug is hit when the lru-list is bigger than the deletion rate and more than lru-list-size number of shards are created specifically using operations like fallocate that fallocate all shards in parallel as opposed to something like writev where there's guarantee that i won't be ever be needing to resolve > lru-limit number of shards for the same fop. Why is this important? A single fallocate fop in shard resolves and creates all shards (in this case 256 > shard-lru-limit) in a single batch. Since the lru list can only old 150 shards, after blocks 1..150 are resolved, shards 150-255 (skipping base shard) will evict the first (255-150) shards even before they're fallocated. This means these (255-150) inodes get evicted from lru list and inode_unlink()d. (in other words its not just enough to fill the lru list but also force eviction of some of the just resolved participant shards which are yet to be operated on). Now shard sends fallocate on all 256 shards. (255-150) of these will be added to fsync list. and the rest of the 150 are part of both fsync and lru lists. Then comes an unlink on the file. Shard deletion is initiated in background. But this is done in batches of shard-deletion-rate shards at a time. The now-evicted shards also need to be resolved first before shard can send UNLINKs on them. inode_resolve() fails on them because they are inode_unlink()d as part of eviction. So a lookup is sent and in the callback they're linked again, but inode_link() still finds the old inode and returns that (whose inode ctx is still valid). unfortunately shard_link_block_inode() ends up relying on local->loc.inode as the source of base inode which is null in background deletion. so when one of these evicted shards is added back to list, the ref on the base shard is not taken since base shard itself is null. that's a missing ref. which is still ok as long as we know we shouldn't unref it at the end. but that's not what happens. once these shards are deleted, in the cbk the base_shard gets unref'd (this is an unwanted extra unref) for each such previously evicted shard because the inode ctx is still valid of the old inode returned by inode_unlink() and contains reference to base inode pointer. SIMPLE SUMMARY OF THE ISSUE: ============================ Here are the expectations - 1. when a shard inode is added to lru list, we ref base shard. when it is added to fsync list, we take another ref. this is to keep the base shard alive precisely for deletion operations. 2. When a shard is evicted from lru list, base shard is unref'd once. When it is evicted from fsync list, again base_shard needs to be unref'd. Simple stuff so far, basically undo what you did in step 1 once you're in step 2. In this bug, step 2 was executed correctly everywhere. But in one particular scenario, when the shard is added to LRU list, base inode is not ref'd (because the pointer passed to the function is NULL). But when that shard's life cycle ends, shard translator *somehow* gets access to the base shard pointer and unrefs it thinking it needs to undo whatever was done at the beginning. So in this way, the number of refs keeps coming down even when io (unlinks) is happening on the shards, leading to more unrefs than refs leading at some point to inode destruction and illegal memory access. There is a way to work around the issue by setting shard-lru-limit to a value that is very high. But exposing to users can have unintended consequences like leading to high memory consumption. Or if the value is too low, it will lead to frequent evictions and hence unnecessary lookups over the network. Besides i introduced the option itself just to make testing easier. the option is NO_DOC and not exposed to users. Purely meant for testing purposes. As for the fix, I still need to think about it. There are multiple ways to "patch" the issue. But I need to be sure it won't break anything that was already working and there are lot many cases to consider where we need to confirm that the fix won't break anything. Tested with RHVH 4.3.5 based on RHEL 7.7 with glusterfs-6.0-7 with 2 test scenarios 1. Created the multiple preallocated raw images with their aggregate size exceeding 2TB and deleted them all together ( concurrent ) 2. Created multiple 2TB preallocated raw images and delete them concurrently On the both of the above mentioned scenarios, the deletion of VM images was smooth, no issues seen, all hosts were operational and DC was fully functional Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0508 |