Description of problem: One of the hosts in the RHHI cluster is in non-operational status. Looking through the vdsm logs, the following error is seen. 2018-07-18 18:02:08,353+0530 WARN (itmap/1) [storage.scanDomains] Could not collect metadata file for domain path /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:_vmstore (fileSD:845) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 834, in collectMetaFiles metaFiles = oop.getProcessPool(client_name).glob.glob(mdPattern) File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 107, in glob return self._iop.glob(pattern) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 560, in glob return self._sendCommand("glob", {"pattern": pattern}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 451, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 11] Resource temporarily unavailable There are no errors in the vmstore mount logs, however the mount is hung - cannot access /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:_vmstore Version-Release number of selected component (if applicable): glusterfs-debuginfo-3.8.4-54.12.el7rhgs.x86_64 glusterfs-3.8.4-54.12.el7rhgs.x86_64 glusterfs-libs-3.8.4-54.12.el7rhgs.x86_64 glusterfs-cli-3.8.4-54.12.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-54.12.el7rhgs.x86_64 glusterfs-fuse-3.8.4-54.12.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 glusterfs-geo-replication-3.8.4-54.12.el7rhgs.x86_64 python-gluster-3.8.4-54.12.el7rhgs.noarch vdsm-gluster-4.20.33-1.el7ev.x86_64 glusterfs-api-3.8.4-54.12.el7rhgs.x86_64 glusterfs-server-3.8.4-54.12.el7rhgs.x86_64 glusterfs-events-3.8.4-54.12.el7rhgs.x86_64 glusterfs-rdma-3.8.4-54.12.el7rhgs.x86_64 gluster-collectd-1.0.0-1.fc27.noarch How reproducible: This was seen on a running environment. Steps to Reproduce: NA
Found the RCA. For this bug to be hit, the LRU list in shard needs to become full. Its size is 16K. That means at least 16K shards should have been accessed from a glusterfs/RHHI mount. (How do you know when you've hit that mark and can stop generating more data when you're testing? Take the statedump of the mount. Under section "[features/shard.$VOLNAME-shard]" you should see the following line "inode-count=16384") Next, a vm should have been migrated from the host where it was created/used for a while to another host. Next, delete the vm from the new host. Access this vm from the old host. You should get "No such file or directory". (I don't know if RHV accessed this file from the first host triggering destruction ("forget") of the base inode of this vm. But it is also quite likely in this case that fuse forced the destruction of the inode upon seeing high memory pressure, and at some point this now invalid pointer is accessed leading to strange behavior - like a crash or a hang). Now perform some more io on one of the other vms that the first host might be managing. At some point, your client will either crash or hang. ====================================================== Here is a simpler way to hit the bug on your non-RHHI (even single node) setup: 1. Create a replica 3 volume and start it. 2. Enable shard on it. And set shard-block-size to 4MB. 3. Create 2 fuse mounts - $M1 and $M2. 4. From $M1, create a 65GB size file (use dd maybe). (Why 64GB? To hit the lru limit, you need 16K shards. That's 16K * 4MB = 64GB. This is where setting shard-block size to 4MB helps for the purpose of this test. With default 64MB size, more time and space will be needed to recreate the issue since now a 1TB image will need to be created to hit the bug). 5. Read that file entirely from $M2. (use cat maybe). 6. Delete the file from $M1. 7. Stat the file from $M2. (Should fail with "No such file or directory"). 8. Now start dd on a second file from $M2. Mount process associated with $M2 must crash soon. -Krutika
(In reply to Krutika Dhananjay from comment #2) > Found the RCA. > > For this bug to be hit, the LRU list in shard needs to become full. Its size > is 16K. That means at least 16K shards should have been accessed from a > glusterfs/RHHI mount. > > (How do you know when you've hit that mark and can stop generating more data > when you're testing? Take the statedump of the mount. Under section > "[features/shard.$VOLNAME-shard]" you should see the following line > "inode-count=16384") > > Next, a vm should have been migrated from the host where it was created/used > for a while to another host. > > Next, delete the vm from the new host. > > Access this vm from the old host. You should get "No such file or directory". > (I don't know if RHV accessed this file from the first host triggering > destruction ("forget") of the base inode of this vm. But it is also quite > likely in this case that fuse forced the destruction of the inode upon > seeing high memory pressure, and at some point this now invalid pointer is > accessed leading to strange behavior - like a crash or a hang). > > Now perform some more io on one of the other vms that the first host might > be managing. At some point, your client will either crash or hang. > > ====================================================== > Here is a simpler way to hit the bug on your non-RHHI (even single node) > setup: > > 1. Create a replica 3 volume and start it. > 2. Enable shard on it. And set shard-block-size to 4MB. > 3. Create 2 fuse mounts - $M1 and $M2. > 4. From $M1, create a 65GB size file (use dd maybe). Sorry, typo. This should be 64GB (although the bug is recreatable with 65GB block size too!) -Krutika > (Why 64GB? To hit the lru limit, you need 16K shards. That's 16K * 4MB = > 64GB. This is where setting shard-block size to 4MB helps for the purpose of > this test. With default 64MB size, more time and space will be needed to > recreate the issue since now a 1TB image will need to be created to hit the > bug). > > 5. Read that file entirely from $M2. (use cat maybe). > 6. Delete the file from $M1. > 7. Stat the file from $M2. (Should fail with "No such file or directory"). > 8. Now start dd on a second file from $M2. > > Mount process associated with $M2 must crash soon. > > -Krutika
https://review.gluster.org/#/q/topic:ref-1605056+(status:open+OR+status:merged)
(In reply to Krutika Dhananjay from comment #4) > https://review.gluster.org/#/q/topic:ref-1605056+(status:open+OR+status: > merged) Upstream patches have been reviewed and merged.
Note to QE now that the bz is ON_QA: To make testing of shard's LRU algorithm easier (which previously had a limit of 16K which meant one had to create/access 1TB of data to hit the limit and find bugs therein), I've introduced a "features.shard-lru-limit" to configure this to a lower number (min is 20, but that's probably an overkill even for testing purposes with vms). So I'm guessing something like 1K is a good number. You can use this to test the stability of the code once the LRU limit is crossed, to ensure there are no stale inodes in LRU list even after they're LRU'd out (client process will crash/hang if there are any, due to illegal memory access). You can check if the LRU limit has been reached through the 'inode-count=' section in the statedump of the client process. Also note that this option can only be set using volume-set once for a fuse mount process. To keep the code simple, I did not implement reconfiguring of the option through volume-set. If you set a value X and want to move to Y, you need to unmount the volume, set the option to Y and mount again for Y to take effect. Let me know if you need any other input/help. -Krutika
Tested with glusterfs-3.12.2-31.el7rhgs 1. Create a replica 3 volume with shard size as 4MB and start it. 2. Create 2 fuse mounts 3. From first fuse mount, create a 64GB size file 4. Read that file entirely from second mount 5. Delete the file from first mount 6. Checking for the file from second mount failed as expected 7. Now start dd on a second file from second mount. With this testing there are no issues found. Repeated the test RHV environment too, with disk image of 100G ( with shard-size as 4MB ). No issues found.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3827