1603118 – [RHHi] Mount hung and not accessible

Bug 1603118 - [RHHi] Mount hung and not accessible

Summary: [RHHi] Mount hung and not accessible

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	sharding
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.z Batch Update 2
Assignee:	Krutika Dhananjay
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1605056 1641440
TreeView+	depends on / blocked

Reported:	2018-07-19 08:33 UTC by Sahina Bose
Modified:	2018-12-17 17:07 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.12.2-27
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1605056 (view as bug list)
Environment:
Last Closed:	2018-12-17 17:07:02 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3827	0	None	None	None	2018-12-17 17:07:17 UTC

Description Sahina Bose 2018-07-19 08:33:22 UTC

Description of problem:

One of the hosts in the RHHI cluster is in non-operational status. Looking through the vdsm logs, the following error is seen.

2018-07-18 18:02:08,353+0530 WARN  (itmap/1) [storage.scanDomains] Could not collect metadata file for domain path /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:_vmstore (fileSD:845)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 834, in collectMetaFiles
    metaFiles = oop.getProcessPool(client_name).glob.glob(mdPattern)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 107, in glob
    return self._iop.glob(pattern)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 560, in glob
    return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 451, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 11] Resource temporarily unavailable

There are no errors in the vmstore mount logs, however the mount is hung - cannot access /rhev/data-center/mnt/glusterSD/rhsdev-grafton2.lab.eng.blr.redhat.com:_vmstore

Version-Release number of selected component (if applicable):
glusterfs-debuginfo-3.8.4-54.12.el7rhgs.x86_64
glusterfs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-libs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.12.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.12.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.12.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
glusterfs-geo-replication-3.8.4-54.12.el7rhgs.x86_64
python-gluster-3.8.4-54.12.el7rhgs.noarch
vdsm-gluster-4.20.33-1.el7ev.x86_64
glusterfs-api-3.8.4-54.12.el7rhgs.x86_64
glusterfs-server-3.8.4-54.12.el7rhgs.x86_64
glusterfs-events-3.8.4-54.12.el7rhgs.x86_64
glusterfs-rdma-3.8.4-54.12.el7rhgs.x86_64
gluster-collectd-1.0.0-1.fc27.noarch


How reproducible:
This was seen on a running environment.

Steps to Reproduce:
NA

Comment 2 Krutika Dhananjay 2018-07-19 16:07:20 UTC

Found the RCA.

For this bug to be hit, the LRU list in shard needs to become full. Its size is 16K. That means at least 16K shards should have been accessed from a glusterfs/RHHI mount.

(How do you know when you've hit that mark and can stop generating more data when you're testing? Take the statedump of the mount. Under section "[features/shard.$VOLNAME-shard]" you should see the following line "inode-count=16384")

Next, a vm should have been migrated from the host where it was created/used for a while to another host.

Next, delete the vm from the new host.

Access this vm from the old host. You should get "No such file or directory".
(I don't know if RHV accessed this file from the first host triggering destruction ("forget") of the base inode of this vm. But it is also quite likely in this case that fuse forced the destruction of the inode upon seeing high memory pressure, and at some point this now invalid pointer is accessed leading to strange behavior - like a crash or a hang).

Now perform some more io on one of the other vms that the first host might be managing. At some point, your client will either crash or hang.

======================================================
Here is a simpler way to hit the bug on your non-RHHI (even single node) setup:

1. Create a replica 3 volume and start it.
2. Enable shard on it. And set shard-block-size to 4MB.
3. Create 2 fuse mounts - $M1 and $M2.
4. From $M1, create a 65GB size file (use dd maybe).
(Why 64GB? To hit the lru limit, you need 16K shards. That's 16K * 4MB = 64GB. This is where setting shard-block size to 4MB helps for the purpose of this test. With default 64MB size, more time and space will be needed to recreate the issue since now a 1TB image will need to be created to hit the bug).

5. Read that file entirely from $M2. (use cat maybe).
6. Delete the file from $M1.
7. Stat the file from $M2. (Should fail with "No such file or directory").
8. Now start dd on a second file from $M2.

Mount process associated with $M2 must crash soon.

-Krutika

Comment 3 Krutika Dhananjay 2018-07-19 16:09:50 UTC

(In reply to Krutika Dhananjay from comment #2)
> Found the RCA.
> 
> For this bug to be hit, the LRU list in shard needs to become full. Its size
> is 16K. That means at least 16K shards should have been accessed from a
> glusterfs/RHHI mount.
> 
> (How do you know when you've hit that mark and can stop generating more data
> when you're testing? Take the statedump of the mount. Under section
> "[features/shard.$VOLNAME-shard]" you should see the following line
> "inode-count=16384")
> 
> Next, a vm should have been migrated from the host where it was created/used
> for a while to another host.
> 
> Next, delete the vm from the new host.
> 
> Access this vm from the old host. You should get "No such file or directory".
> (I don't know if RHV accessed this file from the first host triggering
> destruction ("forget") of the base inode of this vm. But it is also quite
> likely in this case that fuse forced the destruction of the inode upon
> seeing high memory pressure, and at some point this now invalid pointer is
> accessed leading to strange behavior - like a crash or a hang).
> 
> Now perform some more io on one of the other vms that the first host might
> be managing. At some point, your client will either crash or hang.
> 
> ======================================================
> Here is a simpler way to hit the bug on your non-RHHI (even single node)
> setup:
> 
> 1. Create a replica 3 volume and start it.
> 2. Enable shard on it. And set shard-block-size to 4MB.
> 3. Create 2 fuse mounts - $M1 and $M2.
> 4. From $M1, create a 65GB size file (use dd maybe).

Sorry, typo. This should be 64GB (although the bug is recreatable with 65GB block size too!)

-Krutika

> (Why 64GB? To hit the lru limit, you need 16K shards. That's 16K * 4MB =
> 64GB. This is where setting shard-block size to 4MB helps for the purpose of
> this test. With default 64MB size, more time and space will be needed to
> recreate the issue since now a 1TB image will need to be created to hit the
> bug).
> 
> 5. Read that file entirely from $M2. (use cat maybe).
> 6. Delete the file from $M1.
> 7. Stat the file from $M2. (Should fail with "No such file or directory").
> 8. Now start dd on a second file from $M2.
> 
> Mount process associated with $M2 must crash soon.
> 
> -Krutika

Comment 4 Krutika Dhananjay 2018-07-23 08:26:58 UTC

https://review.gluster.org/#/q/topic:ref-1605056+(status:open+OR+status:merged)

Comment 7 Krutika Dhananjay 2018-10-16 03:41:14 UTC

(In reply to Krutika Dhananjay from comment #4)
> https://review.gluster.org/#/q/topic:ref-1605056+(status:open+OR+status:
> merged)

Upstream patches have been reviewed and merged.

Comment 13 Krutika Dhananjay 2018-11-09 07:43:26 UTC

Note to QE now that the bz is ON_QA:

To make testing of shard's LRU algorithm easier (which previously had a limit of 16K which meant one had to create/access 1TB of data to hit the limit and find bugs therein), I've introduced a "features.shard-lru-limit" to configure this to a lower number (min is 20, but that's probably an overkill even for testing purposes with vms). So I'm guessing something like 1K is a good number. You can use this to test the stability of the code once the LRU limit is crossed, to ensure there are no stale inodes in LRU list even after they're LRU'd out (client process will crash/hang if there are any, due to illegal memory access).
You can check if the LRU limit has been reached through the 'inode-count=' section in the statedump of the client process.

Also note that this option can only be set using volume-set once for a fuse mount process. To keep the code simple, I did not implement reconfiguring of the option through volume-set. If you set a value X and want to move to Y, you need to unmount the volume, set the option to Y and mount again for Y to take effect.

Let me know if you need any other input/help.

-Krutika

Comment 14 SATHEESARAN 2018-12-08 00:39:01 UTC

Tested with glusterfs-3.12.2-31.el7rhgs

1. Create a replica 3 volume with shard size as 4MB and start it.
2. Create 2 fuse mounts 
3. From first fuse mount, create a 64GB size file 
4. Read that file entirely from second mount
5. Delete the file from first mount
6. Checking for the file from second mount failed as expected
7. Now start dd on a second file from second mount.

With this testing there are no issues found.

Repeated the test RHV environment too, with disk image of 100G ( with shard-size as 4MB ). No issues found.

Comment 15 errata-xmlrpc 2018-12-17 17:07:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3827

Note You need to log in before you can comment on or make changes to this bug.