Bug 1787664

Summary:	Accessing select directories unmounts the filesystem.
Product:	[Community] GlusterFS	Reporter:	Calvin Dunigan <cdunigan>
Component:	distribute	Assignee:	bugs <bugs>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	6	CC:	bugs, nchilaka, pasik, rhs-bugs, sasundar, storage-qa-internal
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-17 17:13:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Calvin Dunigan 2020-01-03 22:12:08 UTC

Description of problem:
Listing select directories locks the process, generates hundreds of thousands of error messages and causes the filesystem to unmount.


Version-Release number of selected component (if applicable):
6.5

How reproducible:
Every time, on every client (4).

Steps to Reproduce:
1. ls ./attrs
2.
3.

Actual results:
The ls process goes into an uninteruptible wait (D) state.
The volume log is inundated with error messages.
The volume is unmounted.

Expected results:
A file listing.

Additional info:
This is a two-node cluster running in AWS on CentOS 7.7.1908, gluster version 6.5 with two volumes consisting of two bricks each, and an xfs filesystem. The files in the bricks appear to be healthy. Only one of the volumes is affected.  There are tens of thousands of directories in the volume, only one or two of them are affected.

Most frequent error message:
[2020-01-01 01:03:57.908580] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.5/xlator/protocol/client.so(+0x61412) [0x7f9c18d1a412] -
->/usr/lib64/glusterfs/6.5/xlator/cluster/distribute.so(+0x45348) [0x7f9c18a45348] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f9c278
771b4] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid argument]

Most specific error message involving one of the select directories:
[2020-01-03 01:43:30.388644] W [MSGID: 109007] [dht-common.c:2616:dht_lookup_everywhere_cbk] 0-vdata-dht: multiple subvolumes (vdata-client-0 and vdata-client-0) have file /users/HCTA/840/Inbox/702/.stfs/attrs/702FO15vqdG20d1J5nSvGVKaf6SvjAwX.zip (preferably rename the file in the backend,and do a fresh lookup)
The message "W [MSGID: 109007] [dht-common.c:2616:dht_lookup_everywhere_cbk] 0-vdata-dht: multiple subvolumes (vdata-client-0 and vdata-client-0) have file /users/HCTA/840/Inbox/702/.stfs/attrs/702FO15vqdG20d1J5nSvGVKaf6SvjAwX.zip (preferably rename the file in the backend,and do a fresh lookup)" repeated 55772 times between [2020-01-03 01:43:30.388644] and [2020-01-03 01:45:30.402289]

Comment 2 SATHEESARAN 2020-01-07 10:24:21 UTC

This bug is moved to Gluster product, as this is not a downstream product - RHGS - related one.

Please upload the relevant glusterd.log ( /var/log/glusterfs/glusterd.log ) and brick logs ( /var/log/glusterfs/bricks/* )
If possible, you can also get us the sosreports from these centos nodes.

Comment 3 Calvin Dunigan 2020-01-08 01:25:29 UTC

Unfortunately we have a couple of restrictions.  First, our customer is a Federal agency and requires governmental clearance to see most data.  More unfortunately, when the problem first appeared the logs grew so large that they were filling the root filesystem and were truncated to free up space.  Also, due to the nature of the customer, SOS is not an option.

I do have some logs that cover a time when the problem was active.  I could "scrub" those of sensitive data and forward them if you think that would be helpful.

Finally, I have found a potential cure.  It seems that the filesystem error only occurs the first time a file is accessed. (The first time since since the onset of the problem on Dec. 22.)  So I wrote a shell script that touched every file, and for those that crashed the filesystem, it killed the glusterfs proc, remounted and continued.  So far, the problem hasn't recurred for any given file.  I have no way of knowing if this is a permanent cure (it's certainly not a fix) or if the problems will come back.  I only mention it in the hopes that it may provide some insight into the problem.

Let me know if the logs that cover only a portion of the time that the issue was present will be of value.

Comment 4 Calvin Dunigan 2020-01-17 17:13:02 UTC

This problem was self-inflicted, we copied files directly to the bricks.