Bug 1787664 - Accessing select directories unmounts the filesystem.
Summary: Accessing select directories unmounts the filesystem.
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: 6
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-03 22:12 UTC by Calvin Dunigan
Modified: 2020-01-21 16:44 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-17 17:13:02 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Calvin Dunigan 2020-01-03 22:12:08 UTC
Description of problem:
Listing select directories locks the process, generates hundreds of thousands of error messages and causes the filesystem to unmount.


Version-Release number of selected component (if applicable):
6.5

How reproducible:
Every time, on every client (4).

Steps to Reproduce:
1. ls ./attrs
2.
3.

Actual results:
The ls process goes into an uninteruptible wait (D) state.
The volume log is inundated with error messages.
The volume is unmounted.

Expected results:
A file listing.

Additional info:
This is a two-node cluster running in AWS on CentOS 7.7.1908, gluster version 6.5 with two volumes consisting of two bricks each, and an xfs filesystem. The files in the bricks appear to be healthy. Only one of the volumes is affected.  There are tens of thousands of directories in the volume, only one or two of them are affected.

Most frequent error message:
[2020-01-01 01:03:57.908580] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.5/xlator/protocol/client.so(+0x61412) [0x7f9c18d1a412] -
->/usr/lib64/glusterfs/6.5/xlator/cluster/distribute.so(+0x45348) [0x7f9c18a45348] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7f9c278
771b4] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid argument]

Most specific error message involving one of the select directories:
[2020-01-03 01:43:30.388644] W [MSGID: 109007] [dht-common.c:2616:dht_lookup_everywhere_cbk] 0-vdata-dht: multiple subvolumes (vdata-client-0 and vdata-client-0) have file /users/HCTA/840/Inbox/702/.stfs/attrs/702FO15vqdG20d1J5nSvGVKaf6SvjAwX.zip (preferably rename the file in the backend,and do a fresh lookup)
The message "W [MSGID: 109007] [dht-common.c:2616:dht_lookup_everywhere_cbk] 0-vdata-dht: multiple subvolumes (vdata-client-0 and vdata-client-0) have file /users/HCTA/840/Inbox/702/.stfs/attrs/702FO15vqdG20d1J5nSvGVKaf6SvjAwX.zip (preferably rename the file in the backend,and do a fresh lookup)" repeated 55772 times between [2020-01-03 01:43:30.388644] and [2020-01-03 01:45:30.402289]

Comment 2 SATHEESARAN 2020-01-07 10:24:21 UTC
This bug is moved to Gluster product, as this is not a downstream product - RHGS - related one.

Please upload the relevant glusterd.log ( /var/log/glusterfs/glusterd.log ) and brick logs ( /var/log/glusterfs/bricks/* )
If possible, you can also get us the sosreports from these centos nodes.

Comment 3 Calvin Dunigan 2020-01-08 01:25:29 UTC
Unfortunately we have a couple of restrictions.  First, our customer is a Federal agency and requires governmental clearance to see most data.  More unfortunately, when the problem first appeared the logs grew so large that they were filling the root filesystem and were truncated to free up space.  Also, due to the nature of the customer, SOS is not an option.

I do have some logs that cover a time when the problem was active.  I could "scrub" those of sensitive data and forward them if you think that would be helpful.

Finally, I have found a potential cure.  It seems that the filesystem error only occurs the first time a file is accessed. (The first time since since the onset of the problem on Dec. 22.)  So I wrote a shell script that touched every file, and for those that crashed the filesystem, it killed the glusterfs proc, remounted and continued.  So far, the problem hasn't recurred for any given file.  I have no way of knowing if this is a permanent cure (it's certainly not a fix) or if the problems will come back.  I only mention it in the hopes that it may provide some insight into the problem.

Let me know if the logs that cover only a portion of the time that the issue was present will be of value.

Comment 4 Calvin Dunigan 2020-01-17 17:13:02 UTC
This problem was self-inflicted, we copied files directly to the bricks.


Note You need to log in before you can comment on or make changes to this bug.