Bug 1741783 - volume heal info show nothing, while visiting from mount point blame "no such entry"
Summary: volume heal info show nothing, while visiting from mount point blame "no such...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: 4.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Karthik U S
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-16 06:22 UTC by zhou lin
Modified: 2020-03-12 12:57 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-12 12:57:57 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
mn-0 node glustershd log+ services brick glsuterfsd log (2.60 MB, application/zip)
2019-08-17 06:40 UTC, zhou lin
no flags Details
problem-reproduced-complete-log (13.15 MB, application/zip)
2019-08-18 14:11 UTC, zhou lin
no flags Details

Description zhou lin 2019-08-16 06:22:10 UTC
Description of problem:
gluster volume heal info show nothing but visiting a file from mount point blame "no such entry"
1>volume info output:
Volume Name: services
Type: Replicate
Volume ID: 1aedfcb2-d689-4cbc-b94c-2033112e9b3d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: mn-0.local:/mnt/bricks/services/brick
Brick2: mn-1.local:/mnt/bricks/services/brick
Brick3: dbm-0.local:/mnt/bricks/services/brick
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
cluster.server-quorum-type: none
cluster.quorum-type: auto
cluster.quorum-reads: true
cluster.consistent-metadata: on
server.allow-insecure: on
network.ping-timeout: 42
cluster.favorite-child-policy: mtime
cluster.heal-timeout: 60
client.ssl: on
server.ssl: on
ssl.private-key: /var/opt/nokia/certs/glusterfs/glusterfs.key
ssl.own-cert: /var/opt/nokia/certs/glusterfs/glusterfs.pem
ssl.ca-list: /var/opt/nokia/certs/glusterfs/glusterfs.ca
performance.client-io-threads: off
cluster.server-quorum-ratio: 51%

2>ls show file mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
3>cat mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp from mount point return error
3> this file has gfid
4> this file does not have link file in .glusterfs dir

root@mn-0:/mnt/services/db/upgrade]
# ls -l
total 1
-rw-rwxr--+ 1 _nokrcpsysdbimredis _nokrcpsysdbimredis 15 Aug 16 05:21 mn-0__dbim-redis.service__database-nosql-cmredis.sync_state
-rw-rwxr--+ 1 _nokrcpsysdbimredis _nokrcpsysdbimredis 16 Aug 16 05:21 mn-1__dbim-redis.service__database-nosql-cmredis.sync_state
-rw-r--r--  0 _nokrcpsysdbimredis _nokrcpsysdbimredis  0 Aug 16 05:27 mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
[root@mn-0:/mnt/services/db/upgrade]

[root@mn-0:/home/robot]
# cat /mnt/services/db/upgrade/mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
cat: /mnt/services/db/upgrade/mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp: No such file or directory

[root@mn-0:/home/robot]
# getfattr -m . -d -e hex /mnt/bricks/services/brick/mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
getfattr: /mnt/bricks/services/brick/mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp: No such file or directory
[root@mn-0:/home/robot]
# cd /mnt/bricks/services/brick/db/upgrade/
[root@mn-0:/mnt/bricks/services/brick/db/upgrade]
# ls
mn-0__dbim-redis.service__database-nosql-cmredis.sync_state  mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
mn-1__dbim-redis.service__database-nosql-cmredis.sync_state
[root@mn-0:/mnt/bricks/services/brick/db/upgrade]
# getfattr -m . -d -e hex mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
# file: mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
trusted.gfid=0x4d4e638503344672b453441d727b1f8e
trusted.gfid2path.4867fa1704c54e13=0x65383062636534322d333831622d343466632d623436622d3135653066323966353363362f6d6e2d315f5f6462696d2d72656469732e736572766963655f5f64617461626173652d6e6f73716c2d636d72656469732e73796e635f73746174652e746d70

[root@mn-0:/mnt/bricks/services/brick]
# cd .glusterfs/indices/
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/indices]
# ls
dirty  entry-changes  xattrop
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/indices]
# cd xattrop/
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]
# ls
xattrop-6bf3890f-1c5e-4530-8153-df943157ba24
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]
# cd ../../4d/4e
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/4d/4e]
# ls
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/4d/4e]
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

services brick glusterfsd log on mn-0 node:
[2019-08-16 02:28:48.709852] I [MSGID: 115029] [server-handshake.c:529:server_setvolume] 0-services-server: accepted client from mn-0-3410-2019/08/16-02:28:48:669598-services-client-0-0-0 (version: 3.12.15)
[2019-08-16 02:28:49.711976] E [MSGID: 113001] [posix.c:4940:posix_getxattr] 0-services-posix: getxattr failed on /mnt/bricks/services/brick/.glusterfs/4d/4e/4d4e6385-0334-4672-b453-441d727b1f8e: trusted.glusterfs.dht.linkto  [No such file or directory]
[2019-08-16 02:28:50.043311] I [addr.c:55:compare_addr_and_update] 0-/mnt/bricks/services/brick: allowed = "*", received addr = "192.168.1.13"
[2019-08-16 02:28:50.043335] I [login.c:45:gf_auth] 0-auth/login: connecting user name: example ee certificate
[2019-08-16 02:28:50.043349] I [MSGID: 115029] [server-handshake.c:529:server_setvolume] 0-services-server: accepted client from mn-1-2655-2019/08/16-02:28:37:931958-services-client-0-0-0 (version: 3.12.15)

glustershd log on mn-0:
# tail -f /var/log/glusterfs/glustershd.log
[2019-08-16 02:38:53.553298] I [MSGID: 114047] [client-handshake.c:1163:client_setvolume_cbk] 0-services-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2019-08-16 02:38:53.553311] I [MSGID: 0] [client-handshake.c:979:client_post_handshake] 0-services-client-2: No fds to open - notifying all parents child up
[2019-08-16 02:38:53.553379] I [MSGID: 0] [afr-common.c:5361:afr_notify] 0-services-replicate-0: receive GF_EVENT_CHILD_UP envent(5)
[2019-08-16 02:38:53.553470] I [MSGID: 114035] [client-handshake.c:123:client_set_lk_version_cbk] 0-services-client-2: Server lk version = 1
[2019-08-16 02:38:53.553523] I [MSGID: 0] [afr-common.c:5478:afr_notify] 0-services-replicate-0: notify event 23
[2019-08-16 02:38:53.553639] I [MSGID: 0] [client-handshake.c:65:client_notify_parents_child_up] 0-services-client-2: notify chid up success
[2019-08-16 02:39:55.008078] I [MSGID: 108026] [afr-self-heald.c:432:afr_shd_index_heal] 0-services-replicate-0: got entry: 4d4e6385-0334-4672-b453-441d727b1f8e from services-client-0
[2019-08-16 02:39:55.009200] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-services-replicate-0: entry: path /db/upgrade/mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp, gfid: 4d4e6385-0334-4672-b453-441d727b1f8e
[2019-08-16 02:39:55.009698] W [MSGID: 114031] [client-rpc-fops.c:2867:client3_3_lookup_cbk] 0-services-client-0: remote operation failed. Path: <gfid:4d4e6385-0334-4672-b453-441d727b1f8e> (4d4e6385-0334-4672-b453-441d727b1f8e) [No such file or directory]
[2019-08-16 02:39:55.010391] I [MSGID: 108026] [afr-self-heald.c:446:afr_shd_index_heal] 0-services-replicate-0: ret = -2,purge gfid:4d4e6385-0334-4672-b453-441d727b1f8e

questions:
1>seems glustershd is trying to purge this entry but failed, could gluster consider this kind of missing-link-to-brick-dir case when purging entries?
2>why there is no clue from gluster volume heal info output? if there is something showing we could add some script to handle this situation, however, now the only thing possible to do is to scan the whole disk to find this error. that is not a good solution.

Comment 2 Karthik U S 2019-08-16 09:16:42 UTC
Please provide the following data:
- getfattr output of the file from all the bricks
- getfattr output of the file's parent from all bricks
- check whether the gfid link file is present in .glusterfs directory on any of the bricks
- stat output of the file from all the bricks
- logs of these bricks from all the bricks

Comment 3 zhou lin 2019-08-17 06:40:03 UTC
Created attachment 1605201 [details]
mn-0 node glustershd log+ services brick glsuterfsd log

Comment 4 zhou lin 2019-08-17 06:45:58 UTC
1> mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp only exist on mn-0 node, it does not exist in the other storage node brick dir at all, so getfattr and stat output is not available on the other 2 storage nodes,
2>the parent dir getfattr result is not available now, but i am sure on 3 storage node, the /mnt/bricks/services/brick/.glusterfs/indices/xattrop dir has nothing(other than xattrop-****, and this is sth we can ignore i think)
3>i upload mn-0 glustershd and services volume glusterfsd log, please refer to attachment
thanks!

Comment 5 zhou lin 2019-08-18 14:03:57 UTC
i think this issue appear again in my env:
[root@mn-0:/home/robot]
# gluster v heal services info
Brick mn-0.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0

Brick mn-1.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0

Brick dbm-0.local:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0

///////////////
[root@mn-0:/mnt/bricks/services/brick/db/upgrade]
# ls -la
total 44
drwxrwxr-x+ 2 root                root                4096 Aug 18 17:04 .
drwxrwx---+ 7 root                root                4096 Aug 18 14:08 ..
-rw-rwxr--+ 3 _nokrcpsysdbimredis _nokrcpsysdbimredis   16 Aug 18 17:04 mn-0__dbim-redis.service__database-nosql-cmredis.sync_state
-rw-rwxr--+ 3 _nokrcpsysdbimredis _nokrcpsysdbimredis   16 Aug 18 17:04 mn-0__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
-rw-rwxr--+ 2 _nokrcpsysdbimredis _nokrcpsysdbimredis   17 Aug 18 17:04 mn-1__dbim-redis.service__database-nosql-cmredis.sync_state
-rw-r--r--  1 _nokrcpsysdbimredis _nokrcpsysdbimredis    0 Aug 18 17:04 mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp

[root@mn-0:/mnt/bricks/services/brick/db/upgrade]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000700d302000008000700d402000010000700ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-1=0x000000000000000000000000
trusted.afr.services-client-2=0x000000000000000000000000
trusted.gfid=0x6fb86de80d8443e8a678e11fbaa6586f
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

[root@mn-0:/mnt/bricks/services/brick/.glusterfs/6f/b8]
# ls -la
total 8
drwx------  2 root root 4096 Aug 18 14:08 .
drwx------ 37 root root 4096 Aug 18 17:07 ..
lrwxrwxrwx  1 root root   56 Aug 18 14:08 6fb86de8-0d84-43e8-a678-e11fbaa6586f -> ../../0c/80/0c804e38-2566-49a5-ad98-c0bfeb5c1a1f/upgrade
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/6f/b8]
# stat .
  File: .
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: fc05h/64517d	Inode: 49          Links: 2
Access: (0700/drwx------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-08-18 21:53:07.725000000 +0800
Modify: 2019-08-18 14:08:39.912000000 +0800
Change: 2019-08-18 14:08:39.912000000 +0800
 Birth: 2019-08-18 14:08:39.912000000 +0800
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/6f/b8]

[root@mn-0:/mnt/bricks/services/brick/db/upgrade]
# getfattr -m . -d -e hex mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
# file: mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
trusted.gfid=0x5f2dc6c63fc441ba9800a560b483de13
trusted.gfid2path.c577d1d55bde06cc=0x36666238366465382d306438342d343365382d613637382d6531316662616136353836662f6d6e2d315f5f6462696d2d72656469732e736572766963655f5f64617461626173652d6e6f73716c2d636d72656469732e73796e635f73746174652e746d70

[root@mn-0:/mnt/bricks/services/brick/db/upgrade]
# stat mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
  File: mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
  Size: 0         	Blocks: 8          IO Block: 4096   regular empty file
Device: fc05h/64517d	Inode: 8309        Links: 1
Access: (0644/-rw-r--r--)  Uid: (  724/_nokrcpsysdbimredis)   Gid: (  724/_nokrcpsysdbimredis)
Access: 2019-08-18 21:36:23.832000000 +0800
Modify: 2019-08-18 17:04:03.320000000 +0800
Change: 2019-08-18 17:04:03.320000000 +0800
 Birth: 2019-08-18 17:04:03.320000000 +0800
[root@mn-0:/mnt/bricks/services/brick/db/upgrade]
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]
# ls
xattrop-20417b19-c17e-4223-bb18-ba6855fd80fe
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/indices]
# cd entry-changes/
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/indices/entry-changes]
# ls
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/indices/entry-changes]
# cd  ../dirty/
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/indices/dirty]
# ls
dirty-20417b19-c17e-4223-bb18-ba6855fd80fe
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/indices/dirty]

[root@mn-0:/mnt/bricks/services/brick/.glusterfs]
# cd 6f/b8/
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/6f/b8]
# ls
6fb86de8-0d84-43e8-a678-e11fbaa6586f
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/6f/b8]
# stat 6fb86de8-0d84-43e8-a678-e11fbaa6586f/
  File: 6fb86de8-0d84-43e8-a678-e11fbaa6586f/
  Size: 4096      	Blocks: 16         IO Block: 4096   directory
Device: fc05h/64517d	Inode: 47          Links: 2
Access: (0775/drwxrwxr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-08-18 17:04:46.800000000 +0800
Modify: 2019-08-18 17:04:45.649000000 +0800
Change: 2019-08-18 17:04:45.649000000 +0800
 Birth: 2019-08-18 14:08:39.912000000 +0800
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/6f/b8]


no link file for mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp exist
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/5f]
# ls
03  13  14  1b  23  24  27  2f  33  37  40  41  46  47  54  64  6a  6f  73  7b  88  89  92  93  96  9c  a7  ac  be  c3  cd  cf  d3  d4  da  dd  e1  e8  eb  f7
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/5f]
# cd 2d
bash: cd: 2d: No such file or directory
[root@mn-0:/mnt/bricks/services/brick/.glusterfs/5f]


///////////////////////////////mn-1 node///////////////////////
[root@mn-1:/mnt/bricks/services/brick/db/upgrade]
# ls -la
total 40
drwxrwxr-x+ 2 root                root                4096 Aug 18 17:04 .
drwxrwx---+ 7 root                root                4096 Aug 18 14:08 ..
-rw-rwxr--+ 3 _nokrcpsysdbimredis _nokrcpsysdbimredis   16 Aug 18 17:04 mn-0__dbim-redis.service__database-nosql-cmredis.sync_state
-rw-rwxr--+ 3 _nokrcpsysdbimredis _nokrcpsysdbimredis   16 Aug 18 17:04 mn-0__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
-rw-rwxr--+ 2 _nokrcpsysdbimredis _nokrcpsysdbimredis   17 Aug 18 17:04 mn-1__dbim-redis.service__database-nosql-cmredis.sync_state
[root@mn-1:/mnt/bricks/services/brick/db/upgrade]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000700d302000008000700d402000010000700ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-0=0x000000000000000000000000
trusted.afr.services-client-1=0x000000000000000000000000
trusted.afr.services-client-2=0x000000000000000000000000
trusted.gfid=0x6fb86de80d8443e8a678e11fbaa6586f
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

[root@mn-1:/mnt/bricks/services/brick/db/upgrade]
# stat .
  File: .
  Size: 4096      	Blocks: 16         IO Block: 4096   directory
Device: fc05h/64517d	Inode: 154         Links: 2
Access: (0775/drwxrwxr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-08-18 17:04:45.651000000 +0800
Modify: 2019-08-18 17:04:45.182000000 +0800
Change: 2019-08-18 17:04:45.182000000 +0800
 Birth: 2019-08-18 14:08:41.112000000 +0800

[root@mn-1:/mnt/bricks/services/brick/.glusterfs/6f/b8]
# ls -la
total 8
drwx------  2 root root 4096 Aug 18 14:08 .
drwx------ 36 root root 4096 Aug 18 17:07 ..
lrwxrwxrwx  1 root root   56 Aug 18 14:08 6fb86de8-0d84-43e8-a678-e11fbaa6586f -> ../../0c/80/0c804e38-2566-49a5-ad98-c0bfeb5c1a1f/upgrade

[root@mn-1:/mnt/bricks/services/brick/.glusterfs]
# stat 6f/b8/6fb86de8-0d84-43e8-a678-e11fbaa6586f/
  File: 6f/b8/6fb86de8-0d84-43e8-a678-e11fbaa6586f/
  Size: 4096      	Blocks: 16         IO Block: 4096   directory
Device: fc05h/64517d	Inode: 154         Links: 2
Access: (0775/drwxrwxr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-08-18 17:04:45.651000000 +0800
Modify: 2019-08-18 17:04:45.182000000 +0800
Change: 2019-08-18 17:04:45.182000000 +0800
 Birth: 2019-08-18 14:08:41.112000000 +0800
[root@mn-1:/mnt/bricks/services/brick/.glusterfs]
# cd indices/xattrop/
[root@mn-1:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]
# ls
xattrop-b035b38e-2cb0-4c76-8053-87bb1fd74d88
[root@mn-1:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]
# 

/////////////////////dbm-0/////////////////////////////////


[root@dbm-0:/mnt/bricks/services/brick/db/upgrade]
# ls -la
total 40
drwxrwxr-x+ 2 root                root                4096 Aug 18 17:04 .
drwxrwx---+ 7 root                root                4096 Aug 18 14:08 ..
-rw-rwxr--+ 3 _nokrcpsysdbimredis _nokrcpsysdbimredis   16 Aug 18 17:04 mn-0__dbim-redis.service__database-nosql-cmredis.sync_state
-rw-rwxr--+ 3 _nokrcpsysdbimredis _nokrcpsysdbimredis   16 Aug 18 17:04 mn-0__dbim-redis.service__database-nosql-cmredis.sync_state.tmp
-rw-rwxr--+ 2 _nokrcpsysdbimredis _nokrcpsysdbimredis   17 Aug 18 17:04 mn-1__dbim-redis.service__database-nosql-cmredis.sync_state
[root@dbm-0:/mnt/bricks/services/brick/db/upgrade]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000700d302000008000700d402000010000700ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-0=0x000000000000000000000000
trusted.afr.services-client-1=0x000000000000000000000000
trusted.gfid=0x6fb86de80d8443e8a678e11fbaa6586f
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

[root@dbm-0:/mnt/bricks/services/brick/db/upgrade]
# stat .
  File: .
  Size: 4096      	Blocks: 16         IO Block: 4096   directory
Device: fc05h/64517d	Inode: 47          Links: 2
Access: (0775/drwxrwxr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-08-18 21:57:34.874000000 +0800
Modify: 2019-08-18 17:04:44.991000000 +0800
Change: 2019-08-18 17:04:44.991000000 +0800
 Birth: 2019-08-18 14:08:39.573000000 +0800
[root@dbm-0:/mnt/bricks/services/brick/db/upgrade]
# cd ../../.glusterfs/
[root@dbm-0:/mnt/bricks/services/brick/.glusterfs]
# cd 6f/b8
[root@dbm-0:/mnt/bricks/services/brick/.glusterfs/6f/b8]
# ls
6fb86de8-0d84-43e8-a678-e11fbaa6586f
[root@dbm-0:/mnt/bricks/services/brick/.glusterfs/6f/b8]
# ls -la
total 8
drwx------  2 root root 4096 Aug 18 14:08 .
drwx------ 37 root root 4096 Aug 18 17:07 ..
lrwxrwxrwx  1 root root   56 Aug 18 14:08 6fb86de8-0d84-43e8-a678-e11fbaa6586f -> ../../0c/80/0c804e38-2566-49a5-ad98-c0bfeb5c1a1f/upgrade
[root@dbm-0:/mnt/bricks/services/brick/.glusterfs/6f/b8]
# stat 6fb86de8-0d84-43e8-a678-e11fbaa6586f
  File: 6fb86de8-0d84-43e8-a678-e11fbaa6586f -> ../../0c/80/0c804e38-2566-49a5-ad98-c0bfeb5c1a1f/upgrade
  Size: 56        	Blocks: 0          IO Block: 4096   symbolic link
Device: fc05h/64517d	Inode: 50          Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-08-18 14:08:40.125000000 +0800
Modify: 2019-08-18 14:08:39.573000000 +0800
Change: 2019-08-18 14:08:39.573000000 +0800
 Birth: 2019-08-18 14:08:39.573000000 +0800
[root@dbm-0:/mnt/bricks/services/brick/.glusterfs/6f/b8]
# cd ../../indices/xattrop/
[root@dbm-0:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]
# ls -la
total 16
drw------- 2 root root 12288 Aug 18 17:07 .
drw------- 5 root root  4096 Aug 18 14:08 ..
---------- 1 root root     0 Aug 18 17:07 xattrop-f1a53ad2-2a4a-4ed3-b48c-d384c214ba8a
[root@dbm-0:/mnt/bricks/services/brick/.glusterfs/indices/xattrop]

Comment 6 zhou lin 2019-08-18 14:11:02 UTC
Created attachment 1605463 [details]
problem-reproduced-complete-log

Comment 7 zhou lin 2019-08-18 14:12:40 UTC
from mn-0 glustershd log, same phenomenon as before:
[2019-08-18 09:04:45.970566] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-log-replicate-0: entry: path <gfid:4fe696ca-8a9e-4bcf-a1dc-a57c94d2bd35>, gfid: 4fe696ca-8a9e-4bcf-a1dc-a57c94d2bd35
[2019-08-18 09:04:45.971751] I [MSGID: 108026] [afr-self-heald.c:432:afr_shd_index_heal] 0-services-replicate-0: got entry: 5f2dc6c6-3fc4-41ba-9800-a560b483de13 from services-client-0
[2019-08-18 09:04:45.974127] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-services-replicate-0: entry: path /db/upgrade/mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp, gfid: 5f2dc6c6-3fc4-41ba-9800-a560b483de13
[2019-08-18 09:04:45.975839] W [MSGID: 114031] [client-rpc-fops.c:2867:client3_3_lookup_cbk] 0-services-client-0: remote operation failed. Path: <gfid:5f2dc6c6-3fc4-41ba-9800-a560b483de13> (5f2dc6c6-3fc4-41ba-9800-a560b483de13) [No such file or directory]
[2019-08-18 09:04:45.976644] I [MSGID: 108026] [afr-self-heald.c:446:afr_shd_index_heal] 0-services-replicate-0: ret = -2,purge gfid:5f2dc6c6-3fc4-41ba-9800-a560b483de13

Comment 8 zhou lin 2019-08-19 12:05:32 UTC
update from Red Hat Gluster  storage->Glusterfs
the version in use is glusterfs 3.12.15, however,there is no such version for me to choose

Comment 9 Karthik U S 2019-08-19 12:48:16 UTC
From the outputs provided, the file "mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp" is present only on brick "mn-0.local:/mnt/bricks/services/brick" without the gfid-link under .glusterfs dir. On the other two bricks the file itself is not present. By looking at the state of the file one possible scenario that I can think of which can lead to this is:
- File creation succeeded on 1st brick
- Gfid assignment succeeds but gfid-link creation fails on that brick
- On the 2nd & 3rd brick file creation itself will fail
Are there frequent disconnects of the bricks?

I do not see the following message in the source code in version 3.12.15.
[2019-08-18 09:04:45.976644] I [MSGID: 108026] [afr-self-heald.c:446:afr_shd_index_heal] 0-services-replicate-0: ret = -2,purge gfid:5f2dc6c6-3fc4-41ba-9800-a560b483de13
Is this something added by you or anyone from your team? Is this a source install or rpm install?
Give the outputs of "gluster --version" and "rpm -qa | grep gluster".
This is an INFO log according to the log level in the message. So how did you come to the conclusion that entry purge failed? Was this file present on the brick even after this message was logged?
SHD caught this entry during heal, so most probably this file or its parent had some pending marker on them. Or did a full heal was run on this volume?

Comment 10 zhou lin 2019-08-21 01:20:09 UTC
this issue should happened during system startup phase, it is probable that there is connect/disconnect happends between client and server.
yes, the reason i say the purge fails is because there is is still file mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp in brick dir.
i have tried full heal, but it still does not work
i do not quite understand "parent had some  pending marker on them", is there any method to confirm this?

Comment 11 zhou lin 2019-08-21 01:29:10 UTC
[root@mn-0:/mnt/bricks/services/brick/db/upgrade]
# glusterfs --version
glusterfs 3.12.15
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
[root@mn-0:/mnt/bricks/services/brick/db/upgrade]
#

Comment 12 zhou lin 2019-08-21 01:51:55 UTC
when i open services brick process trace level log, i find the posix_lookup return failed, is it because the missing link file? could we do sth to restore the missing link here?


services-changetimerecorder
[2019-08-21 01:38:08.304603] T [MSGID: 0] [changetimerecorder.c:360:ctr_lookup] 0-stack-trace: stack-address: 0x7f01b000b740, winding from services-changetimerecorder to services-trash
[2019-08-21 01:38:08.304607] T [MSGID: 0] [defaults.c:2574:default_lookup] 0-stack-trace: stack-address: 0x7f01b000b740, winding from services-trash to services-posix
[2019-08-21 01:38:08.304652] D [MSGID: 0] [posix.c:370:posix_lookup] 0-stack-trace: stack-address: 0x7f01b000b740, services-posix returned -1 error: No such file or directory [No such file or directory]
[2019-08-21 01:38:08.304663] D [MSGID: 0] [changetimerecorder.c:309:ctr_lookup_cbk] 0-stack-trace: stack-address: 0x7f01b000b740, services-changetimerecorder returned -1 error: No such file or directory [No such file or directory]
[2019-08-21 01:38:08.304669] D [MSGID: 0] [bit-rot-stub.c:2930:br_stub_lookup_cbk] 0-stack-trace: stack-address: 0x7f01b000b740, services-bitrot-stub returned -1 error: No such file or directory [No such file or directory]
[2019-08-21 01:38:08.304675] D [MSGID: 0] [posix-acl.c:1017:posix_acl_lookup_cbk] 0-stack-trace: stack-address: 0x7f01b000b740, services-access-control returned -1 error: No such file or directory [No such file or directory]
[2019-08-21 01:38:08.304681] D [MSGID: 0] [posix.c:2639:pl_lookup_cbk] 0-stack-trace: stack-address: 0x7f01b000b740, services-locks returned -1 error: No such file or directory [No such file or directory]
[2019-08-21 01:38:08.304697] D [MSGID: 0] [upcall.c:796:up_lookup_cbk] 0-stack-trace: stack-address: 0x7f01b000b740, services-upcall returned -1 error: No such file or directory [No such file or directory]
[2019-08-21 01:38:08.304703] D [MSGID: 0] [defaults.c:1266:default_lookup_cbk] 0-stack-trace: stack-address: 0x7f01b000b740, services-io-threads returned -1 error: No such file or directory [No such file or directory]
[2019-08-21 01:38:08.304709] T [marker.c:2913:marker_lookup_cbk] 0-services-marker: lookup failed with No such file or directory
[2019-08-21 01:38:08.304713] D [MSGID: 0] [marker.c:2948:marker_lookup_cbk] 0-stack-trace: stack-address: 0x7f01b000b740, services-marker returned -1 error: No such file or directory [No such file or directory]
[2019-08-21 01:38:08.304720] D [MSGID: 0] [index.c:2015:index_lookup_cbk] 0-stack-trace: stack-address: 0x7f01b000b740, services-index returned -1 error: No such file or directory [No such file or directory]
[2019-08-21 01:38:08.304726] D [MSGID: 0] [io-stats.c:2216:io_stats_lookup_cbk] 0-stack-trace: stack-address: 0x7f01b000b740, services-io-stats returned -1 error: No such file or directory [No such file or directory]
[2019-08-21 01:38:08.304734] D [MSGID: 115050] [server-rpc-fops.c:185:server_lookup_cbk] 0-services-server: 1438388: LOOKUP /db/upgrade/mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp (5f2dc6c6-3fc4-41ba-9800-a560b483de13), client: mn-0-3851-2019/08/20-16:35:22:730333-services-client-0-0-0, error-xlator: services-posix [No such file or directory]

Comment 13 Karthik U S 2019-08-21 11:04:09 UTC
I tried it locally and it is not removing the entry or creating the link file and replicating the file onto the other bricks.
If it does not have the gfid xattr set as well, it assigns the gfid and creates the file on the other bricks.
In this case it has the xattr but the gfid-link is missing. So while doing lookup it sees the gfid and when lookup is done on that gfid it will fail as the gfid link is missing.
I'm checking the code to come up with the best way to handle this case.
For the time bring you can either delete the file (if it is not required) since the file is not present on quorum number of bricks or you can delete the "trusted.gfid" xattr on the brick and then do lookup on the file. That should assign new gfid and create the gfid-link file and replicate it on other bricks.

Comment 14 zhou lin 2019-08-22 01:17:46 UTC
thanks for your reply,
i've tried removing the mn-1__dbim-redis.service__database-nosql-cmredis.sync_state.tmp, from mn-0 brick dir since it's size is 0 . however, there is no clue from glusterfs user side to find this error until visiting that file. because after shd heal, the entry in xattrop will be removed, and gluster volume heal info output becomes empty. is there any method to find this problem so that some actions could be made to repair manually(like removing from brick or deleting "trusted.gfid" xattr) ? in our system some app need to use that file, from their point of view, this is a bug, and i just hope there is some way to detect this issue and do some actions, so app will not stuck on this file. of course, it will be better if glusterfs source code could do the action to repair that file.

Comment 15 Karthik U S 2019-08-26 06:17:42 UTC
IMO the app should not get stuck (until your application keep on retrying to access this file) if we hit this problem as we error out saying ENOENT (correct me if it is not the case in your environment). This problem could have happened most probably due to the scenario explained in comment #9. In that case we do not have any heal pending markers for this file. If we hit this issue, currently there is no way to detect this other than checking the file manually and deleting the file/xattr. I will try to handle this in the code itself so that there won't be any manual effort required for this.

Since this happens on the reboot/startup scenarios, how often the restarts are done in your environment and how often you are seeing this problem?

Comment 16 zhou lin 2019-08-26 07:51:08 UTC
this issue does not appear often, only twice until now.
the case is hard reboot all 3 storage nodes repeatedly, fore each storage nodes, once per minute.

Comment 17 zhou lin 2019-08-26 09:11:51 UTC
it would be a good news for glusterfs users if this problem could be solved by gluster source code, since scanning all volume for such files would be of very expensive cost! i am looking forward for this patch, thanks!

Comment 18 Karthik U S 2019-08-26 09:23:50 UTC
Will update here once the patch is ready.

Comment 19 Worker Ant 2020-03-12 12:57:57 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/976, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.