Bug 1748205

Summary: null gfid entries can not be healed
Product: [Community] GlusterFS Reporter: zhou lin <zz.sh.cynthia>
Component: selfhealAssignee: bugs <bugs>
Status: CLOSED UPSTREAM QA Contact:
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1CC: bugs, pasik
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-12 12:14:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sn-1 glusterfs log
none
sn-0 glusterfs log none

Description zhou lin 2019-09-03 07:32:59 UTC
Description of problem:

some entry can not be healed because of empty gfid
Version-Release number of selected component (if applicable):

3.12.15
# gluster v info services
 
Volume Name: services
Type: Replicate
Volume ID: 32b6bb97-4d0a-4096-9cfa-4cf0385bed31
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 169.254.0.31:/mnt/bricks/services/brick
Brick2: 169.254.0.49:/mnt/bricks/services/brick
Options Reconfigured:
performance.client-io-threads: off
server.allow-insecure: on
network.ping-timeout: 42
cluster.consistent-metadata: on
cluster.favorite-child-policy: mtime
cluster.server-quorum-type: none
transport.address-family: inet
nfs.disable: on
cluster.server-quorum-ratio: 51%
How reproducible:


Steps to Reproduce:
1.start io on one glusterfs client node
2.hard reboot all 3 storage nodes (sn-0 sn-1 has brick, sn-2 is quorum)
3.sometimes this problem appear

Actual results:


Expected results:


Additional info:

1>"/" keeps showing up in command "gluster v heal services info",seems glustershd can not finish healing this "/" of services volume. when i check the glutershd log on sn-0 node, there are following output, repeatedly.
2>there is one entry fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t existing only on sn-1 node(not exist on sn-0 node) "/mnt/bricks/services/brick" directory, and the xattr of it is empty.


[question]:
1>i check the glusterfs heal related code, do not find much difference between glusterfs 3.12.15(we are using) and the latest version 6.5,is this issue a known one? do you think this issue also exits on latest version?
2>in this case sn-0 "/" accuse sn-1, and sn-0 shd try to remove this entry from sn-1, but failed, is this the error and the cause of this issue?



[glustershd log on sn-0]:
[2019-09-03 07:10:50.003265] I [MSGID: 108026] [afr-self-heald.c:432:afr_shd_index_heal] 0-services-replicate-0: got entry: 00000000-0000-0000-0000-000000000001 from services-client-0
[2019-09-03 07:10:50.003476] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-services-replicate-0: entry: path /, gfid: 00000000-0000-0000-0000-000000000001
[2019-09-03 07:10:50.006066] I [MSGID: 108026] [afr-self-heal-entry.c:893:afr_selfheal_entry_do] 0-services-replicate-0: performing entry selfheal on 00000000-0000-0000-0000-000000000001
[2019-09-03 07:10:50.017819] W [MSGID: 108015] [afr-self-heal-entry.c:56:afr_selfheal_entry_delete] 0-services-replicate-0: expunging file 00000000-0000-0000-0000-000000000001/fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t (00000000-0000-0000-0000-000000000000) on services-client-1


[root@SN-0(RCP-1234) /mnt/bricks/services/brick]
# gluster v heal services info
Brick 169.254.0.31:/mnt/bricks/services/brick
/ 
Status: Connected
Number of entries: 1

Brick 169.254.0.49:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0

[root@SN-0(RCP-1234) /mnt/bricks/services/brick]
# ls -l
total 92
drwxr-xr-x  9 _nokfssysalarmprocessor _nokfssysalarmprocessor 4096 Sep  2 14:04 AlarmFileSystem
drw-------  2 root                    root                    4096 Sep  2 14:03 backup
drwxr-xr-x  3 root                    root                    4096 Sep  2 14:04 CLM
drwxr-xr-x  3 root                    root                    4096 Sep  2 14:03 cmf
drwxr-xr-x  3 root                    root                    4096 Sep  2 14:03 commandcalendar
drwxrwx---  2 root                    _nokrcpsysdcif          4096 Sep  2 14:06 commoncollector
drwxr-xr-x  2 root                    root                    4096 Sep  2 14:01 coredumper
drwxr-xr-x  3 root                    root                    4096 Sep  2 14:04 db
drwx------  5 root                    root                    4096 Sep  2 14:04 EventCorrelationEngine
drwx------  8 root                    root                    4096 Sep  2 14:15 hypertracer
drwxrwx---+ 2 root                    root                    4096 Sep  2 14:02 LCM
drwxr-xr-x+ 2 root                    root                    4096 Sep  2 14:01 LDAPUserInfo
drwxr-xr-x  4 root                    root                    4096 Sep  2 14:01 lightcm
-rw-r--r--  2 root                    root                       0 Sep  2 14:01 LMN-0_recover_flag
-rw-r--r--  2 root                    root                       0 Sep  2 14:04 LMN-1_recover_flag
drwxr-xr-x  2 root                    root                    4096 Sep  2 14:05 lockd
drwxr-xr-x  2 root                    root                    4096 Sep  2 14:04 Log
drwxr-xr-x  3 _nokfssyspm9            _nokfssyspm9            4096 Sep  2 14:04 PM9
drw-------  2 root                    root                    4096 Sep  2 14:03 RCP_Backup
drwxr-xr-x  4 root                    root                    4096 Sep  2 14:04 RCPPTEngine
drwxr-xr-x  2 root                    root                    4096 Sep  2 14:01 TestDBDump
[root@SN-0(RCP-1234) /mnt/bricks/services/brick]

[root@SN-0(RCP-1234) /mnt/bricks/services/brick]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000500f103000010000500ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-1=0x00000000000000000000010a
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x32b6bb974d0a40969cfa4cf0385bed31

[root@SN-0(RCP-1234) /mnt/bricks/services/brick/.glusterfs/indices/xattrop]
# ls
00000000-0000-0000-0000-000000000001  xattrop-7006a00e-edbc-4e0c-862b-0c58b2974487

/////////////////////////////////////////////////
[root@SN-1(RCP-1234) /root]
# cd /mnt/bricks/services/brick/
[root@SN-1(RCP-1234) /mnt/bricks/services/brick]
# ls -la
total 108
drwxr-xr-x+  22 root                    root                    4096 Sep  3 14:56 .
drwxr-xr-x    4 root                    root                    4096 Sep  2 14:00 ..
drwxr-xr-x    9 _nokfssysalarmprocessor _nokfssysalarmprocessor 4096 Sep  2 14:04 AlarmFileSystem
drw-------    2 root                    root                    4096 Sep  2 14:03 backup
drwxr-xr-x    3 root                    root                    4096 Sep  2 14:04 CLM
drwxr-xr-x    3 root                    root                    4096 Sep  2 14:03 cmf
drwxr-xr-x    3 root                    root                    4096 Sep  2 14:03 commandcalendar
drwxrwx---    2 root                    _nokrcpsysdcif          4096 Sep  2 14:06 commoncollector
drwxr-xr-x    2 root                    root                    4096 Sep  2 14:01 coredumper
drwxr-xr-x    3 root                    root                    4096 Sep  2 14:04 db
drwx------    5 root                    root                    4096 Sep  2 14:04 EventCorrelationEngine
-rw-r--r--    1 root                    root                       0 Sep  3 14:33 fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t
drw-------  263 root                    root                    4096 Sep  2 14:23 .glusterfs
drwx------    8 root                    root                    4096 Sep  2 14:15 hypertracer
drwxrwx---+   2 root                    root                    4096 Sep  2 14:02 LCM
drwxr-xr-x+   2 root                    root                    4096 Sep  2 14:01 LDAPUserInfo
drwxr-xr-x    4 root                    root                    4096 Sep  2 14:01 lightcm
-rw-r--r--    2 root                    root                       0 Sep  2 14:01 LMN-0_recover_flag
-rw-r--r--    2 root                    root                       0 Sep  2 14:04 LMN-1_recover_flag
drwxr-xr-x    2 root                    root                    4096 Sep  2 14:05 lockd
drwxr-xr-x    2 root                    root                    4096 Sep  2 14:04 Log
drwxr-xr-x    3 _nokfssyspm9            _nokfssyspm9            4096 Sep  2 14:04 PM9
drw-------    2 root                    root                    4096 Sep  2 14:03 RCP_Backup
drwxr-xr-x    4 root                    root                    4096 Sep  2 14:04 RCPPTEngine
drwxr-xr-x    2 root                    root                    4096 Sep  2 14:01 TestDBDump
[root@SN-1(RCP-1234) /mnt/bricks/services/brick]
# getfattr -m . -d -e hex fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t 
[root@SN-1(RCP-1234) /mnt/bricks/services/brick]
# stat fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t 
  File: fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: fd71h/64881d	Inode: 8767        Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-09-03 14:33:48.468224170 +0800
Modify: 2019-09-03 14:33:48.468224170 +0800
Change: 2019-09-03 14:33:48.468224170 +0800
 Birth: 2019-09-03 14:33:48.468224170 +0800
[root@SN-1(RCP-1234) /mnt/bricks/services/brick]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000500f103000010000500ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-0=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x32b6bb974d0a40969cfa4cf0385bed31

[root@SN-1(RCP-1234) /mnt/bricks/services/brick]
in sn-1 services brick process log, there is following error prints:

[2019-09-03 07:20:51.018870] E [MSGID: 113002] [posix.c:362:posix_lookup] 0-services-posix: buf->ia_gfid is null for /mnt/bricks/services/brick/fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t [No data available]
[2019-09-03 07:20:51.018910] W [MSGID: 115005] [server-resolve.c:70:resolve_gfid_entry_cbk] 0-services-server: 00000000-0000-0000-0000-000000000001/fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t: failed to resolve (No data available) [No data available]

Comment 1 zhou lin 2019-09-03 07:40:58 UTC
Created attachment 1611038 [details]
sn-1 glusterfs log

Comment 2 zhou lin 2019-09-03 07:42:45 UTC
Created attachment 1611039 [details]
sn-0 glusterfs log

Comment 3 Worker Ant 2020-03-12 12:14:44 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/848, and will be tracked there from now on. Visit GitHub issues URL for further details