Bug 1748205 - null gfid entries can not be healed
Summary: null gfid entries can not be healed
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: selfheal
Version: 4.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-03 07:32 UTC by zhou lin
Modified: 2020-03-12 12:14 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-12 12:14:44 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
sn-1 glusterfs log (3.91 MB, application/gzip)
2019-09-03 07:40 UTC, zhou lin
no flags Details
sn-0 glusterfs log (7.83 MB, application/gzip)
2019-09-03 07:42 UTC, zhou lin
no flags Details

Description zhou lin 2019-09-03 07:32:59 UTC
Description of problem:

some entry can not be healed because of empty gfid
Version-Release number of selected component (if applicable):

3.12.15
# gluster v info services
 
Volume Name: services
Type: Replicate
Volume ID: 32b6bb97-4d0a-4096-9cfa-4cf0385bed31
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 169.254.0.31:/mnt/bricks/services/brick
Brick2: 169.254.0.49:/mnt/bricks/services/brick
Options Reconfigured:
performance.client-io-threads: off
server.allow-insecure: on
network.ping-timeout: 42
cluster.consistent-metadata: on
cluster.favorite-child-policy: mtime
cluster.server-quorum-type: none
transport.address-family: inet
nfs.disable: on
cluster.server-quorum-ratio: 51%
How reproducible:


Steps to Reproduce:
1.start io on one glusterfs client node
2.hard reboot all 3 storage nodes (sn-0 sn-1 has brick, sn-2 is quorum)
3.sometimes this problem appear

Actual results:


Expected results:


Additional info:

1>"/" keeps showing up in command "gluster v heal services info",seems glustershd can not finish healing this "/" of services volume. when i check the glutershd log on sn-0 node, there are following output, repeatedly.
2>there is one entry fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t existing only on sn-1 node(not exist on sn-0 node) "/mnt/bricks/services/brick" directory, and the xattr of it is empty.


[question]:
1>i check the glusterfs heal related code, do not find much difference between glusterfs 3.12.15(we are using) and the latest version 6.5,is this issue a known one? do you think this issue also exits on latest version?
2>in this case sn-0 "/" accuse sn-1, and sn-0 shd try to remove this entry from sn-1, but failed, is this the error and the cause of this issue?



[glustershd log on sn-0]:
[2019-09-03 07:10:50.003265] I [MSGID: 108026] [afr-self-heald.c:432:afr_shd_index_heal] 0-services-replicate-0: got entry: 00000000-0000-0000-0000-000000000001 from services-client-0
[2019-09-03 07:10:50.003476] I [MSGID: 108026] [afr-self-heald.c:341:afr_shd_selfheal] 0-services-replicate-0: entry: path /, gfid: 00000000-0000-0000-0000-000000000001
[2019-09-03 07:10:50.006066] I [MSGID: 108026] [afr-self-heal-entry.c:893:afr_selfheal_entry_do] 0-services-replicate-0: performing entry selfheal on 00000000-0000-0000-0000-000000000001
[2019-09-03 07:10:50.017819] W [MSGID: 108015] [afr-self-heal-entry.c:56:afr_selfheal_entry_delete] 0-services-replicate-0: expunging file 00000000-0000-0000-0000-000000000001/fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t (00000000-0000-0000-0000-000000000000) on services-client-1


[root@SN-0(RCP-1234) /mnt/bricks/services/brick]
# gluster v heal services info
Brick 169.254.0.31:/mnt/bricks/services/brick
/ 
Status: Connected
Number of entries: 1

Brick 169.254.0.49:/mnt/bricks/services/brick
Status: Connected
Number of entries: 0

[root@SN-0(RCP-1234) /mnt/bricks/services/brick]
# ls -l
total 92
drwxr-xr-x  9 _nokfssysalarmprocessor _nokfssysalarmprocessor 4096 Sep  2 14:04 AlarmFileSystem
drw-------  2 root                    root                    4096 Sep  2 14:03 backup
drwxr-xr-x  3 root                    root                    4096 Sep  2 14:04 CLM
drwxr-xr-x  3 root                    root                    4096 Sep  2 14:03 cmf
drwxr-xr-x  3 root                    root                    4096 Sep  2 14:03 commandcalendar
drwxrwx---  2 root                    _nokrcpsysdcif          4096 Sep  2 14:06 commoncollector
drwxr-xr-x  2 root                    root                    4096 Sep  2 14:01 coredumper
drwxr-xr-x  3 root                    root                    4096 Sep  2 14:04 db
drwx------  5 root                    root                    4096 Sep  2 14:04 EventCorrelationEngine
drwx------  8 root                    root                    4096 Sep  2 14:15 hypertracer
drwxrwx---+ 2 root                    root                    4096 Sep  2 14:02 LCM
drwxr-xr-x+ 2 root                    root                    4096 Sep  2 14:01 LDAPUserInfo
drwxr-xr-x  4 root                    root                    4096 Sep  2 14:01 lightcm
-rw-r--r--  2 root                    root                       0 Sep  2 14:01 LMN-0_recover_flag
-rw-r--r--  2 root                    root                       0 Sep  2 14:04 LMN-1_recover_flag
drwxr-xr-x  2 root                    root                    4096 Sep  2 14:05 lockd
drwxr-xr-x  2 root                    root                    4096 Sep  2 14:04 Log
drwxr-xr-x  3 _nokfssyspm9            _nokfssyspm9            4096 Sep  2 14:04 PM9
drw-------  2 root                    root                    4096 Sep  2 14:03 RCP_Backup
drwxr-xr-x  4 root                    root                    4096 Sep  2 14:04 RCPPTEngine
drwxr-xr-x  2 root                    root                    4096 Sep  2 14:01 TestDBDump
[root@SN-0(RCP-1234) /mnt/bricks/services/brick]

[root@SN-0(RCP-1234) /mnt/bricks/services/brick]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000500f103000010000500ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-1=0x00000000000000000000010a
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x32b6bb974d0a40969cfa4cf0385bed31

[root@SN-0(RCP-1234) /mnt/bricks/services/brick/.glusterfs/indices/xattrop]
# ls
00000000-0000-0000-0000-000000000001  xattrop-7006a00e-edbc-4e0c-862b-0c58b2974487

/////////////////////////////////////////////////
[root@SN-1(RCP-1234) /root]
# cd /mnt/bricks/services/brick/
[root@SN-1(RCP-1234) /mnt/bricks/services/brick]
# ls -la
total 108
drwxr-xr-x+  22 root                    root                    4096 Sep  3 14:56 .
drwxr-xr-x    4 root                    root                    4096 Sep  2 14:00 ..
drwxr-xr-x    9 _nokfssysalarmprocessor _nokfssysalarmprocessor 4096 Sep  2 14:04 AlarmFileSystem
drw-------    2 root                    root                    4096 Sep  2 14:03 backup
drwxr-xr-x    3 root                    root                    4096 Sep  2 14:04 CLM
drwxr-xr-x    3 root                    root                    4096 Sep  2 14:03 cmf
drwxr-xr-x    3 root                    root                    4096 Sep  2 14:03 commandcalendar
drwxrwx---    2 root                    _nokrcpsysdcif          4096 Sep  2 14:06 commoncollector
drwxr-xr-x    2 root                    root                    4096 Sep  2 14:01 coredumper
drwxr-xr-x    3 root                    root                    4096 Sep  2 14:04 db
drwx------    5 root                    root                    4096 Sep  2 14:04 EventCorrelationEngine
-rw-r--r--    1 root                    root                       0 Sep  3 14:33 fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t
drw-------  263 root                    root                    4096 Sep  2 14:23 .glusterfs
drwx------    8 root                    root                    4096 Sep  2 14:15 hypertracer
drwxrwx---+   2 root                    root                    4096 Sep  2 14:02 LCM
drwxr-xr-x+   2 root                    root                    4096 Sep  2 14:01 LDAPUserInfo
drwxr-xr-x    4 root                    root                    4096 Sep  2 14:01 lightcm
-rw-r--r--    2 root                    root                       0 Sep  2 14:01 LMN-0_recover_flag
-rw-r--r--    2 root                    root                       0 Sep  2 14:04 LMN-1_recover_flag
drwxr-xr-x    2 root                    root                    4096 Sep  2 14:05 lockd
drwxr-xr-x    2 root                    root                    4096 Sep  2 14:04 Log
drwxr-xr-x    3 _nokfssyspm9            _nokfssyspm9            4096 Sep  2 14:04 PM9
drw-------    2 root                    root                    4096 Sep  2 14:03 RCP_Backup
drwxr-xr-x    4 root                    root                    4096 Sep  2 14:04 RCPPTEngine
drwxr-xr-x    2 root                    root                    4096 Sep  2 14:01 TestDBDump
[root@SN-1(RCP-1234) /mnt/bricks/services/brick]
# getfattr -m . -d -e hex fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t 
[root@SN-1(RCP-1234) /mnt/bricks/services/brick]
# stat fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t 
  File: fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: fd71h/64881d	Inode: 8767        Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-09-03 14:33:48.468224170 +0800
Modify: 2019-09-03 14:33:48.468224170 +0800
Change: 2019-09-03 14:33:48.468224170 +0800
 Birth: 2019-09-03 14:33:48.468224170 +0800
[root@SN-1(RCP-1234) /mnt/bricks/services/brick]
# getfattr -m . -d -e hex .
# file: .
system.posix_acl_access=0x0200000001000700ffffffff04000500ffffffff08000500f103000010000500ffffffff20000500ffffffff
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.services-client-0=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0x32b6bb974d0a40969cfa4cf0385bed31

[root@SN-1(RCP-1234) /mnt/bricks/services/brick]
in sn-1 services brick process log, there is following error prints:

[2019-09-03 07:20:51.018870] E [MSGID: 113002] [posix.c:362:posix_lookup] 0-services-posix: buf->ia_gfid is null for /mnt/bricks/services/brick/fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t [No data available]
[2019-09-03 07:20:51.018910] W [MSGID: 115005] [server-resolve.c:70:resolve_gfid_entry_cbk] 0-services-server: 00000000-0000-0000-0000-000000000001/fstest_6491509c4500d56f6fc4a621efc970bd___symlink_00_t: failed to resolve (No data available) [No data available]

Comment 1 zhou lin 2019-09-03 07:40:58 UTC
Created attachment 1611038 [details]
sn-1 glusterfs log

Comment 2 zhou lin 2019-09-03 07:42:45 UTC
Created attachment 1611039 [details]
sn-0 glusterfs log

Comment 3 Worker Ant 2020-03-12 12:14:44 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/848, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.