Description of problem: Some directories and files can't be seen by the clients. They are there, they can be entered but they don't show up in ls. If moved, they appear. If moved back to the old name, they disappear again. On the server: > The directory exists on all bricks and contains some files. $ cd /$path_to_brick/data/dir1/ $ ls dir2 $ find /$path_to_brick/data/dir1/ /$path_to_brick/data/dir1/dir2 /$path_to_brick/data/dir1/dir2/file1 /$path_to_brick/data/dir1/dir2/file2 on all clients: > The directory seems empty. $ cd /$path_to_mount/dir1/ $ ls $ > We can list the files in the invisible directory. $ ls dir2 file1 file2 > We can move the directory to make it reappear. $ mv dir2 dir2_new $ ls dir2_new > But after moving back it disappears again. $ mv dir2_new dir2 $ ls $ > We can't remove the affected directory. $ cd; rm -rf /$path_to_mount/dir1/ rm: cannot remove ‘/$path_to_mount/dir1/dir2’: Directory not empty rm: cannot remove ‘/$path_to_mount/dir1/dir3’: Directory not empty Some files in the affected dir show the same behaviour. What we tried so far: Setting "performance.readdir-ahead" to off didn't change the behaviour. Version-Release number of selected component (if applicable): $ gluster --version glusterfs 3.12.7 Repository revision: git://git.gluster.org/glusterfs.git Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/> GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation. $ dpkg -l | grep glusterfs-server ii glusterfs-server 3.12.7-1 amd64 clustered file-system (server package) $ cat /etc/apt/sources.list.d/gluster.list deb [arch=amd64] https://download.gluster.org/pub/gluster/glusterfs/3.12/LATEST/Debian/stretch/amd64/apt stretch main How reproducible: Seems to be random. We weren't able to break a directory on purpose. An affected directory didn't return back to normal. Additional info: Gluster is installed on Debian Stretch using https://download.gluster.org and updated from Gluster 3.8 a while ago and 1 day before the first seen appearance updated from 3.12.6 to 3.12.7. A clean reboot of all GlusterFS servers happened during this update. According to the user who owns the affected directory, it was created after the update. The volume is mounted via FUSE. $ gluster volume info $volume Volume Name: $volume Type: Distribute Volume ID: 0d210c70-e44f-46f1-862c-ef260514c9f1 Status: Started Snapshot Count: 0 Number of Bricks: 23 Transport-type: tcp Bricks: Brick1: gluster02:/srv/glusterfs/bricks/DATA201/data Brick2: gluster02:/srv/glusterfs/bricks/DATA202/data Brick3: gluster02:/srv/glusterfs/bricks/DATA203/data Brick4: gluster02:/srv/glusterfs/bricks/DATA204/data Brick5: gluster02:/srv/glusterfs/bricks/DATA205/data Brick6: gluster02:/srv/glusterfs/bricks/DATA206/data Brick7: gluster02:/srv/glusterfs/bricks/DATA207/data Brick8: gluster02:/srv/glusterfs/bricks/DATA208/data Brick9: gluster01:/srv/glusterfs/bricks/DATA110/data Brick10: gluster01:/srv/glusterfs/bricks/DATA111/data Brick11: gluster01:/srv/glusterfs/bricks/DATA112/data Brick12: gluster01:/srv/glusterfs/bricks/DATA113/data Brick13: gluster01:/srv/glusterfs/bricks/DATA114/data Brick14: gluster02:/srv/glusterfs/bricks/DATA209/data Brick15: gluster01:/srv/glusterfs/bricks/DATA101/data Brick16: gluster01:/srv/glusterfs/bricks/DATA102/data Brick17: gluster01:/srv/glusterfs/bricks/DATA103/data Brick18: gluster01:/srv/glusterfs/bricks/DATA104/data Brick19: gluster01:/srv/glusterfs/bricks/DATA105/data Brick20: gluster01:/srv/glusterfs/bricks/DATA106/data Brick21: gluster01:/srv/glusterfs/bricks/DATA107/data Brick22: gluster01:/srv/glusterfs/bricks/DATA108/data Brick23: gluster01:/srv/glusterfs/bricks/DATA109/data Options Reconfigured: nfs.addr-namelookup: off transport.address-family: inet nfs.disable: on diagnostics.brick-log-level: ERROR performance.readdir-ahead: on auth.allow: $OUR_IP_SPACE features.quota: on features.inode-quota: on features.quota-deem-statfs: on Nothing interesting ran into the brick logs. Just normal boot stuff as always.
(In reply to g.amedick from comment #0) > Description of problem: > > Some directories and files can't be seen by the clients. They are there, > they can be entered but they don't show up in ls. If moved, they appear. If > moved back to the old name, they disappear again. > > On the server: > > > The directory exists on all bricks and contains some files. > $ cd /$path_to_brick/data/dir1/ > $ ls > dir2 > $ find /$path_to_brick/data/dir1/ > /$path_to_brick/data/dir1/dir2 > /$path_to_brick/data/dir1/dir2/file1 > /$path_to_brick/data/dir1/dir2/file2 I see that there are 23 bricks. But listed here are only 3 bricks. Is it possible to figure out: 1. dir2 exists on all 23 bricks 2. getfattr -e hex -m. -d dir1 on all bricks It looks to be an issue of dir2 missing on some bricks.
Created attachment 1417986 [details] requested output list of getfattr for all directories
Hi, 1. I can see all three affected folders on every brick. 2. We have so far found three affected dirs (dir2, dir3, dir4). I've attached the info. I found another curious thing while verifying that all directories exist on all bricks: as said, we have "hidden" files, too. We can rename them without getting an error, but they don't reappear. They do get renamed on the brick. Now I saw that a copy of the file with the same name was created on another brick. Neither of them show up on client side.
Hi, I'm not sure if this is relevant to this case but as a colleague I found an other strange thing on the same volume. (on gluster02:) # gluster volume rebalance $volumename status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 11421 16384.0PB 2629168 0 113066 completed 17:16:58 gluster01 18936 770.3MB 2710359 0 203851 completed 17:16:58 volume rebalance: $volumename: success This rebalance ran weeks ago and we didn't notice any error. 16384.0PB is way too much and looks like an error. We didn't start any rebalance since the reboot (and first notification of the bug) due to concerns about the data safety. Do you think running a new rebalance might help (and hopefully fix this issue) or would you advice against it?
Hi, a fourth affected directory appeared, and there is something interesting about it. In the 3 first directories, there were ~10 files, about 2/3 of them "hidden". They were pretty much scattered over the bricks. Now, the new affected directory has about 2.000 "hidden" files, and they are all located on 4 bricks on the same server (gluster02). On all other bricks, the folder is empty. We don't know whether that's coincidence or not. Is there any additional information we could provide and that might be helpful? Any opinions on doing a rebalance?
Hi, the affected dirs seem to heal by themselves after 2-4 days. There are new ones popping up regularly now though. we increased the log level from error to warning and entries like this are the result on pretty much every brick (paths of user directories replaced with $dir_… & we haven't seen files affected by the bug, but we don't know whether all hidden-directory-appearances are reported to us or whether we increased the loglevel before or after the last reported directory broke): [2018-04-16 12:34:33.643937] W [MSGID: 120020] [quota.c:2755:quota_rename_continue] $volume-quota: quota context not set in inode (gfid:f44e77bc-a54c-4cb5-9f70-c581ed270f2d), considering file size as zero while enforcing quota on new ancestry [2018-04-16 12:44:33.979176] W [MSGID: 113103] [posix.c:282:posix_lookup] $volume-posix: Found stale gfid handle /srv/glusterfs/bricks/DATA111/data/.glusterfs/93/fa/93fa35bb-22fe-40dc-9415-f08186ab1c93, removing it. [Stale file handle] [2018-04-17 09:05:44.438907] A [MSGID: 120004] [quota.c:4998:quota_log_usage] $volume-quota: Usage is above soft limit: 187.4TB used by /$dir_1 [2018-04-17 18:34:52.084247] W [MSGID: 113096] [posix-handle.c:770:posix_handle_hard] $volume-posix: link /srv/glusterfs/bricks/DATA111/data/$file_1 -> /srv/glusterfs/bricks/DATA111/data/.glusterfs/10/21/1021cecf-08dc-48ab-a44e-6542cc8e75acfailed [File exists] [2018-04-17 18:34:52.084325] E [MSGID: 113020] [posix.c:3162:posix_create] $volume-posix: setting gfid on /srv/glusterfs/bricks/DATA111/data/$file_1 failed [2018-04-17 20:57:10.613860] W [MSGID: 113001] [posix.c:4421:posix_get_ancestry_non_directory] $volume-posix: listxattr failed on/srv/glusterfs/bricks/DATA111/data/.glusterfs/8d/7d/8d7dc368-b229-4d41-921c-546627a03248 [No such file or directory] [2018-04-17 20:57:10.614719] W [marker-quota.c:33:mq_loc_copy] 0-marker: src loc is not valid [2018-04-17 20:57:10.614818] E [marker-quota.c:1488:mq_initiate_quota_task] $volume-marker: loc copy failed The message "W [MSGID: 113001] [posix.c:4421:posix_get_ancestry_non_directory] $volume-posix: listxattr failed on/srv/glusterfs/bricks/DATA111/data/.glusterfs/8d/7d/8d7dc368-b229-4d41-921c-546627a03248 [No such file or directory]" repeated 1300 times between [2018-04-17 20:57:10.613860] and [2018-04-17 20:57:11.536419] [2018-04-17 21:34:42.809053] W [MSGID: 113096] [posix-handle.c:770:posix_handle_hard] $volume-posix: link /srv/glusterfs/bricks/DATA111/data/$file_2 -> /srv/glusterfs/bricks/DATA111/data/.glusterfs/21/db/21db009b-aa53-4d26-afb1-8f1667574530failed [File exists] [2018-04-17 21:34:42.809115] E [MSGID: 113020] [posix.c:3162:posix_create] $volume-posix: setting gfid on /srv/glusterfs/bricks/DATA111/data/$file_2 failed [2018-04-17 21:34:42.809944] E [MSGID: 113018] [posix.c:552:posix_setattr] $volume-posix: setattr (lstat) on /srv/glusterfs/bricks/DATA111/data/.glusterfs/21/db/21db009b-aa53-4d26-afb1-8f1667574530 failed [No such file or directory] [2018-04-17 21:34:42.811179] E [MSGID: 113001] [posix.c:4874:posix_getxattr] $volume-posix: getxattr failed on /srv/glusterfs/bricks/DATA111/data/.glusterfs/21/db/21db009b-aa53-4d26-afb1-8f1667574530: trusted.glusterfs.dht.linkto [No such file or directory] [2018-04-18 09:08:02.273714] A [MSGID: 120004] [quota.c:4998:quota_log_usage] $volume-quota: Usage is above soft limit: 188.8TB used by /dir_1 [2018-04-18 10:50:54.072890] A [MSGID: 120004] [quota.c:4998:quota_log_usage] $volume-quota: Usage is above soft limit: 4.0TB used by /dir_2 [2018-04-18 10:50:54.073972] A [MSGID: 120004] [quota.c:4998:quota_log_usage] $volume-quota: Usage is above soft limit: 4.0TB used by /dir_2 [2018-04-18 11:20:12.880347] W [MSGID: 120020] [quota.c:2755:quota_rename_continue] $volume-quota: quota context not set in inode (gfid:367de5fb-c7c3-4bde-a8fa-a3a2cafc6abc), considering file size as zero while enforcing quota on new ancestry [2018-04-18 11:20:16.865349] W [MSGID: 120020] [quota.c:2755:quota_rename_continue] $volume-quota: quota context not set in inode (gfid:cd8877f1-c5d5-47bd-8a60-c3224c13e724), considering file size as zero while enforcing quota on new ancestry [2018-04-18 11:20:17.510650] W [MSGID: 120020] [quota.c:2755:quota_rename_continue] $volume-quota: quota context not set in inode (gfid:4eb88aac-8e95-4eac-8d48-1615a135efcd), considering file size as zero while enforcing quota on new ancestry We are unsure whether this is related or relevant.
Please provide the following: 1. The FUSE mount options being used 2. tcmpdump: From a fresh tmp FUSE mount, perform ls on a directory with invisible files/dirs using tcpdump -i any -s 0 -w /var/tmp/dirls.pcap tcp and not port 22 3. The output of gluster volume status 4. The FUSE mount logs 5. The actual location of the invisible files on the bricks (ls -l on the brick) and the xattrs and stat info for the files and parent directory.
Hi, as I said, the directories seem to heal over time. We currently don't know of a "hidden" folder. There are "hidden" files though. I'll proceed with those, hope it helps. I created a new mount at /mnt on a test compute node. 1. $ mount | grep /mnt gluster01.FQDN:/$vol1 on /mnt type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) $ systemctl status mnt.mount ● mnt.mount Loaded: loaded (/etc/systemd/system/mnt.mount; disabled) Active: active (mounted) since Tue 2018-04-24 16:52:06 CEST; 2 days ago Where: /mnt What: gluster01.FQDN:/$vol1 Process: 3104 ExecMount=/bin/mount -n gluster01.FQDN:/$vol1 /mnt -t glusterfs -o defaults,_netdev,backupvolfile-server=gluster02.FQDN (code=exited, status=0/SUCCESS) CGroup: /system.slice/mnt.mount └─3173 /usr/sbin/glusterfs --volfile-server=gluster01.FQDN --volfile-server=gluster02.FQDN --volfile-id=/$vol1 /mnt 2. see attachment. 3. $ gluster volume status Status of volume: $vol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gluster02:/srv/glusterfs/bricks/DATA2 01/data 49152 0 Y 4064 Brick gluster02:/srv/glusterfs/bricks/DATA2 02/data 49153 0 Y 4072 Brick gluster02:/srv/glusterfs/bricks/DATA2 03/data 49154 0 Y 4080 Brick gluster02:/srv/glusterfs/bricks/DATA2 04/data 49155 0 Y 4090 Brick gluster02:/srv/glusterfs/bricks/DATA2 05/data 49156 0 Y 4098 Brick gluster02:/srv/glusterfs/bricks/DATA2 06/data 49157 0 Y 4107 Brick gluster02:/srv/glusterfs/bricks/DATA2 07/data 49158 0 Y 4116 Brick gluster02:/srv/glusterfs/bricks/DATA2 08/data 49159 0 Y 4125 Brick gluster01:/srv/glusterfs/bricks/DATA1 10/data 49152 0 Y 4418 Brick gluster01:/srv/glusterfs/bricks/DATA1 11/data 49153 0 Y 4426 Brick gluster01:/srv/glusterfs/bricks/DATA1 12/data 49154 0 Y 4434 Brick gluster01:/srv/glusterfs/bricks/DATA1 13/data 49155 0 Y 4444 Brick gluster01:/srv/glusterfs/bricks/DATA1 14/data 49156 0 Y 4452 Brick gluster02:/srv/glusterfs/bricks/DATA2 09/data 49160 0 Y 4134 Brick gluster01:/srv/glusterfs/bricks/DATA1 01/data 49157 0 Y 4461 Brick gluster01:/srv/glusterfs/bricks/DATA1 02/data 49158 0 Y 4470 Brick gluster01:/srv/glusterfs/bricks/DATA1 03/data 49159 0 Y 4479 Brick gluster01:/srv/glusterfs/bricks/DATA1 04/data 49160 0 Y 4488 Brick gluster01:/srv/glusterfs/bricks/DATA1 05/data 49161 0 Y 4498 Brick gluster01:/srv/glusterfs/bricks/DATA1 06/data 49162 0 Y 4507 Brick gluster01:/srv/glusterfs/bricks/DATA1 07/data 49163 0 Y 4516 Brick gluster01:/srv/glusterfs/bricks/DATA1 08/data 49164 0 Y 4525 Brick gluster01:/srv/glusterfs/bricks/DATA1 09/data 49165 0 Y 4533 Quota Daemon on localhost N/A N/A Y 4041 Quota Daemon on gluster03.FQDN N/A N/A Y 701 Quota Daemon on gluster04.FQDN N/A N/A Y 810 Quota Daemon on gluster05.FQDN N/A N/A Y 3011 Quota Daemon on gluster01 N/A N/A Y 4393 Task Status of Volume $vol1 ------------------------------------------------------------------------------ Task : Rebalance ID : 326d0a79-98e7-4e7a-9ae1-6fc5e33663ae Status : failed Status of volume: $vol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gluster02:/srv/glusterfs/bricks/SRV_C LOUD_201/data 49161 0 Y 4143 Task Status of Volume $vol2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: $vol3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gluster02:/srv/glusterfs/bricks/SRV_H OME_201/data 49162 0 Y 4152 Quota Daemon on localhost N/A N/A Y 4041 Quota Daemon on gluster04.FQDN N/A N/A Y 810 Quota Daemon on gluster03.FQDN N/A N/A Y 701 Quota Daemon on gluster01 N/A N/A Y 4393 Quota Daemon on gluster05.FQDN N/A N/A Y 3011 Task Status of Volume $vol3 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: $vol4 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gluster02:/srv/glusterfs/bricks/SRV_S LURM_201/data 49163 0 Y 4161 Task Status of Volume $vol4 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: $vol5 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gluster05.FQDN:/srv/glusterfs/bricks/TEST001/data N/A N/A N N/A Task Status of Volume $vol5 ------------------------------------------------------------------------------ There are no active volume tasks Volume TEST_DISPERSED is not started Volumes 1,2,3 and 4 are productive. We haven't recieved any reports about errors on 2-4. They are small though and don't have a high load. We don't know whether they are affected or not. 4. see attachments 5. The volume is part of a cluster that does genome analysis. I'm afraid I can't publish the complete path, it contains sensitive information. But it only contains alphanumeric symbols, "." and "_". 1st file: root@gluster02:~# getfattr -d -m - /srv/glusterfs/bricks/DATA202/data/$PATH/sd.bin # file: srv/glusterfs/bricks/DATA202/data/$PATH/sd.bin trusted.gfid=0s85TbqbmpQoG/3BV5LbJwxg== trusted.gfid2path.6ccfa9a95c18c513="3847d58a-0225-4be2-8ba6-a7fcaf16dcf2/sd.bin" trusted.glusterfs.quota.3847d58a-0225-4be2-8ba6-a7fcaf16dcf2.contri.1=0sAAAAAAAAfgAAAAAAAAAAAQ== trusted.pgfid.3847d58a-0225-4be2-8ba6-a7fcaf16dcf2=0sAAAAAQ== root@gluster02:~# stat /srv/glusterfs/bricks/DATA202/data/$PATH/sd.bin File: /srv/glusterfs/bricks/DATA202/data/$PATH/sd.bin Size: 32058 Blocks: 72 IO Block: 4096 regular file Device: fe11h/65041d Inode: 34550135635 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 1029/ $user) Gid: ( 1039/$group) Access: 2018-04-24 16:53:47.688932475 +0200 Modify: 2018-03-27 09:11:01.000000000 +0200 Change: 2018-04-24 13:32:26.357256496 +0200 Birth: - 2nd file: root@gluster02:~# getfattr -d -m - /srv/glusterfs/bricks/DATA202/data/$PATH/pairtable.bin # file: srv/glusterfs/bricks/DATA202/data/$PATH/pairtable.bin trusted.gfid=0sGGS421fzQpquDiz3KTaO1g== trusted.gfid2path.5b44f1b5ab80e888="3847d58a-0225-4be2-8ba6-a7fcaf16dcf2/pairtable.bin" trusted.glusterfs.quota.3847d58a-0225-4be2-8ba6-a7fcaf16dcf2.contri.1=0sAAAAAAAABgAAAAAAAAAAAQ== trusted.pgfid.3847d58a-0225-4be2-8ba6-a7fcaf16dcf2=0sAAAAAQ== root@gluster02:~# stat /srv/glusterfs/bricks/DATA202/data/$PATH/pairtable.bin File: /srv/glusterfs/bricks/DATA202/data/$PATH/pairtable.bin Size: 1054 Blocks: 16 IO Block: 4096 regular file Device: fe11h/65041d Inode: 34550135634 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 1029/ $user) Gid: ( 1039/$group) Access: 2018-04-24 13:29:51.615393077 +0200 Modify: 2018-03-27 09:11:04.000000000 +0200 Change: 2018-04-24 13:32:26.357256496 +0200 Birth: - 3rd file root@gluster02:~# getfattr -d -m - /srv/glusterfs/bricks/DATA201/data/$PATH/seqdata.bin # file: srv/glusterfs/bricks/DATA201/data/$PATH/seqdata.bin trusted.gfid=0soL+uP9hOTWyo3Z3+cLOa6w== trusted.gfid2path.91ad63dbe24d5d40="3847d58a-0225-4be2-8ba6-a7fcaf16dcf2/seqdata.bin" trusted.glusterfs.quota.3847d58a-0225-4be2-8ba6-a7fcaf16dcf2.contri.1=0sAAAAAAAJ+AAAAAAAAAAAAQ== trusted.pgfid.3847d58a-0225-4be2-8ba6-a7fcaf16dcf2=0sAAAAAQ== root@gluster02:~# stat /srv/glusterfs/bricks/DATA201/data/$PATH/seqdata.bin File: /srv/glusterfs/bricks/DATA201/data/$PATH/seqdata.bin Size: 653142 Blocks: 1288 IO Block: 4096 regular file Device: fe10h/65040d Inode: 34385264557 Links: 2 Access: (0644/-rw-r--r--) Uid: ( 1029/ $user) Gid: ( 1039/$group) Access: 2018-04-24 16:53:29.588711695 +0200 Modify: 2018-03-27 09:11:03.000000000 +0200 Change: 2018-04-24 13:32:26.357256496 +0200 Birth: - There's another thing that happend. We started the rebalance and, as you can see, it failed on gluster02. This is the part of the log where it failed: [2018-04-24 18:55:08.253990] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-$vol1-client-7: server $IP_gluster02:49159 has not responded in the last 42 seconds, disconnecting. [2018-04-24 18:55:08.254210] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-$vol1-client-7: disconnected from $vol1-client-7. Client process will keep trying to connect to glusterd until brick's port is available [2018-04-24 18:55:08.254260] W [MSGID: 109073] [dht-common.c:9315:dht_notify] 0-$vol1-dht: Received CHILD_DOWN. Exiting [2018-04-24 18:55:08.254283] I [MSGID: 109029] [dht-rebalance.c:5283:gf_defrag_stop] 0-: Received stop command on rebalance [2018-04-24 18:55:08.254620] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x13e)[0x7f5ca82e3b6e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7f5ca80aa111] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f5ca80aa23e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7f5ca80ab8d1] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x288)[0x7f5ca80ac3f8] ))))) 0-$vol1-client-7: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2018-04-24 18:54:25.704755 (xid=0x4150a5) [2018-04-24 18:55:08.254651] W [MSGID: 114031] [client-rpc-fops.c:2922:client3_3_readv_cbk] 0-$vol1-client-7: remote operation failed [Transport endpoint is not connected] [2018-04-24 18:55:08.254740] E [MSGID: 109023] [dht-rebalance.c:1820:dht_migrate_file] 0-$vol1-dht: Migrate file failed: /$PATH1/file1: failed to migrate data [2018-04-24 18:55:08.254807] W [MSGID: 114061] [client-common.c:704:client_pre_fstat] 0-$vol1-client-7: (7d4a7dd7-db43-428f-9618-add08088d7bb) remote_fd is -1. EBADFD [File descriptor in bad state] [2018-04-24 18:55:08.254836] E [MSGID: 109023] [dht-rebalance.c:1459:__dht_migration_cleanup_src_file] 0-$vol1-dht: Migrate file cleanup failed: failed to fstat file /$PATH1/file1 on $vol1-client-7 [File descriptor in bad state] [2018-04-24 18:55:08.254853] W [MSGID: 109023] [dht-rebalance.c:2275:dht_migrate_file] 0-$vol1-dht: /$PATH1/file1: failed to cleanup source file on $vol1-client-7 [2018-04-24 18:55:08.254870] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x13e)[0x7f5ca82e3b6e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1d1)[0x7f5ca80aa111] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f5ca80aa23e] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x91)[0x7f5ca80ab8d1] (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x288)[0x7f5ca80ac3f8] ))))) 0-$vol1-client-7: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2018-04-24 18:54:26.249541 (xid=0x4150a6) [2018-04-24 18:55:08.254898] W [rpc-clnt-ping.c:223:rpc_clnt_ping_cbk] 0-$vol1-client-7: socket disconnected [2018-04-24 18:55:14.862395] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-$vol1-client-7: remote operation failed [Transport endpoint is not connected] [2018-04-24 18:55:14.862493] W [MSGID: 109023] [dht-rebalance.c:2300:dht_migrate_file] 0-$vol1-dht: /$PATH1/file1: failed to unlock file on $vol1-client-7 [Transport endpoint is not connected] [2018-04-24 18:55:14.862585] E [MSGID: 109023] [dht-rebalance.c:2790:gf_defrag_migrate_single_file] 0-$vol1-dht: migrate-data failed for /$PATH1/file1 [Transport endpoint is not connected] [2018-04-24 18:55:14.862626] W [dht-rebalance.c:3397:gf_defrag_process_dir] 0-$vol1-dht: Found error from gf_defrag_get_entry [2018-04-24 18:55:14.863078] E [MSGID: 109111] [dht-rebalance.c:3914:gf_defrag_fix_layout] 0-$vol1-dht: gf_defrag_process_dir failed for directory: /$PATH2 [2018-04-24 18:55:16.492243] W [MSGID: 114061] [client-common.c:1197:client_pre_readdirp] 0-$vol1-client-7: (12c968a9-4d43-4746-9c16-2e3671b87dd7) remote_fd is -1. EBADFD [File descriptor in bad state] [2018-04-24 18:55:18.256351] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-$vol1-client-7: changing port to 49159 (from 0) [2018-04-24 18:55:18.256828] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-$vol1-client-7: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2018-04-24 18:55:18.257718] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-$vol1-client-7: Connected to $vol1-client-7, attached to remote volume '/srv/glusterfs/bricks/DATA208/data'. [2018-04-24 18:55:18.257753] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-$vol1-client-7: Server and Client lk-version numbers are not same, reopening the fds [2018-04-24 18:55:18.257771] I [MSGID: 114042] [client-handshake.c:1047:client_post_handshake] 0-$vol1-client-7: 9 fds open - Delaying child_up until they are re-opened [2018-04-24 18:55:18.258105] I [MSGID: 114060] [client-handshake.c:817:client3_3_reopendir_cbk] 0-$vol1-client-7: reopendir on <gfid:00000000-0000-0000-0000-000000000001> succeeded (fd = 0) [2018-04-24 18:55:18.258152] I [MSGID: 114060] [client-handshake.c:817:client3_3_reopendir_cbk] 0-$vol1-client-7: reopendir on <gfid:30abdc78-e85b-43fb-aac1-df9be4facf8e> succeeded (fd = 1) [2018-04-24 18:55:18.258193] I [MSGID: 114060] [client-handshake.c:817:client3_3_reopendir_cbk] 0-$vol1-client-7: reopendir on <gfid:87541f65-770a-4cc3-89ab-19f6d0e98aa5> succeeded (fd = 4) [2018-04-24 18:55:18.258222] I [MSGID: 114060] [client-handshake.c:817:client3_3_reopendir_cbk] 0-$vol1-client-7: reopendir on <gfid:9faf9889-0419-4e1e-ade1-2929a8575ce2> succeeded (fd = 5) [2018-04-24 18:55:18.258248] I [MSGID: 114060] [client-handshake.c:817:client3_3_reopendir_cbk] 0-$vol1-client-7: reopendir on <gfid:14d6f9bc-5756-444d-86b7-a55d64753ca7> succeeded (fd = 6) [2018-04-24 18:55:18.258272] I [MSGID: 114060] [client-handshake.c:817:client3_3_reopendir_cbk] 0-$vol1-client-7: reopendir on <gfid:ba3dbaf6-5774-416d-956c-483ddb514f42> succeeded (fd = 2) [2018-04-24 18:55:18.258328] I [MSGID: 114060] [client-handshake.c:817:client3_3_reopendir_cbk] 0-$vol1-client-7: reopendir on <gfid:12c968a9-4d43-4746-9c16-2e3671b87dd7> succeeded (fd = 3) [2018-04-24 18:55:18.258442] I [MSGID: 114060] [client-handshake.c:817:client3_3_reopendir_cbk] 0-$vol1-client-7: reopendir on <gfid:14d6f9bc-5756-444d-86b7-a55d64753ca7> succeeded (fd = 6) [2018-04-24 18:55:18.258541] I [MSGID: 114041] [client-handshake.c:678:client_child_up_reopen_done] 0-$vol1-client-7: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2018-04-24 18:55:18.258659] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-$vol1-client-7: Server lk version = 1 [2018-04-24 18:55:20.089490] I [MSGID: 109081] [dht-common.c:4379:dht_setxattr] 0-$vol1-dht: fixing the layout of /dir1/dir2/dir3/dir4/dir5 [2018-04-24 18:56:02.823510] I [dht-rebalance.c:3223:gf_defrag_process_dir] 0-$vol1-dht: migrate data called on /dir1/dir2/dir3/dir4/dir5 [2018-04-24 18:56:02.854207] W [dht-rebalance.c:3397:gf_defrag_process_dir] 0-$vol1-dht: Found error from gf_defrag_get_entry [2018-04-24 18:56:02.854759] E [MSGID: 109111] [dht-rebalance.c:3914:gf_defrag_fix_layout] 0-$vol1-dht: gf_defrag_process_dir failed for directory: /dir1/dir2/dir3/dir4/dir5 [2018-04-24 18:56:02.855041] E [MSGID: 109016] [dht-rebalance.c:3851:gf_defrag_fix_layout] 0-$vol1-dht: Fix layout failed for /dir1/dir2/dir3/dir4 [2018-04-24 18:56:02.855225] E [MSGID: 109016] [dht-rebalance.c:3851:gf_defrag_fix_layout] 0-$vol1-dht: Fix layout failed for /dir1/dir2/dir3 [2018-04-24 18:56:02.855565] E [MSGID: 109016] [dht-rebalance.c:3851:gf_defrag_fix_layout] 0-$vol1-dht: Fix layout failed for /dir1/dir2 [2018-04-24 18:56:02.855760] E [MSGID: 109016] [dht-rebalance.c:3851:gf_defrag_fix_layout] 0-$vol1-dht: Fix layout failed for /dir1 [2018-04-24 18:59:19.254438] I [MSGID: 109022] [dht-rebalance.c:2218:dht_migrate_file] 0-$vol1-dht: completed migration of /$PATH3/file3 from subvolume $vol1-client-0 to $vol1-client-1 [2018-04-24 18:59:19.256074] I [MSGID: 109028] [dht-rebalance.c:5097:gf_defrag_status_get] 0-$vol1-dht: Rebalance is failed. Time taken is 120378.00 secs [2018-04-24 18:59:19.256119] I [MSGID: 109028] [dht-rebalance.c:5101:gf_defrag_status_get] 0-$vol1-dht: Files migrated: 434664, size: 21939317004280, lookups: 935269, failures: 8, skipped: 166223 [2018-04-24 18:59:19.256371] W [glusterfsd.c:1375:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7f5ca755b494] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xf5) [0x5644710ead45] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x5644710eaba4] ) 0-: received signum (15), shutting down client-7 seems to be responsable for the brick DATA208 on gluster02. His log contains this line at that time: [2018-04-24 17:19:33.929281] E [MSGID: 113001] [posix.c:5983:_posix_handle_xattr_keyvalue_pair] 0-$vol1-posix: setxattr failed on /srv/glusterfs/bricks/DATA208/data/.glusterfs/ac/24/ac246c06-bd39-4799-bdbe-7fba9beb4fb7 while doing xattrop: key=trusted.glusterfs.quota.33157353-a842-48ac-8f84-e0cc55a59eae.contri.1 [No such file or directory] [2018-04-24 18:50:33.091715] E [MSGID: 113001] [posix.c:5983:_posix_handle_xattr_keyvalue_pair] 0-$vol1-posix: setxattr failed on /srv/glusterfs/bricks/DATA208/data/.glusterfs/33/c4/33c4ca2e-63cd-4ab6-b56d-b95bb085b9b3 while doing xattrop: key=trusted.glusterfs.quota.8bc4b5a3-0792-429b-878a-7bcfba5d8360.contri.1 [No such file or directory] [2018-04-24 18:56:02.744587] W [socket.c:593:__socket_rwv] 0-tcp.$vol1-server: writev on $IP_gluster02:49057 failed (Broken pipe) [2018-04-24 18:56:02.744742] W [inodelk.c:499:pl_inodelk_log_cleanup] 0-$vol1-server: releasing lock on 7d4a7dd7-db43-428f-9618-add08088d7bb held by {client=0x7f38600ba190, pid=-3 lk-owner=fdffffff} [2018-04-25 04:38:35.718259] A [MSGID: 120004] [quota.c:4998:quota_log_usage] 0-$vol1-quota: Usage is above soft limit: 199.7TB used by /$some_dir I think the brick somehow lost connection so something. Not sure what port 49057 was used for though. We can't find anything in the logs and it's currently not used according to "netstat". We're also starting to see errors in the brick logs (all bricks) that look like this: [2018-04-26 11:43:42.244821] W [marker-quota.c:33:mq_loc_copy] 0-marker: src loc is not valid [2018-04-26 11:43:42.244854] E [marker-quota.c:1488:mq_initiate_quota_task] 0-$vol1-marker: loc copy failed [2018-04-26 11:43:34.752298] W [MSGID: 113001] [posix.c:4430:posix_get_ancestry_non_directory] 0-$vol1-posix: listxattr failed on/srv/glusterfs/bricks/DATA208/data/.glusterfs/75/79/757961cd-4348-41fa-93cb-2a681f87af96 [No such file or directory] [2018-04-26 11:43:42.245003] W [MSGID: 113001] [posix.c:4430:posix_get_ancestry_non_directory] 0-$vol1-posix: listxattr failed on/srv/glusterfs/bricks/DATA208/data/.glusterfs/e2/8a/e28a69cc-e23a-43ab-998c-f41ef77212b5 [No such file or directory]
Created attachment 1427665 [details] The requested tcpdump
Created attachment 1427666 [details] mount log of the volume
Hi, Please capture the tcpdump using tcpdump -i any -s 0 -w /var/tmp/dirls.pcap tcp and not port 22 I need to be able to examine the packets.
There is nothing in the mount logs to indicate a problem. The rebalance process failed because it could not contact $IP_gluster02:49159 [2018-04-24 18:55:08.253990] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-$vol1-client-7: server $IP_gluster02:49159 has not responded in the last 42 seconds, disconnecting. [2018-04-24 18:55:08.254210] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-$vol1-client-7: disconnected from $vol1-client-7. Client process will keep trying to connect to glusterd until brick's port is available [2018-04-24 18:55:08.254260] W [MSGID: 109073] [dht-common.c:9315:dht_notify] 0-$vol1-dht: Received CHILD_DOWN. Exiting This is the expected rebalance behaviour. You need to check if there is any reason why this brick is not reachable. Does it happen if you try the rebalance again?
We restarted the rebalance. It'll take a while though (est time 50 hrs). We'll report the outcome The bricks actually are virtual discs provided by a big storage system. The storage reports no errors (including no loss of connectivity or harddrive failure). We didn't touch the brick process at all (actually, we weren't even present, it was late in the evening). It recovered on its own. Port 49159 on gluster02 belongs to brick DATA208. The port was open when we came to work the next day. The brick was up and running. The glusterd-log showed nothing about having lost a brick, just the failed rebalance: [2018-04-24 18:59:19.256333] I [MSGID: 106172] [glusterd-handshake.c:1014:__server_event_notify] 0-glusterd: received defrag status updated [2018-04-24 18:59:19.263291] W [socket.c:593:__socket_rwv] 0-management: readv on /var/run/gluster/gluster-rebalance-0d210c70-e44f-46f1-862c-ef260514c9f1.sock failed (No data available) [2018-04-24 18:59:19.266258] I [MSGID: 106007] [glusterd-rebalance.c:158:__glusterd_defrag_notify] 0-management: Rebalance process for volume $vol1 has disconnected. That's the complete log of that day. For some reason, DATA208 tried to connect to port 49057: [2018-04-24 18:56:02.744587] W [socket.c:593:__socket_rwv] 0-tcp.$vol1-server: writev on $IP_gluster02:49057 failed (Broken pipe) We are unsure why. There's nothing listening: $ netstat -tulpen | grep 49057 $ netstat -tulpen | grep gluster tcp 0 0 0.0.0.0:49152 0.0.0.0:* LISTEN 0 24130 4064/glusterfsd tcp 0 0 0.0.0.0:49153 0.0.0.0:* LISTEN 0 18881 4072/glusterfsd tcp 0 0 0.0.0.0:49154 0.0.0.0:* LISTEN 0 19775 4080/glusterfsd tcp 0 0 0.0.0.0:49155 0.0.0.0:* LISTEN 0 26969 4090/glusterfsd tcp 0 0 0.0.0.0:49156 0.0.0.0:* LISTEN 0 45238 4098/glusterfsd tcp 0 0 0.0.0.0:49157 0.0.0.0:* LISTEN 0 46649 4107/glusterfsd tcp 0 0 0.0.0.0:49158 0.0.0.0:* LISTEN 0 1440 4116/glusterfsd tcp 0 0 0.0.0.0:49159 0.0.0.0:* LISTEN 0 18417 4125/glusterfsd tcp 0 0 0.0.0.0:24007 0.0.0.0:* LISTEN 0 15592 3873/glusterd tcp 0 0 0.0.0.0:49160 0.0.0.0:* LISTEN 0 19785 4134/glusterfsd tcp 0 0 0.0.0.0:49161 0.0.0.0:* LISTEN 0 36104 4143/glusterfsd tcp 0 0 0.0.0.0:49162 0.0.0.0:* LISTEN 0 72783 4152/glusterfsd tcp 0 0 0.0.0.0:49163 0.0.0.0:* LISTEN 0 38236 4161/glusterfsd We don't know why the rebalance failed. It's the first time that something like this happened. And we don't understand the brick log. We need to discuss uploading the pcap-file with our supervisor, since it contains our IP's. Is there a way to give it to you without making it public? There's something else that happened today: A user reported she wanted to create a smylink with an absolute path to some file. There was no error message (in fact, the mount log reported Success), but the symlink lead to nowhere. The volume usually is mounted as /data, on all compute nodes with the /data-mount, creating a symling to this file didn't work. The new mount I created at /mnt however could do the symlink. The Systemd-mount-unit literally is copied except for the mount point. A server with both mount points (/data and /mnt) could do the smylink on the /mnt- mount point but not at /data. Relative paths however work fine. It looks like this: $ ls -lah lrwxrwxrwx 1 root itsc_test_proj2 120 Apr 30 15:25 test1.gz -> /mnt/$PATH/$file.gz lrwxrwxrwx 1 root itsc_test_proj2 121 Apr 30 15:47 test2.gz -> lrwxrwxrwx 1 root itsc_test_proj2 120 Apr 30 15:48 test3.gz -> /mnt/$PATH/$file.gz lrwxrwxrwx 1 root itsc_test_proj2 118 Apr 30 16:05 test4.gz -> ../$PATH/$file.gz lrwxrwxrwx 1 root itsc_test_proj2 119 Apr 30 16:06 test5.gz -> lrwxrwxrwx 1 root itsc_test_proj2 121 Apr 30 16:08 test6.gz -> lrwxrwxrwx 1 root itsc_test_proj2 121 Apr 30 16:08 test7.gz -> lrwxrwxrwx 1 root itsc_test_proj2 120 Apr 30 15:48 test8.gz -> /mnt/$PATH/$file.gz Creation of the symlinks: test1.gz & test3.gz via "cd /mnt; ln -s /mnt/$PATH/$file.gz test$x.gz" test2.gz, test5.gz & test6.gz via "cd /data; ln -s /data/$PATH/$file.gz test$x.gz" test4.gz via "cd /data; ln -s ../$PATH/$file.gz test$x.gz" test7.gz via "cd /mnt; ln -s /data/$PATH/$file.gz test$x.gz" test8.gz via "cd /data; ln -s /mnt/$PATH/$file.gz test$x.gz" This was reproducible. We know that the /mnt-mount point is not completely fine either, since the hidden files we used to create the logs were hidden there, too. Still, the mounts behave different. Symlinks with an absolute path pointing on /data aren't created correctly. Following the strange symlinks with zcat produces an error: $ zcat test7.gz | head gzip: test7.gz is a directory -- ignored All links, including the one with a relative link pointing to /data, can be used as usual.
> We need to discuss uploading the pcap-file with our supervisor, since it > contains our IP's. Is there a way to give it to you without making it > public? Would it be possible to share it using Google drive?
(In reply to Nithya Balachandran from comment #14) > > We need to discuss uploading the pcap-file with our supervisor, since it > > contains our IP's. Is there a way to give it to you without making it > > public? > > > Would it be possible to share it using Google drive? I sent you an e-mail with the file.
(In reply to Nithya Balachandran from comment #12) > This is the expected rebalance behaviour. You need to check if there is any > reason why this brick is not reachable. Does it happen if you try the > rebalance again? The second try finished without problems. > [2018-05-02 15:50:21.572882] I [MSGID: 109028] [dht-rebalance.c:5097:gf_defrag_status_get] 0-$VOLUME-dht: Rebalance is completed. Time taken is 203329.00 secs > [2018-05-02 15:50:21.572915] I [MSGID: 109028] [dht-rebalance.c:5101:gf_defrag_status_get] 0-$VOLUME-dht: Files migrated: 653637, size: 53966747687482, lookups: 3303019, failures: 0, skipped: 262197
Hi, the rebalance finished on the second server. No brick loss this time, but we got two failures. It looks like he tried to move a directory: [2018-05-03 16:04:21.886030] I [dht-rebalance.c:1570:dht_migrate_file] 0-$vol1-dht: /$path/$dir: attempting to move from §vol1-client-2 to §vol1-client-1 [2018-05-03 16:04:21.886133] I [dht-rebalance.c:1570:dht_migrate_file] 0-§vol1-dht: /$path/$dir: attempting to move from §vol1-client-2 to §vol1-client-1 [2018-05-03 16:04:21.904481] W [MSGID: 109023] [dht-rebalance.c:626:__is_file_migratable] 0-§vol1-dht: Migrate file failed:/$path/$dir: migrate-file called on directory [2018-05-03 16:04:21.904558] W [MSGID: 109023] [dht-rebalance.c:626:__is_file_migratable] 0-§vol1-dht: Migrate file failed:/$path/$dir: migrate-file called on directory [2018-05-03 16:04:21.904711] E [MSGID: 109023] [dht-rebalance.c:2790:gf_defrag_migrate_single_file] 0-$vol1-dht: migrate-data failed for /$path/$dir [Is a directory] [2018-05-03 16:04:21.904715] E [MSGID: 109023] [dht-rebalance.c:2790:gf_defrag_migrate_single_file] 0-$vol1-dht: migrate-data failed for /$path/$dir [Is a directory] the strange symlinks fixed themselves within a few hours. Newly created ones still look broken. When I checked on the bricks, the links were broken there too. readlink also shows an empty link on both brick and mounted volume.
Hi, we did a kernel downgrade back to 4.9.0-5 on all Gluster servers a few days ago. This is the kernel that was running before the errors were reported for the first time. This is what we noticed so far: The symlink anomaly is still reproducable. Error messages are still present during the rebalance (started after all servers booted with 4.9.0-5), but much less and only every few seconds or minutes (instead of multiple per second, as before the downgrade). For example (Volume name replaced with $VOLUME): May 23 05:34:00 gluster02 srv-glusterfs-bricks-DATA209-data[4145]: [2018-05-23 03:34:00.055879] E [MSGID: 113001] [posix.c:5983:_posix_handle_xattr_keyvalue_pair] 0-$VOLUME_DATA-posix: setxattr failed on /srv/glusterfs/bricks/DATA209/data/.glusterfs/1d/5e/1d5e9895-97c5-4099-869e-3d86cdc7486e while doing xattrop: key=trusted.glusterfs.quota.ff8583fd-1452-4939-85c9-7b402ea85968.contri.1 [No such file or directory]
REVIEW: https://review.gluster.org/20387 (storage/posix: Add warning logs on failure) posted (#1) for review on master by N Balachandran
On analysing the tcpdumps sent, I noticed the following: Readdirp returns entries without iatt information from 3 bricks of the volume (the ones that contained the "hidden" files). This is the case for all entries on that brick and for both files and directories. dht_readdirp_cbk finds an invalid iatt type and strips out the entries which is why they are not listed even though they exist on the bricks. From the code, a possible cause is the call to posix_handle_path() failing in posix_readdirp_fill(). posix_readdirp_fill will not fill the iatt information in this case but the entries are still returned to the client.
COMMIT: https://review.gluster.org/20387 committed in master by "Raghavendra Bhat" <raghavendra> with a commit message- storage/posix: Add warning logs on failure posix_readdirp_fill will fail to update the iatt information if posix_handle_path fails. There is currently no log message to indicate this making debugging difficult. Change-Id: I6bce360ea7d1696501637433f80e02794fe1368f updates: bz#1564071 Signed-off-by: N Balachandran <nbalacha>
There are several messages of the type seen in the bricks: Found stale gfid handle /dir1/pool1/brick1/myvol/.glusterfs/86/13/8613f6e0-3171-4191-8b42-d8c8063ffbce, removi ] seen in the brick logs . This was introduced in https://review.gluster.org/#/c/19070/. This needs to be modified in order to skip directories.
REVIEW: https://review.gluster.org/20509 (posix: check before removing stale symlink) posted (#2) for review on master by Ravishankar N
COMMIT: https://review.gluster.org/20509 committed in master by "jiffin tony Thottan" <jthottan> with a commit message- posix: check before removing stale symlink BZ 1564071 complains of directories with missing gfid symlinks and corresponding "Found stale gfid handle" messages in the logs. Hence add a check to see if the symlink points to an actual directory before removing it. Note: Removing stale symlinks was added via commit 3e9a9c029fac359477fb26d9cc7803749ba038b2 Change-Id: I5d91fab8e5f3a621a9ecad4a1f9c898a3c2d346a Updates: bz#1564071 Signed-off-by: Ravishankar N <ravishankar>
I'm changing the version to mainline as the patches sent on master used this BZ ID (by mistake). Will clone it to other release versions to send back ports. Also, Nithya tells me there might be further patches, so moving the state to POST.
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/