Bug 1707866

Summary:	Thousands of duplicate files in glusterfs mountpoint directory listing
Product:	[Community] GlusterFS	Reporter:	Sergey <sergemp>
Component:	core	Assignee:	bugs <bugs>
Status:	CLOSED UPSTREAM	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	4.1	CC:	bugs, moagrawa, pasik
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-12 12:34:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sergey 2019-05-08 15:11:44 UTC

I have something impossible: same filenames are listed multiple times:

  # ls -la /mnt/VOLNAME/
  ...
  -rwxrwxr-x   1 root   root   3486 Jan 28  2016 check_connections.pl
  -rwxr-xr-x   1 root   root    153 Dec  7  2014 sigtest.sh
  -rwxr-xr-x   1 root   root    153 Dec  7  2014 sigtest.sh
  -rwxr-xr-x   1 root   root   3466 Jan  5  2015 zabbix.pm
  -rwxr-xr-x   1 root   root   3466 Jan  5  2015 zabbix.pm

There're about 38981 duplicate files like that.

The volume itself is a 3 x 2-replica:

  # gluster volume info VOLNAME
  Volume Name: VOLNAME
  Type: Distributed-Replicate
  Volume ID: 41f9096f-0d5f-4ea9-b369-89294cf1be99
  Status: Started
  Snapshot Count: 0
  Number of Bricks: 3 x 2 = 6
  Transport-type: tcp
  Bricks:
  Brick1: gfserver1:/srv/BRICK
  Brick2: gfserver2:/srv/BRICK
  Brick3: gfserver3:/srv/BRICK
  Brick4: gfserver4:/srv/BRICK
  Brick5: gfserver5:/srv/BRICK
  Brick6: gfserver6:/srv/BRICK
  Options Reconfigured:
  transport.address-family: inet
  nfs.disable: on
  cluster.self-heal-daemon: enable
  config.transport: tcp

The "duplicated" file on individual bricks:

  [gfserver1]# ls -la /srv/BRICK/zabbix.pm
  ---------T 2 root root 0 Apr 23  2018 /srv/BRICK/zabbix.pm

  [gfserver2]# ls -la /srv/BRICK/zabbix.pm
  ---------T 2 root root 0 Apr 23  2018 /srv/BRICK/zabbix.pm

  [gfserver3]# ls -la /srv/BRICK/zabbix.pm
  -rwxr-xr-x 2 root root 3466 Jan  5  2015 /srv/BRICK/zabbix.pm

  [gfserver4]# ls -la /srv/BRICK/zabbix.pm
  -rwxr-xr-x 2 root root 3466 Jan  5  2015 /srv/BRICK/zabbix.pm

  [gfserver5]# ls -la /srv/BRICK/zabbix.pm
  -rwxr-xr-x 2 root root 3466 Jan  5  2015 /srv/BRICK/zabbix.pm

  [gfserver6]# ls -la /srv/BRICK/zabbix.pm
  -rwxr-xr-x. 2 root root 3466 Jan  5  2015 /srv/BRICK/zabbix.pm

Attributes:

  [gfserver1]# getfattr -m . -d -e hex /srv/BRICK/zabbix.pm
  # file: srv/BRICK/zabbix.pm
  trusted.afr.VOLNAME-client-1=0x000000000000000000000000
  trusted.afr.VOLNAME-client-4=0x000000000000000000000000
  trusted.gfid=0x422a7ccf018242b58e162a65266326c3
  trusted.glusterfs.dht.linkto=0x6678666565642d7265706c69636174652d3100

  [gfserver2]# getfattr -m . -d -e hex /srv/BRICK/zabbix.pm
  # file: srv/BRICK/zabbix.pm
  trusted.gfid=0x422a7ccf018242b58e162a65266326c3
  trusted.gfid2path.3b27d24cad4dceef=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f7a61626269782e706d
  trusted.glusterfs.dht.linkto=0x6678666565642d7265706c69636174652d3100

  [gfserver3]# getfattr -m . -d -e hex /srv/BRICK/zabbix.pm
  # file: srv/BRICK/zabbix.pm
  trusted.afr.VOLNAME-client-2=0x000000000000000000000000
  trusted.afr.VOLNAME-client-3=0x000000000000000000000000
  trusted.gfid=0x422a7ccf018242b58e162a65266326c3

  [gfserver4]# getfattr -m . -d -e hex /srv/BRICK/zabbix.pm
  # file: srv/BRICK/zabbix.pm
  trusted.gfid=0x422a7ccf018242b58e162a65266326c3
  trusted.gfid2path.3b27d24cad4dceef=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f7a61626269782e706d

  [gfserver5]# getfattr -m . -d -e hex /srv/BRICK/zabbix.pm
  # file: srv/BRICK/zabbix.pm
  trusted.bit-rot.version=0x03000000000000005c4f813c000bc71b
  trusted.gfid=0x422a7ccf018242b58e162a65266326c3

  [gfserver6]# getfattr -m . -d -e hex /srv/BRICK/zabbix.pm
  # file: srv/BRICK/zabbix.pm
  security.selinux=0x73797374656d5f753a6f626a6563745f723a7661725f743a733000
  trusted.bit-rot.version=0x02000000000000005add0ffc000eb66a
  trusted.gfid=0x422a7ccf018242b58e162a65266326c3

Not sure why exactly it happened... Maybe because some nodes were suddenly upgraded from centos6's gluster ~3.7 to centos7's 4.1, and some files happened to be on nodes that they're not supposed to be on.

Currently all the nodes are online:

  # gluster pool list
  UUID                                  Hostname        State
  aac9e1a5-018f-4d27-9d77-804f0f1b2f13  gfserver5       Connected
  98b22070-b579-4a91-86e3-482cfcc9c8cf  gfserver3       Connected
  7a9841a1-c63c-49f2-8d6d-a90ae2ff4e04  gfserver4       Connected
  955f5551-8b42-476c-9eaa-feab35b71041  gfserver6       Connected
  7343d655-3527-4bcf-9d13-55386ccb5f9c  gfserver1       Connected
  f9c79a56-830d-4056-b437-a669a1942626  gfserver2       Connected
  45a72ab3-b91e-4076-9cf2-687669647217  localhost       Connected

and have glusterfs-3.12.14-1.el6.x86_64 (Centos 6) and glusterfs-4.1.7-1.el7.x86_64 (Centos 7) installed.


Expected result
---------------

This looks like a layout issue, so:

  gluster volume rebalance VOLNAME fix-layout start

should fix it, right?


Actual result
-------------

I tried:
  gluster volume rebalance VOLNAME fix-layout start
  gluster volume rebalance VOLNAME start
  gluster volume rebalance VOLNAME start force
  gluster volume heal VOLNAME full
Those took 5 to 40 minutes to complete, but the duplicates are still there.

Comment 1 Nithya Balachandran 2019-05-13 04:48:55 UTC

(In reply to Sergey from comment #0)
> I have something impossible: same filenames are listed multiple times:

Based on the information provided for zabbix.pm, the files are listed twice because 2 separate copies of the files exist on different bricks.


> 
>   # ls -la /mnt/VOLNAME/
>   ...
>   -rwxrwxr-x   1 root   root   3486 Jan 28  2016 check_connections.pl
>   -rwxr-xr-x   1 root   root    153 Dec  7  2014 sigtest.sh
>   -rwxr-xr-x   1 root   root    153 Dec  7  2014 sigtest.sh
>   -rwxr-xr-x   1 root   root   3466 Jan  5  2015 zabbix.pm
>   -rwxr-xr-x   1 root   root   3466 Jan  5  2015 zabbix.pm
> 
> There're about 38981 duplicate files like that.
> 
> The volume itself is a 3 x 2-replica:
> 
>   # gluster volume info VOLNAME
>   Volume Name: VOLNAME
>   Type: Distributed-Replicate
>   Volume ID: 41f9096f-0d5f-4ea9-b369-89294cf1be99
>   Status: Started
>   Snapshot Count: 0
>   Number of Bricks: 3 x 2 = 6
>   Transport-type: tcp
>   Bricks:
>   Brick1: gfserver1:/srv/BRICK
>   Brick2: gfserver2:/srv/BRICK
>   Brick3: gfserver3:/srv/BRICK
>   Brick4: gfserver4:/srv/BRICK
>   Brick5: gfserver5:/srv/BRICK
>   Brick6: gfserver6:/srv/BRICK
>   Options Reconfigured:
>   transport.address-family: inet
>   nfs.disable: on
>   cluster.self-heal-daemon: enable
>   config.transport: tcp
> 
> The "duplicated" file on individual bricks:
> 
>   [gfserver1]# ls -la /srv/BRICK/zabbix.pm
>   ---------T 2 root root 0 Apr 23  2018 /srv/BRICK/zabbix.pm
> 
>   [gfserver2]# ls -la /srv/BRICK/zabbix.pm
>   ---------T 2 root root 0 Apr 23  2018 /srv/BRICK/zabbix.pm
> 

These 2 are linkto files and they are pointing to the data files on gfserver3 and gfserver4.

>   [gfserver3]# ls -la /srv/BRICK/zabbix.pm
>   -rwxr-xr-x 2 root root 3466 Jan  5  2015 /srv/BRICK/zabbix.pm
> 
>   [gfserver4]# ls -la /srv/BRICK/zabbix.pm
>   -rwxr-xr-x 2 root root 3466 Jan  5  2015 /srv/BRICK/zabbix.pm
> 



>   [gfserver5]# ls -la /srv/BRICK/zabbix.pm
>   -rwxr-xr-x 2 root root 3466 Jan  5  2015 /srv/BRICK/zabbix.pm
> 
>   [gfserver6]# ls -la /srv/BRICK/zabbix.pm
>   -rwxr-xr-x. 2 root root 3466 Jan  5  2015 /srv/BRICK/zabbix.pm
> 

These are the problematic files. I do not know why or how they ended up existing on these bricks as well.


> Attributes:
> 
>   [gfserver1]# getfattr -m . -d -e hex /srv/BRICK/zabbix.pm
>   # file: srv/BRICK/zabbix.pm
>   trusted.afr.VOLNAME-client-1=0x000000000000000000000000
>   trusted.afr.VOLNAME-client-4=0x000000000000000000000000
>   trusted.gfid=0x422a7ccf018242b58e162a65266326c3
>   trusted.glusterfs.dht.linkto=0x6678666565642d7265706c69636174652d3100
> 
>   [gfserver2]# getfattr -m . -d -e hex /srv/BRICK/zabbix.pm
>   # file: srv/BRICK/zabbix.pm
>   trusted.gfid=0x422a7ccf018242b58e162a65266326c3
>  
> trusted.gfid2path.
> 3b27d24cad4dceef=0x30303030303030302d303030302d303030302d303030302d3030303030
> 303030303030312f7a61626269782e706d
>   trusted.glusterfs.dht.linkto=0x6678666565642d7265706c69636174652d3100
> 
>   [gfserver3]# getfattr -m . -d -e hex /srv/BRICK/zabbix.pm
>   # file: srv/BRICK/zabbix.pm
>   trusted.afr.VOLNAME-client-2=0x000000000000000000000000
>   trusted.afr.VOLNAME-client-3=0x000000000000000000000000
>   trusted.gfid=0x422a7ccf018242b58e162a65266326c3
> 
>   [gfserver4]# getfattr -m . -d -e hex /srv/BRICK/zabbix.pm
>   # file: srv/BRICK/zabbix.pm
>   trusted.gfid=0x422a7ccf018242b58e162a65266326c3
>  
> trusted.gfid2path.
> 3b27d24cad4dceef=0x30303030303030302d303030302d303030302d303030302d3030303030
> 303030303030312f7a61626269782e706d
> 
>   [gfserver5]# getfattr -m . -d -e hex /srv/BRICK/zabbix.pm
>   # file: srv/BRICK/zabbix.pm
>   trusted.bit-rot.version=0x03000000000000005c4f813c000bc71b
>   trusted.gfid=0x422a7ccf018242b58e162a65266326c3
> 
>   [gfserver6]# getfattr -m . -d -e hex /srv/BRICK/zabbix.pm
>   # file: srv/BRICK/zabbix.pm
>   security.selinux=0x73797374656d5f753a6f626a6563745f723a7661725f743a733000
>   trusted.bit-rot.version=0x02000000000000005add0ffc000eb66a
>   trusted.gfid=0x422a7ccf018242b58e162a65266326c3
> 
> Not sure why exactly it happened... Maybe because some nodes were suddenly
> upgraded from centos6's gluster ~3.7 to centos7's 4.1, and some files
> happened to be on nodes that they're not supposed to be on.
> 
> Currently all the nodes are online:
> 
>   # gluster pool list
>   UUID                                  Hostname        State
>   aac9e1a5-018f-4d27-9d77-804f0f1b2f13  gfserver5       Connected
>   98b22070-b579-4a91-86e3-482cfcc9c8cf  gfserver3       Connected
>   7a9841a1-c63c-49f2-8d6d-a90ae2ff4e04  gfserver4       Connected
>   955f5551-8b42-476c-9eaa-feab35b71041  gfserver6       Connected
>   7343d655-3527-4bcf-9d13-55386ccb5f9c  gfserver1       Connected
>   f9c79a56-830d-4056-b437-a669a1942626  gfserver2       Connected
>   45a72ab3-b91e-4076-9cf2-687669647217  localhost       Connected
> 
> and have glusterfs-3.12.14-1.el6.x86_64 (Centos 6) and
> glusterfs-4.1.7-1.el7.x86_64 (Centos 7) installed.
> 
> 
> Expected result
> ---------------
> 
> This looks like a layout issue, so:
> 
>   gluster volume rebalance VOLNAME fix-layout start
> 
> should fix it, right?
> 

No, fix layout only changes the layout and this is not a layout problem. This is a problem with duplicate files on the bricks.


> 
> Actual result
> -------------
> 
> I tried:
>   gluster volume rebalance VOLNAME fix-layout start
>   gluster volume rebalance VOLNAME start
>   gluster volume rebalance VOLNAME start force
>   gluster volume heal VOLNAME full
> Those took 5 to 40 minutes to complete, but the duplicates are still there.



Can you send the rebalance logs for this volume from all the nodes?
How many clients to do you have accessing the volume?
Are the duplicate files seen only in the root of the volume or in subdirs as well?

Comment 2 Mohit Agrawal 2020-02-24 04:52:56 UTC

Hi Sergy,

 Do you have any updates on this?

Thanks,
Mohit Agrawal

Comment 3 Worker Ant 2020-03-12 12:34:10 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/907, and will be tracked there from now on. Visit GitHub issues URL for further details

Comment 4 Red Hat Bugzilla 2023-09-14 05:28:18 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days