+++ This bug was initially created as a clone of Bug #1517260 +++ Hi, I have a volume replica 3 distributed in 3 servers, the 3 severs have the same version (3.12.3) and the same SO (Ubuntu 16.04.1), each server has three bricks and I used this command to create the volume: gluster volume create volume1 replica 3 transport tcp ubuntu1:/work/work1/test-storage1 ubuntu1:/work/work2/test-storage2 ubuntu1:/work/work3/test-storage3 ubuntu2:/work/work1/test-storage1 ubuntu2:/work/work2/test-storage2 ubuntu2:/work/work3/test-storage3 ubuntu3:/work/work1/test-storage1 ubuntu3:/work/work2/test-storage2 ubuntu3:/work/work3/test-storage3 Each brick has a size of 22TB,so I understand that if I mount the volume I should have a partition of 66TB. The problem is once the partition is mounted, if I check the size, appears that it has only 22TB.. I don't know what it's wrong, can anyone help me? Thanks!! --- Additional comment from Raghavendra G on 2017-11-27 07:54:39 EST --- Since the issue looks to be in aggregation of sizes of DHT subvolumes, changing the component to distribute --- Additional comment from Chad Cropper on 2017-11-28 15:50:40 EST --- I seem to have this bug as well. I just upgraded from 3.11.3 to 3.12.3. fuse mount from/on node1 32T 16T 15T 53% node1 brick1 20T 18T 2.1T 90% brick2 12T 90M 11T 1% node2 brick1 20T 15T 5.6T 72% brick2 12T 90M 11T 1% --- Additional comment from Chad Cropper on 2017-11-30 14:52:31 EST --- I removed my 2 new bricks and resized the lvm/ext4fs instead. The aggregation of total size is correct. --- Additional comment from Nithya Balachandran on 2017-12-05 04:12:30 EST --- (In reply to david from comment #0) > Hi, > I have a volume replica 3 distributed in 3 servers, the 3 severs have the > same version (3.12.3) and the same SO (Ubuntu 16.04.1), each server has > three bricks and I used this command to create the volume: > > gluster volume create volume1 replica 3 transport tcp > ubuntu1:/work/work1/test-storage1 ubuntu1:/work/work2/test-storage2 > ubuntu1:/work/work3/test-storage3 ubuntu2:/work/work1/test-storage1 > ubuntu2:/work/work2/test-storage2 ubuntu2:/work/work3/test-storage3 > ubuntu3:/work/work1/test-storage1 ubuntu3:/work/work2/test-storage2 > ubuntu3:/work/work3/test-storage3 > > Each brick has a size of 22TB,so I understand that if I mount the volume I > should have a partition of 66TB. The problem is once the partition is > mounted, if I check the size, appears that it has only 22TB.. > > I don't know what it's wrong, can anyone help me? > > Thanks!! On a different note - this is not the best way to create a replica 3 volume. All you replicas are on the same host so if any one host goes down you lose access to all the data on that replica set. A better way is: gluster volume create volume1 replica 3 transport tcp ubuntu1:/work/work1/test-storage1 ubuntu2:/work/work1/test-storage1 ubuntu3:/work/work1/test-storage1 ubuntu1:/work/work2/test-storage2 ubuntu2:/work/work2/test-storage2 ubuntu3:/work/work2/test-storage2 ubuntu1:/work/work3/test-storage3 ubuntu2:/work/work3/test-storage3 ubuntu3:/work/work3/test-storage3 --- Additional comment from Nithya Balachandran on 2017-12-05 04:13:42 EST --- Can you please mount the volume, check the size and send us the mount log? --- Additional comment from david on 2017-12-05 09:19:36 EST --- (In reply to Nithya Balachandran from comment #4) > (In reply to david from comment #0) > > Hi, > > I have a volume replica 3 distributed in 3 servers, the 3 severs have the > > same version (3.12.3) and the same SO (Ubuntu 16.04.1), each server has > > three bricks and I used this command to create the volume: > > > > gluster volume create volume1 replica 3 transport tcp > > ubuntu1:/work/work1/test-storage1 ubuntu1:/work/work2/test-storage2 > > ubuntu1:/work/work3/test-storage3 ubuntu2:/work/work1/test-storage1 > > ubuntu2:/work/work2/test-storage2 ubuntu2:/work/work3/test-storage3 > > ubuntu3:/work/work1/test-storage1 ubuntu3:/work/work2/test-storage2 > > ubuntu3:/work/work3/test-storage3 > > > > Each brick has a size of 22TB,so I understand that if I mount the volume I > > should have a partition of 66TB. The problem is once the partition is > > mounted, if I check the size, appears that it has only 22TB.. > > > > I don't know what it's wrong, can anyone help me? > > > > Thanks!! > > On a different note - this is not the best way to create a replica 3 volume. > All you replicas are on the same host so if any one host goes down you lose > access to all the data on that replica set. > > A better way is: > > gluster volume create volume1 replica 3 transport tcp > ubuntu1:/work/work1/test-storage1 ubuntu2:/work/work1/test-storage1 > ubuntu3:/work/work1/test-storage1 ubuntu1:/work/work2/test-storage2 > ubuntu2:/work/work2/test-storage2 ubuntu3:/work/work2/test-storage2 > ubuntu1:/work/work3/test-storage3 ubuntu2:/work/work3/test-storage3 > ubuntu3:/work/work3/test-storage3 Hi @Nithya! This is the same configuration that i have, isn't it? but in other order.. Thanks! --- Additional comment from Nithya Balachandran on 2017-12-05 10:16:12 EST --- (In reply to david from comment #6) > (In reply to Nithya Balachandran from comment #4) > > (In reply to david from comment #0) > > > Hi, > > > I have a volume replica 3 distributed in 3 servers, the 3 severs have the > > > same version (3.12.3) and the same SO (Ubuntu 16.04.1), each server has > > > three bricks and I used this command to create the volume: > > > > > > gluster volume create volume1 replica 3 transport tcp > > > ubuntu1:/work/work1/test-storage1 ubuntu1:/work/work2/test-storage2 > > > ubuntu1:/work/work3/test-storage3 ubuntu2:/work/work1/test-storage1 > > > ubuntu2:/work/work2/test-storage2 ubuntu2:/work/work3/test-storage3 > > > ubuntu3:/work/work1/test-storage1 ubuntu3:/work/work2/test-storage2 > > > ubuntu3:/work/work3/test-storage3 > > > > > > Each brick has a size of 22TB,so I understand that if I mount the volume I > > > should have a partition of 66TB. The problem is once the partition is > > > mounted, if I check the size, appears that it has only 22TB.. > > > > > > I don't know what it's wrong, can anyone help me? > > > > > > Thanks!! > > > > On a different note - this is not the best way to create a replica 3 volume. > > All you replicas are on the same host so if any one host goes down you lose > > access to all the data on that replica set. > > > > A better way is: > > > > gluster volume create volume1 replica 3 transport tcp > > ubuntu1:/work/work1/test-storage1 ubuntu2:/work/work1/test-storage1 > > ubuntu3:/work/work1/test-storage1 ubuntu1:/work/work2/test-storage2 > > ubuntu2:/work/work2/test-storage2 ubuntu3:/work/work2/test-storage2 > > ubuntu1:/work/work3/test-storage3 ubuntu2:/work/work3/test-storage3 > > ubuntu3:/work/work3/test-storage3 > > Hi @Nithya! > > This is the same configuration that i have, isn't it? but in other order.. > Yes, but the ordering is what determines which bricks form a replica set (contain copies of the same file). Ideally, you want to create your volume so that you have each brick of a replica set on a different node so if one node goes down the other 2 bricks on the other 2 nodes can still serve the data. In the volume create command, with a replica value of 'n', each group of n bricks forms a replica set. The first set of n bricks forms one set, then the next n and so on. So in your case, the first 3 bricks passed to the create command form a replica set, then the next 3 and so on. As the first 3 bricks are all on the same node, if ubuntu1 goes down, all the files on that set will be inaccessible. You can confirm this by checking the bricks on ubuntu1 - you should see the same files on all 3 bricks. > Thanks! --- Additional comment from david on 2017-12-05 12:19:49 EST --- Yes, sorry @Nithya. I created it how you said, I wrote it wrong.. Thanks --- Additional comment from Nithya Balachandran on 2017-12-05 21:43:09 EST --- (In reply to david from comment #8) > Yes, sorry @Nithya. I created it how you said, I wrote it wrong.. > Thanks Good to hear. :) Coming back to the original issue you reported, can you send us: 1. gluster volume info <volname> 2. mount the volume, check the size and send us the mount log Thanks, Nithya --- Additional comment from david on 2017-12-07 11:06:21 EST --- 1. output of gluster volume info: Volume Name: volume1 Type: Distributed-Replicate Volume ID: 66153ffa-dfd7-4c1f-966e-093862605b40 Status: Started Snapshot Count: 0 Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: ubuntu1:/work/work1/test-storage1 Brick2: ubuntu2:/work/work1/test-storage1 Brick3: ubuntu3:/work/work1/test-storage1 Brick4: ubuntu1:/work/work2/test-storage2 Brick5: ubuntu2:/work/work2/test-storage2 Brick6: ubuntu3:/work/work2/test-storage2 Brick7: ubuntu1:/work/work3/test-storage3 Brick8: ubuntu2:/work/work3/test-storage3 Brick9: ubuntu3:/work/work3/test-storage3 Options Reconfigured: transport.address-family: inet performance.readdir-ahead: on nfs.disable: on performance.quick-read: off performance.io-thread-count: 48 performance.cache-size: 256MB performance.write-behind-window-size: 8MB performance.cache-max-file-size: 2MB performance.read-ahead: off client.event-threads: 4 server.event-threads: 4 cluster.lookup-optimize: on performance.client-io-threads: on cluster.readdir-optimize: on 2. I couldn't see a specific error after mount the volume, but in the syslog there are a lot of errors like: Dec 7 16:37:04 ubuntu1 kernel: [1308096.853903] EXT4-fs error (device sdb1): htree_dirblock_to_tree:986: inode #110821791: block 1773144535: comm glusteriotwr20: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=3925999631, rec_len=1, name_len=0 Thanks! --- Additional comment from Nithya Balachandran on 2018-01-12 04:27:32 EST --- Hi David, Sorry for taking so long to get back. Can you please send the client mount log? (this should be in /var/log/glusterfs/<hyphenated path of mount point>.log ) after retrying the operation? --- Additional comment from Nithya Balachandran on 2018-01-31 01:41:20 EST --- Hi David, Do you still see the problem? We may have found the reason. Can you send me the contents of the /var/lib/glusterd/volume1 directory on any one of the server nodes? Regards, Nithya --- Additional comment from david on 2018-01-31 03:18:28 EST --- Hi @Nithya, ls /var/lib/glusterd/vols/volume1 bricks cksum volume1.ubuntu1.work-work1-test-storage1.vol volume1.ubuntu1.work-work2-test-storage2.vol volume1.ubuntu1.work-work3-test-storage3.vol volume1.ubuntu2.work-work1-test-storage1.vol volume1.ubuntu2.work-work2-test-storage2.vol volume1.ubuntu2.work-work3-test-storage3.vol volume1.ubuntu3.work-work1-test-storage1.vol volume1.ubuntu3.work-work2-test-storage2.vol volume1.ubuntu3.work-work3-test-storage3.vol volume1-rebalance.vol volume1.tcp-fuse.vol info node_state.info quota.cksum quota.conf rebalance run snapd.info trusted-volume1.tcp-fuse.vol Thanks! --- Additional comment from Nithya Balachandran on 2018-01-31 03:25:31 EST --- Apologies, I should have made myself clearer. I need to see the contents of the files in that directory. Specifically: volume1.ubuntu1.work-work1-test-storage1.vol volume1.ubuntu1.work-work2-test-storage2.vol volume1.ubuntu1.work-work3-test-storage3.vol volume1.ubuntu2.work-work1-test-storage1.vol volume1.ubuntu2.work-work2-test-storage2.vol volume1.ubuntu2.work-work3-test-storage3.vol volume1.ubuntu3.work-work1-test-storage1.vol volume1.ubuntu3.work-work2-test-storage2.vol volume1.ubuntu3.work-work3-test-storage3.vol Can you also please send the /var/lib/glusterd/glusterd.info file from all 3 nodes? --- Additional comment from david on 2018-01-31 03:34:47 EST --- Of course, glusterd.info ubuntu1: UUID=a3597de1-d634-4e4f-80c4-186180071298 operating-version=31000 glusterd.info ubuntu2: UUID=74c61527-88ed-4c0c-8a9f-3aafb1f60c3c operating-version=31000 glusterd.info ubuntu3: UUID=e5b2240a-9f81-49c9-87f3-0bc4e9390661 operating-version=31000 The other files are quite big, for example this is the content of volume1.ubuntu1.work-work1-test-storage1.vol: volume volume1-posix type storage/posix option shared-brick-count 3 option volume-id 66153ffa-dfd7-4c1f-966e-093862605b40 option directory /work/work1/test-storage1 end-volume volume volume1-trash type features/trash option trash-internal-op off option brick-path /work/work1/test-storage1 option trash-dir .trashcan subvolumes volume1-posix end-volume volume volume1-changetimerecorder type features/changetimerecorder option sql-db-wal-autocheckpoint 25000 option sql-db-cachesize 12500 option ctr-record-metadata-heat off option record-counters off option ctr-enabled off option record-entry on option ctr_lookupheal_inode_timeout 300 option ctr_lookupheal_link_timeout 300 option ctr_link_consistency off option record-exit off option db-path /work/work1/test-storage1/.glusterfs/ option db-name test-storage1.db option hot-brick off option db-type sqlite3 subvolumes volume1-trash end-volume volume volume1-changelog type features/changelog option changelog-barrier-timeout 120 option changelog-dir /work/work1/test-storage1/.glusterfs/changelogs option changelog-brick /work/work1/test-storage1 subvolumes volume1-changetimerecorder end-volume volume volume1-bitrot-stub type features/bitrot-stub option bitrot disable option export /work/work1/test-storage1 subvolumes volume1-changelog end-volume volume volume1-access-control type features/access-control subvolumes volume1-bitrot-stub end-volume volume volume1-locks type features/locks subvolumes volume1-access-control end-volume volume volume1-worm type features/worm option worm-file-level off option worm off subvolumes volume1-locks end-volume volume volume1-read-only type features/read-only option read-only off subvolumes volume1-worm end-volume volume volume1-leases type features/leases option leases off subvolumes volume1-read-only end-volume volume volume1-upcall type features/upcall option cache-invalidation off subvolumes volume1-leases end-volume volume volume1-io-threads type performance/io-threads option thread-count 48 subvolumes volume1-upcall end-volume volume volume1-selinux type features/selinux option selinux on subvolumes volume1-io-threads end-volume volume volume1-marker type features/marker option inode-quota off option quota off option gsync-force-xtime off option xtime off option quota-version 0 option timestamp-file /var/lib/glusterd/vols/volume1/marker.tstamp option volume-uuid 66153ffa-dfd7-4c1f-966e-093862605b40 subvolumes volume1-selinux end-volume volume volume1-barrier type features/barrier option barrier-timeout 120 option barrier disable subvolumes volume1-marker end-volume volume volume1-index type features/index option xattrop-pending-watchlist trusted.afr.volume1- option xattrop-dirty-watchlist trusted.afr.dirty option index-base /work/work1/test-storage1/.glusterfs/indices subvolumes volume1-barrier end-volume volume volume1-quota type features/quota option deem-statfs off option server-quota off option volume-uuid volume1 subvolumes volume1-index end-volume volume volume1-io-stats type debug/io-stats option count-fop-hits off option latency-measurement off option log-level INFO option unique-id /work/work1/test-storage1 subvolumes volume1-quota end-volume volume /work/work1/test-storage1 type performance/decompounder subvolumes volume1-io-stats end-volume volume volume1-server type protocol/server option transport.listen-backlog 10 option transport.socket.keepalive-count 9 option transport.socket.keepalive-interval 2 option transport.socket.keepalive-time 20 option transport.tcp-user-timeout 0 option event-threads 4 option transport.socket.keepalive 1 option auth.addr./work/work1/test-storage1.allow * option auth-path /work/work1/test-storage1 option auth.login.37801a5b-f8e9-487d-9189-e8ad0f0855fa.password 4170e1ae-f028-449e-b699-06ddc61a3853 option auth.login./work/work1/test-storage1.allow 37801a5b-f8e9-487d-9189-e8ad0f0855fa option transport.address-family inet option transport-type tcp subvolumes /work/work1/test-storage1 end-volume Do you want the others? Thanks! --- Additional comment from Nithya Balachandran on 2018-01-31 03:55:17 EST --- Thanks David. I wanted to see the value of the shared-brick-count. Can you confirm that it is 3 for all the volume1.ubuntu*.work-work*-test-storage*.vol files listed above? volume volume1-posix type storage/posix option shared-brick-count 3 <--- *This* option volume-id 66153ffa-dfd7-4c1f-966e-093862605b40 option directory /work/work1/test-storage1 end-volume The shared-brick-count is a count of how many bricks share a file system and the disk space is divided accordingly. For instance, if I have a filesystem mounted at /data with 100GB space, and I create 3 subdirs inside (/data/brick1, /data/brick2, /data/brick2)to use as bricks, I really only have 100GB so the df should also only return 100GB (Patch https://review.gluster.org/#/c/17618/ introduced this change. In earlier releases, it would have returned 300GB which is incorrect). So if all bricks of volume1 on each node are on different filesystems, this is a bug. Otherwise, it is working as expected. --- Additional comment from david on 2018-01-31 04:32:48 EST --- Sure, grep "shared-brick-count" * volume1.ubuntu1.work-work1-test-storage1.vol: option shared-brick-count 3 volume1.ubuntu1.work-work2-test-storage2.vol: option shared-brick-count 3 volume1.ubuntu1.work-work3-test-storage3.vol: option shared-brick-count 3 volume1.ubuntu2.work-work1-test-storage1.vol: option shared-brick-count 0 volume1.ubuntu2.work-work2-test-storage2.vol: option shared-brick-count 0 volume1.ubuntu2.work-work3-test-storage3.vol: option shared-brick-count 0 volume1.ubuntu3.work-work1-test-storage1.vol: option shared-brick-count 0 volume1.ubuntu3.work-work2-test-storage2.vol: option shared-brick-count 0 volume1.ubuntu3.work-work3-test-storage3.vol: option shared-brick-count 0 Thanks Nithya --- Additional comment from Nithya Balachandran on 2018-01-31 10:23:37 EST --- Hi David, Sorry for the back and forth on this. Can you try the following on any one node? gluster v set volume1 cluster.min-free-inodes 6% And check if the df output and the *.vol files above remain the same post that? --- Additional comment from Nithya Balachandran on 2018-02-01 00:04:13 EST --- Hi David, We don't know why this is happening yet but we have a workaround until we fix this. On every node in the cluster (ubuntu1, ubuntu2 and ubuntu3): 1. Copy the shared-brick-count.sh file to /usr/lib*/glusterfs/3.12.3/filter/. (You might need to create the filter directory in this path.) 2. Give the file execute permissions. For example, on my system (the version is different): [root@rhgsserver1 dir1]# cd /usr/lib/glusterfs/3.12.5/ [root@rhgsserver1 3.12.5]# ll total 4.0K drwxr-xr-x. 2 root root 64 Feb 1 08:56 auth drwxr-xr-x. 2 root root 34 Feb 1 09:12 filter drwxr-xr-x. 2 root root 66 Feb 1 08:55 rpc-transport drwxr-xr-x. 13 root root 4.0K Feb 1 08:57 xlator [root@rhgsserver1 3.12.5]# cd filter [root@rhgsserver1 filter]# pwd /usr/lib/glusterfs/3.12.5/filter [root@rhgsserver1 filter]# ll total 4 -rwxr-xr-x. 1 root root 95 Feb 1 09:12 shared-brick-count.sh On any one node, run: gluster v set volume1 cluster.min-free-inodes 6% This should regenerate the .vol files and set the value of option shared-brick-count to 1. Please check if the .vol files to confirm this. You do not need to restart the volume. Let me know if this solves the problem. --- Additional comment from Nithya Balachandran on 2018-02-01 00:04 EST --- --- Additional comment from Nithya Balachandran on 2018-02-01 00:05:34 EST --- See http://docs.gluster.org/en/latest/Administrator%20Guide/GlusterFS%20Filter/ for more details. --- Additional comment from david on 2018-02-01 04:04:19 EST --- Hi Nithya! this fixed my problem! Now I can see the 66T: ds2-nl2:/volume1 fuse.glusterfs 66T 9.6T 53T 16% /volume1 And this is the output of *.vol: grep "shared-brick-count" * volume1.ubuntu1.work-work1-test-storage1.vol: option shared-brick-count 1 volume1.ubuntu1.work-work2-test-storage2.vol: option shared-brick-count 1 volume1.ubuntu1.work-work3-test-storage3.vol: option shared-brick-count 1 volume1.ubuntu2.work-work1-test-storage1.vol: option shared-brick-count 1 volume1.ubuntu2.work-work2-test-storage2.vol: option shared-brick-count 1 volume1.ubuntu2.work-work3-test-storage3.vol: option shared-brick-count 1 volume1.ubuntu3.work-work1-test-storage1.vol: option shared-brick-count 1 volume1.ubuntu3.work-work2-test-storage2.vol: option shared-brick-count 1 volume1.ubuntu3.work-work3-test-storage3.vol: option shared-brick-count 1 thank you very much! --- Additional comment from Amar Tumballi on 2018-02-02 03:34:38 EST --- David, glad that Nithya's scripts helped you to get to proper state. I wanted to know if on your backend on ubuntu{1,2,3}, all the bricks /work/work{1,2,3} are different partitions? Can you share output of 'df -h /work/work{1,2,3} && stat /work/work{1,2,3}' ? That should help me to resolve issue faster. Thanks, --- Additional comment from david on 2018-02-02 04:10:42 EST --- Yes, there are different partitions: df -h | grep work /dev/sdc1 22T 3.2T 18T 16% /work/work2 /dev/sdd1 22T 3.2T 18T 16% /work/work3 /dev/sdb1 22T 3.5T 18T 17% /work/work1 stat /work/work1/ File: '/work/work1/' Size: 4096 Blocks: 8 IO Block: 4096 directory Device: 811h/2065d Inode: 2 Links: 6 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2017-07-19 08:31:20.932837957 +0200 Modify: 2018-01-31 12:00:43.291297677 +0100 Change: 2018-01-31 12:00:43.291297677 +0100 Birth: - stat /work/work2 File: '/work/work2' Size: 4096 Blocks: 8 IO Block: 4096 directory Device: 821h/2081d Inode: 2 Links: 6 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2017-07-19 08:31:55.444838468 +0200 Modify: 2018-01-30 09:14:04.692589116 +0100 Change: 2018-01-30 09:14:04.692589116 +0100 Birth: - stat /work/work3 File: '/work/work3' Size: 4096 Blocks: 8 IO Block: 4096 directory Device: 831h/2097d Inode: 2 Links: 7 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2017-07-19 08:31:51.936838416 +0200 Modify: 2018-02-02 09:59:44.940879714 +0100 Change: 2018-02-02 09:59:44.940879714 +0100 Birth: - Thanks! --- Additional comment from Worker Ant on 2018-02-02 04:41:33 EST --- REVIEW: https://review.gluster.org/19464 (glusterd: shared-brick-count should consider st_dev details) posted (#1) for review on master by Amar Tumballi --- Additional comment from Amar Tumballi on 2018-02-02 04:43:55 EST --- Hi David, One last question to validate my theory. What is the filesystem of your backend bricks ? (ie, /work/work{1,2,3}) --- Additional comment from david on 2018-02-02 04:52:47 EST --- the filesystem is ext4 for all the partitions. Regards --- Additional comment from Worker Ant on 2018-02-03 23:39:13 EST --- REVIEW: https://review.gluster.org/19484 (glusterd/store: handle the case of fsid being set to 0) posted (#1) for review on master by Amar Tumballi --- Additional comment from Amar Tumballi on 2018-02-04 05:54:05 EST --- This holds good for master too. (hence change in version).
upstream patch : https://review.gluster.org/#/c/19484/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607