Bug 1747844 - Rebalance doesn't work correctly if performance.parallel-readdir on and with some other specific options set [NEEDINFO]
Summary: Rebalance doesn't work correctly if performance.parallel-readdir on and with...
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: 4.1
Hardware: x86_64
OS: Linux
Target Milestone: ---
Assignee: Nithya Balachandran
QA Contact:
Depends On:
TreeView+ depends on / blocked
Reported: 2019-09-02 03:48 UTC by Howard
Modified: 2019-11-11 11:37 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2019-11-11 11:37:21 UTC
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
nbalacha: needinfo? (Howard.Chen)

Attachments (Terms of Use)
Detailed step, volume info, option and log (23.76 KB, application/x-7z-compressed)
2019-09-02 03:48 UTC, Howard
no flags Details

Description Howard 2019-09-02 03:48:15 UTC
Created attachment 1610643 [details]
Detailed step, volume info, option and log

Description of problem:
Rebalance incomplete when volume option performance.parallel-readdir on
, directory doesn't sync after rebalance cmd status is complete

Version-Release number of selected component (if applicable):

How reproducible:
if a volume is set as one is in the attachment (option list), This bug can be  100% duplicated.

Steps to Reproduce:
1.Create a distribute volume (with 5 Bricks)

2.Set some specific options (options in attachment)

3.make directory and files
mkdir /mnt/volume_01/dir_1
mkdir /mnt/volume_01/dir_1/dir_2
mkdir /mnt/volume_01/dir_1/dir_2/dir_3
mkdir /mnt/volume_01/dir_1/dir_2/dir_3/dir_4
touch  /mnt/volume_01/dir_1/dir_2/file{1..100}
touch  /mnt/volume_01/dir_1/dir_2/dir_3/file{101..200}
touch  /mnt/volume_01/dir_1/dir_2/dir_3/dir_4/a{201..300}
4.add-brick to volume (add 5 Bricks) 
5.Rebalance this volume
6.check Rebalance status:Complete (use gluster v status check)
7.check every bricks's directory and files

Actual results:
Only dir_1 and dir_2 are sync-ed onto the 5 Newly-add Bricks 
(doesn't sync dir_3 and dir_4)

Expected results:
Should sync all four directory onto 5 Newly-add Bricks

Additional info:
Detailed step, volume info, option and log is in the attachment
[root@K1 glusterfs]# gluster v get volume_01 all
Option                                  Value
------                                  -----
cluster.lookup-unhashed                 on
cluster.lookup-optimize                 on
cluster.min-free-disk                   10%
cluster.min-free-inodes                 5%
cluster.rebalance-stats                 off
cluster.subvols-per-directory           (null)
cluster.readdir-optimize                off
cluster.rsync-hash-regex                (null)
cluster.extra-hash-regex                (null)
cluster.dht-xattr-name                  trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid    off
cluster.rebal-throttle                  normal
cluster.lock-migration                  off
cluster.force-migration                 off
cluster.local-volume-name               (null)
cluster.weighted-rebalance              on
cluster.switch-pattern                  (null)
cluster.entry-change-log                on
cluster.read-subvolume                  (null)
cluster.read-subvolume-index            -1
cluster.read-hash-mode                  1
cluster.background-self-heal-count      8
cluster.metadata-self-heal              on
cluster.data-self-heal                  on
cluster.entry-self-heal                 on
cluster.self-heal-daemon                on
cluster.heal-timeout                    600
cluster.self-heal-window-size           1
cluster.data-change-log                 on
cluster.metadata-change-log             on
cluster.data-self-heal-algorithm        (null)
cluster.eager-lock                      on
disperse.eager-lock                     on
disperse.other-eager-lock               on
disperse.eager-lock-timeout             1
disperse.other-eager-lock-timeout       1
cluster.quorum-type                     none
cluster.quorum-count                    (null)
cluster.choose-local                    true
cluster.self-heal-readdir-size          1KB
cluster.post-op-delay-secs              1
cluster.ensure-durability               on
cluster.consistent-metadata             no
cluster.heal-wait-queue-length          128
cluster.favorite-child-policy           none
cluster.full-lock                       yes
cluster.stripe-block-size               128KB
cluster.stripe-coalesce                 true
diagnostics.latency-measurement         on
diagnostics.dump-fd-stats               off
diagnostics.count-fop-hits              on
diagnostics.brick-log-level             ERROR
diagnostics.client-log-level            ERROR
diagnostics.brick-sys-log-level         CRITICAL
diagnostics.client-sys-log-level        CRITICAL
diagnostics.brick-logger                (null)
diagnostics.client-logger               (null)
diagnostics.brick-log-format            (null)
diagnostics.client-log-format           (null)
diagnostics.brick-log-buf-size          5
diagnostics.client-log-buf-size         5
diagnostics.brick-log-flush-timeout     120
diagnostics.client-log-flush-timeout    120
diagnostics.stats-dump-interval         0
diagnostics.fop-sample-interval         0
diagnostics.stats-dump-format           json
diagnostics.fop-sample-buf-size         65535
diagnostics.stats-dnscache-ttl-sec      86400
performance.cache-max-file-size         0
performance.cache-min-file-size         0
performance.cache-refresh-timeout       1
performance.cache-size                  32MB
performance.io-thread-count             64
performance.high-prio-threads           16
performance.normal-prio-threads         16
performance.low-prio-threads            16
performance.least-prio-threads          1
performance.enable-least-priority       on
performance.iot-watchdog-secs           (null)
performance.iot-pass-through            false
performance.io-cache-pass-through       false
performance.cache-size                  128MB
performance.qr-cache-timeout            1
performance.cache-invalidation          true
performance.flush-behind                on
performance.nfs.flush-behind            off
performance.write-behind-window-size    1MB
performance.strict-o-direct             off
performance.nfs.strict-o-direct         off
performance.strict-write-ordering       off
performance.nfs.strict-write-ordering   off
performance.aggregate-size              128KB
performance.lazy-open                   yes
performance.read-after-open             no
performance.open-behind-pass-through    false
performance.read-ahead-page-count       4
performance.read-ahead-pass-through     false
performance.readdir-ahead-pass-through  false
performance.md-cache-pass-through       false
performance.md-cache-timeout            1
performance.cache-swift-metadata        true
performance.cache-samba-metadata        false
performance.cache-capability-xattrs     true
performance.cache-ima-xattrs            true
performance.md-cache-statfs             off
performance.nl-cache-pass-through       false
features.encryption                     off
encryption.master-key                   (null)
encryption.data-key-size                256
encryption.block-size                   4096
network.frame-timeout                   1800
network.ping-timeout                    42
network.tcp-window-size                 (null)
network.remote-dio                      disable
client.event-threads                    8
client.tcp-user-timeout                 0
client.keepalive-time                   20
client.keepalive-interval               2
client.keepalive-count                  9
network.tcp-window-size                 (null)
network.inode-lru-limit                 16384
auth.allow                              *
auth.reject                             (null)
transport.keepalive                     1
server.allow-insecure                   on
server.root-squash                      off
server.anonuid                          65534
server.anongid                          65534
server.statedump-path                   /var/run/gluster
server.outstanding-rpc-limit            64
server.ssl                              (null)
auth.ssl-allow                          *
server.manage-gids                      off
server.dynamic-auth                     on
client.send-gids                        on
server.gid-timeout                      300
server.own-thread                       (null)
server.event-threads                    8
server.tcp-user-timeout                 0
server.keepalive-time                   20
server.keepalive-interval               2
server.keepalive-count                  9
transport.listen-backlog                1024
ssl.own-cert                            (null)
ssl.private-key                         (null)
ssl.ca-list                             (null)
ssl.crl-path                            (null)
ssl.certificate-depth                   (null)
ssl.cipher-list                         (null)
ssl.dh-param                            (null)
ssl.ec-curve                            (null)
transport.address-family                inet
performance.write-behind                off
performance.read-ahead                  off
performance.readdir-ahead               on
performance.io-cache                    off
performance.quick-read                  off
performance.open-behind                 off
performance.nl-cache                    on
performance.stat-prefetch               on
performance.client-io-threads           on
performance.nfs.write-behind            off
performance.nfs.read-ahead              off
performance.nfs.io-cache                off
performance.nfs.quick-read              off
performance.nfs.stat-prefetch           off
performance.nfs.io-threads              off
performance.force-readdirp              true
performance.cache-invalidation          true
features.uss                            off
features.snapshot-directory             .snaps
features.show-snapshot-directory        off
features.tag-namespaces                 off
network.compression                     off
network.compression.window-size         -15
network.compression.mem-level           8
network.compression.min-size            0
network.compression.compression-level   -1
network.compression.debug               false
features.default-soft-limit             80%
features.soft-timeout                   60
features.hard-timeout                   5
features.alert-time                     86400
features.quota-deem-statfs              off
geo-replication.indexing                off
geo-replication.indexing                off
geo-replication.ignore-pid-check        off
geo-replication.ignore-pid-check        off
features.quota                          off
features.inode-quota                    off
features.bitrot                         disable
debug.trace                             off
debug.log-history                       no
debug.log-file                          no
debug.exclude-ops                       (null)
debug.include-ops                       (null)
debug.error-gen                         off
debug.error-failure                     (null)
debug.error-number                      (null)
debug.random-failure                    off
debug.error-fops                        (null)
nfs.enable-ino32                        no
nfs.mem-factor                          15
nfs.export-dirs                         on
nfs.export-volumes                      on
nfs.addr-namelookup                     off
nfs.dynamic-volumes                     off
nfs.register-with-portmap               on
nfs.outstanding-rpc-limit               16
nfs.port                                2049
nfs.rpc-auth-unix                       on
nfs.rpc-auth-null                       on
nfs.rpc-auth-allow                      all
nfs.rpc-auth-reject                     none
nfs.ports-insecure                      off
nfs.trusted-sync                        off
nfs.trusted-write                       off
nfs.volume-access                       read-write
nfs.disable                             off
nfs.nlm                                 on
nfs.acl                                 on
nfs.mount-udp                           off
nfs.mount-rmtab                         /var/lib/glusterd/nfs/rmtab
nfs.rpc-statd                           /sbin/rpc.statd
nfs.server-aux-gids                     off
nfs.drc                                 off
nfs.drc-size                            0x20000
nfs.read-size                           (1 * 1048576ULL)
nfs.write-size                          (1 * 1048576ULL)
nfs.readdir-size                        (1 * 1048576ULL)
nfs.rdirplus                            on
nfs.event-threads                       1
nfs.exports-auth-enable                 off
nfs.auth-refresh-interval-sec           30
nfs.auth-cache-ttl-sec                  30
features.read-only                      off
features.worm                           off
features.worm-file-level                disable
features.worm-files-deletable           on
features.default-retention-period       2147483647
features.retention-mode                 enterprise
features.auto-commit-period             7200
storage.linux-aio                       off
storage.batch-fsync-mode                reverse-fsync
storage.batch-fsync-delay-usec          0
storage.owner-uid                       -1
storage.owner-gid                       -1
storage.node-uuid-pathinfo              off
storage.health-check-interval           30
storage.build-pgfid                     off
storage.gfid2path                       on
storage.gfid2path-separator             :
storage.reserve                         1
storage.health-check-timeout            10
storage.fips-mode-rchecksum             off
storage.force-create-mode               0000
storage.force-directory-mode            0000
storage.create-mask                     0777
storage.create-directory-mask           0777
storage.max-hardlinks                   100
storage.ctime                           off
config.gfproxyd                         off
cluster.server-quorum-type              off
cluster.server-quorum-ratio             0
changelog.changelog                     off
changelog.changelog-dir                 {{ brick.path }}/.glusterfs/changelogs
changelog.encoding                      ascii
changelog.rollover-time                 15
changelog.fsync-interval                5
changelog.changelog-barrier-timeout     120
changelog.capture-del-path              off
features.barrier                        disable
features.barrier-timeout                120
features.trash                          off
features.trash-dir                      .trashcan
features.trash-eliminate-path           (null)
features.trash-max-filesize             5MB
features.trash-internal-op              off
cluster.enable-shared-storage           disable
locks.trace                             off
locks.mandatory-locking                 off
cluster.disperse-self-heal-daemon       enable
cluster.quorum-reads                    no
client.bind-insecure                    (null)
features.timeout                        45
features.failover-hosts                 (null)
features.shard                          off
features.shard-block-size               64MB
features.scrub-throttle                 lazy
features.scrub-freq                     biweekly
features.scrub                          false
features.expiry-time                    120
features.cache-invalidation             on
features.cache-invalidation-timeout     600
features.leases                         off
features.lease-lock-recall-timeout      60
disperse.background-heals               8
disperse.heal-wait-qlength              128
cluster.heal-timeout                    600
dht.force-readdirp                      on
disperse.read-policy                    gfid-hash
cluster.shd-max-threads                 1
cluster.shd-wait-qlength                1024
cluster.locking-scheme                  full
cluster.granular-entry-heal             no
features.locks-revocation-secs          0
features.locks-revocation-clear-all     false
features.locks-revocation-max-blocked   0
features.locks-monkey-unlocking         false
features.locks-notify-contention        no
features.locks-notify-contention-delay  5
disperse.shd-max-threads                1
disperse.shd-wait-qlength               1024
disperse.cpu-extensions                 auto
disperse.self-heal-window-size          1
cluster.use-compound-fops               off
performance.parallel-readdir            on
performance.rda-request-size            131072
performance.rda-low-wmark               4096
performance.rda-high-wmark              128KB
performance.rda-cache-limit             40MB
performance.nl-cache-positive-entry     false
performance.nl-cache-limit              10MB
performance.nl-cache-timeout            60
cluster.brick-multiplex                 off
cluster.max-bricks-per-process          0
disperse.optimistic-change-log          on
disperse.stripe-cache                   4
cluster.halo-enabled                    False
cluster.halo-shd-max-latency            99999
cluster.halo-nfsd-max-latency           5
cluster.halo-max-latency                5
cluster.halo-max-replicas               99999
cluster.halo-min-replicas               2
debug.delay-gen                         off
delay-gen.delay-percentage              10%
delay-gen.delay-duration                100000
disperse.parallel-writes                on
features.sdfs                           off
features.cloudsync                      off
features.utime                          off

[root@K1 glusterfs]# gluster v info
Volume Name: volume_01
Type: Distribute
Volume ID: 140b35e4-c095-457f-8f15-0095a10ad83d
Status: Started
Snapshot Count: 0
Number of Bricks: 10
Transport-type: tcp
Brick1: testk1:/mnt/brick01/bk
Brick2: testk1:/mnt/brick02/bk
Brick3: testk1:/mnt/brick03/bk
Brick4: testk1:/mnt/brick04/bk
Brick5: testk1:/mnt/brick05/bk
Brick6: testk1:/mnt/brick06/bk
Brick7: testk1:/mnt/brick07/bk
Brick8: testk1:/mnt/brick08/bk
Brick9: testk1:/mnt/brick09/bk
Brick10: testk1:/mnt/brick10/bk
Options Reconfigured:
performance.rda-cache-limit: 40MB
performance.parallel-readdir: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
features.auto-commit-period: 7200
features.retention-mode: enterprise
features.default-retention-period: 2147483647
features.worm-file-level: disable
nfs.auth-cache-ttl-sec: 30
nfs.auth-refresh-interval-sec: 30
nfs.exports-auth-enable: off
performance.nfs.write-behind: off
performance.nl-cache: on
performance.open-behind: off
performance.quick-read: off
performance.io-cache: off
performance.read-ahead: off
performance.write-behind: off
server.event-threads: 8
client.event-threads: 8
performance.nfs.flush-behind: off
performance.cache-invalidation: true
performance.io-thread-count: 64
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
transport.address-family: inet
nfs.disable: off
[root@K1 glusterfs]# gluster v status
Status of volume: volume_01
Gluster process                             TCP Port  RDMA Port  Online  Pid
Brick testk1:/mnt/brick01/bk                49152     0          Y       3223 
Brick testk1:/mnt/brick02/bk                49153     0          Y       3253 
Brick testk1:/mnt/brick03/bk                49154     0          Y       3283 
Brick testk1:/mnt/brick04/bk                49155     0          Y       3313 
Brick testk1:/mnt/brick05/bk                49156     0          Y       3343 
Brick testk1:/mnt/brick06/bk                49157     0          Y       3570 
Brick testk1:/mnt/brick07/bk                49158     0          Y       3600 
Brick testk1:/mnt/brick08/bk                49159     0          Y       3630 
Brick testk1:/mnt/brick09/bk                49160     0          Y       3660 
Brick testk1:/mnt/brick10/bk                49161     0          Y       3690 
NFS Server on localhost                     2049      0          Y       3842 
Task Status of Volume volume_01
Task                 : Rebalance           
ID                   : 5afe22d8-9906-4a76-93f3-40b8c699cb34
Status               : completed           
[root@K1 glusterfs]# gluster --version
glusterfs 4.1.8
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.

Comment 1 Nithya Balachandran 2019-09-04 05:01:35 UTC
I'll take a look and get back to you.

Comment 2 Nithya Balachandran 2019-10-16 12:19:22 UTC

Apologies for the delay but I finally managed to spend some time on this. Here is what I have so far:

Release 4 is EOL so I tried with release-5.

I used Fuse not NFS and could not reproduce the issue with rebalance - the contents of all directories were being migrated to the new bricks.
I did however see an issue where I could not list the directories from the fuse mount immediately after they were created. This issue was not seen with parallel-readdir off.

[root@rhgs313-7 ~]# glusterd; gluster v create test{1..5} ; gluster v set test readdir-ahead on; gluster v set test parallel-readdir on; gluster v start test;
volume create: test: success: please start the volume to access data
volume set: success
volume set: success
volume start: test: success
[root@rhgs313-7 ~]# mount -t glusterfs -s /mnt/fuse1
[root@rhgs313-7 ~]# cd /mnt/fuse1/; mkdir dir_1; mkdir dir_1/dir_2; mkdir dir_1/dir_2/dir_3; mkdir dir_1/dir_2/dir_3/dir_4
[root@rhgs313-7 fuse1]# ll
total 0

On further analysis, this was happening because the stat information for the dirs received in dht_readdirp_cbk was invalid because of which dht will strip those entries out of the listing. This was fixed by https://review.gluster.org/#/c/glusterfs/+/21811/ and is available from release-6 onwards.

It is possible that the same issue occurred on your volume so rebalance never processed these dirs. As the log-level as been set to ERROR, there are no messages in the rebalance log which can be used to figure out what happened.

Please do the following:
1. Enable info level logging for client-log-level, reproduce the issue and send me the rebalance log.
2. Upgrade to release 6.x and see if you can still see the issue.

Comment 3 Nithya Balachandran 2019-11-04 05:26:23 UTC
(In reply to Nithya Balachandran from comment #2)
> Hi,
> Apologies for the delay but I finally managed to spend some time on this.
> Here is what I have so far:
> Release 4 is EOL so I tried with release-5.

Apologies - 4 is not EOL yet. I retried the test above with the latest release-4.1 code and could not reproduce the rebalance problem.
Please send the logs requested earlier and I will look into it.

Comment 4 Nithya Balachandran 2019-11-11 11:37:21 UTC
I'm closing this with WorksForMe. Please reopen if you still see this in the latest releases.

Note You need to log in before you can comment on or make changes to this bug.