Bug 1747844

Summary:

Rebalance doesn't work correctly if performance.parallel-readdir on and with some other specific options set

Product:

[Community] GlusterFS

Reporter:

Howard <Howard.Chen>

Component:

distribute

Assignee:

Nithya Balachandran <nbalacha>

Status:

CLOSED WORKSFORME

QA Contact:

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

4.1

CC:

bugs, nbalacha, pasik

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-11-11 11:37:21 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Detailed step, volume info, option and log	none

Description Howard 2019-09-02 03:48:15 UTC

Created attachment 1610643 [details]
Detailed step, volume info, option and log

Description of problem:
Rebalance incomplete when volume option performance.parallel-readdir on
, directory doesn't sync after rebalance cmd status is complete

Version-Release number of selected component (if applicable):
Release:4.1.8

How reproducible:
if a volume is set as one is in the attachment (option list), This bug can be  100% duplicated.

Steps to Reproduce:
1.Create a distribute volume (with 5 Bricks)

2.Set some specific options (options in attachment)

3.make directory and files
ex:
mkdir /mnt/volume_01/dir_1
mkdir /mnt/volume_01/dir_1/dir_2
mkdir /mnt/volume_01/dir_1/dir_2/dir_3
mkdir /mnt/volume_01/dir_1/dir_2/dir_3/dir_4
touch  /mnt/volume_01/dir_1/dir_2/file{1..100}
touch  /mnt/volume_01/dir_1/dir_2/dir_3/file{101..200}
touch  /mnt/volume_01/dir_1/dir_2/dir_3/dir_4/a{201..300}
4.add-brick to volume (add 5 Bricks) 
5.Rebalance this volume
6.check Rebalance status:Complete (use gluster v status check)
7.check every bricks's directory and files

Actual results:
Only dir_1 and dir_2 are sync-ed onto the 5 Newly-add Bricks 
(doesn't sync dir_3 and dir_4)

Expected results:
Should sync all four directory onto 5 Newly-add Bricks

Additional info:
Detailed step, volume info, option and log is in the attachment
[root@K1 glusterfs]# gluster v get volume_01 all
Option                                  Value
------                                  -----
cluster.lookup-unhashed                 on
cluster.lookup-optimize                 on
cluster.min-free-disk                   10%
cluster.min-free-inodes                 5%
cluster.rebalance-stats                 off
cluster.subvols-per-directory           (null)
cluster.readdir-optimize                off
cluster.rsync-hash-regex                (null)
cluster.extra-hash-regex                (null)
cluster.dht-xattr-name                  trusted.glusterfs.dht
cluster.randomize-hash-range-by-gfid    off
cluster.rebal-throttle                  normal
cluster.lock-migration                  off
cluster.force-migration                 off
cluster.local-volume-name               (null)
cluster.weighted-rebalance              on
cluster.switch-pattern                  (null)
cluster.entry-change-log                on
cluster.read-subvolume                  (null)
cluster.read-subvolume-index            -1
cluster.read-hash-mode                  1
cluster.background-self-heal-count      8
cluster.metadata-self-heal              on
cluster.data-self-heal                  on
cluster.entry-self-heal                 on
cluster.self-heal-daemon                on
cluster.heal-timeout                    600
cluster.self-heal-window-size           1
cluster.data-change-log                 on
cluster.metadata-change-log             on
cluster.data-self-heal-algorithm        (null)
cluster.eager-lock                      on
disperse.eager-lock                     on
disperse.other-eager-lock               on
disperse.eager-lock-timeout             1
disperse.other-eager-lock-timeout       1
cluster.quorum-type                     none
cluster.quorum-count                    (null)
cluster.choose-local                    true
cluster.self-heal-readdir-size          1KB
cluster.post-op-delay-secs              1
cluster.ensure-durability               on
cluster.consistent-metadata             no
cluster.heal-wait-queue-length          128
cluster.favorite-child-policy           none
cluster.full-lock                       yes
cluster.stripe-block-size               128KB
cluster.stripe-coalesce                 true
diagnostics.latency-measurement         on
diagnostics.dump-fd-stats               off
diagnostics.count-fop-hits              on
diagnostics.brick-log-level             ERROR
diagnostics.client-log-level            ERROR
diagnostics.brick-sys-log-level         CRITICAL
diagnostics.client-sys-log-level        CRITICAL
diagnostics.brick-logger                (null)
diagnostics.client-logger               (null)
diagnostics.brick-log-format            (null)
diagnostics.client-log-format           (null)
diagnostics.brick-log-buf-size          5
diagnostics.client-log-buf-size         5
diagnostics.brick-log-flush-timeout     120
diagnostics.client-log-flush-timeout    120
diagnostics.stats-dump-interval         0
diagnostics.fop-sample-interval         0
diagnostics.stats-dump-format           json
diagnostics.fop-sample-buf-size         65535
diagnostics.stats-dnscache-ttl-sec      86400
performance.cache-max-file-size         0
performance.cache-min-file-size         0
performance.cache-refresh-timeout       1
performance.cache-priority
performance.cache-size                  32MB
performance.io-thread-count             64
performance.high-prio-threads           16
performance.normal-prio-threads         16
performance.low-prio-threads            16
performance.least-prio-threads          1
performance.enable-least-priority       on
performance.iot-watchdog-secs           (null)
performance.iot-cleanup-disconnected-reqsoff
performance.iot-pass-through            false
performance.io-cache-pass-through       false
performance.cache-size                  128MB
performance.qr-cache-timeout            1
performance.cache-invalidation          true
performance.flush-behind                on
performance.nfs.flush-behind            off
performance.write-behind-window-size    1MB
performance.resync-failed-syncs-after-fsyncoff
performance.nfs.write-behind-window-size1MB
performance.strict-o-direct             off
performance.nfs.strict-o-direct         off
performance.strict-write-ordering       off
performance.nfs.strict-write-ordering   off
performance.write-behind-trickling-writeson
performance.aggregate-size              128KB
performance.nfs.write-behind-trickling-writeson
performance.lazy-open                   yes
performance.read-after-open             no
performance.open-behind-pass-through    false
performance.read-ahead-page-count       4
performance.read-ahead-pass-through     false
performance.readdir-ahead-pass-through  false
performance.md-cache-pass-through       false
performance.md-cache-timeout            1
performance.cache-swift-metadata        true
performance.cache-samba-metadata        false
performance.cache-capability-xattrs     true
performance.cache-ima-xattrs            true
performance.md-cache-statfs             off
performance.xattr-cache-list
performance.nl-cache-pass-through       false
features.encryption                     off
encryption.master-key                   (null)
encryption.data-key-size                256
encryption.block-size                   4096
network.frame-timeout                   1800
network.ping-timeout                    42
network.tcp-window-size                 (null)
network.remote-dio                      disable
client.event-threads                    8
client.tcp-user-timeout                 0
client.keepalive-time                   20
client.keepalive-interval               2
client.keepalive-count                  9
network.tcp-window-size                 (null)
network.inode-lru-limit                 16384
auth.allow                              *
auth.reject                             (null)
transport.keepalive                     1
server.allow-insecure                   on
server.root-squash                      off
server.anonuid                          65534
server.anongid                          65534
server.statedump-path                   /var/run/gluster
server.outstanding-rpc-limit            64
server.ssl                              (null)
auth.ssl-allow                          *
server.manage-gids                      off
server.dynamic-auth                     on
client.send-gids                        on
server.gid-timeout                      300
server.own-thread                       (null)
server.event-threads                    8
server.tcp-user-timeout                 0
server.keepalive-time                   20
server.keepalive-interval               2
server.keepalive-count                  9
transport.listen-backlog                1024
ssl.own-cert                            (null)
ssl.private-key                         (null)
ssl.ca-list                             (null)
ssl.crl-path                            (null)
ssl.certificate-depth                   (null)
ssl.cipher-list                         (null)
ssl.dh-param                            (null)
ssl.ec-curve                            (null)
transport.address-family                inet
performance.write-behind                off
performance.read-ahead                  off
performance.readdir-ahead               on
performance.io-cache                    off
performance.quick-read                  off
performance.open-behind                 off
performance.nl-cache                    on
performance.stat-prefetch               on
performance.client-io-threads           on
performance.nfs.write-behind            off
performance.nfs.read-ahead              off
performance.nfs.io-cache                off
performance.nfs.quick-read              off
performance.nfs.stat-prefetch           off
performance.nfs.io-threads              off
performance.force-readdirp              true
performance.cache-invalidation          true
features.uss                            off
features.snapshot-directory             .snaps
features.show-snapshot-directory        off
features.tag-namespaces                 off
network.compression                     off
network.compression.window-size         -15
network.compression.mem-level           8
network.compression.min-size            0
network.compression.compression-level   -1
network.compression.debug               false
features.default-soft-limit             80%
features.soft-timeout                   60
features.hard-timeout                   5
features.alert-time                     86400
features.quota-deem-statfs              off
geo-replication.indexing                off
geo-replication.indexing                off
geo-replication.ignore-pid-check        off
geo-replication.ignore-pid-check        off
features.quota                          off
features.inode-quota                    off
features.bitrot                         disable
debug.trace                             off
debug.log-history                       no
debug.log-file                          no
debug.exclude-ops                       (null)
debug.include-ops                       (null)
debug.error-gen                         off
debug.error-failure                     (null)
debug.error-number                      (null)
debug.random-failure                    off
debug.error-fops                        (null)
nfs.enable-ino32                        no
nfs.mem-factor                          15
nfs.export-dirs                         on
nfs.export-volumes                      on
nfs.addr-namelookup                     off
nfs.dynamic-volumes                     off
nfs.register-with-portmap               on
nfs.outstanding-rpc-limit               16
nfs.port                                2049
nfs.rpc-auth-unix                       on
nfs.rpc-auth-null                       on
nfs.rpc-auth-allow                      all
nfs.rpc-auth-reject                     none
nfs.ports-insecure                      off
nfs.trusted-sync                        off
nfs.trusted-write                       off
nfs.volume-access                       read-write
nfs.export-dir
nfs.disable                             off
nfs.nlm                                 on
nfs.acl                                 on
nfs.mount-udp                           off
nfs.mount-rmtab                         /var/lib/glusterd/nfs/rmtab
nfs.rpc-statd                           /sbin/rpc.statd
nfs.server-aux-gids                     off
nfs.drc                                 off
nfs.drc-size                            0x20000
nfs.read-size                           (1 * 1048576ULL)
nfs.write-size                          (1 * 1048576ULL)
nfs.readdir-size                        (1 * 1048576ULL)
nfs.rdirplus                            on
nfs.event-threads                       1
nfs.exports-auth-enable                 off
nfs.auth-refresh-interval-sec           30
nfs.auth-cache-ttl-sec                  30
features.read-only                      off
features.worm                           off
features.worm-file-level                disable
features.worm-files-deletable           on
features.default-retention-period       2147483647
features.retention-mode                 enterprise
features.auto-commit-period             7200
storage.linux-aio                       off
storage.batch-fsync-mode                reverse-fsync
storage.batch-fsync-delay-usec          0
storage.owner-uid                       -1
storage.owner-gid                       -1
storage.node-uuid-pathinfo              off
storage.health-check-interval           30
storage.build-pgfid                     off
storage.gfid2path                       on
storage.gfid2path-separator             :
storage.reserve                         1
storage.health-check-timeout            10
storage.fips-mode-rchecksum             off
storage.force-create-mode               0000
storage.force-directory-mode            0000
storage.create-mask                     0777
storage.create-directory-mask           0777
storage.max-hardlinks                   100
storage.ctime                           off
config.gfproxyd                         off
cluster.server-quorum-type              off
cluster.server-quorum-ratio             0
changelog.changelog                     off
changelog.changelog-dir                 {{ brick.path }}/.glusterfs/changelogs
changelog.encoding                      ascii
changelog.rollover-time                 15
changelog.fsync-interval                5
changelog.changelog-barrier-timeout     120
changelog.capture-del-path              off
features.barrier                        disable
features.barrier-timeout                120
features.trash                          off
features.trash-dir                      .trashcan
features.trash-eliminate-path           (null)
features.trash-max-filesize             5MB
features.trash-internal-op              off
cluster.enable-shared-storage           disable
locks.trace                             off
locks.mandatory-locking                 off
cluster.disperse-self-heal-daemon       enable
cluster.quorum-reads                    no
client.bind-insecure                    (null)
features.timeout                        45
features.failover-hosts                 (null)
features.shard                          off
features.shard-block-size               64MB
features.scrub-throttle                 lazy
features.scrub-freq                     biweekly
features.scrub                          false
features.expiry-time                    120
features.cache-invalidation             on
features.cache-invalidation-timeout     600
features.leases                         off
features.lease-lock-recall-timeout      60
disperse.background-heals               8
disperse.heal-wait-qlength              128
cluster.heal-timeout                    600
dht.force-readdirp                      on
disperse.read-policy                    gfid-hash
cluster.shd-max-threads                 1
cluster.shd-wait-qlength                1024
cluster.locking-scheme                  full
cluster.granular-entry-heal             no
features.locks-revocation-secs          0
features.locks-revocation-clear-all     false
features.locks-revocation-max-blocked   0
features.locks-monkey-unlocking         false
features.locks-notify-contention        no
features.locks-notify-contention-delay  5
disperse.shd-max-threads                1
disperse.shd-wait-qlength               1024
disperse.cpu-extensions                 auto
disperse.self-heal-window-size          1
cluster.use-compound-fops               off
performance.parallel-readdir            on
performance.rda-request-size            131072
performance.rda-low-wmark               4096
performance.rda-high-wmark              128KB
performance.rda-cache-limit             40MB
performance.nl-cache-positive-entry     false
performance.nl-cache-limit              10MB
performance.nl-cache-timeout            60
cluster.brick-multiplex                 off
cluster.max-bricks-per-process          0
disperse.optimistic-change-log          on
disperse.stripe-cache                   4
cluster.halo-enabled                    False
cluster.halo-shd-max-latency            99999
cluster.halo-nfsd-max-latency           5
cluster.halo-max-latency                5
cluster.halo-max-replicas               99999
cluster.halo-min-replicas               2
debug.delay-gen                         off
delay-gen.delay-percentage              10%
delay-gen.delay-duration                100000
delay-gen.enable
disperse.parallel-writes                on
features.sdfs                           off
features.cloudsync                      off
features.utime                          off

[root@K1 glusterfs]# gluster v info
 
Volume Name: volume_01
Type: Distribute
Volume ID: 140b35e4-c095-457f-8f15-0095a10ad83d
Status: Started
Snapshot Count: 0
Number of Bricks: 10
Transport-type: tcp
Bricks:
Brick1: testk1:/mnt/brick01/bk
Brick2: testk1:/mnt/brick02/bk
Brick3: testk1:/mnt/brick03/bk
Brick4: testk1:/mnt/brick04/bk
Brick5: testk1:/mnt/brick05/bk
Brick6: testk1:/mnt/brick06/bk
Brick7: testk1:/mnt/brick07/bk
Brick8: testk1:/mnt/brick08/bk
Brick9: testk1:/mnt/brick09/bk
Brick10: testk1:/mnt/brick10/bk
Options Reconfigured:
performance.rda-cache-limit: 40MB
performance.parallel-readdir: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
features.auto-commit-period: 7200
features.retention-mode: enterprise
features.default-retention-period: 2147483647
features.worm-file-level: disable
nfs.auth-cache-ttl-sec: 30
nfs.auth-refresh-interval-sec: 30
nfs.exports-auth-enable: off
performance.nfs.write-behind: off
performance.nl-cache: on
performance.open-behind: off
performance.quick-read: off
performance.io-cache: off
performance.read-ahead: off
performance.write-behind: off
server.event-threads: 8
client.event-threads: 8
performance.nfs.flush-behind: off
performance.cache-invalidation: true
performance.io-thread-count: 64
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
transport.address-family: inet
nfs.disable: off
[root@K1 glusterfs]# gluster v status
Status of volume: volume_01
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick testk1:/mnt/brick01/bk                49152     0          Y       3223 
Brick testk1:/mnt/brick02/bk                49153     0          Y       3253 
Brick testk1:/mnt/brick03/bk                49154     0          Y       3283 
Brick testk1:/mnt/brick04/bk                49155     0          Y       3313 
Brick testk1:/mnt/brick05/bk                49156     0          Y       3343 
Brick testk1:/mnt/brick06/bk                49157     0          Y       3570 
Brick testk1:/mnt/brick07/bk                49158     0          Y       3600 
Brick testk1:/mnt/brick08/bk                49159     0          Y       3630 
Brick testk1:/mnt/brick09/bk                49160     0          Y       3660 
Brick testk1:/mnt/brick10/bk                49161     0          Y       3690 
NFS Server on localhost                     2049      0          Y       3842 
 
Task Status of Volume volume_01
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 5afe22d8-9906-4a76-93f3-40b8c699cb34
Status               : completed           
 
[root@K1 glusterfs]# gluster --version
glusterfs 4.1.8
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.

Comment 1 Nithya Balachandran 2019-09-04 05:01:35 UTC

I'll take a look and get back to you.

Comment 2 Nithya Balachandran 2019-10-16 12:19:22 UTC

Hi,

Apologies for the delay but I finally managed to spend some time on this. Here is what I have so far:

Release 4 is EOL so I tried with release-5.

I used Fuse not NFS and could not reproduce the issue with rebalance - the contents of all directories were being migrated to the new bricks.
I did however see an issue where I could not list the directories from the fuse mount immediately after they were created. This issue was not seen with parallel-readdir off.

[root@rhgs313-7 ~]# glusterd; gluster v create test 192.168.122.7:/bricks/brick1/t-{1..5} ; gluster v set test readdir-ahead on; gluster v set test parallel-readdir on; gluster v start test;
volume create: test: success: please start the volume to access data
volume set: success
volume set: success
volume start: test: success
[root@rhgs313-7 ~]# mount -t glusterfs -s 192.168.122.7:/test /mnt/fuse1
[root@rhgs313-7 ~]# cd /mnt/fuse1/; mkdir dir_1; mkdir dir_1/dir_2; mkdir dir_1/dir_2/dir_3; mkdir dir_1/dir_2/dir_3/dir_4
[root@rhgs313-7 fuse1]# ll
total 0

On further analysis, this was happening because the stat information for the dirs received in dht_readdirp_cbk was invalid because of which dht will strip those entries out of the listing. This was fixed by https://review.gluster.org/#/c/glusterfs/+/21811/ and is available from release-6 onwards.

It is possible that the same issue occurred on your volume so rebalance never processed these dirs. As the log-level as been set to ERROR, there are no messages in the rebalance log which can be used to figure out what happened.

Please do the following:
1. Enable info level logging for client-log-level, reproduce the issue and send me the rebalance log.
2. Upgrade to release 6.x and see if you can still see the issue.

Comment 3 Nithya Balachandran 2019-11-04 05:26:23 UTC

(In reply to Nithya Balachandran from comment #2)
> Hi,
> 
> Apologies for the delay but I finally managed to spend some time on this.
> Here is what I have so far:
> 
> Release 4 is EOL so I tried with release-5.

Apologies - 4 is not EOL yet. I retried the test above with the latest release-4.1 code and could not reproduce the rebalance problem.
Please send the logs requested earlier and I will look into it.

Comment 4 Nithya Balachandran 2019-11-11 11:37:21 UTC

I'm closing this with WorksForMe. Please reopen if you still see this in the latest releases.

Comment 5 Red Hat Bugzilla 2023-09-14 05:42:42 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days