Bug 1285695

Summary: nfs-ganesha+data tiering: dbench test process in "D" state with vers=4
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Saurabh <saujain>
Component: nfs-ganeshaAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED WONTFIX QA Contact: storage-qa-internal <storage-qa-internal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: jthottan, kkeithle, ndevos, nlevinki, rkavunga, skoduri
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-15 05:02:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Saurabh 2015-11-26 09:51:36 UTC
Description of problem:
I have a data tiering volume and started executing fs-sanity on it. Now, I see that dbench a tool part of the fs-sanity is hung with threads in "D" state.

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-7.el7rhgs.x86_64
nfs-ganesha-2.2.0-11.el7rhgs.x86_64

How reproducible:
seen for the first time.

Actual results:
ganesha-gfapi.log reports,

[2015-11-26 17:46:35.213205] E [glfs-handleops.c:1809:glfs_h_poll_cache_invalidation] (-->/usr/lib64/ganesha/libfsalgluster.so(GLUSTERFSAL_UP_Thread+0xc3) [0x7fe45a801ec3] -->/lib64/libgfapi.so.0(glfs_h_poll_upcall+0x190) [0x7fe45a3e9ce0] -->/lib64/libgfapi.so.0(+0x18a4c) [0x7fe45a3e9a4c] ) 0-glfs_h_poll_cache_invalidation: invalid argument: object [Invalid argument]
[2015-11-26 17:46:36.213664] E [dht-helper.c:598:dht_subvol_get_hashed] (-->/lib64/libglusterfs.so.0(default_lookup+0x5b) [0x7fe45a13074b] -->/usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so(dht_lookup+0xa81) [0x7fe44e280871] -->/usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so(dht_subvol_get_hashed+0x1c8) [0x7fe44e24a7b8] ) 0-vol2-tier-dht: invalid argument: loc->parent [Invalid argument]
[2015-11-26 17:46:36.213869] E [dht-helper.c:598:dht_subvol_get_hashed] (-->/usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so(dht_lookup+0xf8c) [0x7fe44e280d7c] -->/usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so(dht_lookup+0xa81) [0x7fe44e280871] -->/usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so(dht_subvol_get_hashed+0x1c8) [0x7fe44e24a7b8] ) 0-vol2-cold-dht: invalid argument: loc->parent [Invalid argument]
[2015-11-26 17:46:36.214608] E [dht-helper.c:598:dht_subvol_get_hashed] (-->/usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so(dht_lookup+0xf8c) [0x7fe44e280d7c] -->/usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so(dht_lookup+0xa81) [0x7fe44e280871] -->/usr/lib64/glusterfs/3.7.5/xlator/cluster/distribute.so(dht_subvol_get_hashed+0x1c8) [0x7fe44e24a7b8] ) 0-vol2-hot-dht: invalid argument: loc->parent [Invalid argument]
The message "W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-vol2-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]" repeated 2 times between [2015-11-26 17:46:34.209148] and [2015-11-26 17:46:36.214616]
[2015-11-26 17:46:36.215323] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-vol2-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2015-11-26 17:46:36.215487] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-vol2-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2015-11-26 17:46:36.215629] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-vol2-client-3: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2015-11-26 17:46:36.215742] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-vol2-client-4: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
The message "W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-vol2-client-6: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]" repeated 2 times between [2015-11-26 17:46:34.209935] and [2015-11-26 17:46:36.215813]
[2015-11-26 17:46:36.215892] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-vol2-client-5: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2015-11-26 17:46:36.216153] E [glfs-handleops.c:1809:glfs_h_poll_cache_invalidation] (-->/usr/lib64/ganesha/libfsalgluster.so(GLUSTERFSAL_UP_Thread+0xc3) [0x7fe45a801ec3] -->/lib64/libgfapi.so.0(glfs_h_poll_upcall+0x190) [0x7fe45a3e9ce0] -->/lib64/libgfapi.so.0(+0x18a4c) [0x7fe45a3e9a4c] ) 0-glfs_h_poll_cache_invalidation: invalid argument: object [Invalid argument]


on client side,

# ps -auxww | grep dbench
root     24378  0.0  0.0 113120  1400 pts/1    S+   03:12   0:00 /bin/bash /opt/qa/tools/system_light/scripts/dbench/dbench.sh
root     24379  0.0  0.0 113124  1400 pts/1    S+   03:12   0:00 /bin/bash /opt/qa/tools/system_light/scripts/dbench/dbench_run.sh
root     24381  0.0  0.0   6420   620 pts/1    S+   03:12   0:05 dbench -t 300 -c /opt/qa/tools/client.txt -s -S 10
root     24382  0.0  0.0   7444  1376 pts/1    D+   03:12   0:00 dbench -t 300 -c /opt/qa/tools/client.txt -s -S 10
root     24383  0.0  0.0   7444  1376 pts/1    D+   03:12   0:00 dbench -t 300 -c /opt/qa/tools/client.txt -s -S 10
root     24384  0.0  0.0   7444  1376 pts/1    D+   03:12   0:00 dbench -t 300 -c /opt/qa/tools/client.txt -s -S 10
root     24385  0.0  0.0   7444  1376 pts/1    D+   03:12   0:00 dbench -t 300 -c /opt/qa/tools/client.txt -s -S 10
root     24386  0.0  0.0   7444  1376 pts/1    D+   03:12   0:00 dbench -t 300 -c /opt/qa/tools/client.txt -s -S 10
root     24387  0.0  0.0   7444  1376 pts/1    D+   03:12   0:00 dbench -t 300 -c /opt/qa/tools/client.txt -s -S 10
root     24388  0.0  0.0   7444  1376 pts/1    D+   03:12   0:00 dbench -t 300 -c /opt/qa/tools/client.txt -s -S 10
root     24389  0.0  0.0   7444  1376 pts/1    D+   03:12   0:00 dbench -t 300 -c /opt/qa/tools/client.txt -s -S 10
root     24390  0.0  0.0   7444  1376 pts/1    D+   03:12   0:00 dbench -t 300 -c /opt/qa/tools/client.txt -s -S 10
root     24391  0.0  0.0   7444  1376 pts/1    D+   03:12   0:00 dbench -t 300 -c /opt/qa/tools/client.txt -s -S 10

# gluster volume info vol2
 
Volume Name: vol2
Type: Tier
Volume ID: 769c043e-7764-4f04-9b4f-5d83a0b4d5a2
Status: Started
Number of Bricks: 8
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: 10.70.46.60:/rhs/brick4/d2r2-vol2
Brick2: 10.70.46.63:/rhs/brick4/d2r1-vol2
Brick3: 10.70.46.64:/rhs/brick4/d1r2-vol2
Brick4: 10.70.46.59:/rhs/brick4/d1r1-vol2
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick5: 10.70.46.59:/rhs/brick3/d1r1-vol2
Brick6: 10.70.46.64:/rhs/brick3/d1r2-vol2
Brick7: 10.70.46.63:/rhs/brick3/d2r1-vol2
Brick8: 10.70.46.60:/rhs/brick3/d2r2-vol2
Options Reconfigured:
cluster.write-freq-threshold: 0
cluster.read-freq-threshold: 0
ganesha.enable: on
features.cache-invalidation: on
cluster.tier-mode: cache
features.ctr-enabled: on
nfs.disable: on
performance.readdir-ahead: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable


Expected results:
dbench runs should finish properly,

Additional info:

Comment 4 Shashank Raj 2016-04-30 18:14:32 UTC
Verified this bug with glusterfs-3.7.9-2 and nfs-ganesha-2.3.1-4 and the issue is not reproducible.

Executed dbench tool separately on both v3 and v4 ganesha mount with below tier volume configuration and it passes without any issues.

[root@dhcp37-180 ~]# gluster volume info tiervolume
 
Volume Name: tiervolume
Type: Tier
Volume ID: 45fd73f7-e8ed-43da-b9c6-79ae042cef12
Status: Started
Number of Bricks: 16
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: 10.70.37.174:/bricks/brick3/b3
Brick2: 10.70.37.127:/bricks/brick3/b3
Brick3: 10.70.37.158:/bricks/brick3/b3
Brick4: 10.70.37.180:/bricks/brick3/b3
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick5: 10.70.37.180:/bricks/brick0/b0
Brick6: 10.70.37.158:/bricks/brick0/b0
Brick7: 10.70.37.127:/bricks/brick0/b0
Brick8: 10.70.37.174:/bricks/brick0/b0
Brick9: 10.70.37.180:/bricks/brick1/b1
Brick10: 10.70.37.158:/bricks/brick1/b1
Brick11: 10.70.37.127:/bricks/brick1/b1
Brick12: 10.70.37.174:/bricks/brick1/b1
Brick13: 10.70.37.180:/bricks/brick2/b2
Brick14: 10.70.37.158:/bricks/brick2/b2
Brick15: 10.70.37.127:/bricks/brick2/b2
Brick16: 10.70.37.174:/bricks/brick2/b2
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
cluster.tier-mode: cache
features.ctr-enabled: on
nfs.disable: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable
nfs-ganesha: enable

Comment 6 Shashank Raj 2016-07-28 10:12:57 UTC
As per comment 4, this has already been verified on tiered volume and no issues were seen while multiple executions of dbench tool. Can be closed.