Bug 1302208

Summary: [Tiering]: IO's hung from all clients on a tiered volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: krishnaram Karthick <kramdoss>
Component: tierAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED NOTABUG QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: byarlaga, dlambrig, nbalacha, rgowdapp, rhs-bugs, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-02 15:09:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description krishnaram Karthick 2016-01-27 07:37:57 UTC
Description of problem:
On a 16 node setup, IO's are hung on all fuse mounts after running IO's overnight. 

IO pattern:
1) dd from multiple clients to multiple subfolders
2) linux untar
3) continuous ls on all folders in the vol

One of the brick on hot tier seems to be hung, running 'ls' on the backend brick also hangs.

Although one of the brick is flaky, IO on the complete vol shouldn't be affected.

dmesg from the node where brick process is hung:

[132721.696081] INFO: task glusterfsd:30458 blocked for more than 120 seconds.
[132721.698739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[132721.699963] glusterfsd      D 0000000000000000     0 30458      1 0x00000080
[132721.700093]  ffff88013eb9fad8 0000000000000086 ffff88007720dc00 ffff88013eb9ffd8
[132721.700102]  ffff88013eb9ffd8 ffff88013eb9ffd8 ffff88007720dc00 ffff88040f88e000
[132721.700105]  ffff88026d4037e8 ffff88040f88e1c0 00000000000209cc 0000000000000000
[132721.700113] Call Trace:
[132721.700178]  [<ffffffff8163a879>] schedule+0x29/0x70
[132721.700721]  [<ffffffffa026867d>] xlog_grant_head_wait+0x9d/0x180 [xfs]
[132721.700748]  [<ffffffffa02687fe>] xlog_grant_head_check+0x9e/0x110 [xfs]
[132721.700783]  [<ffffffffa026c192>] xfs_log_reserve+0xc2/0x1b0 [xfs]
[132721.700920]  [<ffffffffa0266ae5>] xfs_trans_reserve+0x1b5/0x1f0 [xfs]
[132721.700942]  [<ffffffffa0258426>] xfs_vn_update_time+0x56/0x190 [xfs]
[132721.701049]  [<ffffffff811f98f5>] update_time+0x25/0xd0
[132721.701067]  [<ffffffff812a68de>] ? process_measurement+0x8e/0x250
[132721.701071]  [<ffffffff811f9ba0>] file_update_time+0xa0/0xf0
[132721.701094]  [<ffffffffa024f93d>] xfs_file_aio_write_checks+0x11d/0x180 [xfs]
[132721.701112]  [<ffffffffa024fa33>] xfs_file_buffered_aio_write+0x93/0x260 [xfs]
[132721.701135]  [<ffffffffa024fcd0>] xfs_file_aio_write+0xd0/0x150 [xfs]
[132721.701155]  [<ffffffff811ddcbd>] do_sync_write+0x8d/0xd0
[132721.701159]  [<ffffffff811de4dd>] vfs_write+0xbd/0x1e0
[132721.701166]  [<ffffffff811eeaad>] ? putname+0x3d/0x60
[132721.701171]  [<ffffffff811def7f>] SyS_write+0x7f/0xe0
[132721.701183]  [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b


sosreport and statedump shall be attached shortly. See additional info for information on volume used for test.

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-17.el7rhgs.x86_64

How reproducible:
Yet to determine

Steps to Reproduce:
Overnight IO with the IO pattern mentioned above on a tiered vol

Actual results:
IO hung on all 

Expected results:
No IO failure or hangs.

Additional info:
[root@dhcp37-120 ~]# gluster v info krkvol

Volume Name: krkvol  
Type: Tier
Volume ID: 520766ab-c67c-4398-9011-cf6fadc8d4b0
Status: Started
Number of Bricks: 36 
Transport-type: tcp  
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 6 x 2 = 12
Brick1: 10.70.35.163:/rhs/brick6/krkvol
Brick2: 10.70.35.173:/rhs/brick6/krkvol
Brick3: 10.70.35.232:/rhs/brick6/krkvol
Brick4: 10.70.35.176:/rhs/brick6/krkvol
Brick5: 10.70.35.231:/rhs/brick6/krkvol
Brick6: 10.70.35.89:/rhs/brick6/krkvol
Brick7: 10.70.37.195:/rhs/brick6/krkvol
Brick8: 10.70.37.202:/rhs/brick6/krkvol
Brick9: 10.70.37.120:/rhs/brick6/krkvol
Brick10: 10.70.37.60:/rhs/brick6/krkvol
Brick11: 10.70.37.69:/rhs/brick6/krkvol
Brick12: 10.70.37.101:/rhs/brick6/krkvol
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (8 + 4) = 24
Brick13: 10.70.35.176:/rhs/brick5/krkvol
Brick14: 10.70.35.232:/rhs/brick5/krkvol
Brick15: 10.70.35.173:/rhs/brick5/krkvol
Brick16: 10.70.35.163:/rhs/brick5/krkvol
Brick17: 10.70.37.101:/rhs/brick5/krkvol
Brick18: 10.70.37.69:/rhs/brick5/krkvol
Brick19: 10.70.37.60:/rhs/brick5/krkvol
Brick20: 10.70.37.120:/rhs/brick5/krkvol
Brick21: 10.70.37.202:/rhs/brick4/krkvol
Brick22: 10.70.37.195:/rhs/brick4/krkvol
Brick23: 10.70.35.155:/rhs/brick4/krkvol
Brick24: 10.70.35.222:/rhs/brick4/krkvol
Brick25: 10.70.35.108:/rhs/brick4/krkvol
Brick26: 10.70.35.44:/rhs/brick4/krkvol
Brick27: 10.70.35.89:/rhs/brick4/krkvol
Brick28: 10.70.35.231:/rhs/brick4/krkvol
Brick29: 10.70.35.176:/rhs/brick4/krkvol
Brick30: 10.70.35.232:/rhs/brick4/krkvol
Brick31: 10.70.35.173:/rhs/brick4/krkvol
Brick32: 10.70.35.163:/rhs/brick4/krkvol
Brick33: 10.70.37.101:/rhs/brick4/krkvol
Brick34: 10.70.37.69:/rhs/brick4/krkvol
Brick35: 10.70.37.60:/rhs/brick4/krkvol
Brick36: 10.70.37.120:/rhs/brick4/krkvol
Options Reconfigured:
cluster.watermark-hi: 80
cluster.tier-mode: cache
features.ctr-enabled: on
features.quota-deem-statfs: off
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on

[root@dhcp37-120 ~]# gluster v status krkvol | more
Status of volume: krkvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.70.35.163:/rhs/brick6/krkvol       49155     0          Y       9393 
Brick 10.70.35.173:/rhs/brick6/krkvol       49155     0          Y       8991 
Brick 10.70.35.232:/rhs/brick6/krkvol       49155     0          Y       8470 
Brick 10.70.35.176:/rhs/brick6/krkvol       49155     0          Y       8894 
Brick 10.70.35.231:/rhs/brick6/krkvol       49157     0          Y       8707 
Brick 10.70.35.89:/rhs/brick6/krkvol        49155     0          Y       30443
Brick 10.70.37.195:/rhs/brick6/krkvol       49155     0          Y       2198 
Brick 10.70.37.202:/rhs/brick6/krkvol       49155     0          Y       2620 
Brick 10.70.37.120:/rhs/brick6/krkvol       49155     0          Y       8798 
Brick 10.70.37.60:/rhs/brick6/krkvol        49155     0          Y       10643
Brick 10.70.37.69:/rhs/brick6/krkvol        49155     0          Y       8727 
Brick 10.70.37.101:/rhs/brick6/krkvol       49155     0          Y       10809
Cold Bricks:
Brick 10.70.35.176:/rhs/brick5/krkvol       49153     0          Y       26780
Brick 10.70.35.232:/rhs/brick5/krkvol       49153     0          Y       26492
Brick 10.70.35.173:/rhs/brick5/krkvol       49153     0          Y       26785
Brick 10.70.35.163:/rhs/brick5/krkvol       49153     0          Y       26784
Brick 10.70.37.101:/rhs/brick5/krkvol       49153     0          Y       26870
Brick 10.70.37.69:/rhs/brick5/krkvol        49153     0          Y       26772
Brick 10.70.37.60:/rhs/brick5/krkvol        49153     0          Y       26873
Brick 10.70.37.120:/rhs/brick5/krkvol       49153     0          Y       26682
Brick 10.70.37.202:/rhs/brick4/krkvol       49154     0          Y       20760
Brick 10.70.37.195:/rhs/brick4/krkvol       49154     0          Y       20374
Brick 10.70.35.155:/rhs/brick4/krkvol       49154     0          Y       17315
Brick 10.70.35.222:/rhs/brick4/krkvol       49154     0          Y       17468
Brick 10.70.35.108:/rhs/brick4/krkvol       49154     0          Y       8450 
Brick 10.70.35.44:/rhs/brick4/krkvol        49154     0          Y       16067
Brick 10.70.35.89:/rhs/brick4/krkvol        49154     0          Y       16048
Brick 10.70.35.231:/rhs/brick4/krkvol       49156     0          Y       26813
Brick 10.70.35.176:/rhs/brick4/krkvol       49154     0          Y       26799
Brick 10.70.35.232:/rhs/brick4/krkvol       49154     0          Y       26511
Brick 10.70.35.173:/rhs/brick4/krkvol       49154     0          Y       26804
Brick 10.70.35.163:/rhs/brick4/krkvol       49154     0          Y       26803
Brick 10.70.37.101:/rhs/brick4/krkvol       49154     0          Y       26889
Brick 10.70.37.69:/rhs/brick4/krkvol        49154     0          Y       26791
Brick 10.70.37.60:/rhs/brick4/krkvol        49154     0          Y       26892
Brick 10.70.37.120:/rhs/brick4/krkvol       49154     0          Y       26701
NFS Server on localhost                     2049      0          Y       8818
Self-heal Daemon on localhost               N/A       N/A        Y       8826
Quota Daemon on localhost                   N/A       N/A        Y       8834
NFS Server on 10.70.37.69                   2049      0          Y       8747
Self-heal Daemon on 10.70.37.69             N/A       N/A        Y       8755
Quota Daemon on 10.70.37.69                 N/A       N/A        Y       8763
NFS Server on 10.70.37.60                   2049      0          Y       10663
Self-heal Daemon on 10.70.37.60             N/A       N/A        Y       10671
Quota Daemon on 10.70.37.60                 N/A       N/A        Y       10679
NFS Server on 10.70.37.195                  2049      0          Y       2218 
Self-heal Daemon on 10.70.37.195            N/A       N/A        Y       2226 
Quota Daemon on 10.70.37.195                N/A       N/A        Y       2234 
NFS Server on 10.70.37.101                  2049      0          Y       10837
Self-heal Daemon on 10.70.37.101            N/A       N/A        Y       10845
Quota Daemon on 10.70.37.101                N/A       N/A        Y       10853
NFS Server on dhcp37-202.lab.eng.blr.redhat
.com                                        2049      0          Y       2640 
Self-heal Daemon on dhcp37-202.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       2648 
Quota Daemon on dhcp37-202.lab.eng.blr.redh
at.com                                      N/A       N/A        Y       2656 
NFS Server on 10.70.35.232                  2049      0          Y       8490 
Self-heal Daemon on 10.70.35.232            N/A       N/A        Y       8498 
Quota Daemon on 10.70.35.232                N/A       N/A        Y       8506 
NFS Server on 10.70.35.173                  2049      0          Y       9011 
Self-heal Daemon on 10.70.35.173            N/A       N/A        Y       9019 
Quota Daemon on 10.70.35.173                N/A       N/A        Y       9027 
NFS Server on 10.70.35.222                  2049      0          Y       31731
Self-heal Daemon on 10.70.35.222            N/A       N/A        Y       31739
Quota Daemon on 10.70.35.222                N/A       N/A        Y       31747
NFS Server on 10.70.35.89                   2049      0          Y       30463
Self-heal Daemon on 10.70.35.89             N/A       N/A        Y       30471
Quota Daemon on 10.70.35.89                 N/A       N/A        Y       30479
NFS Server on 10.70.35.231                  2049      0          Y       8727 
Self-heal Daemon on 10.70.35.231            N/A       N/A        Y       8735 
Quota Daemon on 10.70.35.231                N/A       N/A        Y       8743 
NFS Server on 10.70.35.108                  2049      0          Y       22694
Self-heal Daemon on 10.70.35.108            N/A       N/A        Y       22702
Quota Daemon on 10.70.35.108                N/A       N/A        Y       22710
NFS Server on 10.70.35.155                  2049      0          Y       31498
Self-heal Daemon on 10.70.35.155            N/A       N/A        Y       31506
Quota Daemon on 10.70.35.155                N/A       N/A        Y       31514
NFS Server on 10.70.35.176                  2049      0          Y       8914
Self-heal Daemon on 10.70.35.176            N/A       N/A        Y       8930
Quota Daemon on 10.70.35.176                N/A       N/A        Y       8938
NFS Server on 10.70.35.44                   2049      0          Y       30350
Self-heal Daemon on 10.70.35.44             N/A       N/A        Y       30358
Quota Daemon on 10.70.35.44                 N/A       N/A        Y       30366
NFS Server on 10.70.35.163                  2049      0          Y       9413
Self-heal Daemon on 10.70.35.163            N/A       N/A        Y       9421
Quota Daemon on 10.70.35.163                N/A       N/A        Y       9429

Task Status of Volume krkvol
------------------------------------------------------------------------------
Task                 : Tier migration
ID                   : 7d4cc30c-c603-4375-8e49-4957bd40bd2c
Status               : in progress

Comment 2 krishnaram Karthick 2016-01-27 09:18:09 UTC
sosreports are available here --> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1302208/

Comment 3 Raghavendra G 2016-01-29 07:32:38 UTC
Hi Karthick,

Is it possible to get statedumps of clients?

regards,
Raghavendra.

Comment 4 Bhaskarakiran 2016-02-11 09:01:50 UTC
Based on discussion:

This is a known issue with LVM (pool getting full) and not something gluster can be used to work around.

Comment 6 Dan Lambright 2016-06-02 15:09:12 UTC
Closing based on comment 4 and discussion with QE.