Description of problem: On a 16 node setup, IO's are hung on all fuse mounts after running IO's overnight. IO pattern: 1) dd from multiple clients to multiple subfolders 2) linux untar 3) continuous ls on all folders in the vol One of the brick on hot tier seems to be hung, running 'ls' on the backend brick also hangs. Although one of the brick is flaky, IO on the complete vol shouldn't be affected. dmesg from the node where brick process is hung: [132721.696081] INFO: task glusterfsd:30458 blocked for more than 120 seconds. [132721.698739] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [132721.699963] glusterfsd D 0000000000000000 0 30458 1 0x00000080 [132721.700093] ffff88013eb9fad8 0000000000000086 ffff88007720dc00 ffff88013eb9ffd8 [132721.700102] ffff88013eb9ffd8 ffff88013eb9ffd8 ffff88007720dc00 ffff88040f88e000 [132721.700105] ffff88026d4037e8 ffff88040f88e1c0 00000000000209cc 0000000000000000 [132721.700113] Call Trace: [132721.700178] [<ffffffff8163a879>] schedule+0x29/0x70 [132721.700721] [<ffffffffa026867d>] xlog_grant_head_wait+0x9d/0x180 [xfs] [132721.700748] [<ffffffffa02687fe>] xlog_grant_head_check+0x9e/0x110 [xfs] [132721.700783] [<ffffffffa026c192>] xfs_log_reserve+0xc2/0x1b0 [xfs] [132721.700920] [<ffffffffa0266ae5>] xfs_trans_reserve+0x1b5/0x1f0 [xfs] [132721.700942] [<ffffffffa0258426>] xfs_vn_update_time+0x56/0x190 [xfs] [132721.701049] [<ffffffff811f98f5>] update_time+0x25/0xd0 [132721.701067] [<ffffffff812a68de>] ? process_measurement+0x8e/0x250 [132721.701071] [<ffffffff811f9ba0>] file_update_time+0xa0/0xf0 [132721.701094] [<ffffffffa024f93d>] xfs_file_aio_write_checks+0x11d/0x180 [xfs] [132721.701112] [<ffffffffa024fa33>] xfs_file_buffered_aio_write+0x93/0x260 [xfs] [132721.701135] [<ffffffffa024fcd0>] xfs_file_aio_write+0xd0/0x150 [xfs] [132721.701155] [<ffffffff811ddcbd>] do_sync_write+0x8d/0xd0 [132721.701159] [<ffffffff811de4dd>] vfs_write+0xbd/0x1e0 [132721.701166] [<ffffffff811eeaad>] ? putname+0x3d/0x60 [132721.701171] [<ffffffff811def7f>] SyS_write+0x7f/0xe0 [132721.701183] [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b sosreport and statedump shall be attached shortly. See additional info for information on volume used for test. Version-Release number of selected component (if applicable): glusterfs-3.7.5-17.el7rhgs.x86_64 How reproducible: Yet to determine Steps to Reproduce: Overnight IO with the IO pattern mentioned above on a tiered vol Actual results: IO hung on all Expected results: No IO failure or hangs. Additional info: [root@dhcp37-120 ~]# gluster v info krkvol Volume Name: krkvol Type: Tier Volume ID: 520766ab-c67c-4398-9011-cf6fadc8d4b0 Status: Started Number of Bricks: 36 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 6 x 2 = 12 Brick1: 10.70.35.163:/rhs/brick6/krkvol Brick2: 10.70.35.173:/rhs/brick6/krkvol Brick3: 10.70.35.232:/rhs/brick6/krkvol Brick4: 10.70.35.176:/rhs/brick6/krkvol Brick5: 10.70.35.231:/rhs/brick6/krkvol Brick6: 10.70.35.89:/rhs/brick6/krkvol Brick7: 10.70.37.195:/rhs/brick6/krkvol Brick8: 10.70.37.202:/rhs/brick6/krkvol Brick9: 10.70.37.120:/rhs/brick6/krkvol Brick10: 10.70.37.60:/rhs/brick6/krkvol Brick11: 10.70.37.69:/rhs/brick6/krkvol Brick12: 10.70.37.101:/rhs/brick6/krkvol Cold Tier: Cold Tier Type : Distributed-Disperse Number of Bricks: 2 x (8 + 4) = 24 Brick13: 10.70.35.176:/rhs/brick5/krkvol Brick14: 10.70.35.232:/rhs/brick5/krkvol Brick15: 10.70.35.173:/rhs/brick5/krkvol Brick16: 10.70.35.163:/rhs/brick5/krkvol Brick17: 10.70.37.101:/rhs/brick5/krkvol Brick18: 10.70.37.69:/rhs/brick5/krkvol Brick19: 10.70.37.60:/rhs/brick5/krkvol Brick20: 10.70.37.120:/rhs/brick5/krkvol Brick21: 10.70.37.202:/rhs/brick4/krkvol Brick22: 10.70.37.195:/rhs/brick4/krkvol Brick23: 10.70.35.155:/rhs/brick4/krkvol Brick24: 10.70.35.222:/rhs/brick4/krkvol Brick25: 10.70.35.108:/rhs/brick4/krkvol Brick26: 10.70.35.44:/rhs/brick4/krkvol Brick27: 10.70.35.89:/rhs/brick4/krkvol Brick28: 10.70.35.231:/rhs/brick4/krkvol Brick29: 10.70.35.176:/rhs/brick4/krkvol Brick30: 10.70.35.232:/rhs/brick4/krkvol Brick31: 10.70.35.173:/rhs/brick4/krkvol Brick32: 10.70.35.163:/rhs/brick4/krkvol Brick33: 10.70.37.101:/rhs/brick4/krkvol Brick34: 10.70.37.69:/rhs/brick4/krkvol Brick35: 10.70.37.60:/rhs/brick4/krkvol Brick36: 10.70.37.120:/rhs/brick4/krkvol Options Reconfigured: cluster.watermark-hi: 80 cluster.tier-mode: cache features.ctr-enabled: on features.quota-deem-statfs: off features.inode-quota: on features.quota: on performance.readdir-ahead: on [root@dhcp37-120 ~]# gluster v status krkvol | more Status of volume: krkvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Hot Bricks: Brick 10.70.35.163:/rhs/brick6/krkvol 49155 0 Y 9393 Brick 10.70.35.173:/rhs/brick6/krkvol 49155 0 Y 8991 Brick 10.70.35.232:/rhs/brick6/krkvol 49155 0 Y 8470 Brick 10.70.35.176:/rhs/brick6/krkvol 49155 0 Y 8894 Brick 10.70.35.231:/rhs/brick6/krkvol 49157 0 Y 8707 Brick 10.70.35.89:/rhs/brick6/krkvol 49155 0 Y 30443 Brick 10.70.37.195:/rhs/brick6/krkvol 49155 0 Y 2198 Brick 10.70.37.202:/rhs/brick6/krkvol 49155 0 Y 2620 Brick 10.70.37.120:/rhs/brick6/krkvol 49155 0 Y 8798 Brick 10.70.37.60:/rhs/brick6/krkvol 49155 0 Y 10643 Brick 10.70.37.69:/rhs/brick6/krkvol 49155 0 Y 8727 Brick 10.70.37.101:/rhs/brick6/krkvol 49155 0 Y 10809 Cold Bricks: Brick 10.70.35.176:/rhs/brick5/krkvol 49153 0 Y 26780 Brick 10.70.35.232:/rhs/brick5/krkvol 49153 0 Y 26492 Brick 10.70.35.173:/rhs/brick5/krkvol 49153 0 Y 26785 Brick 10.70.35.163:/rhs/brick5/krkvol 49153 0 Y 26784 Brick 10.70.37.101:/rhs/brick5/krkvol 49153 0 Y 26870 Brick 10.70.37.69:/rhs/brick5/krkvol 49153 0 Y 26772 Brick 10.70.37.60:/rhs/brick5/krkvol 49153 0 Y 26873 Brick 10.70.37.120:/rhs/brick5/krkvol 49153 0 Y 26682 Brick 10.70.37.202:/rhs/brick4/krkvol 49154 0 Y 20760 Brick 10.70.37.195:/rhs/brick4/krkvol 49154 0 Y 20374 Brick 10.70.35.155:/rhs/brick4/krkvol 49154 0 Y 17315 Brick 10.70.35.222:/rhs/brick4/krkvol 49154 0 Y 17468 Brick 10.70.35.108:/rhs/brick4/krkvol 49154 0 Y 8450 Brick 10.70.35.44:/rhs/brick4/krkvol 49154 0 Y 16067 Brick 10.70.35.89:/rhs/brick4/krkvol 49154 0 Y 16048 Brick 10.70.35.231:/rhs/brick4/krkvol 49156 0 Y 26813 Brick 10.70.35.176:/rhs/brick4/krkvol 49154 0 Y 26799 Brick 10.70.35.232:/rhs/brick4/krkvol 49154 0 Y 26511 Brick 10.70.35.173:/rhs/brick4/krkvol 49154 0 Y 26804 Brick 10.70.35.163:/rhs/brick4/krkvol 49154 0 Y 26803 Brick 10.70.37.101:/rhs/brick4/krkvol 49154 0 Y 26889 Brick 10.70.37.69:/rhs/brick4/krkvol 49154 0 Y 26791 Brick 10.70.37.60:/rhs/brick4/krkvol 49154 0 Y 26892 Brick 10.70.37.120:/rhs/brick4/krkvol 49154 0 Y 26701 NFS Server on localhost 2049 0 Y 8818 Self-heal Daemon on localhost N/A N/A Y 8826 Quota Daemon on localhost N/A N/A Y 8834 NFS Server on 10.70.37.69 2049 0 Y 8747 Self-heal Daemon on 10.70.37.69 N/A N/A Y 8755 Quota Daemon on 10.70.37.69 N/A N/A Y 8763 NFS Server on 10.70.37.60 2049 0 Y 10663 Self-heal Daemon on 10.70.37.60 N/A N/A Y 10671 Quota Daemon on 10.70.37.60 N/A N/A Y 10679 NFS Server on 10.70.37.195 2049 0 Y 2218 Self-heal Daemon on 10.70.37.195 N/A N/A Y 2226 Quota Daemon on 10.70.37.195 N/A N/A Y 2234 NFS Server on 10.70.37.101 2049 0 Y 10837 Self-heal Daemon on 10.70.37.101 N/A N/A Y 10845 Quota Daemon on 10.70.37.101 N/A N/A Y 10853 NFS Server on dhcp37-202.lab.eng.blr.redhat .com 2049 0 Y 2640 Self-heal Daemon on dhcp37-202.lab.eng.blr. redhat.com N/A N/A Y 2648 Quota Daemon on dhcp37-202.lab.eng.blr.redh at.com N/A N/A Y 2656 NFS Server on 10.70.35.232 2049 0 Y 8490 Self-heal Daemon on 10.70.35.232 N/A N/A Y 8498 Quota Daemon on 10.70.35.232 N/A N/A Y 8506 NFS Server on 10.70.35.173 2049 0 Y 9011 Self-heal Daemon on 10.70.35.173 N/A N/A Y 9019 Quota Daemon on 10.70.35.173 N/A N/A Y 9027 NFS Server on 10.70.35.222 2049 0 Y 31731 Self-heal Daemon on 10.70.35.222 N/A N/A Y 31739 Quota Daemon on 10.70.35.222 N/A N/A Y 31747 NFS Server on 10.70.35.89 2049 0 Y 30463 Self-heal Daemon on 10.70.35.89 N/A N/A Y 30471 Quota Daemon on 10.70.35.89 N/A N/A Y 30479 NFS Server on 10.70.35.231 2049 0 Y 8727 Self-heal Daemon on 10.70.35.231 N/A N/A Y 8735 Quota Daemon on 10.70.35.231 N/A N/A Y 8743 NFS Server on 10.70.35.108 2049 0 Y 22694 Self-heal Daemon on 10.70.35.108 N/A N/A Y 22702 Quota Daemon on 10.70.35.108 N/A N/A Y 22710 NFS Server on 10.70.35.155 2049 0 Y 31498 Self-heal Daemon on 10.70.35.155 N/A N/A Y 31506 Quota Daemon on 10.70.35.155 N/A N/A Y 31514 NFS Server on 10.70.35.176 2049 0 Y 8914 Self-heal Daemon on 10.70.35.176 N/A N/A Y 8930 Quota Daemon on 10.70.35.176 N/A N/A Y 8938 NFS Server on 10.70.35.44 2049 0 Y 30350 Self-heal Daemon on 10.70.35.44 N/A N/A Y 30358 Quota Daemon on 10.70.35.44 N/A N/A Y 30366 NFS Server on 10.70.35.163 2049 0 Y 9413 Self-heal Daemon on 10.70.35.163 N/A N/A Y 9421 Quota Daemon on 10.70.35.163 N/A N/A Y 9429 Task Status of Volume krkvol ------------------------------------------------------------------------------ Task : Tier migration ID : 7d4cc30c-c603-4375-8e49-4957bd40bd2c Status : in progress
sosreports are available here --> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1302208/
Hi Karthick, Is it possible to get statedumps of clients? regards, Raghavendra.
Based on discussion: This is a known issue with LVM (pool getting full) and not something gluster can be used to work around.
Closing based on comment 4 and discussion with QE.