Description of problem: Mapped 10 LUNs to a RHEL initiator. Started dd to all the 10 LUNS on a 4 GW setup. Seeing a kernel hang on one of the GWs(catalina) only. Logs below. We are seeing login failures on all the GWs: 5255.122834] Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:ceph-igw [ 5255.123787] iSCSI Login negotiation failed. Version-Release number of selected component (if applicable): 3.10.0-714.el7.test.x86_64 #1 SMP Mon Sep 25 01:00:30 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux System still in the same state. Additional info: 5521.099080] INFO: task wb_throttle:13677 blocked for more than 120 seconds. [ 5521.099107] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5521.099131] wb_throttle D ffff88203a635ee0 0 13677 1 0x00000080 [ 5521.099134] ffff8820273cfc00 0000000000000086 ffff8820273cffd8 ffff8820273cffd8 [ 5521.099136] ffff8820273cffd8 0000000000016cc0 ffff880169478fd0 ffff88103e696cc0 [ 5521.099137] 0000000000000000 7fffffffffffffff ffff88107ff8f2e8 ffffffff816b0c70 [ 5521.099139] Call Trace: [ 5521.099146] [<ffffffff816b0c70>] ? bit_wait+0x50/0x50 [ 5521.099148] [<ffffffff816b29d9>] schedule+0x29/0x70 [ 5521.099149] [<ffffffff816b0659>] schedule_timeout+0x239/0x2c0 [ 5521.099153] [<ffffffff81183701>] ? find_get_pages_tag+0xe1/0x1a0 [ 5521.099156] [<ffffffff810e9c1c>] ? ktime_get_ts64+0x4c/0xf0 [ 5521.099158] [<ffffffff816b0c70>] ? bit_wait+0x50/0x50 [ 5521.099159] [<ffffffff816b21cd>] io_schedule_timeout+0xad/0x130 [ 5521.099161] [<ffffffff816b2268>] io_schedule+0x18/0x20 [ 5521.099162] [<ffffffff816b0c81>] bit_wait_io+0x11/0x50 [ 5521.099164] [<ffffffff816b07a5>] __wait_on_bit+0x65/0x90 [ 5521.099166] [<ffffffff811824c1>] wait_on_page_bit+0x81/0xa0 [ 5521.099169] [<ffffffff810b1d10>] ? wake_bit_function+0x40/0x40 [ 5521.099170] [<ffffffff811825f1>] __filemap_fdatawait_range+0x111/0x190 [ 5521.099172] [<ffffffff81182684>] filemap_fdatawait_range+0x14/0x30 [ 5521.099174] [<ffffffff81184606>] filemap_write_and_wait_range+0x56/0x90 [ 5521.099208] [<ffffffffc0367e36>] xfs_file_fsync+0x66/0x1e0 [xfs] [ 5521.099211] [<ffffffff81234c05>] do_fsync+0x65/0xa0 [ 5521.099212] [<ffffffff81234f03>] SyS_fdatasync+0x13/0x20 [ 5521.099214] [<ffffffff816be5c9>] system_call_fastpath+0x16/0x1b 5521.099216] INFO: task filestore_sync:13678 blocked for more than 120 seconds. [ 5521.099241] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5521.099269] filestore_sync D ffff88203a631fa0 0 13678 1 0x00000080 [ 5521.099271] ffff882027b8fbf0 0000000000000086 ffff882027b8ffd8 ffff882027b8ffd8 [ 5521.099272] ffff882027b8ffd8 0000000000016cc0 ffff8810e9f72f70 ffff88203ded6cc0 [ 5521.099274] 0000000000000000 7fffffffffffffff ffff88107ff8f2e8 ffffffff816b0c70 [ 5521.099275] Call Trace: [ 5521.099277] [<ffffffff816b0c70>] ? bit_wait+0x50/0x50 [ 5521.099279] [<ffffffff816b29d9>] schedule+0x29/0x70 [ 5521.099281] [<ffffffff816b0659>] schedule_timeout+0x239/0x2c0 [ 5521.099282] [<ffffffff810e9c1c>] ? ktime_get_ts64+0x4c/0xf0 [ 5521.099285] [<ffffffff8113dbbf>] ? delayacct_end+0x8f/0xb0 [ 5521.099286] [<ffffffff810e9c1c>] ? ktime_get_ts64+0x4c/0xf0 [ 5521.099288] [<ffffffff816b0c70>] ? bit_wait+0x50/0x50 [ 5521.099290] [<ffffffff816b21cd>] io_schedule_timeout+0xad/0x130 [ 5521.099291] [<ffffffff816b2268>] io_schedule+0x18/0x20 [ 5521.099293] [<ffffffff816b0c81>] bit_wait_io+0x11/0x50 [ 5521.099295] [<ffffffff816b07a5>] __wait_on_bit+0x65/0x90 [ 5521.099296] [<ffffffff811824c1>] wait_on_page_bit+0x81/0xa0 [ 5521.099298] [<ffffffff810b1d10>] ? wake_bit_function+0x40/0x40 [ 5521.099300] [<ffffffff811825f1>] __filemap_fdatawait_range+0x111/0x190 [ 5521.099302] [<ffffffff81185427>] filemap_fdatawait_keep_errors+0x27/0x30 [ 5521.099304] [<ffffffff8122f46d>] sync_inodes_sb+0x16d/0x1f0 [ 5521.099305] [<ffffffff81234cd3>] sync_filesystem+0x63/0xb0 [ 5521.099307] [<ffffffff81234e84>] SyS_syncfs+0x54/0xa0 [ 5521.099308] [<ffffffff816be5c9>] system_call_fastpath+0x16/0x1b [ 5533.389313] tcmu: command timed out waiting for ring space. 5558.442398] iSCSI Login negotiation failed. [ 5563.019294] sd 0:0:3:0: [sdc] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 5563.019310] sd 0:0:3:0: [sdc] tag#1 Sense Key : Medium Error [current] [ 5563.019316] sd 0:0:3:0: [sdc] tag#1 Add. Sense: Write error - auto reallocation failed [ 5563.019323] sd 0:0:3:0: [sdc] tag#1 CDB: Write(10) 2a 00 23 e3 cd 88 00 02 00 00 [ 5563.019328] blk_update_request: I/O error, dev sdc, sector 602131851 [ 5565.493721] sd 0:0:3:0: [sdc] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 5565.493726] sd 0:0:3:0: [sdc] tag#11 Sense Key : Medium Error [current] [ 5565.493728] sd 0:0:3:0: [sdc] tag#11 Add. Sense: Write error - auto reallocation failed [ 5565.493730] sd 0:0:3:0: [sdc] tag#11 CDB: Write(10) 2a 00 23 e3 cb 88 00 02 00 00 [ 5565.493732] blk_update_request: I/O error, dev sdc, sector 602131336 [ 5570.364858] sd 0:0:3:0: [sdc] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 5570.364871] sd 0:0:3:0: [sdc] tag#22 Sense Key : Medium Error [current] [ 5570.364876] sd 0:0:3:0: [sdc] tag#22 Add. Sense: Write error - auto reallocation failed [ 5570.364882] sd 0:0:3:0: [sdc] tag#22 CDB: Write(10) 2a 00 23 e3 c9 88 00 02 00 00 [ 5570.364886] blk_update_request: I/O error, dev sdc, sector 602131009 [ 5585.607515] sd 0:0:3:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 5585.607530] sd 0:0:3:0: [sdc] tag#0 Sense Key : Medium Error [current] [ 5585.607537] sd 0:0:3:0: [sdc] tag#0 Add. Sense: Write error - auto reallocation failed [ 5585.607543] sd 0:0:3:0: [sdc] tag#0 CDB: Write(10) 2a 00 23 e3 d3 88 00 02 00 00 [ 5585.607548] blk_update_request: I/O error, dev sdc, sector 602133690 [ 5585.607609] Buffer I/O error on dev dm-3, logical block 73951665, lost async page write [ 5585.607636] Buffer I/O error on dev dm-3, logical block 73951666, lost async page write [ 5585.607667] Buffer I/O error on dev dm-3, logical block 73951667, lost async page write [ 5585.607724] Buffer I/O error on dev dm-3, logical block 73951668, lost async page write
It looks like we are working as expected. It looks like /dev/sdc is starting to go bad. sdc is used by OSDs, so when the iscsi target tries to do IO, it will hang and timeout. This leads to the target not allowing relogins until we can recover (clean up IO, reopen the image, etc). The hung task warnings are from IOs to the bad disk and are expected.
Tejas, Do you want to keep this one open or can I close it?
Mike, We can close it, since its expected.