Bug 852532 - tgtd becomes unresponsive under very heavy load
tgtd becomes unresponsive under very heavy load
Status: CLOSED DUPLICATE of bug 1186090
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: scsi-target-utils (Show other bugs)
6.3
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Andy Grover
Bruno Goncalves
:
: 1136405 (view as bug list)
Depends On:
Blocks: 994246
  Show dependency treegraph
 
Reported: 2012-08-28 16:07 EDT by Thomas Cameron
Modified: 2015-11-05 11:33 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-02 12:07:53 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
sosreport immediately after rebooting the target (4.62 MB, application/x-xz)
2012-08-28 16:11 EDT, Thomas Cameron
no flags Details
sosreport of the affected hypervisor/initiator (3.66 MB, application/x-xz)
2012-08-28 16:16 EDT, Thomas Cameron
no flags Details
strace I captured after tgtd had become unresponsive. (1.26 MB, application/octet-stream)
2012-09-12 11:35 EDT, Thomas Cameron
no flags Details

  None (edit)
Description Thomas Cameron 2012-08-28 16:07:56 EDT
Description of problem:
I have a three-node cluster using RHEL High Availability and Resilient Storage add-ons. 

Target machine:

The three cluster nodes connect to a RHEL 6.3 machine which is an iSCSI target. The target has 6 1TB drives in a RAID6 array. The target exports logical volumes via targets.conf:

default-driver iscsi

<target iqn.2011-06.net.l2aviation:filer.lun_exchange>
  backing-store /dev/vg_storage/lv_lun_exchange
</target>

<target iqn.2011-06.net.l2aviation:filer.lun_srv3>
  backing-store /dev/vg_storage/lv_lun_srv3
</target>

<target iqn.2011-06.net.l2aviation:filer.lun_srv2>
  backing-store /dev/vg_storage/lv_lun_srv2
</target>

<target iqn.2011-06.net.l2aviation:filer.lun_portal>
  backing-store /dev/vg_storage/lv_lun_portal
</target>

<target iqn.2011-06.net.l2aviation:filer.lun_spamcatcher>
  backing-store /dev/vg_storage/lv_lun_spamcatcher
</target>

<target iqn.2011-06.net.l2aviation:filer.lun_xch>
  backing-store /dev/mapper/vg_storage-lv_lun_exch
</target>

Initiator machines:

I am using multipath from the initiators. The multipath.conf looks like:

defaults {
	find_multipaths yes
	user_friendly_names yes
}

blacklist {
}

The iscsid.conf looks like this:

iscsid.startup = /etc/rc.d/init.d/iscsid force-start
node.startup = automatic
node.leading_login = No
node.session.timeo.replacement_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.iscsi.FastAbort = Yes

Problem:

The clustered resources on the initiators are Microsoft Windows virtual machines. One of them is a Microsoft Exchange server. Recently, under very heavy load, the Exchange server causes very much I/O. The tgtd process goes into a D state and starts throwing messages:

Aug 16 17:37:01 filer tgtd: abort_task_set(1139) found 67 0
Aug 16 17:37:01 filer tgtd: abort_cmd(1115) found 67 6
Aug 16 17:37:03 filer tgtd: abort_task_set(1139) found 2000005e 0
Aug 16 17:37:03 filer tgtd: abort_cmd(1115) found 2000005e 6
Aug 16 17:37:17 filer tgtd: conn_close(101) connection closed, 0x25e5a08 3
Aug 16 17:37:17 filer tgtd: conn_close(107) sesson 0x25f76f0 1
Aug 16 17:37:19 filer tgtd: conn_close(101) connection closed, 0x26bd2f8 3
Aug 16 17:37:19 filer tgtd: conn_close(107) sesson 0x26bd5c0 1

Aug 16 17:37:24 filer kernel: INFO: task tgtd:2286 blocked for more than 120 seconds. Aug 16 17:37:24 filer kernel: tgtd D 0000000000000005 0 2286 1 0x00000080 Aug 16 17:37:24 filer kernel: INFO: task tgtd:2302 blocked for more than 120 seconds. Aug 16 17:37:24 filer kernel: tgtd D 0000000000000000 0 2302 1 0x00000080 Aug 16 17:37:24 filer kernel: INFO: task tgtd:2304 blocked for more than 120 seconds. Aug 16 17:37:24 filer kernel: tgtd D 0000000000000000 0 2304 1 0x00000080 Aug 16 17:37:24 filer kernel: INFO: task tgtd:2345 blocked for more than 120 seconds. Aug 16 17:37:24 filer kernel: tgtd D 0000000000000001 0 2345 1 0x00000080 Aug 16 17:37:24 filer kernel: INFO: task tgtd:2349 blocked for more than 120 seconds. Aug 16 17:37:24 filer kernel: tgtd D 0000000000000001 0 2349 1 0x00000080 Aug 16 17:37:24 filer kernel: INFO: task tgtd:2350 blocked for more than 120 seconds. Aug 16 17:37:24 filer kernel: tgtd D 0000000000000001 0 2350 1 0x00000080 Aug 16 17:37:24 filer kernel: INFO: task tgtd:2352 blocked for more than 120 seconds. Aug 16 17:37:24 filer kernel: tgtd D 0000000000000002 0 2352 1 0x00000080

Aug 16 17:37:24 filer tgtd: abort_task_set(1139) found 10000020 0
Aug 16 17:37:24 filer tgtd: abort_cmd(1115) found 10000020 6
Aug 16 17:37:39 filer tgtd: abort_task_set(1139) found 1000003d 0
Aug 16 17:37:39 filer tgtd: abort_cmd(1115) found 1000003d 6
Aug 16 17:37:39 filer tgtd: conn_close(101) connection closed, 0x26ba9a8 4
Aug 16 17:37:39 filer tgtd: conn_close(107) sesson 0x26bac70 1
Aug 16 17:37:54 filer tgtd: conn_close(101) connection closed, 0x26a9898 3
Aug 16 17:37:54 filer tgtd: conn_close(107) sesson 0x26bae20 1

This in dmesg:

INFO: task tgtd:2850 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
tgtd          D 0000000000000000     0  2850      1 0x00000080
 ffff880217ad9c48 0000000000000086 ffff880217ad9be8 ffffffff8125721d
 ffff880217ad9bb8 ffffffff81012bd9 ffff880217ad9bf8 0000000000000286
 ffff880216dd9af8 ffff880217ad9fd8 000000000000fb88 ffff880216dd9af8
Call Trace:
 [<ffffffff8125721d>] ? submit_bio+0x8d/0x120
 [<ffffffff81012bd9>] ? read_tsc+0x9/0x20
 [<ffffffff8109cd49>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff811141f0>] ? sync_page+0x0/0x50
 [<ffffffff814fe0f3>] io_schedule+0x73/0xc0
 [<ffffffff8111422d>] sync_page+0x3d/0x50
 [<ffffffff814fe95a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff811141c7>] __lock_page+0x67/0x70
 [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
 [<ffffffff81114307>] ? unlock_page+0x27/0x30
 [<ffffffff811299b2>] write_cache_pages+0x392/0x4a0
 [<ffffffff811141f0>] ? sync_page+0x0/0x50
 [<ffffffff81128440>] ? __writepage+0x0/0x40
 [<ffffffff81129ae4>] generic_writepages+0x24/0x30
 [<ffffffff81129b11>] do_writepages+0x21/0x40
 [<ffffffff81114a2b>] __filemap_fdatawrite_range+0x5b/0x60
 [<ffffffff811a9ef9>] do_sync_mapping_range+0xa9/0xb0
 [<ffffffff811a9ffb>] sys_sync_file_range+0xfb/0x110
 [<ffffffff810a6e71>] ? sys_futex+0x1/0x170
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task flush-253:5:4330 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-253:5   D 0000000000000000     0  4330      2 0x00000080
 ffff8801de8215d0 0000000000000046 0000000000000000 ffffffff810533f3
 ffff8801de821570 ffff8802175adfc0 ffff880215085a28 ffff8802158e0da8
 ffff8801f89f5af8 ffff8801de821fd8 000000000000fb88 ffff8801f89f5af8
Call Trace:
 [<ffffffff810533f3>] ? __wake_up+0x53/0x70
 [<ffffffffa02050e5>] get_active_stripe+0x235/0x810 [raid456]
 [<ffffffff81116a93>] ? mempool_alloc+0x63/0x140
 [<ffffffff81060250>] ? default_wake_function+0x0/0x20
 [<ffffffff810923be>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa0209435>] make_request+0x1b5/0x760 [raid456]
 [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff813f6395>] md_make_request+0xd5/0x210
 [<ffffffff810538b6>] ? enqueue_task+0x66/0x80
 [<ffffffff81256ebe>] generic_make_request+0x25e/0x530
 [<ffffffff81116985>] ? mempool_alloc_slab+0x15/0x20
 [<ffffffff8125721d>] submit_bio+0x8d/0x120
 [<ffffffff811ace46>] submit_bh+0xf6/0x150
 [<ffffffff811aebb0>] __block_write_full_page+0x1e0/0x3b0
 [<ffffffff811ae4f0>] ? end_buffer_async_write+0x0/0x190
 [<ffffffff811b3470>] ? blkdev_get_block+0x0/0x70
 [<ffffffff811b3470>] ? blkdev_get_block+0x0/0x70
 [<ffffffff811af7f0>] block_write_full_page_endio+0xe0/0x120
 [<ffffffff81113d00>] ? find_get_pages_tag+0x40/0x120
 [<ffffffff811af845>] block_write_full_page+0x15/0x20
 [<ffffffff811b4458>] blkdev_writepage+0x18/0x20
 [<ffffffff81128457>] __writepage+0x17/0x40
 [<ffffffff811297e9>] write_cache_pages+0x1c9/0x4a0
 [<ffffffff81054c30>] ? __dequeue_entity+0x30/0x50
 [<ffffffff81128440>] ? __writepage+0x0/0x40
 [<ffffffff81129ae4>] generic_writepages+0x24/0x30
 [<ffffffff81129b11>] do_writepages+0x21/0x40
 [<ffffffff811a513d>] writeback_single_inode+0xdd/0x290
 [<ffffffff811a554e>] writeback_sb_inodes+0xce/0x180
 [<ffffffff811a56ab>] writeback_inodes_wb+0xab/0x1b0
 [<ffffffff811a5a4b>] wb_writeback+0x29b/0x3f0
 [<ffffffff814fd960>] ? thread_return+0x4e/0x76e
 [<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
 [<ffffffff811a5d39>] wb_do_writeback+0x199/0x240
 [<ffffffff811a5e43>] bdi_writeback_task+0x63/0x1b0
 [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff81138770>] ? bdi_start_fn+0x0/0x100
 [<ffffffff811387f6>] bdi_start_fn+0x86/0x100
 [<ffffffff81138770>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81091d66>] kthread+0x96/0xa0
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffff81091cd0>] ? kthread+0x0/0xa0
 [<ffffffff8100c140>] ? child_rip+0x0/0x20

When this happened recently, the Exchange database became corrupted. When I went to do a repair of the database, the VM ran out of memory and started swapping. Almost immediately, the I/O load went so high that the tgtd locked up again. Essentially, this means that I cannot repair the Exchange database... Every time it happens, the iscsi target dies and the database winds up getting more and more corrupted.

Version-Release number of selected component (if applicable):
Seen with scsi-target-utils-1.0.14-4.el6.x86_64 and also with 1.0.24-2.el6

I'm attaching sosreports from the three hypervisors (initiators) and the target.
Comment 2 Thomas Cameron 2012-08-28 16:11:04 EDT
Created attachment 607689 [details]
sosreport immediately after rebooting the target

This is the sosreport I got after rebooting the target (named "filer").
Comment 3 Thomas Cameron 2012-08-28 16:16:12 EDT
Created attachment 607690 [details]
sosreport of the affected hypervisor/initiator

This is an sosreport taken after rebooting the initiator
Comment 4 Thomas Cameron 2012-09-12 11:35:31 EDT
Created attachment 612156 [details]
strace I captured after tgtd had become unresponsive.

When tgtd spun out of control on September 9th, I ran strace against it for a few minutes before I rebooted the server. This is the dump of that strace.
Comment 5 RHEL Product and Program Management 2013-10-14 00:49:24 EDT
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 6 Jon 2014-02-17 16:22:52 EST
Another instance reported when running synthetic db load test: 

Feb 12 19:26:50 <hostname> tgtd: abort_task_set(1139) found 3000000b 0
Feb 12 19:26:50 <hostname> tgtd: abort_cmd(1115) found 3000000b 6
Feb 12 19:27:06 <hostname> tgtd: conn_close(101) connection closed, 0x175dd98 3
Feb 12 19:27:06 <hostname> tgtd: conn_close(107) sesson 0x175e090 1
Feb 12 19:29:12 <hostname> tgtd: abort_task_set(1139) found 4000006e 0
Feb 12 19:29:12 <hostname> tgtd: abort_cmd(1115) found 4000006e 6
Feb 12 19:29:28 <hostname> tgtd: conn_close(101) connection closed, 0x1762da8 3
Feb 12 19:29:28 <hostname> tgtd: conn_close(107) sesson 0x17670c0 1
Feb 12 19:31:35 <hostname> tgtd: abort_task_set(1139) found 50000065 0
Feb 12 19:31:35 <hostname> tgtd: abort_cmd(1115) found 50000065 6
Feb 12 19:31:51 <hostname> tgtd: conn_close(101) connection closed, 0x1767db8 3
Feb 12 19:31:51 <hostname> tgtd: conn_close(107) sesson 0x176c0f0 1
Feb 12 19:33:57 <hostname> tgtd: abort_task_set(1139) found 60000031 0
Feb 12 19:33:57 <hostname> tgtd: abort_cmd(1115) found 60000031 6
Feb 12 19:34:13 <hostname> tgtd: conn_close(101) connection closed, 0x176cde8 3
Feb 12 19:34:13 <hostname> tgtd: conn_close(107) sesson 0x1771120 1
Feb 12 19:34:27 <hostname> tgtd: tgt_event_modify(224) Cannot find event 16
Feb 12 19:34:27 <hostname> tgtd: iscsi_event_modify(413) tgt_event_modify failed
Feb 12 19:34:27 <hostname> tgtd: tgt_event_modify(224) Cannot find event 18
Feb 12 19:34:27 <hostname> tgtd: iscsi_event_modify(413) tgt_event_modify failed
Feb 12 19:34:27 <hostname> tgtd: tgt_event_modify(224) Cannot find event 17
Feb 12 19:34:27 <hostname> tgtd: iscsi_event_modify(413) tgt_event_modify failed
Feb 12 19:34:27 <hostname> tgtd: tgt_event_modify(224) Cannot find event 15
Feb 12 19:34:27 <hostname> tgtd: iscsi_event_modify(413) tgt_event_modify failed

And dstate processes for flush to disk 253:3 (note this is not a complete output but we can see the flush came out of D-state at 19:34:26 we can see that immediately after this the tgtd daemon went into tgt_event_modify)

Wed Feb 12 19:31:36 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:51 [flush-253:3]
Wed Feb 12 19:31:46 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:51 [flush-253:3]
Wed Feb 12 19:31:56 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:51 [flush-253:3]
Wed Feb 12 19:32:06 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:51 [flush-253:3]
Wed Feb 12 19:32:16 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:51 [flush-253:3]
Wed Feb 12 19:32:26 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:51 [flush-253:3]
Wed Feb 12 19:32:36 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:51 [flush-253:3]
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:52 [flush-253:3]
Wed Feb 12 19:32:56 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:52 [flush-253:3]
Wed Feb 12 19:33:06 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:52 [flush-253:3]
Wed Feb 12 19:33:16 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:52 [flush-253:3]
Wed Feb 12 19:33:26 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:52 [flush-253:3]
Wed Feb 12 19:33:36 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:52 [flush-253:3]
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:52 [flush-253:3]
Wed Feb 12 19:33:56 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:52 [flush-253:3]
Wed Feb 12 19:34:06 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:52 [flush-253:3]
Wed Feb 12 19:34:16 CST 2014
root      7100  0.0  0.0      0     0 ?        D    Feb11   0:52 [flush-253:3]
Wed Feb 12 19:34:26 CST 2014
Wed Feb 12 19:34:36 CST 2014
Comment 8 mdidomenico 2015-02-06 10:45:59 EST
I also see this behaviour in rhel 6.6 x86_64, with the latest patch updates.

i have 200 targets defined, with two luns each.  under load i get the same messages popping out on the client and server.  however on the clients i start to see slow downs from commands that touch the filesystem
Comment 11 Andy Grover 2015-11-02 12:06:30 EST
*** Bug 1136405 has been marked as a duplicate of this bug. ***
Comment 12 Andy Grover 2015-11-02 12:07:53 EST

*** This bug has been marked as a duplicate of bug 1186090 ***
Comment 13 Tom Coughlan 2015-11-05 11:33:39 EST
Removing from the 68FileAndStorageRPL.

Note You need to log in before you can comment on or make changes to this bug.