Bug 1500757 - [iSCSI]: conn errors and IO errors seen when deep-scrub is in progress
Summary: [iSCSI]: conn errors and IO errors seen when deep-scrub is in progress
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: iSCSI
Version: 3.0
Hardware: Unspecified
OS: Linux
high
urgent
Target Milestone: rc
: 4.0
Assignee: Mike Christie
QA Contact: Tejas
Erin Donnelly
URL:
Whiteboard:
Keywords:
Depends On: 1480434
Blocks: 1494421
TreeView+ depends on / blocked
 
Reported: 2017-10-11 12:16 UTC by Tejas
Modified: 2019-02-27 20:42 UTC (History)
8 users (show)

(edit)
.An iSCSI initiator can send more than `max_data_area_mb` worth of data when a Ceph cluster is under heavy load causing a temporary performance drop

When a Ceph cluster is under heavy load, an iSCSI initiator might send more data than specified by the `max_data_area_mb` parameter. Once the `max_data_area_mb` limit has been reached, the `target_core_user` module returns queue full statuses for commands. The initiators might not fairly retry these commands and they can hit initiator side time outs and be failed in the multipath layer. The multipath layer will retry the commands on another path while other commands are still being executed on the original path. This causes a temporary performance drop, and in some extreme cases in Linux environment the `multipathd` daemon can terminate unexpectedly.

If the `multipathd` daemon crashes, restart it manually:

----
# systemctl restart multipathd
----
Clone Of:
(edit)
Last Closed: 2019-02-27 20:42:02 UTC


Attachments (Terms of Use)
disable qfull waiting (1.13 KB, patch)
2017-10-22 16:55 UTC, Mike Christie
no flags Details | Diff
kernel: make qfull timeout settable (4.48 KB, application/mbox)
2017-10-25 17:51 UTC, Mike Christie
no flags Details

Description Tejas 2017-10-11 12:16:46 UTC
Description of problem:

Scenario:
- Running IO from all initiators RHEL7.4 win2k16 ESX(rhel+win2k16)
- Start deep-scrub on 4 OSD out of 12 (1 OSD per node)
- Seeing IO errors on initiators
Saw a performance drop on the ceph cluster to less than half of the previous value.

Version-Release number of selected component (if applicable):
ceph version 12.2.1-10.el7cp
3.10.0-714.el7.test.x86_64


Additional info:
 on the GWs:
ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 564139
[ 6152.292228] perf: interrupt took too long (3160 > 3126), lowering kernel.perf_event_max_sample_rate to 63000
[11682.556903] perf: interrupt took too long (4016 > 3950), lowering kernel.perf_event_max_sample_rate to 49000
[80146.198268] ABORT_TASK: Found referenced iSCSI task_tag: 17790216
[80149.045982] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 17790216
[85396.220521] ABORT_TASK: Found referenced iSCSI task_tag: 18357204
[85396.220567] ABORT_TASK: Found referenced iSCSI task_tag: 18357202

[ 4297.532362] perf: interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 6570.493086] perf: interrupt took too long (3226 > 3130), lowering kernel.perf_event_max_sample_rate to 62000
[19474.651473] perf: interrupt took too long (4118 > 4032), lowering kernel.perf_event_max_sample_rate to 48000
[75013.028182] ABORT_TASK: Found referenced iSCSI task_tag: 9482429
[75015.875647] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 9482429
[75121.793415] tcmu: command timed out waiting for ring space.
[80136.917632] tcmu: command timed out waiting for ring space.
[84077.594239] tcmu: command timed out waiting for ring space.

Windows initiator had IOmeter running:
Seeing a huge error count on IOmeter.

RHEL initiator had a dd running, it went to process state 'D':
root     30737 12.5  0.0 108984  1404 pts/0    D    12:42  23:06 dd if=/dev/urandom of=/mnt/59/temp1 bs=1M count=2900000

Seeing conn errors, looks like that led to failovers:
99252.347515]  connection3:0: detected conn error (1020)
[99265.686408]  connection3:0: detected conn error (1020)
[99280.310155]  connection3:0: detected conn error (1020)
[99292.423208]  connection3:0: detected conn error (1020)
[99292.423331] sd 13:0:0:11: [sday] FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[99292.423337] sd 13:0:0:11: [sday] CDB: Test Unit Ready 00 00 00 00 00 00
[99292.423380] sd 13:0:0:40: [sdfl] FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[99292.423383] sd 13:0:0:40: [sdfl] CDB: Write(16) 8a 00 00 00 00 00 67 ad e0 c0 00 00 20 00 00 00
[99292.423386] blk_update_request: 78 callbacks suppressed
[99292.423389] blk_update_request: I/O error, dev sdfl, sector 1739448512
[99292.423686] device-mapper: multipath: Failing path 130:112.
[99292.425299] sd 11:0:0:40: alua: port group 01 state N non-preferred supports TolUsNA
[99327.855396]  connection1:0: detected conn error (1020)
[99327.855531] sd 11:0:0:2: [sdo] FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[99327.855534] sd 11:0:0:2: [sdo] CDB: Test Unit Ready 00 00 00 00 00 00
[99341.237092]  connection1:0: detected conn error (1020)
[99341.241244] device-mapper: multipath: Reinstating path 130:112.
[99355.867906]  connection1:0: detected conn error (1020)
[99426.986029]  connection1:0: detected conn error (1020)
[99471.375500]  connection1:0: detected conn error (1020)
[99484.445433]  connection1:0: detected conn error (1020)
[99512.244108]  connection1:0: detected conn error (1020)
[99524.423663]  connection1:0: detected conn error (1020)
[99536.871730]  connection1:0: detected conn error (1020)
[99548.981327]  connection1:0: detected conn error (1020)
[99563.026404]  connection1:0: detected conn error (1020)
[99563.026529] sd 11:0:0:36: [sdeu] FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[99563.026533] sd 11:0:0:36: [sdeu] CDB: Test Unit Ready 00 00 00 00 00 00
.
.
.
102600.714002]  connection3:0: detected conn error (1020)
[102615.093764]  connection3:0: detected conn error (1020)
[102615.093911] sd 13:0:0:40: [sdfl] FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[102615.093916] sd 13:0:0:40: [sdfl] CDB: Write(16) 8a 00 00 00 00 00 6d 5b d8 c0 00 00 20 00 00 00
[102615.093918] blk_update_request: I/O error, dev sdfl, sector 1834735808
[102615.094149] sd 13:0:0:40: [sdfl] FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[102615.094150] sd 13:0:0:40: [sdfl] CDB: Write(16) 8a 00 00 00 00 00 6d 5b f8 c0 00 00 20 00 00 00
[102615.094151] blk_update_request: I/O error, dev sdfl, sector 1834744000
[102615.094281] sd 13:0:0:40: [sdfl] FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[102615.094282] sd 13:0:0:40: [sdfl] CDB: Write(16) 8a 00 00 00 00 00 6d 5a d8 c0 00 00 20 00 00 00
[102615.094283] blk_update_request: I/O error, dev sdfl, sector 1834670272

Comment 3 Jason Dillaman 2017-10-11 13:21:47 UTC
@Tejas: did Windows report any (IO) errors when this was occurring? Given the size and performance of your cluster, I wouldn't be surprised that things were timing out on the iSCSI side when the OSDs were being thrashed.

Comment 4 Tejas 2017-10-11 13:24:57 UTC
Yes I could see a few thousand IO errors on IOmeter.
I need to check the event log in more detail.

ESX reported quite a few errors in vmkernel.log

Comment 5 Jason Dillaman 2017-10-11 15:52:50 UTC
@Tejas: couple quick things:

1) It looks like ceph-ansible didn't configure any librbd logging so I added the following those hosts' ceph.conf file:

[client]
	log file = /var/log/ceph/$name.$pid.log
        admin socket = /var/run/ceph/$name.$pid.asok

I think that we should open a BZ against ceph-ansible to ensure this is properly configured for gateway roles.

2) Your multipath.conf on buckeye was incorrect. It should be the following:

devices {
        device {
                vendor                 "LIO-ORG"
                hardware_handler       "1 alua"
                path_grouping_policy   "failover"
                path_selector          "queue-length 0"
                failback               60
                path_checker           tur
                prio                   alua
                prio_args              exclusive_pref_bit
                fast_io_fail_tmo       25
                no_path_retry          queue
        }
}

Comment 6 Mike Christie 2017-10-11 17:22:59 UTC
Hey Tejas,

I just wanted to confirm the initial failure time of the test for the bz. The messages in the initial bz description look like they start around Oct 11 13:*.

However, earlier at Oct 10 18:16* there is an error which should never happen where a memory allocation failed.  If they are separate issues, do you by any chance know what test was run at around 18:16 on Oct 10th?

Also, are you using dmesg to dump the log messages you pasted here? If so it might be easier in the future to use "dmesg -T" because you get the human readable output and if the system is rebooted/crashes we can later figure out when the message occurred.

Comment 9 Mike Christie 2017-10-18 05:26:41 UTC
Tejas,

It looks like the linux initiator errors are caused by a lio/tcmu bug. I will have a new with a fix for this by tomorrow for you.

Comment 10 Mike Christie 2017-10-22 16:51:38 UTC
For the linux initiator connection errors, the qfull changes will fix the problem.

I have not figured out the reason for the kmalloc failures yet.

Here is the bz for the kernel changes

https://bugzilla.redhat.com/show_bug.cgi?id=1480434

Comment 11 Mike Christie 2017-10-22 16:55 UTC
Created attachment 1341832 [details]
disable qfull waiting

This disables the qfull timeout for downstream. The patch is downstream only.

For upstream, we have a internal queue but in RHEL based kernels the iscsi target recvs and tcmu waits for ring space from the same context. This leads to lun access blocking on one bad device and/or iscsi requests like pings timing out like we see in the log in this bz.

Comment 12 Mike Christie 2017-10-22 17:22:30 UTC
Another patch for ceph-iscsi-config that will be needed is:

https://github.com/ceph/ceph-iscsi-config/pull/31

The patch in comment #11 will prevent initiator side timeouts when the ring is temporarily full.

The patch in this comment will prevent us from hitting qfull retries too often for smaller/medium sized workloads or at least allow the user to control how much memory they want to devote to preventing qfull retries.

Comment 19 Mike Christie 2017-10-25 17:51 UTC
Created attachment 1343359 [details]
kernel: make qfull timeout settable

Yeah, here is the patch I am going to post to rh-kernel. It works like the latest upstream posting where 0 means do not wait.


Note You need to log in before you can comment on or make changes to this bug.