Bug 1571129 - Without any "slow request" information in Ceph cluster side, from the Vmware side, we are still found like this "Access Lost" information from monitor tab.
Summary: Without any "slow request" information in Ceph cluster side, from the Vmware...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: iSCSI
Version: 3.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: 3.*
Assignee: Mike Christie
QA Contact: Tejas
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-24 07:21 UTC by liuwei
Modified: 2019-02-16 22:03 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-16 22:03:31 UTC
Target Upstream Version:


Attachments (Terms of Use)
Systemtap script based on kernel-3.10.0-862.el7 source (939 bytes, text/plain)
2018-05-01 00:55 UTC, Brad Hubbard
no flags Details

Description liuwei 2018-04-24 07:21:25 UTC
Description of problem:

Without any "slow request" information in Ceph cluster side,  from the Vmware side, we are still found  like this "Access Lost" information from monitor tab.

We use the single Path And single ISCSI GateWay node  to test.

 
For example:

From the vmware vmkernel.log, we found the below messages:



  2018-04-22T22:26:09.748Z cpu5:66599)NMP: nmp_ThrottleLogForDevice:3617: Cmd 0x28 (0x439541441500, 0) to dev "naa.600140551aee71664944242a4503557d" on path "vmhba64:C1:T0:L32    " Failed: H:0x0 D:0x28 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:NONE
  2018-04-22T22:26:09.748Z cpu5:66599)ScsiDeviceIO: 2927: Cmd(0x439541441500) 0x28, CmdSN 0x5631 from world 0 to dev "naa.600140551aee71664944242a4503557d" failed H:0x0 D:0x28     P:0x0 Invalid sense data: 0x76 0x6d 0x68.
  2018-04-22T22:26:09.772Z cpu0:65662)HBX: 283: 'CEPH-LUN32': HB at offset 3276800 - Reclaimed heartbeat [Timeout]:
  2018-04-22T22:26:09.772Z cpu0:65662)  [HB state abcdef02 offset 3276800 gen 225 stampUS 38117571952 uuid 5adc77ba-8cc0f028-bcdd-2c55d3d66700 jrnl <FB 41943053> drv 24.81 loc    kImpl 4 ip 10.2.226.137]
  2018-04-22T22:26:09.777Z cpu5:66599)ScsiDeviceIO: 2927: Cmd(0x439d46e1c740) 0x28, CmdSN 0xf3 from world 70149 to dev "naa.600140551aee71664944242a4503557d" failed H:0x0 D:0x    28 P:0x0 Invalid sense data: 0x0 0x0 0x0.

From the gw messages log, I found many ABORT_TASK informaton.  For example:

Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399889
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399889
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399893
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399893
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399894
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399894
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399895
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399895
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399896
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399896
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399897
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399897
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399898
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399898
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399899
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399899
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399900
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399900
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399909
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399909
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399910
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399910
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399911
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399911
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399912
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399912
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399913
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399913
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399914
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399914
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399915
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399915
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399916
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399916
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399925



Version-Release number of selected component (if applicable):


How reproducible:

100%reproduced


Steps to Reproduce:
1.
2.
3.

Actual results:

Path lost Access with unkown reason.

Expected results:

Find the Path lost Access reason.

Additional info:

[collab-shell] https://gitlab.cee.redhat.com/gss-tools/collab-shell

    # ssh your_kerb@collab-shell.usersys.redhat.com
    # cd /cases/02083677

Comment 5 Mike Christie 2018-04-24 15:33:25 UTC
Did you only look for the "slow request" message in the logs? Did you get any of the other debugging info we talked about? I did not see any type of perf dump type of info in /cases/02083677?

Comment 9 Mike Christie 2018-04-24 16:17:29 UTC
On a side note that may actually make the problem worse you are getting these errors:

failed H:0x0 D:0x28

because the runner ring is too small. If you wanted to avoid them, on the iscsi target systems open

/etc/ceph/iscsi-gateway.cfg

and add:

[target]
max_data_area_mb = N


Here N should be big enough to handle all the commands the initiators are going to send. It is only 8MB by default. It would normally be something like


MaxIoSizeKB  * min(64, esx device queue depth) * connected initiators.

You can get the Io size limit from

esxcli system settings advanced   list -o /ISCSI/MaxIoSizeKB


This assumes that the ceph cluster can handle that many IOs. So if like in that email thread we were thinking we might be over queueing then increasing this limit might cause more problems.

Just a FYI, for when we get that fixed.

Comment 10 Jiaju Zhang 2018-04-24 16:24:05 UTC
Just a quick question, how can we get the value of esx device queue depth?

Comment 11 Jiaju Zhang 2018-04-24 16:33:17 UTC
(In reply to Jiaju Zhang from comment #10)
> Just a quick question, how can we get the value of esx device queue depth?

Ignore that, I find it.

Comment 12 Mike Christie 2018-04-24 16:54:51 UTC
I am not going to change the component yet, but I think this might be a kernel bug. It is just too suspicious that a lot of times right when we get an ABORT the cmd completes the same second so we are able to then complete the ABORT.

Note that there do seem to be other errors that are probably slow downs in the cluster like when we see:

tcmu_rbd_handle_timedout_cmd

or when we see the ABORT timeout like here:

WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba64:CH:1 T:0 L:32 : Task mgmt "Abort Task" with itt=0x2423b22 (refITT=0x2423a85) timed out.

In those cases commands look like they are stuck in different layers like ceph.


Can you guys:


1. Answer the question about if you got the perf data.

2. Confirm that the times for the ESX and target boxes are in sync? It seems they might be off because there are times when the ESX log is noisy due to aborts, but the target logs are not showing anything. 

3. What is the value of /VMFS3/UseATSForHBOnVMFS5 on each machine?


4. What is the value of scsi.LogAborts?

5. Did you increase the vmkernel.log size or are you guys cutting it off now?

Comment 13 Mike Christie 2018-04-24 17:19:49 UTC
(In reply to Mike Christie from comment #9)
> /etc/ceph/iscsi-gateway.cfg
> 
> and add:
> 
> [target]
> max_data_area_mb = N
> 

Oh yeah, to make this take hold you need to do:

systemctl stop rbd-target-gw
systemctl start rbd-target-gw

Comment 14 Jason Dillaman 2018-04-24 17:23:21 UTC
(In reply to Mike Christie from comment #13)
> (In reply to Mike Christie from comment #9)
> > /etc/ceph/iscsi-gateway.cfg
> > 
> > and add:
> > 
> > [target]
> > max_data_area_mb = N
> > 
> 
> Oh yeah, to make this take hold you need to do:
> 
> systemctl stop rbd-target-gw
> systemctl start rbd-target-gw

Correct me if I'm wrong here, but I don't think ceph-iscsi-config supports the 'max_data_area_mb' override. That's included as part of BZ 1562388.

Comment 15 Mike Christie 2018-04-24 18:42:25 UTC
Upstream doesn't support it, but we added the patch, 0016-ceph-iscsi-config-add-max_data_area-setting.patch, thinking that was how we were going to do it upstream.

We later decided to make rtslib's constructor more generic so we would not have to update rtslib for every new setting, and Paul requested that we do per device settings. So we are now doing what Venky is doing in that bz.

Comment 16 Mike Christie 2018-04-24 21:53:47 UTC
liuwei or Jiaju Zhang,

Have you guys done system tap before? For this case where we are seeing ABORTs complete immediately like here:


Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 33399889
Apr 22 22:26:27 osd1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33399889


I basically just want the transport_state, state_active, cmd_kref.refcount and scsi_status values of the se_cmd when core_tmr_abort_task  calls __target_check_io_state and when it calls transport_wait_for_tasks

I think the tcmu queue/ring is backed up because commands are taking longer than they should on the ceph cluster. This leads to commands getting failed with task set full and retried immediately in a sort of busy loop. An abort is eventually sent and that completes quickly because it is retried so quickly. We can confirm this by dumping the se_cmd fields when that function is run.

Comment 17 Brad Hubbard 2018-05-01 00:55:27 UTC
Created attachment 1428985 [details]
Systemtap script based on kernel-3.10.0-862.el7 source

Comment 18 Brad Hubbard 2018-05-01 01:01:15 UTC
I've attached a stap script that should work based on the kernel-3.10.0-862.el7 but I do not have a test system and can not justify the time or knowledge acquisition required to set one up given this is not a core RADOS issue and I am here by invitation only :) If you need to run this on a different kernel you'll need to verify the line numbers for the function call sites. The script probably needs some tweaking to get working properly and I would be happy to do the testing if someone can provide the environment. The easiest way to set up the system to run the script from a purely systemtap perspective is by using the stap-prep command.


Note You need to log in before you can comment on or make changes to this bug.