Bug 843763

Summary: SG_IO ioctl gets stuck in kernel
Product: Red Hat Enterprise Linux 6 Reporter: Linux engineering teams - Veritas <linux26port>
Component: kernelAssignee: Ewan D. Milne <emilne>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.3CC: linux26port, mpatocka, msnitzer, mukesh_bafna, ram_pandiri, revers, venkat.boddu, vinay.sequeira
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-08 21:03:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Linux engineering teams - Veritas 2012-07-27 09:38:44 UTC
Description of problem:

During port disable operations SG_IO ioctl sent on device gets stuck in kernel and never returns causing system hang

Version-Release number of selected component (if applicable):

We have hit same issue with below RHEL releases..

RHEL 6.1
RHEL 6.2
RHEL 6.3

How reproducible:

Issue is reproducible frequently

Steps to Reproduce:
1. Have a thread running SG_IO ioctl to devices 
2. disable/enable ports in loop
3. sometimes it causes SG_IO to get stuck 
  
We have captured crash dump when issue is hit. We can upload it if required. Any workaround suggestion are welcome in mean time to avoid hitting issue..

Actual results:

SG_IO gets stuck in kernel.

Expected results:

SG_IO should return within timeout parameter set

Additional info:

Below is thread stack stuck in kernel waiting for SG_IO ioctl to complete. 

PID: 31971  TASK: ffff88037d5f2b00  CPU: 0   COMMAND: "vol_kmsg_receiv"
 #0 [ffff8803e5f2b280] schedule at ffffffff814db2e9
 #1 [ffff8803e5f2b348] schedule_timeout at ffffffff814dc065
 #2 [ffff8803e5f2b3f8] wait_for_common at ffffffff814dbce3
 #3 [ffff8803e5f2b488] wait_for_completion at ffffffff814dbdfd
 #4 [ffff8803e5f2b498] blk_execute_rq at ffffffff8124e79c
 #5 [ffff8803e5f2b548] sg_io at ffffffff812528e2
 #6 [ffff8803e5f2b618] scsi_cmd_ioctl at ffffffff81253321
 #7 [ffff8803e5f2b718] sd_ioctl at ffffffffa02320aa [sd_mod]
 #8 [ffff8803e5f2b768] __blkdev_driver_ioctl at ffffffff81250077
 #9 [ffff8803e5f2b7a8] blkdev_ioctl at ffffffff812504ed #10 [ffff8803e5f2b7f8]
block_ioctl at ffffffff811a91ac
#11 [ffff8803e5f2b808] dmp_ioctl_by_bdev at ffffffffa08723db [vxdmp]
#12 [ffff8803e5f2b858] dmp_kernel_scsi_ioctl at ffffffffa08724fc [vxdmp]
#13 [ffff8803e5f2b8a8] dmp_dev_ioctl at ffffffffa087264b [vxdmp]
#14 [ffff8803e5f2b8c8] do_passthru_ioctl at ffffffffa089b55b [vxdmp]
#15 [ffff8803e5f2b9e8] dmp_tur_temp_pgr at ffffffffa08adc2d [vxdmp]
#16 [ffff8803e5f2ba68] dmp_set_cur_pri at ffffffffa08a1a78 [vxdmp]
#17 [ffff8803e5f2bb38] dmp_setiopath at ffffffffa0898c41 [vxdmp]
#18 [ffff8803e5f2bba8] dmp_info_ioctl at ffffffffa0899206 [vxdmp]
#19 [ffff8803e5f2bc08] gendmpioctl at ffffffffa087cfff [vxdmp] #20
[ffff8803e5f2bc28] dmpioctl at ffffffffa087e955 [vxdmp]
#21 [ffff8803e5f2bc48] vol_dmp_ktok_ioctl at ffffffffa095632e [vxio]
#22 [ffff8803e5f2bc98] dmp_set_paths at ffffffffa092f3c0 [vxio]
#23 [ffff8803e5f2bcb8] vx_dmp_config_ioctl at ffffffffa09305f5 [vxio]
#24 [ffff8803e5f2bd48] cvm_dmp_failover_path at ffffffffa092506e [vxio]
#25 [ffff8803e5f2bdb8] vol_kmsg_cluster_request at ffffffffa0920e0b [vxio]
#26 [ffff8803e5f2be08] vol_kmsg_request_receive at ffffffffa0901cdd [vxio]
#27 [ffff8803e5f2be78] vol_kmsg_receiver at ffffffffa09066f9 [vxio]
#28 [ffff8803e5f2bf48] kernel_thread at ffffffff8100c1ca

timeout has been properly set to 20s in corresponding sg_io_hdr_t which is stuck.

crash> sg_io_hdr_t ffff880404b59800
struct sg_io_hdr_t {
  interface_id = 83,
  dxfer_direction = -1,
  cmd_len = 6 '\006',
  mx_sb_len = 32 ' ',
  iovec_count = 0,
  dxfer_len = 0,
  dxferp = 0x0,
  cmdp = 0xffff880404b59908 "",
  sbp = 0xffff880404b598c8,
  timeout = 20000,   <<<<<<<<<<<<<<< timeout in milisecond = 20s
  flags = 0,
  pack_id = 0,
  usr_ptr = 0x0,
  status = 0 '\000',
  masked_status = 0 '\000',
  msg_status = 0 '\000',
  sb_len_wr = 0 '\000',
  host_status = 0,
  driver_status = 0,
  resid = 0,
  duration = 0,
  info = 0
}



We have crash dump saved. We can upload it if required. We hit this issue quite frequently during port disable/enable op and on RHEL-6.1, RHEL-6.2, RHEL-6.3. 

This causes overall system hung so we would like to know any workaround if available. Let us know if you need any more information

-- mukesh bafna, Symantec

Comment 2 Tom Coughlan 2012-09-26 19:39:52 UTC
It appears the I/O is not timing out. Are there any errors on the console leading up to this hang? Is there any multipath code in the stack? Is the kernel tainted? (If so, can you reproduce this on a non-tainted kernel?) 

Please provide an sosreport for a system where this problem occurs. Based on that, and the answers to the questions above, we will decide whether we need the crash dump.

Comment 5 mukesh bafna 2012-09-27 13:18:01 UTC
This issue was later raised through REDHAT support channel:

https://access.redhat.com/support/cases/00694718

Currently issue was resolved with patches provided by redhat support. We are waiting for input on in which release fix patches would be available..

Comment 6 Ewan D. Milne 2013-03-06 15:07:15 UTC
Please upload the crash dump mentioned in the problem description, along
with /var/log/messages file.  Will try to pick some info out of the dump.

Comment 7 mukesh bafna 2013-03-07 07:27:27 UTC
In case required, crash dumps and logs have been uploaded through support case raised with redhat:
https://access.redhat.com/support/cases/00694718

We have been notified on that that fix for this problem has been release in RHEL-6.3 GA update. We have not yet been put up resources to verify same. We are working on that.

Comment 8 Ewan D. Milne 2013-05-08 15:03:48 UTC
Please reply whether this problem still exists in RHEL6.3 and/or RHEL6.4.

Comment 9 mukesh bafna 2013-05-20 12:08:34 UTC
We are seing the same issue on RHEL5.8 also.

Comment 10 Ewan D. Milne 2013-09-24 19:31:58 UTC
Awaiting response to comment # 8 regarding whether this is still an issue
in RHEL6.3 and/or RHEL6.4, because of what was reported in comment # 7.

Comment 12 Ewan D. Milne 2013-11-08 21:03:59 UTC
Closing this bug due to lack of response from the reporter.

Re-open it if this is still an issue.