Bug 844715 - Hung task in kernel, also SSH hangs when primary path to storage is lost
Hung task in kernel, also SSH hangs when primary path to storage is lost
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: device-mapper-multipath (Show other bugs)
6.2
x86_64 Linux
unspecified Severity high
: rc
: ---
Assigned To: Ben Marzinski
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-31 09:29 EDT by Karandeep Chahal
Modified: 2012-08-03 16:04 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-08-03 16:04:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Messages from the Initiator (625.22 KB, application/x-gzip)
2012-07-31 09:29 EDT, Karandeep Chahal
no flags Details

  None (edit)
Description Karandeep Chahal 2012-07-31 09:29:14 EDT
Created attachment 601523 [details]
Messages from the Initiator

Description of problem:

If the pathchecker is set to tur, I sometimes get hung task messages in the messages on failover. Please see attached messages for more info. I think it goes away if I change the path checker to directio, but I am not completely certain.

Jul 30 09:45:55 ashe kernel: INFO: task simpled:6952 blocked for more than 120 seconds.
Jul 30 09:45:55 ashe kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 30 09:45:55 ashe kernel: simpled       D 0000000000000007     0  6952   6940 0x00000080
Jul 30 09:45:55 ashe kernel: ffff8804077a5a98 0000000000000082 0000000000000000 ffffffffa00041fc
Jul 30 09:45:55 ashe kernel: ffff8804079d85b8 ffff88040f993c00 0000000000000001 000000000000000c
Jul 30 09:45:55 ashe kernel: ffff88041f9a3ab8 ffff8804077a5fd8 000000000000fb88 ffff88041f9a3ab8
Jul 30 09:45:55 ashe kernel: Call Trace:
Jul 30 09:45:55 ashe kernel: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Jul 30 09:45:55 ashe kernel: [<ffffffff814fe0f3>] io_schedule+0x73/0xc0
Jul 30 09:45:55 ashe kernel: [<ffffffff811b676e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Jul 30 09:45:55 ashe kernel: [<ffffffff811b6c5e>] __blockdev_direct_IO+0x5e/0xd0
Jul 30 09:45:55 ashe kernel: [<ffffffff811b3510>] ? blkdev_get_blocks+0x0/0xc0
Jul 30 09:45:55 ashe kernel: [<ffffffff811b4377>] blkdev_direct_IO+0x57/0x60
Jul 30 09:45:55 ashe kernel: [<ffffffff811b3510>] ? blkdev_get_blocks+0x0/0xc0
Jul 30 09:45:55 ashe kernel: [<ffffffff81114e62>] generic_file_direct_write+0xc2/0x190
Jul 30 09:45:55 ashe kernel: [<ffffffff81116675>] __generic_file_aio_write+0x345/0x480
Jul 30 09:45:55 ashe kernel: [<ffffffff811b4e00>] ? blkdev_open+0x0/0xc0
Jul 30 09:45:55 ashe kernel: [<ffffffff811b3b0c>] blkdev_aio_write+0x3c/0xa0
Jul 30 09:45:55 ashe kernel: [<ffffffff8117ae9a>] do_sync_write+0xfa/0x140
Jul 30 09:45:55 ashe kernel: [<ffffffff8118c2f0>] ? do_filp_open+0x780/0xd60
Jul 30 09:45:55 ashe kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Jul 30 09:45:55 ashe kernel: [<ffffffff81213266>] ? security_file_permission+0x16/0x20
Jul 30 09:45:55 ashe kernel: [<ffffffff8117b198>] vfs_write+0xb8/0x1a0
Jul 30 09:45:55 ashe kernel: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
Jul 30 09:45:55 ashe kernel: [<ffffffff8117bbb1>] sys_write+0x51/0x90
Jul 30 09:45:55 ashe kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Jul 30 09:45:55 ashe kernel: INFO: task simpled:6953 blocked for more than 120 seconds.



Version-Release number of selected component (if applicable):
device-mapper-multipath-libs-0.4.9-56.el6_3.1.x86_64
device-mapper-multipath-0.4.9-56.el6_3.1.x86_64


How reproducible:
This is _not_ 100% reproducible but it happens a lot: 50% or thereabouts like.

Run IO to several mpath devices with multiple threads. Say sgp_dd thr=8 if=/dev/zero of=/dev/dm-x bpt=2048 &

Then pull the primary path cable. I was using IB srp. I am sure this would happen with FC/ISCSI as well.

Steps to Reproduce:
1. Run IO
2. Pull primary path cable (or reboot the primary storage controller)
3. Check dmesg
  
Actual results:
Hung tasks, sometimes SSH hangs.

Expected results:
No hangs.

Additional info:
[root@ashe ~]# uname -a
Linux ashe 2.6.32-279.2.1.el6.x86_64 #1 SMP Thu Jul 5 21:08:58 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
[root@ashe ~]# 

[root@ashe ~]# cat /etc/issue
Red Hat Enterprise Linux Server release 6.2 (Santiago)
Kernel \r on an \m

[root@ashe ~]#
Comment 1 Karandeep Chahal 2012-07-31 09:31:42 EDT
Please let me know if any other information is needed or if you would like me to run some more tests.
Comment 3 Ben Marzinski 2012-08-02 11:42:05 EDT
FC drives have a fast_io_fail_tmo that sets how long the scsi layer will wait before passing up failures. ISCSI has a node.conn[0].timeo.noop_out_timeout parameter to do the same thing.  If these aren't set correctly, the scsi-layer may wait to long before failing back IO, and you can see these messages.  Do you know if IB has something similar?  Could you please verify that directio doesn't suffer from this problem.
Comment 4 Karandeep Chahal 2012-08-02 11:45:27 EDT
Hi Ben

Directio also hangs. IB has srp_dev_loss_tmo which defaults to 60 seconds. I have tried a lot of settings from 5 seconds to 60 seconds but it does not seem to fix this.

Thanks for looking at this.
Karan
Comment 5 Karandeep Chahal 2012-08-02 22:37:59 EDT
Upon investigation I found out that the problem lies with OFED. I have uninstalled OFED and I am now using stock RHEL IB packages. The problem has disappeared. For the record, I was using OFED-1.5.4.

Thanks again.

Note You need to log in before you can comment on or make changes to this bug.