Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 844715

Summary:

Hung task in kernel, also SSH hangs when primary path to storage is lost

Product:

Red Hat Enterprise Linux 6

Reporter:

Karandeep Chahal <karandeepchahal>

Component:

device-mapper-multipath

Assignee:

Ben Marzinski <bmarzins>

Status:

CLOSED NOTABUG

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

6.2

CC:

agk, bmarzins, dwysocha, dyasny, heinzm, karandeepchahal, msnitzer, prajnoha, prockai, zkabelac

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-08-03 20:04:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Messages from the Initiator	none

Description Karandeep Chahal 2012-07-31 13:29:14 UTC

Created attachment 601523 [details]
Messages from the Initiator

Description of problem:

If the pathchecker is set to tur, I sometimes get hung task messages in the messages on failover. Please see attached messages for more info. I think it goes away if I change the path checker to directio, but I am not completely certain.

Jul 30 09:45:55 ashe kernel: INFO: task simpled:6952 blocked for more than 120 seconds.
Jul 30 09:45:55 ashe kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 30 09:45:55 ashe kernel: simpled       D 0000000000000007     0  6952   6940 0x00000080
Jul 30 09:45:55 ashe kernel: ffff8804077a5a98 0000000000000082 0000000000000000 ffffffffa00041fc
Jul 30 09:45:55 ashe kernel: ffff8804079d85b8 ffff88040f993c00 0000000000000001 000000000000000c
Jul 30 09:45:55 ashe kernel: ffff88041f9a3ab8 ffff8804077a5fd8 000000000000fb88 ffff88041f9a3ab8
Jul 30 09:45:55 ashe kernel: Call Trace:
Jul 30 09:45:55 ashe kernel: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Jul 30 09:45:55 ashe kernel: [<ffffffff814fe0f3>] io_schedule+0x73/0xc0
Jul 30 09:45:55 ashe kernel: [<ffffffff811b676e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Jul 30 09:45:55 ashe kernel: [<ffffffff811b6c5e>] __blockdev_direct_IO+0x5e/0xd0
Jul 30 09:45:55 ashe kernel: [<ffffffff811b3510>] ? blkdev_get_blocks+0x0/0xc0
Jul 30 09:45:55 ashe kernel: [<ffffffff811b4377>] blkdev_direct_IO+0x57/0x60
Jul 30 09:45:55 ashe kernel: [<ffffffff811b3510>] ? blkdev_get_blocks+0x0/0xc0
Jul 30 09:45:55 ashe kernel: [<ffffffff81114e62>] generic_file_direct_write+0xc2/0x190
Jul 30 09:45:55 ashe kernel: [<ffffffff81116675>] __generic_file_aio_write+0x345/0x480
Jul 30 09:45:55 ashe kernel: [<ffffffff811b4e00>] ? blkdev_open+0x0/0xc0
Jul 30 09:45:55 ashe kernel: [<ffffffff811b3b0c>] blkdev_aio_write+0x3c/0xa0
Jul 30 09:45:55 ashe kernel: [<ffffffff8117ae9a>] do_sync_write+0xfa/0x140
Jul 30 09:45:55 ashe kernel: [<ffffffff8118c2f0>] ? do_filp_open+0x780/0xd60
Jul 30 09:45:55 ashe kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40
Jul 30 09:45:55 ashe kernel: [<ffffffff81213266>] ? security_file_permission+0x16/0x20
Jul 30 09:45:55 ashe kernel: [<ffffffff8117b198>] vfs_write+0xb8/0x1a0
Jul 30 09:45:55 ashe kernel: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0
Jul 30 09:45:55 ashe kernel: [<ffffffff8117bbb1>] sys_write+0x51/0x90
Jul 30 09:45:55 ashe kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Jul 30 09:45:55 ashe kernel: INFO: task simpled:6953 blocked for more than 120 seconds.



Version-Release number of selected component (if applicable):
device-mapper-multipath-libs-0.4.9-56.el6_3.1.x86_64
device-mapper-multipath-0.4.9-56.el6_3.1.x86_64


How reproducible:
This is _not_ 100% reproducible but it happens a lot: 50% or thereabouts like.

Run IO to several mpath devices with multiple threads. Say sgp_dd thr=8 if=/dev/zero of=/dev/dm-x bpt=2048 &

Then pull the primary path cable. I was using IB srp. I am sure this would happen with FC/ISCSI as well.

Steps to Reproduce:
1. Run IO
2. Pull primary path cable (or reboot the primary storage controller)
3. Check dmesg
  
Actual results:
Hung tasks, sometimes SSH hangs.

Expected results:
No hangs.

Additional info:
[root@ashe ~]# uname -a
Linux ashe 2.6.32-279.2.1.el6.x86_64 #1 SMP Thu Jul 5 21:08:58 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
[root@ashe ~]# 

[root@ashe ~]# cat /etc/issue
Red Hat Enterprise Linux Server release 6.2 (Santiago)
Kernel \r on an \m

[root@ashe ~]#

Comment 1 Karandeep Chahal 2012-07-31 13:31:42 UTC

Please let me know if any other information is needed or if you would like me to run some more tests.

Comment 3 Ben Marzinski 2012-08-02 15:42:05 UTC

FC drives have a fast_io_fail_tmo that sets how long the scsi layer will wait before passing up failures. ISCSI has a node.conn[0].timeo.noop_out_timeout parameter to do the same thing.  If these aren't set correctly, the scsi-layer may wait to long before failing back IO, and you can see these messages.  Do you know if IB has something similar?  Could you please verify that directio doesn't suffer from this problem.

Comment 4 Karandeep Chahal 2012-08-02 15:45:27 UTC

Hi Ben

Directio also hangs. IB has srp_dev_loss_tmo which defaults to 60 seconds. I have tried a lot of settings from 5 seconds to 60 seconds but it does not seem to fix this.

Thanks for looking at this.
Karan

Comment 5 Karandeep Chahal 2012-08-03 02:37:59 UTC

Upon investigation I found out that the problem lies with OFED. I have uninstalled OFED and I am now using stock RHEL IB packages. The problem has disappeared. For the record, I was using OFED-1.5.4.

Thanks again.