Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
FC drives have a fast_io_fail_tmo that sets how long the scsi layer will wait before passing up failures. ISCSI has a node.conn[0].timeo.noop_out_timeout parameter to do the same thing. If these aren't set correctly, the scsi-layer may wait to long before failing back IO, and you can see these messages. Do you know if IB has something similar? Could you please verify that directio doesn't suffer from this problem.
Hi Ben
Directio also hangs. IB has srp_dev_loss_tmo which defaults to 60 seconds. I have tried a lot of settings from 5 seconds to 60 seconds but it does not seem to fix this.
Thanks for looking at this.
Karan
Upon investigation I found out that the problem lies with OFED. I have uninstalled OFED and I am now using stock RHEL IB packages. The problem has disappeared. For the record, I was using OFED-1.5.4.
Thanks again.
Created attachment 601523 [details] Messages from the Initiator Description of problem: If the pathchecker is set to tur, I sometimes get hung task messages in the messages on failover. Please see attached messages for more info. I think it goes away if I change the path checker to directio, but I am not completely certain. Jul 30 09:45:55 ashe kernel: INFO: task simpled:6952 blocked for more than 120 seconds. Jul 30 09:45:55 ashe kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 30 09:45:55 ashe kernel: simpled D 0000000000000007 0 6952 6940 0x00000080 Jul 30 09:45:55 ashe kernel: ffff8804077a5a98 0000000000000082 0000000000000000 ffffffffa00041fc Jul 30 09:45:55 ashe kernel: ffff8804079d85b8 ffff88040f993c00 0000000000000001 000000000000000c Jul 30 09:45:55 ashe kernel: ffff88041f9a3ab8 ffff8804077a5fd8 000000000000fb88 ffff88041f9a3ab8 Jul 30 09:45:55 ashe kernel: Call Trace: Jul 30 09:45:55 ashe kernel: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod] Jul 30 09:45:55 ashe kernel: [<ffffffff814fe0f3>] io_schedule+0x73/0xc0 Jul 30 09:45:55 ashe kernel: [<ffffffff811b676e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90 Jul 30 09:45:55 ashe kernel: [<ffffffff811b6c5e>] __blockdev_direct_IO+0x5e/0xd0 Jul 30 09:45:55 ashe kernel: [<ffffffff811b3510>] ? blkdev_get_blocks+0x0/0xc0 Jul 30 09:45:55 ashe kernel: [<ffffffff811b4377>] blkdev_direct_IO+0x57/0x60 Jul 30 09:45:55 ashe kernel: [<ffffffff811b3510>] ? blkdev_get_blocks+0x0/0xc0 Jul 30 09:45:55 ashe kernel: [<ffffffff81114e62>] generic_file_direct_write+0xc2/0x190 Jul 30 09:45:55 ashe kernel: [<ffffffff81116675>] __generic_file_aio_write+0x345/0x480 Jul 30 09:45:55 ashe kernel: [<ffffffff811b4e00>] ? blkdev_open+0x0/0xc0 Jul 30 09:45:55 ashe kernel: [<ffffffff811b3b0c>] blkdev_aio_write+0x3c/0xa0 Jul 30 09:45:55 ashe kernel: [<ffffffff8117ae9a>] do_sync_write+0xfa/0x140 Jul 30 09:45:55 ashe kernel: [<ffffffff8118c2f0>] ? do_filp_open+0x780/0xd60 Jul 30 09:45:55 ashe kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 Jul 30 09:45:55 ashe kernel: [<ffffffff81213266>] ? security_file_permission+0x16/0x20 Jul 30 09:45:55 ashe kernel: [<ffffffff8117b198>] vfs_write+0xb8/0x1a0 Jul 30 09:45:55 ashe kernel: [<ffffffff810d6b12>] ? audit_syscall_entry+0x272/0x2a0 Jul 30 09:45:55 ashe kernel: [<ffffffff8117bbb1>] sys_write+0x51/0x90 Jul 30 09:45:55 ashe kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b Jul 30 09:45:55 ashe kernel: INFO: task simpled:6953 blocked for more than 120 seconds. Version-Release number of selected component (if applicable): device-mapper-multipath-libs-0.4.9-56.el6_3.1.x86_64 device-mapper-multipath-0.4.9-56.el6_3.1.x86_64 How reproducible: This is _not_ 100% reproducible but it happens a lot: 50% or thereabouts like. Run IO to several mpath devices with multiple threads. Say sgp_dd thr=8 if=/dev/zero of=/dev/dm-x bpt=2048 & Then pull the primary path cable. I was using IB srp. I am sure this would happen with FC/ISCSI as well. Steps to Reproduce: 1. Run IO 2. Pull primary path cable (or reboot the primary storage controller) 3. Check dmesg Actual results: Hung tasks, sometimes SSH hangs. Expected results: No hangs. Additional info: [root@ashe ~]# uname -a Linux ashe 2.6.32-279.2.1.el6.x86_64 #1 SMP Thu Jul 5 21:08:58 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux [root@ashe ~]# [root@ashe ~]# cat /etc/issue Red Hat Enterprise Linux Server release 6.2 (Santiago) Kernel \r on an \m [root@ashe ~]#