590069 – Multipathd hang and IO timeout after controller resets

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 590069 - Multipathd hang and IO timeout after controller resets

Summary: Multipathd hang and IO timeout after controller resets

Keywords:
Status:	CLOSED DUPLICATE of bug 591608
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	device-mapper-multipath
Sub Component:
Version:	6.1
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	LVM and device-mapper development team
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-05-07 16:12 UTC by Sean Stewart
Modified:	2010-06-03 14:19 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-06-03 14:19:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Logs from one of the hosts during the test run (306.15 KB, application/octet-stream) 2010-05-07 16:12 UTC, Sean Stewart	no flags	Details
View All

Description Sean Stewart 2010-05-07 16:12:38 UTC

Created attachment 412393 [details]
Logs from one of the hosts during the test run

Description of problem:
Running IO to 64 volumes mapped to each host. I started a script to reboot the controllers on a 10 minute interval. Shortly after the second reboot, a stack trace was reported in the messages, saying that the multipathd had hung. The IO tool also seems to have hung, and this caused IO to time out after 10 minutes.  The problem cannot be with the IO tool, because I have been using the same version with our RHEL6 testing and did not have this problem until recently.

Equipment:
2 hosts x 2 arrays
Host 1 has Emulex HBAs
Host 2 has QLogic HBAs
Both hosts are running RHEL6 snapshot 2.

Version-Release number of selected component (if applicable):
0.4.9-18

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL6 on the hosts, map 64 volumes to each
2. Start IO and an automated reboot script
3. Wait for IO to time out after the second controller reboot
  
Actual results:
After the second reboot, Multipathd and LinuxSmash report that they are hung and output a stack trace. 10 minutes later, the IO times out. It looks like the path checker failed the paths, but failover did not take place.

Expected results:
10 seconds after the reboot, the host should detect the path loss. Multipath checker should mark the paths failed, and the host should issue a mode select to give ownership of the volumes to the alternate controller. After the controller reboot is complete, the volumes should fail back.

Additional info:
Here is the stack trace:

May  6 18:35:11 dhcp-135-15-75-65 kernel: INFO: task multipathd:6190 blocked for more than 120 seconds.
May  6 18:35:11 dhcp-135-15-75-65 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  6 18:35:11 dhcp-135-15-75-65 kernel: multipathd    D ffffffffffffffff     0  6190      1 0x00000080
May  6 18:35:11 dhcp-135-15-75-65 kernel: ffff8803fb085c78 0000000000000086 ffff880419941200 0000000000000001
May  6 18:35:11 dhcp-135-15-75-65 kernel: ffff8803fb085c78 ffffffffa000482a 0000000000000000 00000000c4fbd4a6
May  6 18:35:11 dhcp-135-15-75-65 kernel: ffff8803fd4806b8 ffff8803fb085fd8 000000000000fcb8 ffff8803fd4806b8
May  6 18:35:11 dhcp-135-15-75-65 kernel: Call Trace:
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa000482a>] ? validate_hardware_logical_block_alignment+0x7a/0x140 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c8522>] __mutex_lock_slowpath+0xe2/0x160
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c841b>] mutex_lock+0x2b/0x50
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0000b3f>] dm_swap_table+0xdf/0x250 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0003fa2>] ? suspend_targets+0x42/0x80 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0007f50>] ? dev_suspend+0x0/0x240 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa000806f>] dev_suspend+0x11f/0x240 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0007f50>] ? dev_suspend+0x0/0x240 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0008993>] ctl_ioctl+0x1a3/0x240 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8112f6ac>] ? handle_pte_fault+0x3cc/0x990
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0008a43>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811769d2>] vfs_ioctl+0x22/0xa0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81176b74>] do_vfs_ioctl+0x84/0x580
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8112fe5e>] ? handle_mm_fault+0x1ee/0x2b0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811770f1>] sys_ioctl+0x81/0xa0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
May  6 18:35:11 dhcp-135-15-75-65 kernel: INFO: task LinuxSmash:9597 blocked for more than 120 seconds.
May  6 18:35:11 dhcp-135-15-75-65 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  6 18:35:11 dhcp-135-15-75-65 kernel: LinuxSmash    D ffff880418359c28     0  9597   7752 0x00020080
May  6 18:35:11 dhcp-135-15-75-65 kernel: ffff880418359b48 0000000000000082 0000000000000021 ffff880405c8a2f0
May  6 18:35:11 dhcp-135-15-75-65 kernel: ffff880418359ad8 ffffffff8122d712 ffff88040d77a948 0000000000000021
May  6 18:35:11 dhcp-135-15-75-65 kernel: ffff880412255b38 ffff880418359fd8 000000000000fcb8 ffff880412255b38
May  6 18:35:11 dhcp-135-15-75-65 kernel: Call Trace:
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8122d712>] ? __elv_add_request+0x72/0xd0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c7dc5>] schedule_timeout+0x225/0x2f0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8123699f>] ? generic_make_request+0x1af/0x4f0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81109fc5>] ? mempool_alloc_slab+0x15/0x20
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c7a1d>] wait_for_common+0x11d/0x190
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81058d60>] ? default_wake_function+0x0/0x20
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81236d62>] ? submit_bio+0x82/0x110
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c7b4d>] wait_for_completion+0x1d/0x20
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81239ef6>] blkdev_issue_flush+0xa6/0xe0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811990bf>] blkdev_fsync+0x3f/0x60
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8118f441>] vfs_fsync_range+0xa1/0xe0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8118f4cb>] generic_write_sync+0x4b/0x50
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81197c8e>] blkdev_aio_write+0x7e/0xa0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81163eba>] do_sync_write+0xfa/0x140
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff810a1260>] ? do_futex+0x100/0xae0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8108dc30>] ? autoremove_wake_function+0x0/0x40
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811f4751>] ? cap_file_permission+0x1/0x10
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81013a8e>] ? common_interrupt+0xe/0x13
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811f2e06>] ? security_file_permission+0x16/0x20
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811641b8>] vfs_write+0xb8/0x1a0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81164cb2>] sys_pwrite64+0x82/0xa0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8104939c>] sys32_pwrite+0x1c/0x20
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff810487ef>] sysenter_dispatch+0x7/0x2e
May  6 18:35:26 dhcp-135-15-75-65 kernel: .ready

Comment 2 RHEL Program Management 2010-05-07 17:38:39 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 Ben Marzinski 2010-05-24 21:48:09 UTC

What kind of hardware are you using for your storage?

Comment 4 Sean Stewart 2010-05-25 14:50:05 UTC

The hardware is LSI 8Gbps fibre channel storage arrays.

Comment 5 Abdel Sadek 2010-06-02 21:52:35 UTC

Ben;
This issue looks similar to 591608. Do you agree? if so, can you dup it to it?

Comment 6 Ben Marzinski 2010-06-03 14:19:03 UTC

Yes, this looks like the same bug.

*** This bug has been marked as a duplicate of bug 591608 ***

Note You need to log in before you can comment on or make changes to this bug.