Bug 590069

Summary: Multipathd hang and IO timeout after controller resets
Product: Red Hat Enterprise Linux 6 Reporter: Sean Stewart <Sean.Stewart>
Component: device-mapper-multipathAssignee: LVM and device-mapper development team <lvm-team>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: low    
Version: 6.1CC: abdel.sadek, agk, babu.moger, bmarzins, christophe.varoqui, dl-iop-bugzilla, dwysocha, egoggin, heinzm, junichi.nomura, kueda, lmb, mbroz, prockai, Sean.Stewart, tranlan, vijay.chauhan
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-03 14:19:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs from one of the hosts during the test run none

Description Sean Stewart 2010-05-07 16:12:38 UTC
Created attachment 412393 [details]
Logs from one of the hosts during the test run

Description of problem:
Running IO to 64 volumes mapped to each host. I started a script to reboot the controllers on a 10 minute interval. Shortly after the second reboot, a stack trace was reported in the messages, saying that the multipathd had hung. The IO tool also seems to have hung, and this caused IO to time out after 10 minutes.  The problem cannot be with the IO tool, because I have been using the same version with our RHEL6 testing and did not have this problem until recently.

Equipment:
2 hosts x 2 arrays
Host 1 has Emulex HBAs
Host 2 has QLogic HBAs
Both hosts are running RHEL6 snapshot 2.

Version-Release number of selected component (if applicable):
0.4.9-18

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL6 on the hosts, map 64 volumes to each
2. Start IO and an automated reboot script
3. Wait for IO to time out after the second controller reboot
  
Actual results:
After the second reboot, Multipathd and LinuxSmash report that they are hung and output a stack trace. 10 minutes later, the IO times out. It looks like the path checker failed the paths, but failover did not take place.

Expected results:
10 seconds after the reboot, the host should detect the path loss. Multipath checker should mark the paths failed, and the host should issue a mode select to give ownership of the volumes to the alternate controller. After the controller reboot is complete, the volumes should fail back.

Additional info:
Here is the stack trace:

May  6 18:35:11 dhcp-135-15-75-65 kernel: INFO: task multipathd:6190 blocked for more than 120 seconds.
May  6 18:35:11 dhcp-135-15-75-65 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  6 18:35:11 dhcp-135-15-75-65 kernel: multipathd    D ffffffffffffffff     0  6190      1 0x00000080
May  6 18:35:11 dhcp-135-15-75-65 kernel: ffff8803fb085c78 0000000000000086 ffff880419941200 0000000000000001
May  6 18:35:11 dhcp-135-15-75-65 kernel: ffff8803fb085c78 ffffffffa000482a 0000000000000000 00000000c4fbd4a6
May  6 18:35:11 dhcp-135-15-75-65 kernel: ffff8803fd4806b8 ffff8803fb085fd8 000000000000fcb8 ffff8803fd4806b8
May  6 18:35:11 dhcp-135-15-75-65 kernel: Call Trace:
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa000482a>] ? validate_hardware_logical_block_alignment+0x7a/0x140 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c8522>] __mutex_lock_slowpath+0xe2/0x160
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c841b>] mutex_lock+0x2b/0x50
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0000b3f>] dm_swap_table+0xdf/0x250 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0003fa2>] ? suspend_targets+0x42/0x80 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0007f50>] ? dev_suspend+0x0/0x240 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa000806f>] dev_suspend+0x11f/0x240 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0007f50>] ? dev_suspend+0x0/0x240 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0008993>] ctl_ioctl+0x1a3/0x240 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8112f6ac>] ? handle_pte_fault+0x3cc/0x990
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0008a43>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811769d2>] vfs_ioctl+0x22/0xa0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81176b74>] do_vfs_ioctl+0x84/0x580
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8112fe5e>] ? handle_mm_fault+0x1ee/0x2b0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811770f1>] sys_ioctl+0x81/0xa0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
May  6 18:35:11 dhcp-135-15-75-65 kernel: INFO: task LinuxSmash:9597 blocked for more than 120 seconds.
May  6 18:35:11 dhcp-135-15-75-65 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  6 18:35:11 dhcp-135-15-75-65 kernel: LinuxSmash    D ffff880418359c28     0  9597   7752 0x00020080
May  6 18:35:11 dhcp-135-15-75-65 kernel: ffff880418359b48 0000000000000082 0000000000000021 ffff880405c8a2f0
May  6 18:35:11 dhcp-135-15-75-65 kernel: ffff880418359ad8 ffffffff8122d712 ffff88040d77a948 0000000000000021
May  6 18:35:11 dhcp-135-15-75-65 kernel: ffff880412255b38 ffff880418359fd8 000000000000fcb8 ffff880412255b38
May  6 18:35:11 dhcp-135-15-75-65 kernel: Call Trace:
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8122d712>] ? __elv_add_request+0x72/0xd0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c7dc5>] schedule_timeout+0x225/0x2f0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8123699f>] ? generic_make_request+0x1af/0x4f0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81109fc5>] ? mempool_alloc_slab+0x15/0x20
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c7a1d>] wait_for_common+0x11d/0x190
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81058d60>] ? default_wake_function+0x0/0x20
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81236d62>] ? submit_bio+0x82/0x110
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c7b4d>] wait_for_completion+0x1d/0x20
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81239ef6>] blkdev_issue_flush+0xa6/0xe0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811990bf>] blkdev_fsync+0x3f/0x60
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8118f441>] vfs_fsync_range+0xa1/0xe0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8118f4cb>] generic_write_sync+0x4b/0x50
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81197c8e>] blkdev_aio_write+0x7e/0xa0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81163eba>] do_sync_write+0xfa/0x140
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff810a1260>] ? do_futex+0x100/0xae0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8108dc30>] ? autoremove_wake_function+0x0/0x40
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811f4751>] ? cap_file_permission+0x1/0x10
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81013a8e>] ? common_interrupt+0xe/0x13
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811f2e06>] ? security_file_permission+0x16/0x20
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811641b8>] vfs_write+0xb8/0x1a0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81164cb2>] sys_pwrite64+0x82/0xa0
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8104939c>] sys32_pwrite+0x1c/0x20
May  6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff810487ef>] sysenter_dispatch+0x7/0x2e
May  6 18:35:26 dhcp-135-15-75-65 kernel: .ready

Comment 2 RHEL Program Management 2010-05-07 17:38:39 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 Ben Marzinski 2010-05-24 21:48:09 UTC
What kind of hardware are you using for your storage?

Comment 4 Sean Stewart 2010-05-25 14:50:05 UTC
The hardware is LSI 8Gbps fibre channel storage arrays.

Comment 5 Abdel Sadek 2010-06-02 21:52:35 UTC
Ben;
This issue looks similar to 591608. Do you agree? if so, can you dup it to it?

Comment 6 Ben Marzinski 2010-06-03 14:19:03 UTC
Yes, this looks like the same bug.

*** This bug has been marked as a duplicate of bug 591608 ***