Created attachment 412393[details]
Logs from one of the hosts during the test run
Description of problem:
Running IO to 64 volumes mapped to each host. I started a script to reboot the controllers on a 10 minute interval. Shortly after the second reboot, a stack trace was reported in the messages, saying that the multipathd had hung. The IO tool also seems to have hung, and this caused IO to time out after 10 minutes. The problem cannot be with the IO tool, because I have been using the same version with our RHEL6 testing and did not have this problem until recently.
Equipment:
2 hosts x 2 arrays
Host 1 has Emulex HBAs
Host 2 has QLogic HBAs
Both hosts are running RHEL6 snapshot 2.
Version-Release number of selected component (if applicable):
0.4.9-18
How reproducible:
Always
Steps to Reproduce:
1. Install RHEL6 on the hosts, map 64 volumes to each
2. Start IO and an automated reboot script
3. Wait for IO to time out after the second controller reboot
Actual results:
After the second reboot, Multipathd and LinuxSmash report that they are hung and output a stack trace. 10 minutes later, the IO times out. It looks like the path checker failed the paths, but failover did not take place.
Expected results:
10 seconds after the reboot, the host should detect the path loss. Multipath checker should mark the paths failed, and the host should issue a mode select to give ownership of the volumes to the alternate controller. After the controller reboot is complete, the volumes should fail back.
Additional info:
Here is the stack trace:
May 6 18:35:11 dhcp-135-15-75-65 kernel: INFO: task multipathd:6190 blocked for more than 120 seconds.
May 6 18:35:11 dhcp-135-15-75-65 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 6 18:35:11 dhcp-135-15-75-65 kernel: multipathd D ffffffffffffffff 0 6190 1 0x00000080
May 6 18:35:11 dhcp-135-15-75-65 kernel: ffff8803fb085c78 0000000000000086 ffff880419941200 0000000000000001
May 6 18:35:11 dhcp-135-15-75-65 kernel: ffff8803fb085c78 ffffffffa000482a 0000000000000000 00000000c4fbd4a6
May 6 18:35:11 dhcp-135-15-75-65 kernel: ffff8803fd4806b8 ffff8803fb085fd8 000000000000fcb8 ffff8803fd4806b8
May 6 18:35:11 dhcp-135-15-75-65 kernel: Call Trace:
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa000482a>] ? validate_hardware_logical_block_alignment+0x7a/0x140 [dm_mod]
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c8522>] __mutex_lock_slowpath+0xe2/0x160
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c841b>] mutex_lock+0x2b/0x50
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0000b3f>] dm_swap_table+0xdf/0x250 [dm_mod]
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0003fa2>] ? suspend_targets+0x42/0x80 [dm_mod]
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0007f50>] ? dev_suspend+0x0/0x240 [dm_mod]
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa000806f>] dev_suspend+0x11f/0x240 [dm_mod]
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0007f50>] ? dev_suspend+0x0/0x240 [dm_mod]
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0008993>] ctl_ioctl+0x1a3/0x240 [dm_mod]
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8112f6ac>] ? handle_pte_fault+0x3cc/0x990
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0008a43>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811769d2>] vfs_ioctl+0x22/0xa0
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81176b74>] do_vfs_ioctl+0x84/0x580
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8112fe5e>] ? handle_mm_fault+0x1ee/0x2b0
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811770f1>] sys_ioctl+0x81/0xa0
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
May 6 18:35:11 dhcp-135-15-75-65 kernel: INFO: task LinuxSmash:9597 blocked for more than 120 seconds.
May 6 18:35:11 dhcp-135-15-75-65 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 6 18:35:11 dhcp-135-15-75-65 kernel: LinuxSmash D ffff880418359c28 0 9597 7752 0x00020080
May 6 18:35:11 dhcp-135-15-75-65 kernel: ffff880418359b48 0000000000000082 0000000000000021 ffff880405c8a2f0
May 6 18:35:11 dhcp-135-15-75-65 kernel: ffff880418359ad8 ffffffff8122d712 ffff88040d77a948 0000000000000021
May 6 18:35:11 dhcp-135-15-75-65 kernel: ffff880412255b38 ffff880418359fd8 000000000000fcb8 ffff880412255b38
May 6 18:35:11 dhcp-135-15-75-65 kernel: Call Trace:
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8122d712>] ? __elv_add_request+0x72/0xd0
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c7dc5>] schedule_timeout+0x225/0x2f0
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8123699f>] ? generic_make_request+0x1af/0x4f0
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81109fc5>] ? mempool_alloc_slab+0x15/0x20
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c7a1d>] wait_for_common+0x11d/0x190
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81058d60>] ? default_wake_function+0x0/0x20
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81236d62>] ? submit_bio+0x82/0x110
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c7b4d>] wait_for_completion+0x1d/0x20
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81239ef6>] blkdev_issue_flush+0xa6/0xe0
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811990bf>] blkdev_fsync+0x3f/0x60
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8118f441>] vfs_fsync_range+0xa1/0xe0
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8118f4cb>] generic_write_sync+0x4b/0x50
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81197c8e>] blkdev_aio_write+0x7e/0xa0
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81163eba>] do_sync_write+0xfa/0x140
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff810a1260>] ? do_futex+0x100/0xae0
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8108dc30>] ? autoremove_wake_function+0x0/0x40
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811f4751>] ? cap_file_permission+0x1/0x10
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81013a8e>] ? common_interrupt+0xe/0x13
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811f2e06>] ? security_file_permission+0x16/0x20
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811641b8>] vfs_write+0xb8/0x1a0
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81164cb2>] sys_pwrite64+0x82/0xa0
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8104939c>] sys32_pwrite+0x1c/0x20
May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff810487ef>] sysenter_dispatch+0x7/0x2e
May 6 18:35:26 dhcp-135-15-75-65 kernel: .ready
Comment 2RHEL Program Management
2010-05-07 17:38:39 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release. Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release. This request is not yet committed for
inclusion.
Created attachment 412393 [details] Logs from one of the hosts during the test run Description of problem: Running IO to 64 volumes mapped to each host. I started a script to reboot the controllers on a 10 minute interval. Shortly after the second reboot, a stack trace was reported in the messages, saying that the multipathd had hung. The IO tool also seems to have hung, and this caused IO to time out after 10 minutes. The problem cannot be with the IO tool, because I have been using the same version with our RHEL6 testing and did not have this problem until recently. Equipment: 2 hosts x 2 arrays Host 1 has Emulex HBAs Host 2 has QLogic HBAs Both hosts are running RHEL6 snapshot 2. Version-Release number of selected component (if applicable): 0.4.9-18 How reproducible: Always Steps to Reproduce: 1. Install RHEL6 on the hosts, map 64 volumes to each 2. Start IO and an automated reboot script 3. Wait for IO to time out after the second controller reboot Actual results: After the second reboot, Multipathd and LinuxSmash report that they are hung and output a stack trace. 10 minutes later, the IO times out. It looks like the path checker failed the paths, but failover did not take place. Expected results: 10 seconds after the reboot, the host should detect the path loss. Multipath checker should mark the paths failed, and the host should issue a mode select to give ownership of the volumes to the alternate controller. After the controller reboot is complete, the volumes should fail back. Additional info: Here is the stack trace: May 6 18:35:11 dhcp-135-15-75-65 kernel: INFO: task multipathd:6190 blocked for more than 120 seconds. May 6 18:35:11 dhcp-135-15-75-65 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 6 18:35:11 dhcp-135-15-75-65 kernel: multipathd D ffffffffffffffff 0 6190 1 0x00000080 May 6 18:35:11 dhcp-135-15-75-65 kernel: ffff8803fb085c78 0000000000000086 ffff880419941200 0000000000000001 May 6 18:35:11 dhcp-135-15-75-65 kernel: ffff8803fb085c78 ffffffffa000482a 0000000000000000 00000000c4fbd4a6 May 6 18:35:11 dhcp-135-15-75-65 kernel: ffff8803fd4806b8 ffff8803fb085fd8 000000000000fcb8 ffff8803fd4806b8 May 6 18:35:11 dhcp-135-15-75-65 kernel: Call Trace: May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa000482a>] ? validate_hardware_logical_block_alignment+0x7a/0x140 [dm_mod] May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c8522>] __mutex_lock_slowpath+0xe2/0x160 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c841b>] mutex_lock+0x2b/0x50 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0000b3f>] dm_swap_table+0xdf/0x250 [dm_mod] May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0003fa2>] ? suspend_targets+0x42/0x80 [dm_mod] May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0007f50>] ? dev_suspend+0x0/0x240 [dm_mod] May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa000806f>] dev_suspend+0x11f/0x240 [dm_mod] May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0007f50>] ? dev_suspend+0x0/0x240 [dm_mod] May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0008993>] ctl_ioctl+0x1a3/0x240 [dm_mod] May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8112f6ac>] ? handle_pte_fault+0x3cc/0x990 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffffa0008a43>] dm_ctl_ioctl+0x13/0x20 [dm_mod] May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811769d2>] vfs_ioctl+0x22/0xa0 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81176b74>] do_vfs_ioctl+0x84/0x580 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8112fe5e>] ? handle_mm_fault+0x1ee/0x2b0 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811770f1>] sys_ioctl+0x81/0xa0 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81013132>] system_call_fastpath+0x16/0x1b May 6 18:35:11 dhcp-135-15-75-65 kernel: INFO: task LinuxSmash:9597 blocked for more than 120 seconds. May 6 18:35:11 dhcp-135-15-75-65 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 6 18:35:11 dhcp-135-15-75-65 kernel: LinuxSmash D ffff880418359c28 0 9597 7752 0x00020080 May 6 18:35:11 dhcp-135-15-75-65 kernel: ffff880418359b48 0000000000000082 0000000000000021 ffff880405c8a2f0 May 6 18:35:11 dhcp-135-15-75-65 kernel: ffff880418359ad8 ffffffff8122d712 ffff88040d77a948 0000000000000021 May 6 18:35:11 dhcp-135-15-75-65 kernel: ffff880412255b38 ffff880418359fd8 000000000000fcb8 ffff880412255b38 May 6 18:35:11 dhcp-135-15-75-65 kernel: Call Trace: May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8122d712>] ? __elv_add_request+0x72/0xd0 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c7dc5>] schedule_timeout+0x225/0x2f0 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8123699f>] ? generic_make_request+0x1af/0x4f0 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81109fc5>] ? mempool_alloc_slab+0x15/0x20 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c7a1d>] wait_for_common+0x11d/0x190 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81058d60>] ? default_wake_function+0x0/0x20 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81236d62>] ? submit_bio+0x82/0x110 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff814c7b4d>] wait_for_completion+0x1d/0x20 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81239ef6>] blkdev_issue_flush+0xa6/0xe0 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811990bf>] blkdev_fsync+0x3f/0x60 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8118f441>] vfs_fsync_range+0xa1/0xe0 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8118f4cb>] generic_write_sync+0x4b/0x50 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81197c8e>] blkdev_aio_write+0x7e/0xa0 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81163eba>] do_sync_write+0xfa/0x140 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff810a1260>] ? do_futex+0x100/0xae0 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8108dc30>] ? autoremove_wake_function+0x0/0x40 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811f4751>] ? cap_file_permission+0x1/0x10 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81013a8e>] ? common_interrupt+0xe/0x13 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811f2e06>] ? security_file_permission+0x16/0x20 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff811641b8>] vfs_write+0xb8/0x1a0 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff81164cb2>] sys_pwrite64+0x82/0xa0 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff8104939c>] sys32_pwrite+0x1c/0x20 May 6 18:35:11 dhcp-135-15-75-65 kernel: [<ffffffff810487ef>] sysenter_dispatch+0x7/0x2e May 6 18:35:26 dhcp-135-15-75-65 kernel: .ready