Red Hat Bugzilla – Bug 1009700
mirror device failure scenarios lead to deadlocks
Last modified: 2013-11-21 18:28:41 EST
Description of problem: This may be another version of bug 637936. Scenario kill_both_logs_2_legs_2_logs: Kill both logs of synced 2 leg redundant log mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_both_logs_2legs_2logs_1 * sync: 1 * striped: 0 * leg devices: /dev/sdh1 /dev/sdf1 * log devices: /dev/sdc1 /dev/sdg1 * no MDA devices: * failpv(s): /dev/sdc1 /dev/sdg1 * failnode(s): virt-024.cluster-qe.lab.eng.brq.redhat.com * additional snap: /dev/sdh1 * lvmetad: 0 * leg fault policy: remove * log fault policy: allocate ****************************************************** Creating mirror(s) on virt-024.cluster-qe.lab.eng.brq.redhat.com... virt-024.cluster-qe.lab.eng.brq.redhat.com: lvcreate --mirrorlog mirrored -m 1 -n syncd_both_logs_2legs_2logs_1 -L 500M helter_skelter /dev/sdh1:0-1000 /dev/sdf1:0-1000 /dev/sdc1:0-150 /dev/sdg1:0-150 Creating a snapshot volume of each of the mirrors WARNING: Snapshots of mirrors can deadlock under rare device failures. WARNING: Consider using the raid1 mirror type to avoid this. WARNING: See global/mirror_segtype_default in lvm.conf. Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices hs_snap1 swi-a-s--- 252.00m /dev/sdh1(125) syncd_both_logs_2legs_2logs_1 owi-a-m--- 500.00m 37.60 syncd_both_logs_2legs_2logs_1_mimage_0(0),syncd_both_logs_2legs_2logs_1_mimage_1(0) [syncd_both_logs_2legs_2logs_1_mimage_0] Iwi-aom--- 500.00m /dev/sdh1(0) [syncd_both_logs_2legs_2logs_1_mimage_1] Iwi-aom--- 500.00m /dev/sdf1(0) [syncd_both_logs_2legs_2logs_1_mlog] mwi-aom--- 4.00m 100.00 syncd_both_logs_2legs_2logs_1_mlog_mimage_0(0),syncd_both_logs_2legs_2logs_1_mlog_mimage_1(0) [syncd_both_logs_2legs_2logs_1_mlog_mimage_0] iwi-aom--- 4.00m /dev/sdc1(0) [syncd_both_logs_2legs_2logs_1_mlog_mimage_1] iwi-aom--- 4.00m /dev/sdg1(0) PV=/dev/sdg1 syncd_both_logs_2legs_2logs_1_mlog_mimage_1: 1.4 PV=/dev/sdc1 syncd_both_logs_2legs_2logs_1_mlog_mimage_0: 1.4 PV=/dev/sdg1 syncd_both_logs_2legs_2logs_1_mlog_mimage_1: 1.4 PV=/dev/sdc1 syncd_both_logs_2legs_2logs_1_mlog_mimage_0: 1.4 Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Creating ext on top of mirror(s) on virt-024.cluster-qe.lab.eng.brq.redhat.com... mke2fs 1.41.12 (17-May-2010) Mounting mirrored ext filesystems on virt-024.cluster-qe.lab.eng.brq.redhat.com... Writing verification files (checkit) to mirror(s) on... ---- virt-024.cluster-qe.lab.eng.brq.redhat.com ---- Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- virt-024.cluster-qe.lab.eng.brq.redhat.com ---- Disabling device sdc on virt-024.cluster-qe.lab.eng.brq.redhat.com Disabling device sdg on virt-024.cluster-qe.lab.eng.brq.redhat.com [deadlock] [root@virt-024 /]# dmsetup ls helter_skelter-syncd_both_logs_2legs_2logs_1_mimage_1 (253:6) helter_skelter-syncd_both_logs_2legs_2logs_1_mlog_mimage_1 (253:3) helter_skelter-syncd_both_logs_2legs_2logs_1_mimage_0 (253:5) helter_skelter-syncd_both_logs_2legs_2logs_1_mlog_mimage_0 (253:2) helter_skelter-hs_snap1-cow (253:10) helter_skelter-syncd_both_logs_2legs_2logs_1 (253:7) helter_skelter-hs_snap1 (253:8) helter_skelter-syncd_both_logs_2legs_2logs_1_mlog (253:4) helter_skelter-syncd_both_logs_2legs_2logs_1-real (253:9) kernel: INFO: task dmeventd:10613 blocked for more than 120 seconds. kernel: Not tainted 2.6.32-419.el6.x86_64 #1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: dmeventd D 0000000000000000 0 10613 1 0x00000080 kernel: ffff880015cbda38 0000000000000086 ffff88001feb5337 0000000100000001 kernel: ffff880015cbd9f0 0000000081065dbe ffff88001feb5338 ffffffff811337c0 kernel: ffff880019f5bab8 ffff880015cbdfd8 000000000000fbc8 ffff880019f5bab8 kernel: Call Trace: kernel: [<ffffffff811337c0>] ? __writepage+0x0/0x40 kernel: [<ffffffff8109b72e>] ? prepare_to_wait+0x4e/0x80 kernel: [<ffffffffa03d5ddd>] log_wait_commit+0x10d/0x170 [jbd] kernel: [<ffffffff8109b400>] ? autoremove_wake_function+0x0/0x40 kernel: [<ffffffffa04012d6>] ext3_sync_fs+0x76/0xb0 [ext3] kernel: [<ffffffff811eff8e>] sync_quota_sb+0x5e/0x130 kernel: [<ffffffff811bac8a>] __sync_filesystem+0x7a/0x90 kernel: [<ffffffff811bae9b>] sync_filesystem+0x4b/0x70 kernel: [<ffffffff811c615e>] freeze_bdev+0xfe/0x280 kernel: [<ffffffffa0002787>] dm_suspend+0x97/0x290 [dm_mod] kernel: [<ffffffffa000780c>] ? __find_device_hash_cell+0xac/0x170 [dm_mod] kernel: [<ffffffffa0008bb0>] ? dev_suspend+0x0/0x250 [dm_mod] kernel: [<ffffffffa0008c26>] dev_suspend+0x76/0x250 [dm_mod] kernel: [<ffffffffa0008bb0>] ? dev_suspend+0x0/0x250 [dm_mod] kernel: [<ffffffffa0009a84>] ctl_ioctl+0x214/0x450 [dm_mod] kernel: [<ffffffffa0009cd3>] dm_ctl_ioctl+0x13/0x20 [dm_mod] kernel: [<ffffffff8119daa2>] vfs_ioctl+0x22/0xa0 kernel: [<ffffffff8119dc44>] do_vfs_ioctl+0x84/0x580 kernel: [<ffffffff8119e1c1>] sys_ioctl+0x81/0xa0 kernel: [<ffffffff810e1f7e>] ? __audit_syscall_exit+0x25e/0x290 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b kernel: INFO: task kmirrord:10595 blocked for more than 120 seconds. kernel: Not tainted 2.6.32-419.el6.x86_64 #1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: kmirrord D 0000000000000000 0 10595 2 0x00000080 kernel: ffff88001a095bc0 0000000000000046 ffff8800ffffffff 000008e01eb0d423 kernel: ffff8800398a8948 ffff88003d81ac90 000000000026c4ea ffffffffadc5f788 kernel: ffff88003a57baf8 ffff88001a095fd8 000000000000fbc8 ffff88003a57baf8 kernel: Call Trace: kernel: [<ffffffff810a7281>] ? ktime_get_ts+0xb1/0xf0 kernel: [<ffffffff815282d3>] io_schedule+0x73/0xc0 kernel: [<ffffffffa000a315>] sync_io+0x95/0x110 [dm_mod] kernel: [<ffffffffa00026d0>] ? dm_unplug_all+0x50/0x70 [dm_mod] kernel: [<ffffffff811221c5>] ? mempool_kmalloc+0x15/0x20 kernel: [<ffffffff811223a3>] ? mempool_alloc+0x63/0x140 kernel: [<ffffffffa000a4c5>] ? dm_io+0xc5/0x1c0 [dm_mod] kernel: [<ffffffffa0009d50>] ? bvec_get_page+0x0/0x30 [dm_mod] kernel: [<ffffffffa000a5b7>] dm_io+0x1b7/0x1c0 [dm_mod] kernel: [<ffffffffa0009e20>] ? vm_get_page+0x0/0x70 [dm_mod] kernel: [<ffffffffa0009d90>] ? vm_next_page+0x0/0x30 [dm_mod] kernel: [<ffffffffa001fc61>] disk_flush+0x91/0x170 [dm_log] kernel: [<ffffffffa0026740>] ? dm_rh_inc+0xc0/0xd0 [dm_region_hash] kernel: [<ffffffffa00260d3>] dm_rh_flush+0x13/0x20 [dm_region_hash] kernel: [<ffffffffa002f15f>] do_mirror+0x27f/0x6e0 [dm_mirror] kernel: [<ffffffffa002eee0>] ? do_mirror+0x0/0x6e0 [dm_mirror] kernel: [<ffffffff81094e80>] worker_thread+0x170/0x2a0 kernel: [<ffffffff8109b400>] ? autoremove_wake_function+0x0/0x40 kernel: [<ffffffff81094d10>] ? worker_thread+0x0/0x2a0 kernel: [<ffffffff8109b056>] kthread+0x96/0xa0 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 kernel: [<ffffffff8109afc0>] ? kthread+0x0/0xa0 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Version-Release number of selected component (if applicable): 2.6.32-419.el6.x86_64 lvm2-2.02.100-3.el6 BUILT: Thu Sep 12 15:59:03 CEST 2013 lvm2-libs-2.02.100-3.el6 BUILT: Thu Sep 12 15:59:03 CEST 2013 lvm2-cluster-2.02.100-3.el6 BUILT: Thu Sep 12 15:59:03 CEST 2013 udev-147-2.48.el6 BUILT: Fri Aug 9 13:09:50 CEST 2013 device-mapper-1.02.79-3.el6 BUILT: Thu Sep 12 15:59:03 CEST 2013 device-mapper-libs-1.02.79-3.el6 BUILT: Thu Sep 12 15:59:03 CEST 2013 device-mapper-event-1.02.79-3.el6 BUILT: Thu Sep 12 15:59:03 CEST 2013 device-mapper-event-libs-1.02.79-3.el6 BUILT: Thu Sep 12 15:59:03 CEST 2013 cmirror-2.02.100-3.el6 BUILT: Thu Sep 12 15:59:03 CEST 2013
This is reproducible. lvm[3779]: Monitoring snapshot helter_skelter-hs_snap1 lvm[3779]: Monitoring mirror device helter_skelter-syncd_both_logs_2legs_2logs_1-real for events. lvm[3779]: Mirror log status: 1 of 2 images failed. lvm[3779]: Trying to up-convert to 2 images, 2 logs. lvm[3779]: /dev/sdf1: read failed after 0 of 512 at 4096: Input/output error lvm[3779]: /dev/sdf1: read failed after 0 of 2048 at 0: Input/output error lvm[3779]: Monitoring mirror device helter_skelter-syncd_both_logs_2legs_2logs_1_mlog for events. kernel: device-mapper: raid1: Unable to read primary mirror during recovery kernel: device-mapper: raid1: Primary mirror (253:2) failed while out-of-sync: Reads may fail. lvm[3779]: Trying to up-convert to 2 images, 1 logs. lvm[3779]: intermediate VG write failed. lvm[3779]: Trying to up-convert to 2 images, 0 logs. lvm[3779]: intermediate VG write failed. lvm[3779]: WARNING: Failed to replace 2 of 2 logs in volume syncd_both_logs_2legs_2logs_1 lvm[3779]: Repair of mirrored device helter_skelter-syncd_both_logs_2legs_2logs_1_mlog finished successfully. lvm[3779]: helter_skelter-syncd_both_logs_2legs_2logs_1-real is now in-sync. lvm[3779]: Primary mirror device 253:2 sync failed. lvm[3779]: No longer monitoring mirror device helter_skelter-syncd_both_logs_2legs_2logs_1_mlog for events. lvm[3779]: No longer monitoring snapshot helter_skelter-hs_snap1 lvm[3779]: No longer monitoring mirror device helter_skelter-syncd_both_logs_2legs_2logs_1-real for events. qarshd[13150]: Running cmdline: dd if=/dev/zero of=/mnt/syncd_both_logs_2legs_2logs_1/ddfile count=10 bs=4M qarshd[13152]: Running cmdline: sync kernel: INFO: task kmirrord:12944 blocked for more than 120 seconds. kernel: Not tainted 2.6.32-419.el6.x86_64 #1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: kmirrord D 0000000000000000 0 12944 2 0x00000080 kernel: ffff88003d707bc0 0000000000000046 ffff8800ffffffff 000018b03b040975 kernel: 0000000000000000 ffff88003d2c2ec0 00000000003c30d0 ffffffffadc5d87b kernel: ffff88003d173ab8 ffff88003d707fd8 000000000000fbc8 ffff88003d173ab8 kernel: Call Trace: kernel: [<ffffffff810a7281>] ? ktime_get_ts+0xb1/0xf0 kernel: [<ffffffff815282d3>] io_schedule+0x73/0xc0 kernel: [<ffffffffa000a315>] sync_io+0x95/0x110 [dm_mod] kernel: [<ffffffff811221c5>] ? mempool_kmalloc+0x15/0x20 kernel: [<ffffffff811223a3>] ? mempool_alloc+0x63/0x140 kernel: [<ffffffffa000a5b7>] dm_io+0x1b7/0x1c0 [dm_mod]
I've hit what appears to be this same issue with two other helter_skelter scenarios. ================================================================================ Iteration 0.15 started at Thu Sep 26 20:38:52 CDT 2013 ================================================================================ Scenario kill_multiple_legs_synced_4_legs: Kill multiple legs of synced 4 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_multiple_legs_4legs_1 * sync: 1 * striped: 0 * leg devices: /dev/sdc1 /dev/sdd1 /dev/sdg1 /dev/sdh1 * log devices: /dev/sdb1 * no MDA devices: * failpv(s): /dev/sdc1 /dev/sdh1 * failnode(s): virt-020.cluster-qe.lab.eng.brq.redhat.com * lvmetad: 0 * leg fault policy: allocate * log fault policy: allocate ****************************************************** [...] Disabling device sdc on virt-020.cluster-qe.lab.eng.brq.redhat.com Disabling device sdh on virt-020.cluster-qe.lab.eng.brq.redhat.com Getting recovery check start time from /var/log/messages: Sep 27 03:40 Attempting I/O to cause mirror down conversion(s) on virt-020.cluster-qe.lab.eng.brq.redhat.com 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.233749 s, 179 MB/s LOG: Sep 27 03:40:24 virt-020 qarshd[23925]: Running cmdline: echo offline > /sys/block/sdc/device/state & Sep 27 03:40:26 virt-020 qarshd[23928]: Running cmdline: echo offline > /sys/block/sdh/device/state & [...] Sep 27 03:40:26 virt-020 lvm[3807]: intermediate VG write failed. Sep 27 03:40:26 virt-020 lvm[3807]: Mirror status: 1 of 4 images failed. Sep 27 03:40:26 virt-020 lvm[3807]: Trying to up-convert to 4 images, 1 logs. Sep 27 03:40:26 virt-020 lvm[3807]: Failed to write intermediate VG helter_skelter metadata for mirror conversion. Sep 27 03:40:26 virt-020 lvm[3807]: Failed to insert resync layer Sep 27 03:40:26 virt-020 lvm[3807]: Failed to insert resync layer Sep 27 03:40:26 virt-020 lvm[3807]: Trying to up-convert to 3 images, 1 logs. Sep 27 03:40:26 virt-020 kernel: sd 9:0:0:1: rejecting I/O to offline device Sep 27 03:40:26 virt-020 lvm[3807]: Trying to up-convert to 2 images, 1 logs. Sep 27 03:40:26 virt-020 kernel: sd 9:0:0:1: rejecting I/O to offline device Sep 27 03:40:26 virt-020 lvm[3807]: intermediate VG write failed. Sep 27 03:40:26 virt-020 lvm[3807]: Trying to up-convert to 2 images, 0 logs. Sep 27 03:40:26 virt-020 kernel: sd 9:0:0:1: rejecting I/O to offline device Sep 27 03:40:26 virt-020 lvm[3807]: intermediate VG write failed. Sep 27 03:40:26 virt-020 lvm[3807]: WARNING: Failed to replace 2 of 4 images in volume syncd_multiple_legs_4legs_1 Sep 27 03:40:26 virt-020 lvm[3807]: WARNING: Failed to replace 1 of 1 logs in volume syncd_multiple_legs_4legs_1 Sep 27 03:40:26 virt-020 lvm[3807]: Repair of mirrored device helter_skelter-syncd_multiple_legs_4legs_1 finished successfully. [...] Sep 27 03:40:36 virt-020 qarshd[23951]: Running cmdline: sync Sep 27 03:43:15 virt-020 kernel: INFO: task flush-253:7:23884 blocked for more than 120 seconds. Sep 27 03:43:15 virt-020 kernel: Not tainted 2.6.32-419.el6.x86_64 #1 Sep 27 03:43:15 virt-020 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 27 03:43:15 virt-020 kernel: flush-253:7 D 0000000000000000 0 23884 2 0x00000080 Sep 27 03:43:15 virt-020 kernel: ffff88003e1cd840 0000000000000046 ffff8800ffffffff 00024f4f9010a407 Sep 27 03:43:15 virt-020 kernel: ffff88003e1cd7b0 ffff88003de0e3d0 0000000001a603f4 ffffffffadbb70bd Sep 27 03:43:15 virt-020 kernel: ffff880020a89af8 ffff88003e1cdfd8 000000000000fbc8 ffff880020a89af8 Sep 27 03:43:15 virt-020 kernel: Call Trace: Sep 27 03:43:15 virt-020 kernel: [<ffffffff810a7281>] ? ktime_get_ts+0xb1/0xf0 Sep 27 03:43:15 virt-020 kernel: [<ffffffff811bf0a0>] ? sync_buffer+0x0/0x50 Sep 27 03:43:15 virt-020 kernel: [<ffffffff815282d3>] io_schedule+0x73/0xc0 Sep 27 03:43:15 virt-020 kernel: [<ffffffff811bf0e0>] sync_buffer+0x40/0x50 Sep 27 03:43:15 virt-020 kernel: [<ffffffff81528b6a>] __wait_on_bit_lock+0x5a/0xc0 Sep 27 03:43:15 virt-020 kernel: [<ffffffff811bf0a0>] ? sync_buffer+0x0/0x50 Sep 27 03:43:15 virt-020 kernel: [<ffffffff81528c48>] out_of_line_wait_on_bit_lock+0x78/0x90 Sep 27 03:43:15 virt-020 kernel: [<ffffffff8109b480>] ? wake_bit_function+0x0/0x50 Sep 27 03:43:15 virt-020 kernel: [<ffffffff811bf410>] ? end_buffer_async_write+0x0/0x190 Sep 27 03:43:15 virt-020 kernel: [<ffffffff811bf286>] __lock_buffer+0x36/0x40 Sep 27 03:43:15 virt-020 kernel: [<ffffffff811c0585>] __block_write_full_page+0x305/0x330 Sep 27 03:43:15 virt-020 kernel: [<ffffffff811bf410>] ? end_buffer_async_write+0x0/0x190 Sep 27 03:43:15 virt-020 kernel: [<ffffffff811c0690>] block_write_full_page_endio+0xe0/0x120 Sep 27 03:43:15 virt-020 kernel: [<ffffffffa03aeb30>] ? buffer_unmapped+0x0/0x20 [ext3] Sep 27 03:43:15 virt-020 kernel: [<ffffffff811c06e5>] block_write_full_page+0x15/0x20 Sep 27 03:43:15 virt-020 kernel: [<ffffffffa03af6dd>] ext3_ordered_writepage+0x1ed/0x240 [ext3] Sep 27 03:43:15 virt-020 kernel: [<ffffffff811337d7>] __writepage+0x17/0x40 Sep 27 03:43:15 virt-020 kernel: [<ffffffff81134a9d>] write_cache_pages+0x1fd/0x4c0 Sep 27 03:43:15 virt-020 kernel: [<ffffffff811337c0>] ? __writepage+0x0/0x40
Another example: Scenario kill_primary_synced_3_legs: Kill primary leg of synced 3 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_primary_3legs_1 syncd_primary_3legs_2 * sync: 1 * striped: 0 * leg devices: /dev/sdf1 /dev/sdd1 /dev/sdh1 * log devices: /dev/sdg1 * no MDA devices: * failpv(s): /dev/sdf1 * failnode(s): taft-01 * lvmetad: 0 * leg fault policy: remove * log fault policy: remove ****************************************************** Creating mirror(s) on taft-01... taft-01: lvcreate -m 2 -n syncd_primary_3legs_1 -L 500M helter_skelter /dev/sdf1:0-1000 /dev/sdd1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-150 taft-01: lvcreate -m 2 -n syncd_primary_3legs_2 -L 500M helter_skelter /dev/sdf1:0-1000 /dev/sdd1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-150 [...] Disabling device sdf on taft-01 LOG: Sep 26 22:40:16 taft-01 qarshd[26774]: Running cmdline: echo offline > /sys/block/sdf/device/state & Sep 26 22:40:17 taft-01 lvm[3372]: Primary mirror device 253:4 has failed (D). Sep 26 22:40:17 taft-01 lvm[3372]: Device failure in helter_skelter-syncd_primary_3legs_1. Sep 26 22:40:17 taft-01 kernel: device-mapper: raid1: Mirror read failed from 253:9. Trying alternative device. [...] Sep 26 22:40:22 taft-01 kernel: sd 3:0:0:5: rejecting I/O to offline device Sep 26 22:40:22 taft-01 kernel: sd 3:0:0:5: rejecting I/O to offline device Sep 26 22:43:47 taft-01 kernel: INFO: task dmeventd:26626 blocked for more than 120 seconds. Sep 26 22:43:47 taft-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 26 22:43:47 taft-01 kernel: dmeventd D 0000000000000002 0 26626 1 0x00000080 Sep 26 22:43:47 taft-01 kernel: ffff88021b57ba98 0000000000000086 000000000003ffff ffffea0006c29fa0 Sep 26 22:43:47 taft-01 kernel: ffff8801ef0e4cd8 ffff88021b57bc08 ffffffff811bfa40 0000000000000286 Sep 26 22:43:47 taft-01 kernel: ffff88021824fab8 ffff88021b57bfd8 000000000000fbc8 ffff88021824fab8 Sep 26 22:43:47 taft-01 kernel: Call Trace: Sep 26 22:43:47 taft-01 kernel: [<ffffffff811bfa40>] ? blkdev_get_block+0x0/0x20 Sep 26 22:43:47 taft-01 kernel: [<ffffffff810a4421>] ? ktime_get_ts+0xb1/0xf0 Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111c8c0>] ? sync_page+0x0/0x50 Sep 26 22:43:47 taft-01 kernel: [<ffffffff81520b53>] io_schedule+0x73/0xc0 Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111c8fd>] sync_page+0x3d/0x50 Sep 26 22:43:47 taft-01 kernel: [<ffffffff8152161f>] __wait_on_bit+0x5f/0x90 Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111cb33>] wait_on_page_bit+0x73/0x80 Sep 26 22:43:47 taft-01 kernel: [<ffffffff81098620>] ? wake_bit_function+0x0/0x50 Sep 26 22:43:47 taft-01 kernel: [<ffffffff81131bd5>] ? pagevec_lookup_tag+0x25/0x40 Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111cf5b>] wait_on_page_writeback_range+0xfb/0x190 Sep 26 22:43:47 taft-01 kernel: [<ffffffff81130c44>] ? generic_writepages+0x24/0x30 Sep 26 22:43:47 taft-01 kernel: [<ffffffff81130c71>] ? do_writepages+0x21/0x40 Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111d0ab>] ? __filemap_fdatawrite_range+0x5b/0x60 Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111d128>] filemap_write_and_wait_range+0x78/0x90 Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111e4b8>] generic_file_aio_read+0x418/0x700 Sep 26 22:43:47 taft-01 kernel: [<ffffffff81229351>] ? avc_has_perm+0x71/0x90 Sep 26 22:43:47 taft-01 kernel: [<ffffffff8122b1b2>] ? selinux_inode_permission+0x72/0xb0 Sep 26 22:43:47 taft-01 kernel: [<ffffffff811bfe81>] blkdev_aio_read+0x51/0x80 Sep 26 22:43:47 taft-01 kernel: [<ffffffff81184ada>] do_sync_read+0xfa/0x140 Sep 26 22:43:47 taft-01 kernel: [<ffffffff810985a0>] ? autoremove_wake_function+0x0/0x40 Sep 26 22:43:47 taft-01 kernel: [<ffffffff811bfddc>] ? block_ioctl+0x3c/0x40 Sep 26 22:43:47 taft-01 kernel: [<ffffffff81199382>] ? vfs_ioctl+0x22/0xa0 Sep 26 22:43:47 taft-01 kernel: [<ffffffff8122ed7b>] ? selinux_file_permission+0xfb/0x150 Sep 26 22:43:47 taft-01 kernel: [<ffffffff81221c56>] ? security_file_permission+0x16/0x20 Sep 26 22:43:47 taft-01 kernel: [<ffffffff811853c5>] vfs_read+0xb5/0x1a0 Sep 26 22:43:47 taft-01 kernel: [<ffffffff81185501>] sys_read+0x51/0x90
Just to be clear, snapshots of mirrors (which are know to deadlock) is not required to cause this issue. ================================================================================ Iteration 1.6 started at Tue Oct 1 19:49:14 CDT 2013 ================================================================================ Scenario kill_primary_synced_2_legs: Kill primary leg of synced 2 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_primary_2legs_1 syncd_primary_2legs_2 * sync: 1 * striped: 0 * leg devices: /dev/sdg1 /dev/sdd1 * log devices: /dev/sdc1 * no MDA devices: * failpv(s): /dev/sdg1 * failnode(s): taft-01 * lvmetad: 0 * leg fault policy: allocate * log fault policy: allocate ****************************************************** Creating mirror(s) on taft-01... taft-01: lvcreate -m 1 -n syncd_primary_2legs_1 -L 500M helter_skelter /dev/sdg1:0-1000 /dev/sdd1:0-1000 /dev/sdc1:0-150 taft-01: lvcreate -m 1 -n syncd_primary_2legs_2 -L 500M helter_skelter /dev/sdg1:0-1000 /dev/sdd1:0-1000 /dev/sdc1:0-150 Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices syncd_primary_2legs_1 mwi-a-m--- 500.00m 18.40 syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0) [syncd_primary_2legs_1_mimage_0] Iwi-aom--- 500.00m /dev/sdg1(0) [syncd_primary_2legs_1_mimage_1] Iwi-aom--- 500.00m /dev/sdd1(0) [syncd_primary_2legs_1_mlog] lwi-aom--- 4.00m /dev/sdc1(0) syncd_primary_2legs_2 mwi-a-m--- 500.00m 3.20 syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0) [syncd_primary_2legs_2_mimage_0] Iwi-aom--- 500.00m /dev/sdg1(125) [syncd_primary_2legs_2_mimage_1] Iwi-aom--- 500.00m /dev/sdd1(125) [syncd_primary_2legs_2_mlog] lwi-aom--- 4.00m /dev/sdc1(1) Waiting until all mirror|raid volumes become fully syncd... 0/2 mirror(s) are fully synced: ( 77.70% 62.30% ) 2/2 mirror(s) are fully synced: ( 100.00% 100.00% ) Creating ext on top of mirror(s) on taft-01... mke2fs 1.41.12 (17-May-2010) mke2fs 1.41.12 (17-May-2010) Mounting mirrored ext filesystems on taft-01... PV=/dev/sdg1 syncd_primary_2legs_1_mimage_0: 5.1 syncd_primary_2legs_2_mimage_0: 5.1 PV=/dev/sdg1 syncd_primary_2legs_1_mimage_0: 5.1 syncd_primary_2legs_2_mimage_0: 5.1 Writing verification files (checkit) to mirror(s) on... ---- taft-01 ---- Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- taft-01 ---- Disabling device sdg on taft-01 [DEADLOCK] qarshd[19223]: Running cmdline: echo offline > /sys/block/sdg/device/state & lvm[3265]: Primary mirror device 253:8 has failed (D). lvm[3265]: Device failure in helter_skelter-syncd_primary_2legs_2. kernel: device-mapper: raid1: Mirror read failed from 253:4. Trying alternative device. [...] kernel: sd 3:0:0:6: rejecting I/O to offline device kernel: INFO: task dmeventd:19107 blocked for more than 120 seconds. kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: dmeventd D 0000000000000003 0 19107 1 0x00000080 kernel: ffff880218feda98 0000000000000086 000000000003ffff ffffea0007183b18 kernel: ffff880214ae6318 ffff880218fedc08 ffffffff811bfa40 0000000000000286 kernel: ffff880218cc6638 ffff880218fedfd8 000000000000fbc8 ffff880218cc6638 kernel: Call Trace: kernel: [<ffffffff811bfa40>] ? blkdev_get_block+0x0/0x20 kernel: [<ffffffff810a4421>] ? ktime_get_ts+0xb1/0xf0 kernel: [<ffffffff8111c8c0>] ? sync_page+0x0/0x50 kernel: [<ffffffff81520b53>] io_schedule+0x73/0xc0 kernel: [<ffffffff8111c8fd>] sync_page+0x3d/0x50 kernel: [<ffffffff8152161f>] __wait_on_bit+0x5f/0x90 kernel: [<ffffffff8111cb33>] wait_on_page_bit+0x73/0x80 kernel: [<ffffffff81098620>] ? wake_bit_function+0x0/0x50 kernel: [<ffffffff81131bd5>] ? pagevec_lookup_tag+0x25/0x40 kernel: [<ffffffff8111cf5b>] wait_on_page_writeback_range+0xfb/0x190 kernel: [<ffffffff81130c44>] ? generic_writepages+0x24/0x30 kernel: [<ffffffff81130c71>] ? do_writepages+0x21/0x40 kernel: [<ffffffff8111d0ab>] ? __filemap_fdatawrite_range+0x5b/0x60 kernel: [<ffffffff8111d128>] filemap_write_and_wait_range+0x78/0x90 kernel: [<ffffffff8111e4b8>] generic_file_aio_read+0x418/0x700 kernel: [<ffffffff81229351>] ? avc_has_perm+0x71/0x90 kernel: [<ffffffff8122b1b2>] ? selinux_inode_permission+0x72/0xb0 kernel: [<ffffffff811bfe81>] blkdev_aio_read+0x51/0x80 kernel: [<ffffffff81184ada>] do_sync_read+0xfa/0x140 kernel: [<ffffffff810985a0>] ? autoremove_wake_function+0x0/0x40 kernel: [<ffffffff811bfddc>] ? block_ioctl+0x3c/0x40 kernel: [<ffffffff81199382>] ? vfs_ioctl+0x22/0xa0 kernel: [<ffffffff8122ed7b>] ? selinux_file_permission+0xfb/0x150 kernel: [<ffffffff81221c56>] ? security_file_permission+0x16/0x20 kernel: [<ffffffff811853c5>] vfs_read+0xb5/0x1a0 kernel: [<ffffffff81185501>] sys_read+0x51/0x90 kernel: [<ffffffff810deefe>] ? __audit_syscall_exit+0x25e/0x290 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
So, we /don't/ need: 1) a mirrored log 2) snapshots to produce this issue. (right?)
Can it be reproduced with 1 mirror, or is more that one required (or one of the above conditions)? IOW, what is the simplest form in which we can reproduce the bug?
having trouble reproducing... I've seen it once with a borrowed QA machine and haven't reproduced since turning on debugging. I'm working to get my machines moved over to kvm to test on them as well.
moved all my VMs over to kvm. Tests fail to run there also.
This message seems to preceded the issues you are seeing: lvm[3779]: intermediate VG write failed. I don't see it in every one of your above reports, but I don't know if all the logs are in each of the above reports. Clearly, failing to do a vg_write is going to cause problems. I'm not sure why that is failing though.
It appears that 2+ mirrors are required to force this issue to happen. In the cases I've been seeing, the devices are failed and should affect all mirrors in the same way. (I am running 'kill_both_logs_2_legs_2_logs', which kills both devices of a mirrored log on both mirrors at once.) However, the kernel does not detect the failure of a device until it has been written to. I am seeing the following status at the beginning of the repair command: lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies helter_skelter/syncd_both_logs_2legs_2logs_1 helter_skelter-syncd_both_logs_2legs_2logs_2_mlog: 0 8192 mirror 2 253:9 253:10 8/8 1 AA 1 core helter_skelter-syncd_both_logs_2legs_2logs_2_mlog_mimage_1: 0 8192 linear helter_skelter-syncd_both_logs_2legs_2logs_2_mimage_1: 0 1024000 linear helter_skelter-syncd_both_logs_2legs_2logs_2_mlog_mimage_0: 0 8192 linear helter_skelter-syncd_both_logs_2legs_2logs_2_mimage_0: 0 1024000 linear helter_skelter-syncd_both_logs_2legs_2logs_1_mimage_1: 0 1024000 linear helter_skelter-syncd_both_logs_2legs_2logs_1_mlog_mimage_1: 0 8192 linear helter_skelter-syncd_both_logs_2legs_2logs_1_mimage_0: 0 1024000 linear helter_skelter-syncd_both_logs_2legs_2logs_1_mlog_mimage_0: 0 8192 linear helter_skelter-syncd_both_logs_2legs_2logs_2: 0 1024000 mirror 2 253:12 253:13 1000/1000 1 AA 3 disk 253:11 A helter_skelter-syncd_both_logs_2legs_2logs_1: 0 1024000 mirror 2 253:6 253:7 1000/1000 1 AA 3 disk 253:5 D helter_skelter-syncd_both_logs_2legs_2logs_1_mlog: 0 8192 mirror 2 253:3 253:4 7/8 1 DD 1 core You can see that the mirrored log has failed and been noticed in helter_skelter-syncd_both_logs_2legs_2logs_1, but not in helter_skelter-syncd_both_logs_2legs_2logs_2. I think the repair is attempting to read the second mirror (because the kernel has not yet noticed the failure) and getting stuck. The traces seem to confirm this. If I replace the second mirror with an error target, the trace from the repair command reports an I/O error reading the second mirror. The fix will involve additional efforts to avoid reading mirrors. (I was pretty certain all cases were covered... perhaps there has been a regression?)
Somehow (I'm not sure yet) the code is making it past 'device_is_usable' and hitting the filter check - which is where it is hanging: 291 if (!device_is_usable(dev)) { 292 log_debug_devs("%s: Skipping unusable device", dev_name(dev)); 293 return 0; 294 } 295 return pf->real->passes_filter(pf->real, dev); // HANG! backtrace: #0 0x0000003af140e75d in read () from /lib64/libpthread.so.0 #1 0x00007fd779097427 in _io (where=0x7fd778ef7060, buffer=0x7fd75c066400 "", should_write=0) at device/dev-io.c:97 #2 0x00007fd7790978a2 in _aligned_io (where=0x7fd778ef70e0, buffer=0x7fd778ef712c "", should_write=0) at device/dev-io.c:204 #3 0x00007fd779098e51 in dev_read (dev=0x7fd75c06f9c0, offset=0, len=4, buffer=0x7fd778ef712c) at device/dev-io.c:671 #4 0x00007fd779099320 in _dev_has_md_magic (dev=0x7fd75c06f9c0, sb_offset=0) at device/dev-md.c:37 #5 0x00007fd7790994a9 in dev_is_md (dev=0x7fd75c06f9c0, sb=0x0) at device/dev-md.c:113 #6 0x00007fd7790a232c in _ignore_md (f=0x7fd75c00be40, dev=0x7fd75c06f9c0) at filters/filter-md.c:29 #7 0x00007fd77909facf in _and_p (f=0x7fd75c00be80, dev=0x7fd75c06f9c0) at filters/filter-composite.c:24 #8 0x00007fd7790a0bff in _lookup_p (f=0x7fd75c020800, dev=0x7fd75c06f9c0) at filters/filter-persistent.c:295 #9 0x00007fd779097191 in dev_iter_get (iter=0x7fd75c038670) at device/dev-cache.c:1011 #10 0x00007fd7790873c9 in lvmcache_label_scan (cmd=0x208efa0, full_scan=0) at cache/lvmcache.c:691
bisecting: e9f83147d5da3120604c67354b4d6f027ac10ed0 GOOD 05bf4b8cc3e3d15e79c9964e496dbe396beacb3c GOOD 6d2604f02651158223ac31fe58e0638b6171171c GOOD c36dcc17282cf9c5f745cc627f9ce3825b4ee5e2 GOOD bd75844024ad4a2e66c7743c50e189a00c233284 GOOD dd796d6a9485287973e61c427ee45694965203a0 BAD 9658032cba4f6a2350bb35d5133a436bad84fec2 BAD a94e28ae019b94aa48dbaf1db83cc30bfdee0f4f BAD 85b9c12e92873a87b7c02bac20287beeea82ef2a BAD 861a3b2f1940cd3667cc353c762ff3d1ef75c37b GOOD 30432bd6048ce9c52eefeba9e99d1b0805c6c59b FINAL BAD ** but that doesn't seem to make sense...
Retesting (with 20 test iterations instead of 4): 861a3b2f1940cd3667cc353c762ff3d1ef75c37b BAD bd75844024ad4a2e66c7743c50e189a00c233284 BAD c36dcc17282cf9c5f745cc627f9ce3825b4ee5e2 BAD 6d2604f02651158223ac31fe58e0638b6171171c GOOD Restarting bisect: 6d2604f02651158223ac31fe58e0638b6171171c GOOD c36dcc17282cf9c5f745cc627f9ce3825b4ee5e2 BAD 003f08c1642f060e57c55e950d4690d470451602 GOOD f1dc4d3d81456d506b2c56fb2e8b12106cbd9e16 BAD bb457adbb6d625a5f942faf8498bb5ee87645ec3 BAD 68c2d352ec2746ab278856388c6203c604f6248a BAD f50e8fb9b9030c86f3a65f61a4b7f43089a26c60 GOOD b15278c3dca14d7ee09a2ebcca4b91cbdf8428af GOOD c95f17ea64e3faee3d36be95544c907b5e89c073 FINAL BAD ** This looks right. commit c95f17ea64e3faee3d36be95544c907b5e89c073 Author: Jonathan Brassow <jbrassow@redhat.com> Date: Wed Aug 7 17:42:26 2013 -0500 Mirror: Fix issue preventing PV creation on mirror LVs Commit b248ba0a396d7fc9a459eea02cfdc70b33ce3441 attempted to prevent mirror devices which had a failed device in their mirrored log from being usable/readable by LVM. This was to protect against circular dependancies where one LVM command could be blocked trying to read one of these affected mirrors while the LVM command to fix/unblock that mirror was stuck behind the currently running command. The above commit went wrong when it used 'device_is_usable()' to recurse on the mirrored log device to check if it was suspended or blocked. The 'device_is_usable' function also contains a check for reserved names - like *_mlog, etc. This last check always triggered when checking a mirror's log simply because of the name, not because it was suspended or blocked - a false positive. The solution is to create a new function like 'device_is_usable', but without the check for reserved names. Using this new function (device_is_suspended_or_blocked), we can check the status of a mirror's log device properly. I believe the above commit is correct, but it revealed a bug that was previously not seen because it enabled an otherwise unusable feature. Two things I'm fairly certain of: 1) There must be more than one mirror that suffers device failure. 2) The mirrors must have mirrored logs.
Ok, now I'm virtually certain this is timing related. Here's the scenario: - We have two mirrors with mirrored logs with sub-LVs place on the same devices - Killing both log devices will require replacement of the logs of both mirrors - When 'lvconvert --repair' is run, it is run in response to an event received from the kernel on ONE of the mirror logs. At this point, I/O has not yet been run on the second mirror to cause the kernel to notice the failure there as well. So, a 'dmsetup status' will reveal ONLY ONE mirror and its mirrored log have failed devices. - The 'lvconvert --repair' is careful to avoid reading from suspended DM devices and blocking mirrors (see lib/activate/dev_manager.c:_ignore_blocked_mirror_devices()), as this would cause the command to block indefinitely and lock out any other LVM commands. However, the second mirror is not suspended and the failed devices that will cause it to block I/O haven't been detected yet! Infact (and this is crucial to why this bug is difficult to reproduce outside the STS testsuite), a read to the second mirror will be successful because it doesn't require updating the log, which is where the device failures have occurred. - Ongoing I/O to the second mirror finally gets around to causing the log failure to be detected in the kernel. Specifically, a write operation was necessary to cause the failure of the mirrored log to be noticed. Now, all I/O (even reads) blocks to the entire mirror waiting for the mirrored log to be repaired. All the pieces are now in place to cause the blockage. The 'lvconvert --repair' has gotten to the point where it thinks it is ok to check the 2nd mirror for an LVM label. Another process has caused the 2nd mirror become blocked in the meantime. The resulting label read will cause the command and any subsequent LVM commands to hang. Frankly, I find it amazing that the timing of the STS test operations are such that it can hit this bug reliably, but this is exactly the conclusion that matches when looking at the logs, tracing through gdb, and compairing results when instrumenting the above by hand. I'm not quite sure how many other operations this could affect. I suppose most LVM commands that come across an undetected mirror failure could result in a hang. It seems like a likely possibility and I'm not sure mirrors with mirrored logs are a requirement in all cases. I can think of two solutions and a long-shot to this problem: 1) [easy but blunt]: Disallow mirrors from being PVs. Customers have filed bugs in the past over this when discovering that this is not allowed. I'm not sure how necessary the ability to do this is, but this would immediately disallow any kind of VG-level stacking on mirrors - cluster or otherwise. 2) [harder but more precise]: Implement another filter which should come before other filters that would check if a mirror is stacked on a PV which has failed. I'm not sure how possible this is. Ordering of devices that are being read for labels would be important. Attempting to read a mirror with undetected failed devices before the failed PV themselves would obviously make it impossible to do this. *) [long-shot]: I *think* this could affect mirrors that do not use mirrored logs. If that's not true, then we could disallow only mirrors with mirrored logs to be used as PVs. The only time I have ever heard a customer what to stack VGs on mirrors is in a cluster (they want RAID10 like features in a cluster). Mirrors with mirrored logs are not available in a cluster, so they won't know the difference. RAID1/10 are available for single machine mode, so they shouldn't care either. This would probably be the best solution, but it requires that the problem is isolated to mirrors with mirrored logs OR it requires us to live with the fact that we've shrunk the problem but not eliminated it.
Yes, I can make it happen without using 'mirrored' logs. The timing is tighter though. So, any mirror with an undetected error can be problematic for any LVM command if the command reaches the point of label reading and then some outside processes causes the mirror to become blocked by performing a write. The long-shot solution in comment 14 will not fix the problem entirely, it will simply make it harder to cause.
Discussion of potential workarounds/fixes ongoing.
I've implemented the code to disallow label-reading on mirrors with mirrored logs. It does not solve the problem. It is still somewhat simple to reproduce the problem with the QA test suite. We must decide whether to disallow mirrors as PVs until we implement asynchronous label reading. Because of the problems it can cause, I suggest we disable PVs on mirror LVs - perhaps at a minimum during repair.
Another solution that I'm testing at the moment is to introduce a new global variable (like _ignore_suspended_devices) called '_avoid_reading_mirrors'. It is set when cmd->handles_missing_pvs is set. No mirror devices encountered while this global is set will be read for labels. 'handles_missing_pvs' is set when performing repair type operations. I'm not yet sure if this is enough to prevent initial label reads from hitting mirrors. I may have to make a configuration file option that adjusts whether or not to read mirrors for PVs. That may actually be a good solution. We would set the config option to /not/ read mirrors. Those users who complain could change the config file, but not after seeing a description that warns them of why it is bad to allow reading mirrors and what their alternatives might be.
(In reply to Jonathan Earl Brassow from comment #18) > Another solution that I'm testing at the moment is to introduce a new global > variable (like _ignore_suspended_devices) called '_avoid_reading_mirrors'. > It is set when cmd->handles_missing_pvs is set. No mirror devices > encountered while this global is set will be read for labels. > 'handles_missing_pvs' is set when performing repair type operations. I'm > not yet sure if this is enough to prevent initial label reads from hitting > mirrors. I may have to make a configuration file option that adjusts > whether or not to read mirrors for PVs. That may actually be a good > solution. We would set the config option to /not/ read mirrors. Those > users who complain could change the config file, but not after seeing a > description that warns them of why it is bad to allow reading mirrors and > what their alternatives might be. This solution seems to work to prevent this bug. I can no longer produce the issue using the STS tests. I can create a VG on top of a mirrored LV and create and use LVs within that layered VG. However, it is impossible to '[vg|lv]change' or '[vg|lv]remove' the LVs in the layer VG because they set 'handles_missing_pvs'. I wonder why that is... This could be worked around by setting 'avoid_reading_mirrors' in a better place.
Full backtrace of where the problem is occurring: #0 0x0000003af140e75d in read () from /lib64/libpthread.so.0 #1 0x00007f9ea114f5ef in _io (where=0x7f9ea1126160, buffer=0x7f9e9c057800 "", should_write=0) at device/dev-io.c:97 #2 0x00007f9ea114fa6a in _aligned_io (where=0x7f9ea11261e0, buffer=0x7f9ea112622c "", should_write=0) at device/dev-io.c:204 #3 0x00007f9ea1151019 in dev_read (dev=0x7f9e9c06f4c8, offset=0, len=4, buffer=0x7f9ea112622c) at device/dev-io.c:671 #4 0x00007f9ea11514e8 in _dev_has_md_magic (dev=0x7f9e9c06f4c8, sb_offset=0) at device/dev-md.c:37 #5 0x00007f9ea1151671 in dev_is_md (dev=0x7f9e9c06f4c8, sb=0x0) at device/dev-md.c:113 #6 0x00007f9ea115a4f4 in _ignore_md (f=0x7f9e9c020810, dev=0x7f9e9c06f4c8) at filters/filter-md.c:29 #7 0x00007f9ea1157c97 in _and_p (f=0x7f9e9c020890, dev=0x7f9e9c06f4c8) at filters/filter-composite.c:24 #8 0x00007f9ea1158dc7 in _lookup_p (f=0x7f9e9c020700, dev=0x7f9e9c06f4c8) at filters/filter-persistent.c:295 #9 0x00007f9ea114f359 in dev_iter_get (iter=0x7f9e9c056d00) at device/dev-cache.c:1011 #10 0x00007f9ea113f591 in lvmcache_label_scan (cmd=0x1d80fa0, full_scan=0) at cache/lvmcache.c:691 #11 0x00007f9ea1192ba1 in _vg_read (cmd=0x1d80fa0, vgname=0x7f9e9c055bc2 "helter_skelter", vgid=0x0, warnings=1, consistent=0x7f9ea1126584, precommitted=0) at metadata/metadata.c:3032 #12 0x00007f9ea1193e9a in vg_read_internal (cmd=0x1d80fa0, vgname=0x7f9e9c055bc2 "helter_skelter", vgid=0x0, warnings=1, consistent=0x7f9ea1126584) at metadata/metadata.c:3437 #13 0x00007f9ea11958af in _vg_lock_and_read (cmd=0x1d80fa0, vg_name=0x7f9e9c055bc2 "helter_skelter", vgid=0x0, lock_flags=36, status_flags=514, misc_flags=1048576) at metadata/metadata.c:4139 #14 0x00007f9ea1195c78 in vg_read (cmd=0x1d80fa0, vg_name=0x7f9e9c055bc2 "helter_skelter", vgid=0x0, flags=1048576) at metadata/metadata.c:4246 #15 0x00007f9ea1195cb9 in vg_read_for_update (cmd=0x1d80fa0, vg_name=0x7f9e9c055bc2 "helter_skelter", vgid=0x0, flags=0) at metadata/metadata.c:4257 #16 0x00007f9ea11e0477 in _get_lvconvert_vg (cmd=0x1d80fa0, name=0x7f9e9c055bc2 "helter_skelter", uuid=0x0) at lvconvert.c:563 #17 0x00007f9ea11e7a35 in get_vg_lock_and_logical_volume (cmd=0x1d80fa0, vg_name=0x7f9e9c055bc2 "helter_skelter", lv_name=0x7f9e9c005bee "syncd_secondary_log_2legs_1") at lvconvert.c:2701 #18 0x00007f9ea11e7c74 in lvconvert_single (cmd=0x1d80fa0, lp=0x7f9ea1126790) at lvconvert.c:2739 #19 0x00007f9ea11e825b in lvconvert (cmd=0x1d80fa0, argc=1, argv=0x7f9ea1126938) at lvconvert.c:2848 #20 0x00007f9ea11f2300 in lvm_run_command (cmd=0x1d80fa0, argc=1, argv=0x7f9ea1126938) at lvmcmdline.c:1179 #21 0x00007f9ea120beb8 in lvm2_run (handle=0x1d80fa0, #22 0x00007f9ea1480e5b in dmeventd_lvm2_run ( cmdline=0x7f9ea1126b90 "lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies helter_skelter/syncd_secondary_log_2legs_1") at dmeventd_lvm.c:146 #23 0x00007f9ea1688ed6 in _remove_failed_devices ( device=0x7f9e9c001330 "helter_skelter-syncd_secondary_log_2legs_1") at dmeventd_mirror.c:144 #24 0x00007f9ea168905d in process_event (dmt=0x7f9e9c0011e0, event=DM_EVENT_DEVICE_ERROR, unused=0x13a6ea8) at dmeventd_mirror.c:188 #25 0x0000000000403d5f in _do_process_event (thread=0x13a6e30, task=0x7f9e9c0011e0) at dmeventd.c:721 #26 0x000000000040409b in _monitor_thread (arg=0x13a6e30) at dmeventd.c:845 #27 0x0000003af14079d1 in start_thread () from /lib64/libpthread.so.0 #28 0x0000003af10e8b3d in clone () from /lib64/libc.so.6
Bug 855398 would resurface if we choose to disable PVs on mirror LVs. We may wish to make a config option that toggles the ability to use mirror LVs. We can skip by default or when can strongly encourage users to disable PVs on mirrors. The problem is that if we choose not to disable it by default, this bug has the potential to affect all mirror users. If we disable it by default, then only users who stack VGs on mirrors will be affected. The second group is much smaller, but bug 855398 does show there are some people out there who do this.
History of capability to put PVs on mirrors: RHEL 6.0) cluster VGs: yes single machine: yes * bug 1009700 (i.e. this bug) is present, but not discovered RHEL 6.3) cluster VGs: NO single machine: yes - commit 7a965168193968fd1a34c766f0b435e4f08ef4ac caused clvmd to always set _ignore_suspended_devices. The code at the time would always avoid reading mirror LVs if this variable was set. Hence, stacking no longer worked in a cluster. Bug 855398 was filed in response. - The fix for bug 855398 was to realize that avoiding mirrors whenever _ingore_suspended_devices was set solved the problem of reading blocked mirrors by simply ignoring mirrors all together. Mirrors needed to be checked if they contained failed devices and avoided if they could be. This was done through the mirror status string via 'ignore_blocked_mirror_devices()'. * bug 1009700 (i.e. this bug) is present, but not discovered RHEL 6.4) cluster VGs: yes single machine: yes - Changes made to fix bug 855398 were correct, but resulted in a new bug being discovered - bug 1009700. This bug has always been present, but hadn't been seen until this point because mirror LVs were not being read when _ignore_suspended_devices was set. It should be noted that avoiding mirrors while '_ignore_suspended_devices' is set does not eliminate bug 1009700 (i.e. this bug). It simply means that those commands that had set the variable (like commands that perform repairs) would not hit it. Other commands like pvs, lvs, vgs, lvcreate, vgcreate, lvchange, vgchange, etc are still able to block on mirrors with failures. Once blocked, those commands prevent any subsequent repair from processing.
Workaround committed upstream: commit d5896f0afd28096c3dc04a9c7e9627b387f145fb Author: Jonathan Brassow <jbrassow@redhat.com> Date: Tue Oct 22 19:14:33 2013 -0500 Mirror: Fix hangs and lock-ups caused by attempting label reads of mirrors There is a problem with the way mirrors have been designed to handle failures that is resulting in stuck LVM processes and hung I/O. When mirrors encounter a write failure, they block I/O and notify userspace to reconfigure the mirror to remove failed devices. This process is open to a couple races: 1) Any LVM process other than the one that is meant to deal with the mirror failure can attempt to read the mirror, fail, and block other LVM commands (including the repair command) from proceeding due to holding a lock on the volume group. 2) If there are multiple mirrors that suffer a failure in the same volume group, a repair can block while attempting to read the LVM label from one mirror while trying to repair the other. Mitigation of these races has been attempted by disallowing label reading of mirrors that are either suspended or are indicated as blocking by the kernel. While this has closed the window of opportunity for hitting the above problems considerably, it hasn't closed it completely. This is because it is still possible to start an LVM command, read the status of the mirror as healthy, and then perform the read for the label at the moment after a the failure is discovered by the kernel. I can see two solutions to this problem: 1) Allow users to configure whether mirrors can be candidates for LVM labels (i.e. whether PVs can be created on mirror LVs). If the user chooses to allow label scanning of mirror LVs, it will be at the expense of a possible hang in I/O or LVM processes. 2) Instrument a way to allow asynchronous label reading - allowing blocked label reads to be ignored while continuing to process the LVM command. This would action would allow LVM commands to continue even though they would have otherwise blocked trying to read a mirror. They can then release their lock and allow a repair command to commence. In the event of #2 above, the repair command already in progress can continue and repair the failed mirror. This patch brings solution #1. If solution #2 is developed later on, the configuration option created in #1 can be negated - allowing mirrors to be scanned for labels by default once again.
+#define DEFAULT_IGNORE_LVM_MIRRORS 1 Flipped to "0" for 6.5 defaults as agreed...
prajnoha, You may need to also switch the 'ignore_lvm_mirrors' in the example config file...
QA, please be advised that you must change the configuration file setting to 'ignore_lvm_mirrors = 1' in order to see that this bug has been fixed. Users will be made aware of the necessity to change the setting via REHL6.5 release note.
Tested with ignore_lvm_mirrors = 0, which caused the deadlock in 3rd iteration. changed the setting to ignore_lvm_mirrors = 1 and five iterations of the above mentioned test (kill_both_logs_2_legs_2_logs) passed without issues. Marking VERIFIED with lvm2-2.02.100-7.el6
This issue still exists in the latest rpms. I've reproduce it a few times now even w/ 'ignore_lvm_mirrors = 1'. I'll leave this bug marked verified and clone this bug for 6.6 since it does appear to be harder to hit now with this fix. # Set to 1 to disallow stacking and thereby avoid a possible deadlock. ignore_lvm_mirrors = 1 2.6.32-425.el6.x86_64 lvm2-2.02.100-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 lvm2-libs-2.02.100-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 lvm2-cluster-2.02.100-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 udev-147-2.51.el6 BUILT: Thu Oct 17 13:14:34 CEST 2013 device-mapper-1.02.79-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 device-mapper-libs-1.02.79-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 device-mapper-event-1.02.79-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 device-mapper-event-libs-1.02.79-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 device-mapper-persistent-data-0.2.8-2.el6 BUILT: Mon Oct 21 16:14:25 CEST 2013 cmirror-2.02.100-8.el6 BUILT: Wed Oct 30 09:10:56 CET 2013 ================================================================================ Iteration 0.10 started at Wed Oct 30 19:22:16 CDT 2013 ================================================================================ Scenario kill_multiple_legs_synced_4_legs: Kill multiple legs of synced 4 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_multiple_legs_4legs_1 * sync: 1 * striped: 0 * leg devices: /dev/sdb1 /dev/sdd1 /dev/sdf1 /dev/sde1 * log devices: /dev/sdc1 * no MDA devices: * failpv(s): /dev/sdb1 /dev/sde1 * failnode(s): virt-002.cluster-qe.lab.eng.brq.redhat.com * lvmetad: 0 * leg fault policy: allocate * log fault policy: allocate ****************************************************** Creating mirror(s) on virt-002.cluster-qe.lab.eng.brq.redhat.com... virt-002.cluster-qe.lab.eng.brq.redhat.com: lvcreate -m 3 -n syncd_multiple_legs_4legs_1 -L 500M helter_skelter /dev/sdb1:0-1000 /dev/sdd1:0-1000 /dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdc1:0-150 Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices syncd_multiple_legs_4legs_1 mwi-a-m--- 500.00m 8.80 syncd_multiple_legs_4legs_1_mimage_0(0),syncd_multiple_legs_4legs_1_mimage_1(0),syncd_multiple_legs_4legs_1_mimage_2(0),syncd_multiple_legs_4legs_1_mimage_3(0) [syncd_multiple_legs_4legs_1_mimage_0] Iwi-aom--- 500.00m /dev/sdb1(0) [syncd_multiple_legs_4legs_1_mimage_1] Iwi-aom--- 500.00m /dev/sdd1(0) [syncd_multiple_legs_4legs_1_mimage_2] Iwi-aom--- 500.00m /dev/sdf1(0) [syncd_multiple_legs_4legs_1_mimage_3] Iwi-aom--- 500.00m /dev/sde1(0) [syncd_multiple_legs_4legs_1_mlog] lwi-aom--- 4.00m /dev/sdc1(0) Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Creating ext on top of mirror(s) on virt-002.cluster-qe.lab.eng.brq.redhat.com... mke2fs 1.41.12 (17-May-2010) Mounting mirrored ext filesystems on virt-002.cluster-qe.lab.eng.brq.redhat.com... PV=/dev/sde1 syncd_multiple_legs_4legs_1_mimage_3: 3.1 PV=/dev/sdb1 syncd_multiple_legs_4legs_1_mimage_0: 4.1 PV=/dev/sde1 syncd_multiple_legs_4legs_1_mimage_3: 3.1 PV=/dev/sdb1 syncd_multiple_legs_4legs_1_mimage_0: 4.1 Writing verification files (checkit) to mirror(s) on... ---- virt-002.cluster-qe.lab.eng.brq.redhat.com ---- Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- virt-002.cluster-qe.lab.eng.brq.redhat.com ---- Disabling device sdb on virt-002.cluster-qe.lab.eng.brq.redhat.com Disabling device sde on virt-002.cluster-qe.lab.eng.brq.redhat.com Getting recovery check start time from /var/log/messages: Oct 31 01:23 Attempting I/O to cause mirror down conversion(s) on virt-002.cluster-qe.lab.eng.brq.redhat.com 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.443132 s, 94.7 MB/s [DEADLOCK] lvm[3678]: Monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1 for events. qarshd[10931]: Running cmdline: echo offline > /sys/block/sde/device/state & lvm[3678]: Monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1 for events. lvm[3678]: Mirror status: 1 of 4 images failed. lvm[3678]: Trying to up-convert to 4 images, 1 logs. lvm[3678]: Monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1_mimagetmp_4 for events. lvm[3678]: Monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1 for events. lvm[3678]: WARNING: Failed to replace 2 of 4 images in volume syncd_multiple_legs_4legs_1 lvm[3678]: WARNING: Failed to replace 1 of 1 logs in volume syncd_multiple_legs_4legs_1 [...] lvm[3678]: /dev/sde1: read failed after 0 of 1024 at 4096: Input/output error lvm[3678]: WARNING: Inconsistent metadata found for VG helter_skelter - updating to use version 110 lvm[3678]: Automatic metadata correction failed lvm[3678]: Volume group "helter_skelter" not found lvm[3678]: ABORTING: Can't reread VG for helter_skelter/syncd_multiple_legs_4legs_1 lvm[3678]: Repair of mirrored device helter_skelter-syncd_multiple_legs_4legs_1 failed. lvm[3678]: Failed to remove faulty devices in helter_skelter-syncd_multiple_legs_4legs_1. lvm[3678]: helter_skelter-syncd_multiple_legs_4legs_1 is now in-sync. lvm[3678]: helter_skelter-syncd_multiple_legs_4legs_1_mimagetmp_4 is now in-sync. lvm[3678]: No longer monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1 for events. lvm[3678]: No longer monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1 for events. lvm[3678]: Secondary mirror device 253:6 has failed (D). lvm[3678]: Device failure in helter_skelter-syncd_multiple_legs_4legs_1_mimagetmp_4. lvm[3678]: Names including "_mimage" are reserved. Please choose a different LV name. lvm[3678]: Run `lvconvert --help' for more information. lvm[3678]: Repair of mirrored device helter_skelter-syncd_multiple_legs_4legs_1_mimagetmp_4 failed. lvm[3678]: Failed to remove faulty devices in helter_skelter-syncd_multiple_legs_4legs_1_mimagetmp_4. qarshd[11032]: Running cmdline: pvs -a lvm[3678]: No longer monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1 for events. kernel: sd 4:0:0:1: rejecting I/O to offline device [...] qarshd[11036]: Running cmdline: dd if=/dev/zero of=/mnt/syncd_multiple_legs_4legs_1/ddfile count=10 bs=4M lvm[3678]: helter_skelter-syncd_multiple_legs_4legs_1 is now in-sync. qarshd[11038]: Running cmdline: sync kernel: INFO: task flush-253:7:10895 blocked for more than 120 seconds. kernel: Not tainted 2.6.32-425.el6.x86_64 #1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: flush-253:7 D 0000000000000000 0 10895 2 0x00000080 kernel: ffff8800248a1840 0000000000000046 ffff8800ffffffff 0001a5093836ed53 kernel: ffff8800248a17b0 ffff88003dfd71a0 00000000029d00ac ffffffffad8ebae9 kernel: ffff88003d54bab8 ffff8800248a1fd8 000000000000fbc8 ffff88003d54bab8 kernel: Call Trace: kernel: [<ffffffff810a7351>] ? ktime_get_ts+0xb1/0xf0 kernel: [<ffffffff811bf3d0>] ? sync_buffer+0x0/0x50 kernel: [<ffffffff81528a63>] io_schedule+0x73/0xc0 kernel: [<ffffffff811bf410>] sync_buffer+0x40/0x50 kernel: [<ffffffff815292fa>] __wait_on_bit_lock+0x5a/0xc0 kernel: [<ffffffff811bf3d0>] ? sync_buffer+0x0/0x50 kernel: [<ffffffff815293d8>] out_of_line_wait_on_bit_lock+0x78/0x90 kernel: [<ffffffff8109b520>] ? wake_bit_function+0x0/0x50 kernel: [<ffffffff811bf740>] ? end_buffer_async_write+0x0/0x190 kernel: [<ffffffff811bf5b6>] __lock_buffer+0x36/0x40 kernel: [<ffffffff811c08b5>] __block_write_full_page+0x305/0x330 kernel: [<ffffffff811bf740>] ? end_buffer_async_write+0x0/0x190 kernel: [<ffffffff811c09c0>] block_write_full_page_endio+0xe0/0x120 kernel: [<ffffffffa03aeb30>] ? buffer_unmapped+0x0/0x20 [ext3] kernel: [<ffffffff811c0a15>] block_write_full_page+0x15/0x20 kernel: [<ffffffffa03af6dd>] ext3_ordered_writepage+0x1ed/0x240 [ext3] kernel: [<ffffffff81133967>] __writepage+0x17/0x40 kernel: [<ffffffff81134c2d>] write_cache_pages+0x1fd/0x4c0 kernel: [<ffffffff81133950>] ? __writepage+0x0/0x40 kernel: [<ffffffff81134f14>] generic_writepages+0x24/0x30 kernel: [<ffffffff81134f55>] do_writepages+0x35/0x40 kernel: [<ffffffff811b52ad>] writeback_single_inode+0xdd/0x290 kernel: [<ffffffff811b56ad>] writeback_sb_inodes+0xbd/0x170 kernel: [<ffffffff811b580b>] writeback_inodes_wb+0xab/0x1b0 kernel: [<ffffffff811b5c03>] wb_writeback+0x2f3/0x410 kernel: [<ffffffff815282d0>] ? thread_return+0x4e/0x76e
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1704.html