1009700 – mirror device failure scenarios lead to deadlocks

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1009700 - mirror device failure scenarios lead to deadlocks

Summary: mirror device failure scenarios lead to deadlocks

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	6.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1025456
TreeView+	depends on / blocked

Reported:	2013-09-18 22:19 UTC by Corey Marthaler
Modified:	2013-11-21 23:28 UTC (History)
CC List:	12 users (show)
Fixed In Version:	lvm2-2.02.100-7.el6
Doc Type:	Bug Fix
Doc Text:	A small window of opportunity exists for a bug to express itself during device failures while using LVM mirrors. It is possible for LVM commands to become hung when attempting to read an LVM mirror just after a write failure to a device but before the repair command has had a chance to handle the fault. A new configuration option has been made available to avoid this issue - the 'ignore_lvm_mirrors' option in the lvm configuration file. Setting this to '1' (i.e. "ignore") will cause LVM mirrors to be ignored and prevent the problem from happening. Ignoring LVM mirrors also means that it is impossible to stack volume groups on LVM mirrors. The problem is not present with the LVM RAID types, like "raid1". Users are encouraged to use the RAID segment types - especially when attempting to stack volume groups on top of mirrored logical volumes.
Clone Of:
Clones:	1025456 (view as bug list)
Environment:
Last Closed:	2013-11-21 23:28:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:1704	0	normal	SHIPPED_LIVE	lvm2 bug fix and enhancement update	2013-11-20 21:52:01 UTC

Description Corey Marthaler 2013-09-18 22:19:59 UTC

Description of problem:
This may be another version of bug 637936.

Scenario kill_both_logs_2_legs_2_logs: Kill both logs of synced 2 leg redundant log mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_both_logs_2legs_2logs_1
* sync:               1
* striped:            0
* leg devices:        /dev/sdh1 /dev/sdf1
* log devices:        /dev/sdc1 /dev/sdg1
* no MDA devices:     
* failpv(s):          /dev/sdc1 /dev/sdg1
* failnode(s):        virt-024.cluster-qe.lab.eng.brq.redhat.com
* additional snap:    /dev/sdh1
* lvmetad:            0
* leg fault policy:   remove
* log fault policy:   allocate
******************************************************

Creating mirror(s) on virt-024.cluster-qe.lab.eng.brq.redhat.com...
virt-024.cluster-qe.lab.eng.brq.redhat.com: lvcreate --mirrorlog mirrored -m 1 -n syncd_both_logs_2legs_2logs_1 -L 500M helter_skelter /dev/sdh1:0-1000 /dev/sdf1:0-1000 /dev/sdc1:0-150 /dev/sdg1:0-150
Creating a snapshot volume of each of the mirrors
  WARNING: Snapshots of mirrors can deadlock under rare device failures.
  WARNING: Consider using the raid1 mirror type to avoid this.
  WARNING: See global/mirror_segtype_default in lvm.conf.

Current mirror/raid device structure(s):
 LV                                            Attr       LSize   Cpy%Sync Devices
 hs_snap1                                      swi-a-s--- 252.00m          /dev/sdh1(125)
 syncd_both_logs_2legs_2logs_1                 owi-a-m--- 500.00m    37.60 syncd_both_logs_2legs_2logs_1_mimage_0(0),syncd_both_logs_2legs_2logs_1_mimage_1(0)          
 [syncd_both_logs_2legs_2logs_1_mimage_0]      Iwi-aom--- 500.00m          /dev/sdh1(0)
 [syncd_both_logs_2legs_2logs_1_mimage_1]      Iwi-aom--- 500.00m          /dev/sdf1(0)                                                                                  
 [syncd_both_logs_2legs_2logs_1_mlog]          mwi-aom---   4.00m   100.00 syncd_both_logs_2legs_2logs_1_mlog_mimage_0(0),syncd_both_logs_2legs_2logs_1_mlog_mimage_1(0) 
 [syncd_both_logs_2legs_2logs_1_mlog_mimage_0] iwi-aom---   4.00m          /dev/sdc1(0)                                                                                  
 [syncd_both_logs_2legs_2logs_1_mlog_mimage_1] iwi-aom---   4.00m          /dev/sdg1(0)                                                                                  
                                                                                                                                                                         
PV=/dev/sdg1                                                                                                                                                             
        syncd_both_logs_2legs_2logs_1_mlog_mimage_1: 1.4                                                                                                                 
PV=/dev/sdc1                                                                                                                                                             
        syncd_both_logs_2legs_2logs_1_mlog_mimage_0: 1.4                                                                                                                 
PV=/dev/sdg1                                                                                                                                                             
        syncd_both_logs_2legs_2logs_1_mlog_mimage_1: 1.4                                                                                                                 
PV=/dev/sdc1                                                                                                                                                             
        syncd_both_logs_2legs_2logs_1_mlog_mimage_0: 1.4                                                                                                                 
                                                                                                                                                                         
Waiting until all mirror|raid volumes become fully syncd...                                                                                                              
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating ext on top of mirror(s) on virt-024.cluster-qe.lab.eng.brq.redhat.com...
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on virt-024.cluster-qe.lab.eng.brq.redhat.com...

Writing verification files (checkit) to mirror(s) on...
        ---- virt-024.cluster-qe.lab.eng.brq.redhat.com ----

Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- virt-024.cluster-qe.lab.eng.brq.redhat.com ----

Disabling device sdc on virt-024.cluster-qe.lab.eng.brq.redhat.com
Disabling device sdg on virt-024.cluster-qe.lab.eng.brq.redhat.com

[deadlock]


[root@virt-024 /]# dmsetup ls
helter_skelter-syncd_both_logs_2legs_2logs_1_mimage_1   (253:6)
helter_skelter-syncd_both_logs_2legs_2logs_1_mlog_mimage_1      (253:3)
helter_skelter-syncd_both_logs_2legs_2logs_1_mimage_0   (253:5)
helter_skelter-syncd_both_logs_2legs_2logs_1_mlog_mimage_0      (253:2)
helter_skelter-hs_snap1-cow     (253:10)
helter_skelter-syncd_both_logs_2legs_2logs_1    (253:7)
helter_skelter-hs_snap1 (253:8)
helter_skelter-syncd_both_logs_2legs_2logs_1_mlog       (253:4)
helter_skelter-syncd_both_logs_2legs_2logs_1-real       (253:9)

kernel: INFO: task dmeventd:10613 blocked for more than 120 seconds.
kernel:      Not tainted 2.6.32-419.el6.x86_64 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: dmeventd      D 0000000000000000     0 10613      1 0x00000080
kernel: ffff880015cbda38 0000000000000086 ffff88001feb5337 0000000100000001
kernel: ffff880015cbd9f0 0000000081065dbe ffff88001feb5338 ffffffff811337c0
kernel: ffff880019f5bab8 ffff880015cbdfd8 000000000000fbc8 ffff880019f5bab8
kernel: Call Trace:
kernel: [<ffffffff811337c0>] ? __writepage+0x0/0x40
kernel: [<ffffffff8109b72e>] ? prepare_to_wait+0x4e/0x80
kernel: [<ffffffffa03d5ddd>] log_wait_commit+0x10d/0x170 [jbd]
kernel: [<ffffffff8109b400>] ? autoremove_wake_function+0x0/0x40
kernel: [<ffffffffa04012d6>] ext3_sync_fs+0x76/0xb0 [ext3]
kernel: [<ffffffff811eff8e>] sync_quota_sb+0x5e/0x130
kernel: [<ffffffff811bac8a>] __sync_filesystem+0x7a/0x90
kernel: [<ffffffff811bae9b>] sync_filesystem+0x4b/0x70
kernel: [<ffffffff811c615e>] freeze_bdev+0xfe/0x280
kernel: [<ffffffffa0002787>] dm_suspend+0x97/0x290 [dm_mod]
kernel: [<ffffffffa000780c>] ? __find_device_hash_cell+0xac/0x170 [dm_mod]
kernel: [<ffffffffa0008bb0>] ? dev_suspend+0x0/0x250 [dm_mod]
kernel: [<ffffffffa0008c26>] dev_suspend+0x76/0x250 [dm_mod]
kernel: [<ffffffffa0008bb0>] ? dev_suspend+0x0/0x250 [dm_mod]
kernel: [<ffffffffa0009a84>] ctl_ioctl+0x214/0x450 [dm_mod]
kernel: [<ffffffffa0009cd3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
kernel: [<ffffffff8119daa2>] vfs_ioctl+0x22/0xa0
kernel: [<ffffffff8119dc44>] do_vfs_ioctl+0x84/0x580
kernel: [<ffffffff8119e1c1>] sys_ioctl+0x81/0xa0
kernel: [<ffffffff810e1f7e>] ? __audit_syscall_exit+0x25e/0x290
kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
kernel: INFO: task kmirrord:10595 blocked for more than 120 seconds.
kernel:      Not tainted 2.6.32-419.el6.x86_64 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kmirrord      D 0000000000000000     0 10595      2 0x00000080
kernel: ffff88001a095bc0 0000000000000046 ffff8800ffffffff 000008e01eb0d423
kernel: ffff8800398a8948 ffff88003d81ac90 000000000026c4ea ffffffffadc5f788
kernel: ffff88003a57baf8 ffff88001a095fd8 000000000000fbc8 ffff88003a57baf8
kernel: Call Trace:
kernel: [<ffffffff810a7281>] ? ktime_get_ts+0xb1/0xf0
kernel: [<ffffffff815282d3>] io_schedule+0x73/0xc0
kernel: [<ffffffffa000a315>] sync_io+0x95/0x110 [dm_mod]
kernel: [<ffffffffa00026d0>] ? dm_unplug_all+0x50/0x70 [dm_mod]
kernel: [<ffffffff811221c5>] ? mempool_kmalloc+0x15/0x20
kernel: [<ffffffff811223a3>] ? mempool_alloc+0x63/0x140
kernel: [<ffffffffa000a4c5>] ? dm_io+0xc5/0x1c0 [dm_mod]
kernel: [<ffffffffa0009d50>] ? bvec_get_page+0x0/0x30 [dm_mod]
kernel: [<ffffffffa000a5b7>] dm_io+0x1b7/0x1c0 [dm_mod]
kernel: [<ffffffffa0009e20>] ? vm_get_page+0x0/0x70 [dm_mod]
kernel: [<ffffffffa0009d90>] ? vm_next_page+0x0/0x30 [dm_mod]
kernel: [<ffffffffa001fc61>] disk_flush+0x91/0x170 [dm_log]
kernel: [<ffffffffa0026740>] ? dm_rh_inc+0xc0/0xd0 [dm_region_hash]
kernel: [<ffffffffa00260d3>] dm_rh_flush+0x13/0x20 [dm_region_hash]
kernel: [<ffffffffa002f15f>] do_mirror+0x27f/0x6e0 [dm_mirror]
kernel: [<ffffffffa002eee0>] ? do_mirror+0x0/0x6e0 [dm_mirror]
kernel: [<ffffffff81094e80>] worker_thread+0x170/0x2a0
kernel: [<ffffffff8109b400>] ? autoremove_wake_function+0x0/0x40
kernel: [<ffffffff81094d10>] ? worker_thread+0x0/0x2a0
kernel: [<ffffffff8109b056>] kthread+0x96/0xa0
kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
kernel: [<ffffffff8109afc0>] ? kthread+0x0/0xa0
kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20



Version-Release number of selected component (if applicable):
2.6.32-419.el6.x86_64
lvm2-2.02.100-3.el6    BUILT: Thu Sep 12 15:59:03 CEST 2013
lvm2-libs-2.02.100-3.el6    BUILT: Thu Sep 12 15:59:03 CEST 2013
lvm2-cluster-2.02.100-3.el6    BUILT: Thu Sep 12 15:59:03 CEST 2013
udev-147-2.48.el6    BUILT: Fri Aug  9 13:09:50 CEST 2013
device-mapper-1.02.79-3.el6    BUILT: Thu Sep 12 15:59:03 CEST 2013
device-mapper-libs-1.02.79-3.el6    BUILT: Thu Sep 12 15:59:03 CEST 2013
device-mapper-event-1.02.79-3.el6    BUILT: Thu Sep 12 15:59:03 CEST 2013
device-mapper-event-libs-1.02.79-3.el6    BUILT: Thu Sep 12 15:59:03 CEST 2013
cmirror-2.02.100-3.el6    BUILT: Thu Sep 12 15:59:03 CEST 2013

Comment 1 Corey Marthaler 2013-09-19 19:48:21 UTC

This is reproducible.

lvm[3779]: Monitoring snapshot helter_skelter-hs_snap1
lvm[3779]: Monitoring mirror device helter_skelter-syncd_both_logs_2legs_2logs_1-real for events.
lvm[3779]: Mirror log status: 1 of 2 images failed.
lvm[3779]: Trying to up-convert to 2 images, 2 logs.
lvm[3779]: /dev/sdf1: read failed after 0 of 512 at 4096: Input/output error
lvm[3779]: /dev/sdf1: read failed after 0 of 2048 at 0: Input/output error
lvm[3779]: Monitoring mirror device helter_skelter-syncd_both_logs_2legs_2logs_1_mlog for events.
kernel: device-mapper: raid1: Unable to read primary mirror during recovery
kernel: device-mapper: raid1: Primary mirror (253:2) failed while out-of-sync: Reads may fail.
lvm[3779]: Trying to up-convert to 2 images, 1 logs.
lvm[3779]: intermediate VG write failed.
lvm[3779]: Trying to up-convert to 2 images, 0 logs.
lvm[3779]: intermediate VG write failed.
lvm[3779]: WARNING: Failed to replace 2 of 2 logs in volume syncd_both_logs_2legs_2logs_1
lvm[3779]: Repair of mirrored device helter_skelter-syncd_both_logs_2legs_2logs_1_mlog finished successfully.
lvm[3779]: helter_skelter-syncd_both_logs_2legs_2logs_1-real is now in-sync.
lvm[3779]: Primary mirror device 253:2 sync failed.
lvm[3779]: No longer monitoring mirror device helter_skelter-syncd_both_logs_2legs_2logs_1_mlog for events.
lvm[3779]: No longer monitoring snapshot helter_skelter-hs_snap1
lvm[3779]: No longer monitoring mirror device helter_skelter-syncd_both_logs_2legs_2logs_1-real for events.
qarshd[13150]: Running cmdline: dd if=/dev/zero of=/mnt/syncd_both_logs_2legs_2logs_1/ddfile count=10 bs=4M
qarshd[13152]: Running cmdline: sync
kernel: INFO: task kmirrord:12944 blocked for more than 120 seconds.
kernel:      Not tainted 2.6.32-419.el6.x86_64 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kmirrord      D 0000000000000000     0 12944      2 0x00000080
kernel: ffff88003d707bc0 0000000000000046 ffff8800ffffffff 000018b03b040975
kernel: 0000000000000000 ffff88003d2c2ec0 00000000003c30d0 ffffffffadc5d87b
kernel: ffff88003d173ab8 ffff88003d707fd8 000000000000fbc8 ffff88003d173ab8
kernel: Call Trace:
kernel: [<ffffffff810a7281>] ? ktime_get_ts+0xb1/0xf0
kernel: [<ffffffff815282d3>] io_schedule+0x73/0xc0
kernel: [<ffffffffa000a315>] sync_io+0x95/0x110 [dm_mod]
kernel: [<ffffffff811221c5>] ? mempool_kmalloc+0x15/0x20
kernel: [<ffffffff811223a3>] ? mempool_alloc+0x63/0x140
kernel: [<ffffffffa000a5b7>] dm_io+0x1b7/0x1c0 [dm_mod]

Comment 2 Corey Marthaler 2013-09-27 16:27:44 UTC

I've hit what appears to be this same issue with two other helter_skelter scenarios.

================================================================================
Iteration 0.15 started at Thu Sep 26 20:38:52 CDT 2013
================================================================================
Scenario kill_multiple_legs_synced_4_legs: Kill multiple legs of synced 4 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_multiple_legs_4legs_1
* sync:               1
* striped:            0
* leg devices:        /dev/sdc1 /dev/sdd1 /dev/sdg1 /dev/sdh1
* log devices:        /dev/sdb1
* no MDA devices:     
* failpv(s):          /dev/sdc1 /dev/sdh1
* failnode(s):        virt-020.cluster-qe.lab.eng.brq.redhat.com
* lvmetad:            0
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

[...]

Disabling device sdc on virt-020.cluster-qe.lab.eng.brq.redhat.com
Disabling device sdh on virt-020.cluster-qe.lab.eng.brq.redhat.com

Getting recovery check start time from /var/log/messages: Sep 27 03:40
Attempting I/O to cause mirror down conversion(s) on virt-020.cluster-qe.lab.eng.brq.redhat.com
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.233749 s, 179 MB/s

LOG:
Sep 27 03:40:24 virt-020 qarshd[23925]: Running cmdline: echo offline > /sys/block/sdc/device/state &
Sep 27 03:40:26 virt-020 qarshd[23928]: Running cmdline: echo offline > /sys/block/sdh/device/state &
[...]
Sep 27 03:40:26 virt-020 lvm[3807]: intermediate VG write failed.
Sep 27 03:40:26 virt-020 lvm[3807]: Mirror status: 1 of 4 images failed.
Sep 27 03:40:26 virt-020 lvm[3807]: Trying to up-convert to 4 images, 1 logs.
Sep 27 03:40:26 virt-020 lvm[3807]: Failed to write intermediate VG helter_skelter metadata for mirror conversion.
Sep 27 03:40:26 virt-020 lvm[3807]: Failed to insert resync layer
Sep 27 03:40:26 virt-020 lvm[3807]: Failed to insert resync layer
Sep 27 03:40:26 virt-020 lvm[3807]: Trying to up-convert to 3 images, 1 logs.
Sep 27 03:40:26 virt-020 kernel: sd 9:0:0:1: rejecting I/O to offline device
Sep 27 03:40:26 virt-020 lvm[3807]: Trying to up-convert to 2 images, 1 logs.
Sep 27 03:40:26 virt-020 kernel: sd 9:0:0:1: rejecting I/O to offline device
Sep 27 03:40:26 virt-020 lvm[3807]: intermediate VG write failed.
Sep 27 03:40:26 virt-020 lvm[3807]: Trying to up-convert to 2 images, 0 logs.
Sep 27 03:40:26 virt-020 kernel: sd 9:0:0:1: rejecting I/O to offline device
Sep 27 03:40:26 virt-020 lvm[3807]: intermediate VG write failed.
Sep 27 03:40:26 virt-020 lvm[3807]: WARNING: Failed to replace 2 of 4 images in volume syncd_multiple_legs_4legs_1
Sep 27 03:40:26 virt-020 lvm[3807]: WARNING: Failed to replace 1 of 1 logs in volume syncd_multiple_legs_4legs_1
Sep 27 03:40:26 virt-020 lvm[3807]: Repair of mirrored device helter_skelter-syncd_multiple_legs_4legs_1 finished successfully.
[...]
Sep 27 03:40:36 virt-020 qarshd[23951]: Running cmdline: sync
Sep 27 03:43:15 virt-020 kernel: INFO: task flush-253:7:23884 blocked for more than 120 seconds.
Sep 27 03:43:15 virt-020 kernel:      Not tainted 2.6.32-419.el6.x86_64 #1
Sep 27 03:43:15 virt-020 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 27 03:43:15 virt-020 kernel: flush-253:7   D 0000000000000000     0 23884      2 0x00000080
Sep 27 03:43:15 virt-020 kernel: ffff88003e1cd840 0000000000000046 ffff8800ffffffff 00024f4f9010a407
Sep 27 03:43:15 virt-020 kernel: ffff88003e1cd7b0 ffff88003de0e3d0 0000000001a603f4 ffffffffadbb70bd
Sep 27 03:43:15 virt-020 kernel: ffff880020a89af8 ffff88003e1cdfd8 000000000000fbc8 ffff880020a89af8
Sep 27 03:43:15 virt-020 kernel: Call Trace:
Sep 27 03:43:15 virt-020 kernel: [<ffffffff810a7281>] ? ktime_get_ts+0xb1/0xf0
Sep 27 03:43:15 virt-020 kernel: [<ffffffff811bf0a0>] ? sync_buffer+0x0/0x50
Sep 27 03:43:15 virt-020 kernel: [<ffffffff815282d3>] io_schedule+0x73/0xc0
Sep 27 03:43:15 virt-020 kernel: [<ffffffff811bf0e0>] sync_buffer+0x40/0x50
Sep 27 03:43:15 virt-020 kernel: [<ffffffff81528b6a>] __wait_on_bit_lock+0x5a/0xc0
Sep 27 03:43:15 virt-020 kernel: [<ffffffff811bf0a0>] ? sync_buffer+0x0/0x50
Sep 27 03:43:15 virt-020 kernel: [<ffffffff81528c48>] out_of_line_wait_on_bit_lock+0x78/0x90
Sep 27 03:43:15 virt-020 kernel: [<ffffffff8109b480>] ? wake_bit_function+0x0/0x50
Sep 27 03:43:15 virt-020 kernel: [<ffffffff811bf410>] ? end_buffer_async_write+0x0/0x190
Sep 27 03:43:15 virt-020 kernel: [<ffffffff811bf286>] __lock_buffer+0x36/0x40
Sep 27 03:43:15 virt-020 kernel: [<ffffffff811c0585>] __block_write_full_page+0x305/0x330
Sep 27 03:43:15 virt-020 kernel: [<ffffffff811bf410>] ? end_buffer_async_write+0x0/0x190
Sep 27 03:43:15 virt-020 kernel: [<ffffffff811c0690>] block_write_full_page_endio+0xe0/0x120
Sep 27 03:43:15 virt-020 kernel: [<ffffffffa03aeb30>] ? buffer_unmapped+0x0/0x20 [ext3]
Sep 27 03:43:15 virt-020 kernel: [<ffffffff811c06e5>] block_write_full_page+0x15/0x20
Sep 27 03:43:15 virt-020 kernel: [<ffffffffa03af6dd>] ext3_ordered_writepage+0x1ed/0x240 [ext3]
Sep 27 03:43:15 virt-020 kernel: [<ffffffff811337d7>] __writepage+0x17/0x40
Sep 27 03:43:15 virt-020 kernel: [<ffffffff81134a9d>] write_cache_pages+0x1fd/0x4c0
Sep 27 03:43:15 virt-020 kernel: [<ffffffff811337c0>] ? __writepage+0x0/0x40

Comment 3 Corey Marthaler 2013-09-27 16:41:40 UTC

Another example:

Scenario kill_primary_synced_3_legs: Kill primary leg of synced 3 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_primary_3legs_1 syncd_primary_3legs_2
* sync:               1
* striped:            0
* leg devices:        /dev/sdf1 /dev/sdd1 /dev/sdh1
* log devices:        /dev/sdg1
* no MDA devices:     
* failpv(s):          /dev/sdf1
* failnode(s):        taft-01
* lvmetad:            0
* leg fault policy:   remove
* log fault policy:   remove
******************************************************

Creating mirror(s) on taft-01...
taft-01: lvcreate -m 2 -n syncd_primary_3legs_1 -L 500M helter_skelter /dev/sdf1:0-1000 /dev/sdd1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-150
taft-01: lvcreate -m 2 -n syncd_primary_3legs_2 -L 500M helter_skelter /dev/sdf1:0-1000 /dev/sdd1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-150

[...]

Disabling device sdf on taft-01

LOG:
Sep 26 22:40:16 taft-01 qarshd[26774]: Running cmdline: echo offline > /sys/block/sdf/device/state &

Sep 26 22:40:17 taft-01 lvm[3372]: Primary mirror device 253:4 has failed (D).
Sep 26 22:40:17 taft-01 lvm[3372]: Device failure in helter_skelter-syncd_primary_3legs_1.
Sep 26 22:40:17 taft-01 kernel: device-mapper: raid1: Mirror read failed from 253:9. Trying alternative device.

[...]

Sep 26 22:40:22 taft-01 kernel: sd 3:0:0:5: rejecting I/O to offline device
Sep 26 22:40:22 taft-01 kernel: sd 3:0:0:5: rejecting I/O to offline device
Sep 26 22:43:47 taft-01 kernel: INFO: task dmeventd:26626 blocked for more than 120 seconds.
Sep 26 22:43:47 taft-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 26 22:43:47 taft-01 kernel: dmeventd      D 0000000000000002     0 26626      1 0x00000080
Sep 26 22:43:47 taft-01 kernel: ffff88021b57ba98 0000000000000086 000000000003ffff ffffea0006c29fa0
Sep 26 22:43:47 taft-01 kernel: ffff8801ef0e4cd8 ffff88021b57bc08 ffffffff811bfa40 0000000000000286
Sep 26 22:43:47 taft-01 kernel: ffff88021824fab8 ffff88021b57bfd8 000000000000fbc8 ffff88021824fab8
Sep 26 22:43:47 taft-01 kernel: Call Trace:
Sep 26 22:43:47 taft-01 kernel: [<ffffffff811bfa40>] ? blkdev_get_block+0x0/0x20
Sep 26 22:43:47 taft-01 kernel: [<ffffffff810a4421>] ? ktime_get_ts+0xb1/0xf0
Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111c8c0>] ? sync_page+0x0/0x50
Sep 26 22:43:47 taft-01 kernel: [<ffffffff81520b53>] io_schedule+0x73/0xc0
Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111c8fd>] sync_page+0x3d/0x50
Sep 26 22:43:47 taft-01 kernel: [<ffffffff8152161f>] __wait_on_bit+0x5f/0x90
Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111cb33>] wait_on_page_bit+0x73/0x80
Sep 26 22:43:47 taft-01 kernel: [<ffffffff81098620>] ? wake_bit_function+0x0/0x50
Sep 26 22:43:47 taft-01 kernel: [<ffffffff81131bd5>] ? pagevec_lookup_tag+0x25/0x40
Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111cf5b>] wait_on_page_writeback_range+0xfb/0x190
Sep 26 22:43:47 taft-01 kernel: [<ffffffff81130c44>] ? generic_writepages+0x24/0x30
Sep 26 22:43:47 taft-01 kernel: [<ffffffff81130c71>] ? do_writepages+0x21/0x40
Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111d0ab>] ? __filemap_fdatawrite_range+0x5b/0x60
Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111d128>] filemap_write_and_wait_range+0x78/0x90
Sep 26 22:43:47 taft-01 kernel: [<ffffffff8111e4b8>] generic_file_aio_read+0x418/0x700
Sep 26 22:43:47 taft-01 kernel: [<ffffffff81229351>] ? avc_has_perm+0x71/0x90
Sep 26 22:43:47 taft-01 kernel: [<ffffffff8122b1b2>] ? selinux_inode_permission+0x72/0xb0
Sep 26 22:43:47 taft-01 kernel: [<ffffffff811bfe81>] blkdev_aio_read+0x51/0x80
Sep 26 22:43:47 taft-01 kernel: [<ffffffff81184ada>] do_sync_read+0xfa/0x140
Sep 26 22:43:47 taft-01 kernel: [<ffffffff810985a0>] ? autoremove_wake_function+0x0/0x40
Sep 26 22:43:47 taft-01 kernel: [<ffffffff811bfddc>] ? block_ioctl+0x3c/0x40
Sep 26 22:43:47 taft-01 kernel: [<ffffffff81199382>] ? vfs_ioctl+0x22/0xa0
Sep 26 22:43:47 taft-01 kernel: [<ffffffff8122ed7b>] ? selinux_file_permission+0xfb/0x150
Sep 26 22:43:47 taft-01 kernel: [<ffffffff81221c56>] ? security_file_permission+0x16/0x20
Sep 26 22:43:47 taft-01 kernel: [<ffffffff811853c5>] vfs_read+0xb5/0x1a0
Sep 26 22:43:47 taft-01 kernel: [<ffffffff81185501>] sys_read+0x51/0x90

Comment 4 Corey Marthaler 2013-10-02 16:45:32 UTC

Just to be clear, snapshots of mirrors (which are know to deadlock) is not required to cause this issue.

================================================================================
Iteration 1.6 started at Tue Oct  1 19:49:14 CDT 2013
================================================================================
Scenario kill_primary_synced_2_legs: Kill primary leg of synced 2 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_primary_2legs_1 syncd_primary_2legs_2
* sync:               1
* striped:            0
* leg devices:        /dev/sdg1 /dev/sdd1
* log devices:        /dev/sdc1
* no MDA devices:     
* failpv(s):          /dev/sdg1
* failnode(s):        taft-01
* lvmetad:            0
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

Creating mirror(s) on taft-01...
taft-01: lvcreate -m 1 -n syncd_primary_2legs_1 -L 500M helter_skelter /dev/sdg1:0-1000 /dev/sdd1:0-1000 /dev/sdc1:0-150
taft-01: lvcreate -m 1 -n syncd_primary_2legs_2 -L 500M helter_skelter /dev/sdg1:0-1000 /dev/sdd1:0-1000 /dev/sdc1:0-150

Current mirror/raid device structure(s):
 LV                               Attr       LSize   Cpy%Sync Devices
 syncd_primary_2legs_1            mwi-a-m--- 500.00m    18.40 syncd_primary_2legs_1_mimage_0(0),syncd_primary_2legs_1_mimage_1(0)
 [syncd_primary_2legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/sdg1(0)
 [syncd_primary_2legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/sdd1(0)
 [syncd_primary_2legs_1_mlog]     lwi-aom---   4.00m          /dev/sdc1(0)
 syncd_primary_2legs_2            mwi-a-m--- 500.00m     3.20 syncd_primary_2legs_2_mimage_0(0),syncd_primary_2legs_2_mimage_1(0)
 [syncd_primary_2legs_2_mimage_0] Iwi-aom--- 500.00m          /dev/sdg1(125)
 [syncd_primary_2legs_2_mimage_1] Iwi-aom--- 500.00m          /dev/sdd1(125)
 [syncd_primary_2legs_2_mlog]     lwi-aom---   4.00m          /dev/sdc1(1)

Waiting until all mirror|raid volumes become fully syncd...
   0/2 mirror(s) are fully synced: ( 77.70% 62.30% )
   2/2 mirror(s) are fully synced: ( 100.00% 100.00% )

Creating ext on top of mirror(s) on taft-01...
mke2fs 1.41.12 (17-May-2010)
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on taft-01...

PV=/dev/sdg1
        syncd_primary_2legs_1_mimage_0: 5.1
        syncd_primary_2legs_2_mimage_0: 5.1
PV=/dev/sdg1
        syncd_primary_2legs_1_mimage_0: 5.1
 
       syncd_primary_2legs_2_mimage_0: 5.1
Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----

Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----

Disabling device sdg on taft-01

[DEADLOCK]


qarshd[19223]: Running cmdline: echo offline > /sys/block/sdg/device/state &
lvm[3265]: Primary mirror device 253:8 has failed (D).
lvm[3265]: Device failure in helter_skelter-syncd_primary_2legs_2.
kernel: device-mapper: raid1: Mirror read failed from 253:4. Trying alternative device.
[...]
kernel: sd 3:0:0:6: rejecting I/O to offline device
kernel: INFO: task dmeventd:19107 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: dmeventd      D 0000000000000003     0 19107      1 0x00000080
kernel: ffff880218feda98 0000000000000086 000000000003ffff ffffea0007183b18
kernel: ffff880214ae6318 ffff880218fedc08 ffffffff811bfa40 0000000000000286
kernel: ffff880218cc6638 ffff880218fedfd8 000000000000fbc8 ffff880218cc6638
kernel: Call Trace:
kernel: [<ffffffff811bfa40>] ? blkdev_get_block+0x0/0x20
kernel: [<ffffffff810a4421>] ? ktime_get_ts+0xb1/0xf0
kernel: [<ffffffff8111c8c0>] ? sync_page+0x0/0x50
kernel: [<ffffffff81520b53>] io_schedule+0x73/0xc0
kernel: [<ffffffff8111c8fd>] sync_page+0x3d/0x50
kernel: [<ffffffff8152161f>] __wait_on_bit+0x5f/0x90
kernel: [<ffffffff8111cb33>] wait_on_page_bit+0x73/0x80
kernel: [<ffffffff81098620>] ? wake_bit_function+0x0/0x50
kernel: [<ffffffff81131bd5>] ? pagevec_lookup_tag+0x25/0x40
kernel: [<ffffffff8111cf5b>] wait_on_page_writeback_range+0xfb/0x190
kernel: [<ffffffff81130c44>] ? generic_writepages+0x24/0x30
kernel: [<ffffffff81130c71>] ? do_writepages+0x21/0x40
kernel: [<ffffffff8111d0ab>] ? __filemap_fdatawrite_range+0x5b/0x60
kernel: [<ffffffff8111d128>] filemap_write_and_wait_range+0x78/0x90
kernel: [<ffffffff8111e4b8>] generic_file_aio_read+0x418/0x700
kernel: [<ffffffff81229351>] ? avc_has_perm+0x71/0x90
kernel: [<ffffffff8122b1b2>] ? selinux_inode_permission+0x72/0xb0
kernel: [<ffffffff811bfe81>] blkdev_aio_read+0x51/0x80
kernel: [<ffffffff81184ada>] do_sync_read+0xfa/0x140
kernel: [<ffffffff810985a0>] ? autoremove_wake_function+0x0/0x40
kernel: [<ffffffff811bfddc>] ? block_ioctl+0x3c/0x40
kernel: [<ffffffff81199382>] ? vfs_ioctl+0x22/0xa0
kernel: [<ffffffff8122ed7b>] ? selinux_file_permission+0xfb/0x150
kernel: [<ffffffff81221c56>] ? security_file_permission+0x16/0x20
kernel: [<ffffffff811853c5>] vfs_read+0xb5/0x1a0
kernel: [<ffffffff81185501>] sys_read+0x51/0x90
kernel: [<ffffffff810deefe>] ? __audit_syscall_exit+0x25e/0x290
kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Comment 5 Jonathan Earl Brassow 2013-10-04 18:57:00 UTC

So, we /don't/ need:
1) a mirrored log
2) snapshots
to produce this issue.  (right?)

Comment 6 Jonathan Earl Brassow 2013-10-04 18:58:45 UTC

Can it be reproduced with 1 mirror, or is more that one required (or one of the above conditions)?

IOW, what is the simplest form in which we can reproduce the bug?

Comment 7 Jonathan Earl Brassow 2013-10-08 17:39:12 UTC

having trouble reproducing...  I've seen it once with a borrowed QA machine and haven't reproduced since turning on debugging.

I'm working to get my machines moved over to kvm to test on them as well.

Comment 8 Jonathan Earl Brassow 2013-10-09 02:49:26 UTC

moved all my VMs over to kvm.  Tests fail to run there also.

Comment 9 Jonathan Earl Brassow 2013-10-09 14:49:10 UTC

This message seems to preceded the issues you are seeing:
  lvm[3779]: intermediate VG write failed.

I don't see it in every one of your above reports, but I don't know if all the logs are in each of the above reports.  Clearly, failing to do a vg_write is going to cause problems.  I'm not sure why that is failing though.

Comment 10 Jonathan Earl Brassow 2013-10-09 18:10:43 UTC

It appears that 2+ mirrors are required to force this issue to happen.  In the cases I've been seeing, the devices are failed and should affect all mirrors in the same way.  (I am running 'kill_both_logs_2_legs_2_logs', which kills both devices of a mirrored log on both mirrors at once.)  However, the kernel does not detect the failure of a device until it has been written to.  I am seeing the following status at the beginning of the repair command:

lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies helter_skelter/syncd_both_logs_2legs_2logs_1
helter_skelter-syncd_both_logs_2legs_2logs_2_mlog: 0 8192 mirror 2 253:9 253:10 8/8 1 AA 1 core
helter_skelter-syncd_both_logs_2legs_2logs_2_mlog_mimage_1: 0 8192 linear 
helter_skelter-syncd_both_logs_2legs_2logs_2_mimage_1: 0 1024000 linear 
helter_skelter-syncd_both_logs_2legs_2logs_2_mlog_mimage_0: 0 8192 linear 
helter_skelter-syncd_both_logs_2legs_2logs_2_mimage_0: 0 1024000 linear 
helter_skelter-syncd_both_logs_2legs_2logs_1_mimage_1: 0 1024000 linear 
helter_skelter-syncd_both_logs_2legs_2logs_1_mlog_mimage_1: 0 8192 linear 
helter_skelter-syncd_both_logs_2legs_2logs_1_mimage_0: 0 1024000 linear 
helter_skelter-syncd_both_logs_2legs_2logs_1_mlog_mimage_0: 0 8192 linear 
helter_skelter-syncd_both_logs_2legs_2logs_2: 0 1024000 mirror 2 253:12 253:13 1000/1000 1 AA 3 disk 253:11 A
helter_skelter-syncd_both_logs_2legs_2logs_1: 0 1024000 mirror 2 253:6 253:7 1000/1000 1 AA 3 disk 253:5 D
helter_skelter-syncd_both_logs_2legs_2logs_1_mlog: 0 8192 mirror 2 253:3 253:4 7/8 1 DD 1 core

You can see that the mirrored log has failed and been noticed in helter_skelter-syncd_both_logs_2legs_2logs_1, but not in helter_skelter-syncd_both_logs_2legs_2logs_2.

I think the repair is attempting to read the second mirror (because the kernel has not yet noticed the failure) and getting stuck.  The traces seem to confirm this.  If I replace the second mirror with an error target, the trace from the repair command reports an I/O error reading the second mirror.

The fix will involve additional efforts to avoid reading mirrors.  (I was pretty certain all cases were covered... perhaps there has been a regression?)

Comment 11 Jonathan Earl Brassow 2013-10-09 21:16:58 UTC

Somehow (I'm not sure yet) the code is making it past 'device_is_usable' and hitting the filter check - which is where it is hanging:
291                     if (!device_is_usable(dev)) {
292                             log_debug_devs("%s: Skipping unusable device", dev_name(dev));
293                             return 0;
294                     }
295                     return pf->real->passes_filter(pf->real, dev); // HANG!

backtrace:
#0  0x0000003af140e75d in read () from /lib64/libpthread.so.0
#1  0x00007fd779097427 in _io (where=0x7fd778ef7060, buffer=0x7fd75c066400 "", 
    should_write=0) at device/dev-io.c:97
#2  0x00007fd7790978a2 in _aligned_io (where=0x7fd778ef70e0, 
    buffer=0x7fd778ef712c "", should_write=0) at device/dev-io.c:204
#3  0x00007fd779098e51 in dev_read (dev=0x7fd75c06f9c0, offset=0, len=4, 
    buffer=0x7fd778ef712c) at device/dev-io.c:671
#4  0x00007fd779099320 in _dev_has_md_magic (dev=0x7fd75c06f9c0, sb_offset=0)
    at device/dev-md.c:37
#5  0x00007fd7790994a9 in dev_is_md (dev=0x7fd75c06f9c0, sb=0x0)
    at device/dev-md.c:113
#6  0x00007fd7790a232c in _ignore_md (f=0x7fd75c00be40, dev=0x7fd75c06f9c0)
    at filters/filter-md.c:29
#7  0x00007fd77909facf in _and_p (f=0x7fd75c00be80, dev=0x7fd75c06f9c0)
    at filters/filter-composite.c:24
#8  0x00007fd7790a0bff in _lookup_p (f=0x7fd75c020800, dev=0x7fd75c06f9c0)
    at filters/filter-persistent.c:295
#9  0x00007fd779097191 in dev_iter_get (iter=0x7fd75c038670)
    at device/dev-cache.c:1011
#10 0x00007fd7790873c9 in lvmcache_label_scan (cmd=0x208efa0, full_scan=0)
    at cache/lvmcache.c:691

Comment 12 Jonathan Earl Brassow 2013-10-10 04:21:26 UTC

bisecting:
e9f83147d5da3120604c67354b4d6f027ac10ed0  GOOD
05bf4b8cc3e3d15e79c9964e496dbe396beacb3c  GOOD
6d2604f02651158223ac31fe58e0638b6171171c  GOOD
c36dcc17282cf9c5f745cc627f9ce3825b4ee5e2  GOOD
bd75844024ad4a2e66c7743c50e189a00c233284  GOOD
dd796d6a9485287973e61c427ee45694965203a0  BAD
9658032cba4f6a2350bb35d5133a436bad84fec2  BAD
a94e28ae019b94aa48dbaf1db83cc30bfdee0f4f  BAD
85b9c12e92873a87b7c02bac20287beeea82ef2a  BAD
861a3b2f1940cd3667cc353c762ff3d1ef75c37b  GOOD
30432bd6048ce9c52eefeba9e99d1b0805c6c59b  FINAL BAD ** but that doesn't seem to make sense...

Comment 13 Jonathan Earl Brassow 2013-10-11 14:23:20 UTC

Retesting (with 20 test iterations instead of 4):
861a3b2f1940cd3667cc353c762ff3d1ef75c37b  BAD
bd75844024ad4a2e66c7743c50e189a00c233284  BAD
c36dcc17282cf9c5f745cc627f9ce3825b4ee5e2  BAD
6d2604f02651158223ac31fe58e0638b6171171c  GOOD

Restarting bisect:
6d2604f02651158223ac31fe58e0638b6171171c  GOOD
c36dcc17282cf9c5f745cc627f9ce3825b4ee5e2  BAD
003f08c1642f060e57c55e950d4690d470451602  GOOD
f1dc4d3d81456d506b2c56fb2e8b12106cbd9e16  BAD
bb457adbb6d625a5f942faf8498bb5ee87645ec3  BAD
68c2d352ec2746ab278856388c6203c604f6248a  BAD
f50e8fb9b9030c86f3a65f61a4b7f43089a26c60  GOOD
b15278c3dca14d7ee09a2ebcca4b91cbdf8428af  GOOD
c95f17ea64e3faee3d36be95544c907b5e89c073  FINAL BAD ** This looks right.

commit c95f17ea64e3faee3d36be95544c907b5e89c073
Author: Jonathan Brassow <jbrassow>
Date:   Wed Aug 7 17:42:26 2013 -0500

    Mirror: Fix issue preventing PV creation on mirror LVs
    
    Commit b248ba0a396d7fc9a459eea02cfdc70b33ce3441 attempted to
    prevent mirror devices which had a failed device in their
    mirrored log from being usable/readable by LVM.  This was to
    protect against circular dependancies where one LVM command
    could be blocked trying to read one of these affected mirrors
    while the LVM command to fix/unblock that mirror was stuck
    behind the currently running command.
    
    The above commit went wrong when it used 'device_is_usable()' to
    recurse on the mirrored log device to check if it was suspended
    or blocked.  The 'device_is_usable' function also contains a check
    for reserved names - like *_mlog, etc.  This last check always
    triggered when checking a mirror's log simply because of the name,
    not because it was suspended or blocked - a false positive.
    
    The solution is to create a new function like 'device_is_usable',
    but without the check for reserved names.  Using this new function
    (device_is_suspended_or_blocked), we can check the status of a
    mirror's log device properly.

I believe the above commit is correct, but it revealed a bug that was previously not seen because it enabled an otherwise unusable feature.  Two things I'm fairly certain of:
1) There must be more than one mirror that suffers device failure.
2) The mirrors must have mirrored logs.

Comment 14 Jonathan Earl Brassow 2013-10-11 15:42:20 UTC

Ok, now I'm virtually certain this is timing related.

Here's the scenario:
- We have two mirrors with mirrored logs with sub-LVs place on the same devices

- Killing both log devices will require replacement of the logs of both mirrors

- When 'lvconvert --repair' is run, it is run in response to an event received from the kernel on ONE of the mirror logs. At this point, I/O has not yet been run on the second mirror to cause the kernel to notice the failure there as well. So, a 'dmsetup status' will reveal ONLY ONE mirror and its mirrored log have failed devices.

- The 'lvconvert --repair' is careful to avoid reading from suspended DM devices and blocking mirrors (see lib/activate/dev_manager.c:_ignore_blocked_mirror_devices()), as this would cause the command to block indefinitely and lock out any other LVM commands. However, the second mirror is not suspended and the failed devices that will cause it to block I/O haven't been detected yet! Infact (and this is crucial to why this bug is difficult to reproduce outside the STS testsuite), a read to the second mirror will be successful because it doesn't require updating the log, which is where the device failures have occurred.

- Ongoing I/O to the second mirror finally gets around to causing the log failure to be detected in the kernel. Specifically, a write operation was necessary to cause the failure of the mirrored log to be noticed. Now, all I/O (even reads) blocks to the entire mirror waiting for the mirrored log to be repaired.

All the pieces are now in place to cause the blockage. The 'lvconvert --repair' has gotten to the point where it thinks it is ok to check the 2nd mirror for an LVM label. Another process has caused the 2nd mirror become blocked in the meantime. The resulting label read will cause the command and any subsequent LVM commands to hang. Frankly, I find it amazing that the timing of the STS test operations are such that it can hit this bug reliably, but this is exactly the conclusion that matches when looking at the logs, tracing through gdb, and compairing results when instrumenting the above by hand.

I'm not quite sure how many other operations this could affect. I suppose most LVM commands that come across an undetected mirror failure could result in a hang. It seems like a likely possibility and I'm not sure mirrors with mirrored logs are a requirement in all cases.

I can think of two solutions and a long-shot to this problem:
1) [easy but blunt]: Disallow mirrors from being PVs. Customers have filed bugs in the past over this when discovering that this is not allowed. I'm not sure how necessary the ability to do this is, but this would immediately disallow any kind of VG-level stacking on mirrors - cluster or otherwise.

2) [harder but more precise]: Implement another filter which should come before other filters that would check if a mirror is stacked on a PV which has failed. I'm not sure how possible this is. Ordering of devices that are being read for labels would be important. Attempting to read a mirror with undetected failed devices before the failed PV themselves would obviously make it impossible to do this.

*) [long-shot]: I *think* this could affect mirrors that do not use mirrored logs. If that's not true, then we could disallow only mirrors with mirrored logs to be used as PVs. The only time I have ever heard a customer what to stack VGs on mirrors is in a cluster (they want RAID10 like features in a cluster). Mirrors with mirrored logs are not available in a cluster, so they won't know the difference. RAID1/10 are available for single machine mode, so they shouldn't care either. This would probably be the best solution, but it requires that the problem is isolated to mirrors with mirrored logs OR it requires us to live with the fact that we've shrunk the problem but not eliminated it.

Comment 15 Jonathan Earl Brassow 2013-10-11 16:06:58 UTC

Yes, I can make it happen without using 'mirrored' logs.  The timing is tighter though.

So, any mirror with an undetected error can be problematic for any LVM command if the command reaches the point of label reading and then some outside processes causes the mirror to become blocked by performing a write.

The long-shot solution in comment 14 will not fix the problem entirely, it will simply make it harder to cause.

Comment 16 Alasdair Kergon 2013-10-18 20:54:11 UTC

Discussion of potential workarounds/fixes ongoing.

Comment 17 Jonathan Earl Brassow 2013-10-18 21:34:47 UTC

I've implemented the code to disallow label-reading on mirrors with mirrored logs.  It does not solve the problem.  It is still somewhat simple to reproduce the problem with the QA test suite.

We must decide whether to disallow mirrors as PVs until we implement asynchronous label reading.  Because of the problems it can cause, I suggest we disable PVs on mirror LVs - perhaps at a minimum during repair.

Comment 18 Jonathan Earl Brassow 2013-10-18 22:04:56 UTC

Another solution that I'm testing at the moment is to introduce a new global variable (like _ignore_suspended_devices) called '_avoid_reading_mirrors'.  It is set when cmd->handles_missing_pvs is set.  No mirror devices encountered while this global is set will be read for labels.  'handles_missing_pvs' is set when performing repair type operations.  I'm not yet sure if this is enough to prevent initial label reads from hitting mirrors.  I may have to make a configuration file option that adjusts whether or not to read mirrors for PVs.   That may actually be a good solution.  We would set the config option to /not/ read mirrors.  Those users who complain could change the config file, but not after seeing a description that warns them of why it is bad to allow reading mirrors and what their alternatives might be.

Comment 19 Jonathan Earl Brassow 2013-10-19 03:07:14 UTC

(In reply to Jonathan Earl Brassow from comment #18)
> Another solution that I'm testing at the moment is to introduce a new global
> variable (like _ignore_suspended_devices) called '_avoid_reading_mirrors'. 
> It is set when cmd->handles_missing_pvs is set.  No mirror devices
> encountered while this global is set will be read for labels. 
> 'handles_missing_pvs' is set when performing repair type operations.  I'm
> not yet sure if this is enough to prevent initial label reads from hitting
> mirrors.  I may have to make a configuration file option that adjusts
> whether or not to read mirrors for PVs.   That may actually be a good
> solution.  We would set the config option to /not/ read mirrors.  Those
> users who complain could change the config file, but not after seeing a
> description that warns them of why it is bad to allow reading mirrors and
> what their alternatives might be.

This solution seems to work to prevent this bug.  I can no longer produce the issue using the STS tests.

I can create a VG on top of a mirrored LV and create and use LVs within that layered VG.  However, it is impossible to '[vg|lv]change' or '[vg|lv]remove' the LVs in the layer VG because they set 'handles_missing_pvs'.  I wonder why that is...  This could be worked around by setting 'avoid_reading_mirrors' in a better place.

Comment 21 Jonathan Earl Brassow 2013-10-22 01:18:07 UTC

Full backtrace of where the problem is occurring:

#0  0x0000003af140e75d in read () from /lib64/libpthread.so.0
#1  0x00007f9ea114f5ef in _io (where=0x7f9ea1126160, buffer=0x7f9e9c057800 "", 
    should_write=0) at device/dev-io.c:97
#2  0x00007f9ea114fa6a in _aligned_io (where=0x7f9ea11261e0, 
    buffer=0x7f9ea112622c "", should_write=0) at device/dev-io.c:204
#3  0x00007f9ea1151019 in dev_read (dev=0x7f9e9c06f4c8, offset=0, len=4, 
    buffer=0x7f9ea112622c) at device/dev-io.c:671
#4  0x00007f9ea11514e8 in _dev_has_md_magic (dev=0x7f9e9c06f4c8, sb_offset=0)
    at device/dev-md.c:37
#5  0x00007f9ea1151671 in dev_is_md (dev=0x7f9e9c06f4c8, sb=0x0)
    at device/dev-md.c:113
#6  0x00007f9ea115a4f4 in _ignore_md (f=0x7f9e9c020810, dev=0x7f9e9c06f4c8)
    at filters/filter-md.c:29
#7  0x00007f9ea1157c97 in _and_p (f=0x7f9e9c020890, dev=0x7f9e9c06f4c8)
    at filters/filter-composite.c:24
#8  0x00007f9ea1158dc7 in _lookup_p (f=0x7f9e9c020700, dev=0x7f9e9c06f4c8)
    at filters/filter-persistent.c:295
#9  0x00007f9ea114f359 in dev_iter_get (iter=0x7f9e9c056d00)
    at device/dev-cache.c:1011
#10 0x00007f9ea113f591 in lvmcache_label_scan (cmd=0x1d80fa0, full_scan=0)
    at cache/lvmcache.c:691
#11 0x00007f9ea1192ba1 in _vg_read (cmd=0x1d80fa0, 
    vgname=0x7f9e9c055bc2 "helter_skelter", vgid=0x0, warnings=1, 
    consistent=0x7f9ea1126584, precommitted=0) at metadata/metadata.c:3032
#12 0x00007f9ea1193e9a in vg_read_internal (cmd=0x1d80fa0, 
    vgname=0x7f9e9c055bc2 "helter_skelter", vgid=0x0, warnings=1, 
    consistent=0x7f9ea1126584) at metadata/metadata.c:3437
#13 0x00007f9ea11958af in _vg_lock_and_read (cmd=0x1d80fa0, 
    vg_name=0x7f9e9c055bc2 "helter_skelter", vgid=0x0, lock_flags=36, 
    status_flags=514, misc_flags=1048576) at metadata/metadata.c:4139
#14 0x00007f9ea1195c78 in vg_read (cmd=0x1d80fa0, 
    vg_name=0x7f9e9c055bc2 "helter_skelter", vgid=0x0, flags=1048576)
    at metadata/metadata.c:4246
#15 0x00007f9ea1195cb9 in vg_read_for_update (cmd=0x1d80fa0, 
    vg_name=0x7f9e9c055bc2 "helter_skelter", vgid=0x0, flags=0)
    at metadata/metadata.c:4257
#16 0x00007f9ea11e0477 in _get_lvconvert_vg (cmd=0x1d80fa0, 
    name=0x7f9e9c055bc2 "helter_skelter", uuid=0x0) at lvconvert.c:563
#17 0x00007f9ea11e7a35 in get_vg_lock_and_logical_volume (cmd=0x1d80fa0, 
    vg_name=0x7f9e9c055bc2 "helter_skelter", 
    lv_name=0x7f9e9c005bee "syncd_secondary_log_2legs_1") at lvconvert.c:2701
#18 0x00007f9ea11e7c74 in lvconvert_single (cmd=0x1d80fa0, lp=0x7f9ea1126790)
    at lvconvert.c:2739
#19 0x00007f9ea11e825b in lvconvert (cmd=0x1d80fa0, argc=1, 
    argv=0x7f9ea1126938) at lvconvert.c:2848
#20 0x00007f9ea11f2300 in lvm_run_command (cmd=0x1d80fa0, argc=1, 
    argv=0x7f9ea1126938) at lvmcmdline.c:1179
#21 0x00007f9ea120beb8 in lvm2_run (handle=0x1d80fa0, 
#22 0x00007f9ea1480e5b in dmeventd_lvm2_run (
    cmdline=0x7f9ea1126b90 "lvconvert --config devices{ignore_suspended_devices=1} --repair --use-policies helter_skelter/syncd_secondary_log_2legs_1")
    at dmeventd_lvm.c:146
#23 0x00007f9ea1688ed6 in _remove_failed_devices (
    device=0x7f9e9c001330 "helter_skelter-syncd_secondary_log_2legs_1")
    at dmeventd_mirror.c:144
#24 0x00007f9ea168905d in process_event (dmt=0x7f9e9c0011e0, 
    event=DM_EVENT_DEVICE_ERROR, unused=0x13a6ea8) at dmeventd_mirror.c:188
#25 0x0000000000403d5f in _do_process_event (thread=0x13a6e30, 
    task=0x7f9e9c0011e0) at dmeventd.c:721
#26 0x000000000040409b in _monitor_thread (arg=0x13a6e30) at dmeventd.c:845
#27 0x0000003af14079d1 in start_thread () from /lib64/libpthread.so.0
#28 0x0000003af10e8b3d in clone () from /lib64/libc.so.6

Comment 22 Jonathan Earl Brassow 2013-10-22 01:46:48 UTC

Bug 855398 would resurface if we choose to disable PVs on mirror LVs.

We may wish to make a config option that toggles the ability to use mirror LVs.  We can skip by default or when can strongly encourage users to disable PVs on mirrors.

The problem is that if we choose not to disable it by default, this bug has the potential to affect all mirror users.  If we disable it by default, then only users who stack VGs on mirrors will be affected.  The second group is much smaller, but bug 855398 does show there are some people out there who do this.

Comment 23 Jonathan Earl Brassow 2013-10-22 16:40:08 UTC

History of capability to put PVs on mirrors:

RHEL 6.0)
cluster VGs: yes
single machine: yes

* bug 1009700 (i.e. this bug) is present, but not discovered

RHEL 6.3)
cluster VGs: NO
single machine: yes

- commit 7a965168193968fd1a34c766f0b435e4f08ef4ac caused clvmd to always set _ignore_suspended_devices. The code at the time would always avoid reading mirror LVs if this variable was set. Hence, stacking no longer worked in a cluster. Bug 855398 was filed in response.

- The fix for bug 855398 was to realize that avoiding mirrors whenever _ingore_suspended_devices was set solved the problem of reading blocked mirrors by simply ignoring mirrors all together. Mirrors needed to be checked if they contained failed devices and avoided if they could be. This was done through the mirror status string via 'ignore_blocked_mirror_devices()'.

* bug 1009700 (i.e. this bug) is present, but not discovered

RHEL 6.4)
cluster VGs: yes
single machine: yes

- Changes made to fix bug 855398 were correct, but resulted in a new bug being discovered - bug 1009700. This bug has always been present, but hadn't been seen until this point because mirror LVs were not being read when _ignore_suspended_devices was set.

It should be noted that avoiding mirrors while '_ignore_suspended_devices' is set does not eliminate bug 1009700 (i.e. this bug). It simply means that those commands that had set the variable (like commands that perform repairs) would not hit it. Other commands like pvs, lvs, vgs, lvcreate, vgcreate, lvchange, vgchange, etc are still able to block on mirrors with failures. Once blocked, those commands prevent any subsequent repair from processing.

Comment 24 Jonathan Earl Brassow 2013-10-23 14:28:03 UTC

Workaround committed upstream:

commit d5896f0afd28096c3dc04a9c7e9627b387f145fb
Author: Jonathan Brassow <jbrassow>
Date:   Tue Oct 22 19:14:33 2013 -0500

    Mirror: Fix hangs and lock-ups caused by attempting label reads of mirrors
    
    There is a problem with the way mirrors have been designed to handle
    failures that is resulting in stuck LVM processes and hung I/O.  When
    mirrors encounter a write failure, they block I/O and notify userspace
    to reconfigure the mirror to remove failed devices.  This process is
    open to a couple races:
    1) Any LVM process other than the one that is meant to deal with the
    mirror failure can attempt to read the mirror, fail, and block other
    LVM commands (including the repair command) from proceeding due to
    holding a lock on the volume group.
    2) If there are multiple mirrors that suffer a failure in the same
    volume group, a repair can block while attempting to read the LVM
    label from one mirror while trying to repair the other.
    
    Mitigation of these races has been attempted by disallowing label reading
    of mirrors that are either suspended or are indicated as blocking by
    the kernel.  While this has closed the window of opportunity for hitting
    the above problems considerably, it hasn't closed it completely.  This is
    because it is still possible to start an LVM command, read the status of
    the mirror as healthy, and then perform the read for the label at the
    moment after a the failure is discovered by the kernel.
    
    I can see two solutions to this problem:
    1) Allow users to configure whether mirrors can be candidates for LVM
    labels (i.e. whether PVs can be created on mirror LVs).  If the user
    chooses to allow label scanning of mirror LVs, it will be at the expense
    of a possible hang in I/O or LVM processes.
    2) Instrument a way to allow asynchronous label reading - allowing
    blocked label reads to be ignored while continuing to process the LVM
    command.  This would action would allow LVM commands to continue even
    though they would have otherwise blocked trying to read a mirror.  They
    can then release their lock and allow a repair command to commence.  In
    the event of #2 above, the repair command already in progress can continue
    and repair the failed mirror.
    
    This patch brings solution #1.  If solution #2 is developed later on, the
    configuration option created in #1 can be negated - allowing mirrors to
    be scanned for labels by default once again.

Comment 26 Peter Rajnoha 2013-10-23 15:11:50 UTC

+#define DEFAULT_IGNORE_LVM_MIRRORS 1

Flipped to "0" for 6.5 defaults as agreed...

Comment 28 Jonathan Earl Brassow 2013-10-23 16:14:04 UTC

prajnoha,

You may need to also switch the 'ignore_lvm_mirrors' in the example config file...

Comment 29 Jonathan Earl Brassow 2013-10-23 16:15:39 UTC

QA, please be advised that you must change the configuration file setting to 'ignore_lvm_mirrors = 1' in order to see that this bug has been fixed.  Users will be made aware of the necessity to change the setting via REHL6.5 release note.

Comment 30 Nenad Peric 2013-10-28 14:18:03 UTC

Tested with 
ignore_lvm_mirrors = 0, which caused the deadlock in 3rd iteration.

changed the setting to 

ignore_lvm_mirrors = 1

and five iterations of the above mentioned test (kill_both_logs_2_legs_2_logs) passed without issues. 


Marking VERIFIED with lvm2-2.02.100-7.el6

Comment 31 Corey Marthaler 2013-10-31 16:54:40 UTC

This issue still exists in the latest rpms. I've reproduce it a few times now even w/ 'ignore_lvm_mirrors = 1'.

I'll leave this bug marked verified and clone this bug for 6.6 since it does appear to be harder to hit now with this fix.

    # Set to 1 to disallow stacking and thereby avoid a possible deadlock.
    ignore_lvm_mirrors = 1


2.6.32-425.el6.x86_64
lvm2-2.02.100-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
lvm2-libs-2.02.100-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
lvm2-cluster-2.02.100-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
udev-147-2.51.el6    BUILT: Thu Oct 17 13:14:34 CEST 2013
device-mapper-1.02.79-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
device-mapper-libs-1.02.79-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
device-mapper-event-1.02.79-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
device-mapper-event-libs-1.02.79-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013
device-mapper-persistent-data-0.2.8-2.el6    BUILT: Mon Oct 21 16:14:25 CEST 2013
cmirror-2.02.100-8.el6    BUILT: Wed Oct 30 09:10:56 CET 2013


================================================================================
Iteration 0.10 started at Wed Oct 30 19:22:16 CDT 2013
================================================================================
Scenario kill_multiple_legs_synced_4_legs: Kill multiple legs of synced 4 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_multiple_legs_4legs_1
* sync:               1
* striped:            0
* leg devices:        /dev/sdb1 /dev/sdd1 /dev/sdf1 /dev/sde1
* log devices:        /dev/sdc1
* no MDA devices:     
* failpv(s):          /dev/sdb1 /dev/sde1
* failnode(s):        virt-002.cluster-qe.lab.eng.brq.redhat.com
* lvmetad:            0
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

Creating mirror(s) on virt-002.cluster-qe.lab.eng.brq.redhat.com...
virt-002.cluster-qe.lab.eng.brq.redhat.com: lvcreate -m 3 -n syncd_multiple_legs_4legs_1 -L 500M helter_skelter /dev/sdb1:0-1000 /dev/sdd1:0-1000 /dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdc1:0-150

Current mirror/raid device structure(s):
  LV                                     Attr       LSize   Cpy%Sync Devices
  syncd_multiple_legs_4legs_1            mwi-a-m--- 500.00m     8.80 syncd_multiple_legs_4legs_1_mimage_0(0),syncd_multiple_legs_4legs_1_mimage_1(0),syncd_multiple_legs_4legs_1_mimage_2(0),syncd_multiple_legs_4legs_1_mimage_3(0)
  [syncd_multiple_legs_4legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/sdb1(0)
  [syncd_multiple_legs_4legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/sdd1(0)
  [syncd_multiple_legs_4legs_1_mimage_2] Iwi-aom--- 500.00m          /dev/sdf1(0)
  [syncd_multiple_legs_4legs_1_mimage_3] Iwi-aom--- 500.00m          /dev/sde1(0)
  [syncd_multiple_legs_4legs_1_mlog]     lwi-aom---   4.00m          /dev/sdc1(0)

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating ext on top of mirror(s) on virt-002.cluster-qe.lab.eng.brq.redhat.com...
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on virt-002.cluster-qe.lab.eng.brq.redhat.com...

PV=/dev/sde1
        syncd_multiple_legs_4legs_1_mimage_3: 3.1
PV=/dev/sdb1
        syncd_multiple_legs_4legs_1_mimage_0: 4.1
PV=/dev/sde1
        syncd_multiple_legs_4legs_1_mimage_3: 3.1
PV=/dev/sdb1
        syncd_multiple_legs_4legs_1_mimage_0: 4.1

Writing verification files (checkit) to mirror(s) on...
        ---- virt-002.cluster-qe.lab.eng.brq.redhat.com ----

Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- virt-002.cluster-qe.lab.eng.brq.redhat.com ----

Disabling device sdb on virt-002.cluster-qe.lab.eng.brq.redhat.com
Disabling device sde on virt-002.cluster-qe.lab.eng.brq.redhat.com

Getting recovery check start time from /var/log/messages: Oct 31 01:23
Attempting I/O to cause mirror down conversion(s) on virt-002.cluster-qe.lab.eng.brq.redhat.com
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.443132 s, 94.7 MB/s
[DEADLOCK]

lvm[3678]: Monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1 for events.
qarshd[10931]: Running cmdline: echo offline > /sys/block/sde/device/state &
lvm[3678]: Monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1 for events.
lvm[3678]: Mirror status: 1 of 4 images failed.
lvm[3678]: Trying to up-convert to 4 images, 1 logs.
lvm[3678]: Monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1_mimagetmp_4 for events.
lvm[3678]: Monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1 for events.
lvm[3678]: WARNING: Failed to replace 2 of 4 images in volume syncd_multiple_legs_4legs_1
lvm[3678]: WARNING: Failed to replace 1 of 1 logs in volume syncd_multiple_legs_4legs_1
[...]
lvm[3678]: /dev/sde1: read failed after 0 of 1024 at 4096: Input/output error
lvm[3678]: WARNING: Inconsistent metadata found for VG helter_skelter - updating to use version 110
lvm[3678]: Automatic metadata correction failed
lvm[3678]: Volume group "helter_skelter" not found
lvm[3678]: ABORTING: Can't reread VG for helter_skelter/syncd_multiple_legs_4legs_1
lvm[3678]: Repair of mirrored device helter_skelter-syncd_multiple_legs_4legs_1 failed.
lvm[3678]: Failed to remove faulty devices in helter_skelter-syncd_multiple_legs_4legs_1.
lvm[3678]: helter_skelter-syncd_multiple_legs_4legs_1 is now in-sync.
lvm[3678]: helter_skelter-syncd_multiple_legs_4legs_1_mimagetmp_4 is now in-sync.
lvm[3678]: No longer monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1 for events.
lvm[3678]: No longer monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1 for events.
lvm[3678]: Secondary mirror device 253:6 has failed (D).
lvm[3678]: Device failure in helter_skelter-syncd_multiple_legs_4legs_1_mimagetmp_4.
lvm[3678]: Names including "_mimage" are reserved. Please choose a different LV name.
lvm[3678]: Run `lvconvert --help' for more information.
lvm[3678]: Repair of mirrored device helter_skelter-syncd_multiple_legs_4legs_1_mimagetmp_4 failed.
lvm[3678]: Failed to remove faulty devices in helter_skelter-syncd_multiple_legs_4legs_1_mimagetmp_4.
qarshd[11032]: Running cmdline: pvs -a
lvm[3678]: No longer monitoring mirror device helter_skelter-syncd_multiple_legs_4legs_1 for events.
kernel: sd 4:0:0:1: rejecting I/O to offline device
[...]
qarshd[11036]: Running cmdline: dd if=/dev/zero of=/mnt/syncd_multiple_legs_4legs_1/ddfile count=10 bs=4M
lvm[3678]: helter_skelter-syncd_multiple_legs_4legs_1 is now in-sync.
qarshd[11038]: Running cmdline: sync
kernel: INFO: task flush-253:7:10895 blocked for more than 120 seconds.
kernel:      Not tainted 2.6.32-425.el6.x86_64 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: flush-253:7   D 0000000000000000     0 10895      2 0x00000080
kernel: ffff8800248a1840 0000000000000046 ffff8800ffffffff 0001a5093836ed53
kernel: ffff8800248a17b0 ffff88003dfd71a0 00000000029d00ac ffffffffad8ebae9
kernel: ffff88003d54bab8 ffff8800248a1fd8 000000000000fbc8 ffff88003d54bab8
kernel: Call Trace:
kernel: [<ffffffff810a7351>] ? ktime_get_ts+0xb1/0xf0
kernel: [<ffffffff811bf3d0>] ? sync_buffer+0x0/0x50
kernel: [<ffffffff81528a63>] io_schedule+0x73/0xc0
kernel: [<ffffffff811bf410>] sync_buffer+0x40/0x50
kernel: [<ffffffff815292fa>] __wait_on_bit_lock+0x5a/0xc0
kernel: [<ffffffff811bf3d0>] ? sync_buffer+0x0/0x50
kernel: [<ffffffff815293d8>] out_of_line_wait_on_bit_lock+0x78/0x90
kernel: [<ffffffff8109b520>] ? wake_bit_function+0x0/0x50
kernel: [<ffffffff811bf740>] ? end_buffer_async_write+0x0/0x190
kernel: [<ffffffff811bf5b6>] __lock_buffer+0x36/0x40
kernel: [<ffffffff811c08b5>] __block_write_full_page+0x305/0x330
kernel: [<ffffffff811bf740>] ? end_buffer_async_write+0x0/0x190
kernel: [<ffffffff811c09c0>] block_write_full_page_endio+0xe0/0x120
kernel: [<ffffffffa03aeb30>] ? buffer_unmapped+0x0/0x20 [ext3]
kernel: [<ffffffff811c0a15>] block_write_full_page+0x15/0x20
kernel: [<ffffffffa03af6dd>] ext3_ordered_writepage+0x1ed/0x240 [ext3]
kernel: [<ffffffff81133967>] __writepage+0x17/0x40
kernel: [<ffffffff81134c2d>] write_cache_pages+0x1fd/0x4c0
kernel: [<ffffffff81133950>] ? __writepage+0x0/0x40
kernel: [<ffffffff81134f14>] generic_writepages+0x24/0x30
kernel: [<ffffffff81134f55>] do_writepages+0x35/0x40
kernel: [<ffffffff811b52ad>] writeback_single_inode+0xdd/0x290
kernel: [<ffffffff811b56ad>] writeback_sb_inodes+0xbd/0x170
kernel: [<ffffffff811b580b>] writeback_inodes_wb+0xab/0x1b0
kernel: [<ffffffff811b5c03>] wb_writeback+0x2f3/0x410
kernel: [<ffffffff815282d0>] ? thread_return+0x4e/0x76e

Comment 32 errata-xmlrpc 2013-11-21 23:28:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1704.html

Note You need to log in before you can comment on or make changes to this bug.