1127231 – dmeventd hanging while handling lost leg in RAID1 LV

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1127231 - dmeventd hanging while handling lost leg in RAID1 LV

Summary: dmeventd hanging while handling lost leg in RAID1 LV

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Jes Sorensen
QA Contact:	XiaoNi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-08-06 12:37 UTC by Marian Csontos
Modified:	2014-10-14 06:20 UTC (History)
CC List:	14 users (show)
Fixed In Version:	kernel-2.6.32-501.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-10-14 06:20:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
details from the (virtual) machine (246.75 KB, text/plain) 2014-08-06 12:37 UTC, Marian Csontos	no flags	Details
Another log file showing the problem (8.77 MB, text/plain) 2014-08-29 16:36 UTC, Jonathan Earl Brassow	no flags	Details
Another log file showing the problem (17.93 MB, text/plain) 2014-08-29 16:42 UTC, Jonathan Earl Brassow	no flags	Details
Another log file showing the problem (13.41 MB, text/plain) 2014-08-30 16:13 UTC, Jonathan Earl Brassow	no flags	Details
Another log file showing the problem (7.89 MB, text/plain) 2014-08-30 16:19 UTC, Jonathan Earl Brassow	no flags	Details
Problem showing "frozen" sync action from rhel kernel 6c942da (325.26 KB, text/plain) 2014-08-31 18:21 UTC, Jonathan Earl Brassow	no flags	Details
Latest repo with commit 6c942da (6.54 MB, text/plain) 2014-09-02 13:46 UTC, Jonathan Earl Brassow	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2014:1392	0	normal	SHIPPED_LIVE	Important: kernel security, bug fix, and enhancement update	2014-10-14 01:28:44 UTC

Description Marian Csontos 2014-08-06 12:37:54 UTC

Created attachment 924469 [details]
details from the (virtual) machine

Description of problem:
dmeventd may hang while handling lost leg in (out-of-sync) RAID1 LV.
lvmetad was not running.
From the output in messages file it looks like dmeventd tried to replace the leg twice. Only one thread 

I was unable to get core dump (signals blocked?)

Version-Release number of selected component (if applicable):
lvm2-2.02.110-0.31.el6.x86_64 (basically the same as latest RHEL6.6 package)
kernel-2.6.32-494.el6.x86_64
udev-147-2.57.el6.x86_64

How reproducible:
Not very often.

Steps to Reproduce:
2. offline leg of non-synced RAID1 LV

Actual results:
dmeventd hangs

Expected results:
no hang

Additional info:

Running:

    ...
    lvcreate --type raid1 -m 3 -n non_synced_random_raid1_3legs_1 -L 3G black_bird /dev/sda1:0-2400 /dev/sdf1:0-2400 /dev/sdb1:0-2400 /dev/sdc1:0-2400
    # run IO task on the device here...
    echo offline > /sys/block/sdb/device/state
    pvs -a

dmevent stack:

    dmeventd      D 0000000000000000     0 23733      1 0x00000080
    ffff8800224119b8 0000000000000082 ffff880022411928 0000000000000286
    ffff880022411938 ffffffff8109eefc ffff88003bf39000 ffff880022411ae8                                                                   
    0000000000000001 ffff880022411ab8 ffff880025a665f8 ffff880022411fd8
    Call Trace:
    [<ffffffff8109eefc>] ? remove_wait_queue+0x3c/0x50
    [<ffffffff81041e98>] ? pvclock_clocksource_read+0x58/0xd0
    [<ffffffff8152a0a5>] schedule_timeout+0x215/0x2e0
    [<ffffffff8106d165>] ? enqueue_entity+0x125/0x450
    [<ffffffff8105dea4>] ? check_preempt_wakeup+0x1a4/0x260
    [<ffffffff8106d58b>] ? enqueue_task_fair+0xfb/0x100
    [<ffffffff81529d23>] wait_for_common+0x123/0x180
    [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
    [<ffffffff81529e3d>] wait_for_completion+0x1d/0x20
    [<ffffffff8109e6db>] kthread_stop+0x4b/0xd0
    [<ffffffff814151f4>] md_unregister_thread+0x54/0xa0
    [<ffffffff8141995d>] md_reap_sync_thread+0x1d/0x150
    [<ffffffff81419ac0>] __md_stop_writes+0x30/0x90
    [<ffffffff8141a455>] md_stop_writes+0x25/0x40
    [<ffffffffa03b8236>] raid_presuspend+0x16/0x20 [dm_raid]
    [<ffffffffa00043a2>] suspend_targets+0x42/0x80 [dm_mod]
    [<ffffffffa00043f5>] dm_table_presuspend_targets+0x15/0x20 [dm_mod]
    [<ffffffffa0001343>] dm_suspend+0x63/0x1b0 [dm_mod]
    [<ffffffffa0008e80>] ? dev_suspend+0x0/0x260 [dm_mod]
    [<ffffffffa0008ef6>] dev_suspend+0x76/0x260 [dm_mod]
    [<ffffffffa0008e80>] ? dev_suspend+0x0/0x260 [dm_mod]
    [<ffffffffa0009e14>] ctl_ioctl+0x214/0x450 [dm_mod]
    [<ffffffff8106d58b>] ? enqueue_task_fair+0xfb/0x100
    [<ffffffffa000a063>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
    [<ffffffff811a3502>] vfs_ioctl+0x22/0xa0
    [<ffffffff811a36a4>] do_vfs_ioctl+0x84/0x580
    [<ffffffff8118f791>] ? __fput+0x1a1/0x210
    [<ffffffff811a3c21>] sys_ioctl+0x81/0xa0
    [<ffffffff810e5aee>] ? __audit_syscall_exit+0x25e/0x290
    [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

More in the attached file...

Comment 2 Corey Marthaler 2014-08-06 16:25:52 UTC

Just hit this as well...


================================================================================
Iteration 2.2 started at Mon Aug  4 21:49:13 CDT 2014
================================================================================
Scenario kill_primary_non_synced_raid1_3legs: Kill primary leg of NON synced 3 leg raid1 volume(s)

********* RAID hash info for this scenario *********
* names:              non_synced_primary_raid1_3legs_1 non_synced_primary_raid1_3legs_2 non_synced_primary_raid1_3legs_3
* sync:               0
* type:               raid1
* -m |-i value:       3
* leg devices:        /dev/sdg1 /dev/sdh1 /dev/sdc1 /dev/sdd1
* spanned legs:        0
* failpv(s):          /dev/sdg1
* failnode(s):        host-049.virt.lab.msp.redhat.com
* lvmetad:            0
* raid fault policy:  warn
******************************************************

Creating raids(s) on host-049.virt.lab.msp.redhat.com...
host-049.virt.lab.msp.redhat.com: lvcreate --type raid1 -m 3 -n non_synced_primary_raid1_3legs_1 -L 3G black_bird /dev/sdg1:0-2400 /dev/sdh1:0-2400 /dev/sdc1:0-2400 /dev/sdd1:0-2400
host-049.virt.lab.msp.redhat.com: lvcreate --type raid1 -m 3 -n non_synced_primary_raid1_3legs_2 -L 3G black_bird /dev/sdg1:0-2400 /dev/sdh1:0-2400 /dev/sdc1:0-2400 /dev/sdd1:0-2400
host-049.virt.lab.msp.redhat.com: lvcreate --type raid1 -m 3 -n non_synced_primary_raid1_3legs_3 -L 3G black_bird /dev/sdg1:0-2400 /dev/sdh1:0-2400 /dev/sdc1:0-2400 /dev/sdd1:0-2400

[...]

Current sync percent just before failure
        ( 28.26% 22.63% 18.09% )
Disabling device sdg on host-049.virt.lab.msp.redhat.com

Aug  4 21:50:15 host-049 qarshd[6884]: Running cmdline: pvs -a
Aug  4 21:50:46 host-049 kernel: md: mdX: resync done.
Aug  4 21:52:52 host-049 kernel: INFO: task dmeventd:6600 blocked for more than 120 seconds.
Aug  4 21:52:53 host-049 kernel:      Not tainted 2.6.32-494.el6.x86_64 #1
Aug  4 21:52:53 host-049 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  4 21:52:53 host-049 kernel: dmeventd      D 0000000000000000     0  6600      1 0x00000080
Aug  4 21:52:53 host-049 kernel: ffff880013833a18 0000000000000086 ffff8800138339a8 ffffffff81041e98
Aug  4 21:52:53 host-049 kernel: ffff8800ffffffff 000027b6866d857b ffff8800138339e8 ffff8800398ac360
Aug  4 21:52:53 host-049 kernel: 000000000176eee2 ffffffffac1ffa23 ffff88002aac1ab8 ffff880013833fd8
Aug  4 21:52:53 host-049 kernel: Call Trace:
Aug  4 21:52:53 host-049 kernel: [<ffffffff81041e98>] ? pvclock_clocksource_read+0x58/0xd0
Aug  4 21:52:53 host-049 kernel: [<ffffffff810aaa21>] ? ktime_get_ts+0xb1/0xf0
Aug  4 21:52:53 host-049 kernel: [<ffffffff81529993>] io_schedule+0x73/0xc0
Aug  4 21:52:53 host-049 kernel: [<ffffffff811ce56d>] __blockdev_direct_IO_newtrunc+0xb7d/0x1270
Aug  4 21:52:53 host-049 kernel: [<ffffffff811c9ed0>] ? blkdev_get_block+0x0/0x20
Aug  4 21:52:53 host-049 kernel: [<ffffffff811cecd7>] __blockdev_direct_IO+0x77/0xe0
Aug  4 21:52:53 host-049 kernel: [<ffffffff811c9ed0>] ? blkdev_get_block+0x0/0x20
Aug  4 21:52:53 host-049 kernel: [<ffffffff811caf57>] blkdev_direct_IO+0x57/0x60
Aug  4 21:52:53 host-049 kernel: [<ffffffff811c9ed0>] ? blkdev_get_block+0x0/0x20
Aug  4 21:52:53 host-049 kernel: [<ffffffff811261db>] generic_file_aio_read+0x6bb/0x700
Aug  4 21:52:53 host-049 kernel: [<ffffffff81234841>] ? avc_has_perm+0x71/0x90
Aug  4 21:52:53 host-049 kernel: [<ffffffff812366f2>] ? selinux_inode_permission+0x72/0xb0
Aug  4 21:52:53 host-049 kernel: [<ffffffff811ca311>] blkdev_aio_read+0x51/0x80
Aug  4 21:52:53 host-049 kernel: [<ffffffff8118dcaa>] do_sync_read+0xfa/0x140
Aug  4 21:52:53 host-049 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
Aug  4 21:52:53 host-049 kernel: [<ffffffff811ca26c>] ? block_ioctl+0x3c/0x40
Aug  4 21:52:53 host-049 kernel: [<ffffffff811a3502>] ? vfs_ioctl+0x22/0xa0
Aug  4 21:52:53 host-049 kernel: [<ffffffff8123a24b>] ? selinux_file_permission+0xfb/0x150
Aug  4 21:52:53 host-049 kernel: [<ffffffff8122d0a6>] ? security_file_permission+0x16/0x20
Aug  4 21:52:53 host-049 kernel: [<ffffffff8118e665>] vfs_read+0xb5/0x1a0
Aug  4 21:52:53 host-049 kernel: [<ffffffff8118e7a1>] sys_read+0x51/0x90
Aug  4 21:52:53 host-049 kernel: [<ffffffff810e5aee>] ? __audit_syscall_exit+0x25e/0x290
Aug  4 21:52:53 host-049 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Comment 5 Jonathan Earl Brassow 2014-08-22 00:45:45 UTC

What is dmeventd blocking on?  Are there any suspended DM devices it might be trying to read?

Comment 6 Marian Csontos 2014-08-22 07:25:50 UTC

See the attachment - there is everything I collected (I missed inactive table)

Comment 7 Jonathan Earl Brassow 2014-08-22 15:58:41 UTC

When running STS tests, I can usually figure out what test is being run, but it takes me a while.  Could you save me time by always listing the following:

black_bird -e kill_random_non_synced_raid1_3legs

Then I just need to add my specific options, but the test and scenario is given (as it appears on the command line).

Comment 8 Jonathan Earl Brassow 2014-08-22 16:09:15 UTC

One possibily interesting thing in the attachment was:
    black_bird-non_synced_random_raid1_3legs_1: 0 6291456 raid raid1 4 AADA 0/6291456 frozen 0

Comment 9 Jonathan Earl Brassow 2014-08-23 04:06:05 UTC

The original post doesn't seem to contain snapshots.  All of the cases I've repo'ed so far have.

Comment 10 Jonathan Earl Brassow 2014-08-23 04:50:34 UTC

These messages are very strange... and there are many of them:
Aug 22 16:52:08 host-075 lvm[3646]: device-mapper: waitevent ioctl on  failed: Interrupted system call
Aug 22 16:52:08 host-075 lvm[3646]: device-mapper: waitevent ioctl on  failed: Interrupted system call
Aug 22 16:52:18 host-075 lvm[3646]: device-mapper: waitevent ioctl on  failed: Interrupted system call
Aug 22 16:52:18 host-075 lvm[3646]: device-mapper: waitevent ioctl on  failed: Interrupted system call
Aug 22 16:52:28 host-075 lvm[3646]: device-mapper: waitevent ioctl on  failed: Interrupted system call
Aug 22 16:52:28 host-075 lvm[3646]: device-mapper: waitevent ioctl on  failed: Interrupted system call
Aug 22 16:52:38 host-075 lvm[3646]: device-mapper: waitevent ioctl on  failed: Interrupted system call

I wish it would print out something for "ioctl on _?_ failed:".  I don't know if I've ever seen these error messages before.

Comment 11 Jonathan Earl Brassow 2014-08-23 05:58:48 UTC

I will test this with a 6.5 kernel.  If it works there, the problem can be bisected.

Comment 12 Jonathan Earl Brassow 2014-08-23 06:00:44 UTC

> I wish it would print out something for "ioctl on _?_ failed:".  I don't
> know if I've ever seen these error messages before.

That message has only been recorded once according to bugzilla in bug 624148.

Comment 13 Jonathan Earl Brassow 2014-08-23 06:06:50 UTC

The following know issue is likely not related to this bug, but it helps me remember some reasons for hanging I/O: bug 889368.

Comment 14 Jonathan Earl Brassow 2014-08-23 06:29:13 UTC

lots of wait_barrier changes between 6.5 kernel and 6.6 kernel in md/raid1.c, which seems to be where all the problems are coming from:

INFO: task mdX_resync:11334 blocked for more than 120 seconds.                  
      Not tainted 2.6.32-497.el6.x86_64 #1                                      
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.       
mdX_resync    D 0000000000000000     0 11334      2 0x00000080                  
 ffff8801fada5be0 0000000000000046 ffff88021a3bcc00 0000000300000001            
 ffff880218b92200 ffff880218668f50 0000000000000282 0000000000000003            
 0000000000000001 0000000000000282 ffff88021bfb45f8 ffff8801fada5fd8            
Call Trace:                                                                     
 [<ffffffff8140e439>] ? md_wakeup_thread+0x39/0x70                              
 [<ffffffffa02b81dd>] sync_request+0x26d/0xb90 [raid1]                          
 [<ffffffff81064b90>] ? default_wake_function+0x0/0x20                          
 [<ffffffff810577e9>] ? __wake_up_common+0x59/0x90                              
 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40                       
 [<ffffffff8105bd23>] ? __wake_up+0x53/0x70                                     
 [<ffffffff81414a47>] md_do_sync+0x6c7/0xd20                                    
 [<ffffffff814154f5>] md_thread+0x115/0x150                                     
 [<ffffffff814153e0>] ? md_thread+0x0/0x150                                     
 [<ffffffff8109e66e>] kthread+0x9e/0xc0                                         
 [<ffffffff8100c20a>] child_rip+0xa/0x20                                        
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0                                        
 [<ffffffff8100c200>] ? child_rip+0x0/0x20                                      


and...


INFO: task kjournald:10732 blocked for more than 120 seconds.                   
      Not tainted 2.6.32-497.el6.x86_64 #1                                      
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.       
kjournald     D 0000000000000008     0 10732      2 0x00000080                  
 ffff8801d71238b0 0000000000000046 0000000000000000 0000000300000001            
 ffff880218b92200 ffff880218668f50 000003783fd7424b 0000000000000003            
 0000000000000001 000000010035a163 ffff880219135af8 ffff8801d7123fd8            
Call Trace:                                                                     
 [<ffffffffa02b463f>] wait_barrier+0xcf/0x280 [raid1]                           
 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40                       
 [<ffffffffa02b71bf>] make_request+0x8f/0xe40 [raid1]                           
 [<ffffffff81126af5>] ? mempool_alloc_slab+0x15/0x20                            
 [<ffffffff81126c93>] ? mempool_alloc+0x63/0x140                                
 [<ffffffff811c9d6b>] ? bio_alloc_bioset+0x5b/0xf0                              
 [<ffffffffa02cf018>] raid_map+0x18/0x20 [dm_raid]

Comment 15 Jonathan Earl Brassow 2014-08-23 12:31:42 UTC

has this happened with anything other than raid1?

Comment 16 Jonathan Earl Brassow 2014-08-24 20:41:28 UTC

1) Did this happen in RHEL6.5?

2) Does it happen on RHEL6.5 with a stock kernel and 6.6 LVM rpm?

3) Does it happen on RHEL6.6 with a 6.5 LVM rpm?

4) Does it happen with the upstream kernel?  When was the last time it worked?


I am currently opperating under the assumption that the raid1 changes to wait_barrier were no good and that the kernel can be bisected to find the problem.  Please answer the above questions if you can.

Comment 17 Jonathan Earl Brassow 2014-08-25 00:50:02 UTC

It works with rhel6.5 kernel and upstream/6.6 LVM rpms.  The problem is in the kernel.

Comment 18 Jonathan Earl Brassow 2014-08-25 01:04:41 UTC

I am certain enough that this is a kernel issue to make me switch components.  I believe it is still a blocker.

I have not been able to reproduce this issue in a simpler way, so choosing the subcomponent is a little trickier - either MD or DM.  Device-mapper provides a small wrapper layer to interface with the MD RAID personalities (linux/drivers/md/dm-raid.c).  There have been no changes to this device-mapper file since 6.5, but there may have been a change in MD that required a change in device-mapper that hasn't been done.  I think it is more likely that the problem has been introduced in MD (see comment 14).

Comment 20 Jonathan Earl Brassow 2014-08-26 01:17:36 UTC

Mikulas reports that (from comment 14) that the problem could be with the 2nd 'wait_event...' in 'raise_barrier'.  Changes were made to this code in the following commits:
- rhel6.6:  6c942dae97c883e67dee6364532a767d1343f2b9
- upstream: 79ef3a8aa1cb1523cc231c9a90a278333c21f761

Comment 21 Jonathan Earl Brassow 2014-08-26 04:41:31 UTC

I've been bisecting on
(rhel6.5) 85e68d4f4cc010d07515f2eb1975cfc52756a969, which contains the commit from comment 20.  I was expecting a failure, but haven't received it after 4 hours and am assuming it to be good.

My bisect has progress like this so far:
  git bisect start
  git bisect bad #a00460be15ff7fc5b16cb56bdba44443f3f1a8d2
  git bisect good 09fc14f7f34eadff9f79669c35994a87641add3f
  git bisect good 85e68d4f4cc010d07515f2eb1975cfc52756a969
-- now at a6d2657279c15e3959519135b3cf7f54ab7494f4 (will test through the night)

Comment 22 Jes Sorensen 2014-08-26 08:34:34 UTC

Jonathan,

You think this is a bug in upstream as well as in RHEL?

Cheers
Jes

Comment 23 Jonathan Earl Brassow 2014-08-26 13:42:37 UTC

...
git bisect good a6d2657279c15e3959519135b3cf7f54ab7494f4
- now at 1a302a73032e2bee84ca5e32fde81a4aab3b366a

Comment 24 Jonathan Earl Brassow 2014-08-26 13:45:48 UTC

(In reply to Jes Sorensen from comment #22)
> Jonathan,
> 
> You think this is a bug in upstream as well as in RHEL?
> 

Yes, I do.  I've tested with the latest upstream kernel (git head) and I hit the problem there too.

Unfortunately, the problem does not seem to be where I thought it was.  The commit that I thought caused the problem has been in the last 2 bisects that I've done and I haven't reproduced in those.  Still digging.  (Sometimes I can hit the problem in an hour or so, but failing to hit the bug means the test runs longer - up to 3-4 hrs.)

Comment 25 Jonathan Earl Brassow 2014-08-26 19:09:19 UTC

...
git bisect good 1a302a73032e2bee84ca5e32fde81a4aab3b366a
- now at 62ad98fbfa9ab01675df4994f1d4c5b181955bcd

Comment 26 Alasdair Kergon 2014-08-26 23:07:06 UTC

(In reply to Jonathan Earl Brassow from comment #10)
> These messages are very strange... and there are many of them:
> Aug 22 16:52:08 host-075 lvm[3646]: device-mapper: waitevent ioctl on 
> failed: Interrupted system call

It's blank because the name displayed is one that the kernel reports back to userspace.  Interrupted system call => kernel doesn't fill in the name of the device.  (We could probably fix up the message to at least report the major:minor that was passed into the kernel though a previous log message should already tell you this if you can increase the log level.)

Comment 27 Marian Csontos 2014-08-27 11:44:58 UTC

Jon, to speed it up, I have found a pattern I am checking right now:
- so far all failures happened in non_synced RAID1 3legs scenarios only (4/4)
- this happens only without snapshots (4/4).

So running only the two scenarios kill_random_non_synced_raid1_3legs and kill_primary_non_synced_raid1_3legs (perhaps in a for cycle to get better numbers) with tweaked get_add_snap_devices to always return empty list could significantly reduce run time.

Comment 28 Jonathan Earl Brassow 2014-08-27 13:01:09 UTC

... git bisect good 62ad98fbfa9ab01675df4994f1d4c5b181955bcd
- now at 21835c75f25653821b620f391f09ec92677592c9

Comment 29 Marian Csontos 2014-08-27 13:41:08 UTC

Bad news: when running only these two scenarios repeatedly I am not hitting that issue.

This suggest it is likely related to:
- background storage (which is thin-volumes in my case which may get slower as not everything is trimmed)
- or another component in the system (dmeventd is coming to my mind as a candidate.)

Comment 30 Jonathan Earl Brassow 2014-08-27 18:09:03 UTC

...
git bisect good 21835c75f25653821b620f391f09ec92677592c9
- now at 85a6c98641c01579991d8a00d3b28ba3f2e1d478

I'm now putting my money on the following commit as the culprit (I may short circuit my git bisect to directly check it against the one before:
commit 39575fd84d6793154e58b0bb84be5f7ef8dcca8f
Author: Jes Sorensen <Jes.Sorensen>
Date:   Tue Aug 5 08:48:07 2014 -0400

    [md] raid1, raid10: always abort recover on write error
    
    Message-id: <1407228487-23770-2-git-send-email-Jes.Sorensen>
    Patchwork-id: 87390
    O-Subject: [RHEL6.6 PATCH 1/1] md/raid1, raid10: always abort recover on wri
    Bugzilla: 1124351
    RH-Acked-by: Kyle McMartin <kmcmarti>
    RH-Acked-by: Nikolay Aleksandrov <nikolay>
    RH-Acked-by: Doug Ledford <dledford>
    
    From: NeilBrown <neilb>
    
    Currently we don't abort recovery on a write error if the write error
    to the recovering device was triggerd by normal IO (as opposed to
    recovery IO).
    
    This means that for one bitmap region, the recovery might write to the
    recovering device for a few sectors, then not bother for subsequent
    sectors (as it never writes to failed devices).  In this case
    the bitmap bit will be cleared, but it really shouldn't.
    
    The result is that if the recovering device fails and is then re-added
    (after fixing whatever hardware problem triggerred the failure),
    the second recovery won't redo the region it was in the middle of,
    so some of the device will not be recovered properly.
    
    If we abort the recovery, the region being processes will be cancelled
    (bit not cleared) and the whole region will be retried.
    
    As the bug can result in data corruption the patch is suitable for
    -stable.  For kernels prior to 3.11 there is a conflict in raid10.c
    which will require care.
    
    Original-from: jiao hui <jiaohui.cn>
    Reported-and-tested-by: jiao hui <jiaohui.cn>
    Signed-off-by: NeilBrown <neilb>
    Cc: stable.org
    (cherry picked from commit 2446dba03f9dabe0b477a126cbeb377854785b47)
    Signed-off-by: Jes Sorensen <Jes.Sorensen>
    
    Signed-off-by: Rafael Aquini <aquini>

Comment 31 Jonathan Earl Brassow 2014-08-29 15:35:16 UTC

I'm getting frustrated.  It can take hours to reproduce the issue and there is no why of knowing when.  Testing the commit in comment 30 produces 4+ hours of failure free running (and going).

The bisect may have been a waste (due to not allowing some test to run to failure).

Comment 32 Jonathan Earl Brassow 2014-08-29 16:36:57 UTC

Created attachment 932745 [details]
Another log file showing the problem

here is the log file (with a 'echo t > /proc/sysrq-trigger').

This test failed on iteration 3.24 (7 hours after beginning the test).

Note the processes reported as "blocked".  The resync also stopped in the follow-up log file I will be posting, but there it is missing the "blocked" processes.

Comment 33 Jonathan Earl Brassow 2014-08-29 16:42:38 UTC

Created attachment 932747 [details]
Another log file showing the problem

This log file also contains a sysrq-trigger(t) and an example of a test that has failed after 3.5 hrs of testing.  This one is slightly different in that the resync has gotten stuck and not completed, but there are no other processes being blocked.

Comment 34 Jonathan Earl Brassow 2014-08-30 16:13:25 UTC

Created attachment 932918 [details]
Another log file showing the problem

Another failure after 9.75 hrs.

I've added a "report" feature that allows me to print out a couple key variables that are related to the barrier operations in MD's RAID1 code.  Interesting with this report is that resync has stopped for one of the RAID devices, but there is no resync process blocked.  Also, the barrier count is 0.

Comment 35 Jonathan Earl Brassow 2014-08-30 16:19:00 UTC

Created attachment 932930 [details]
Another log file showing the problem

After 6 hrs, another reproduction.

This instance is more like what I expect to see.  There is a resync thread and the raid1d thread that are blocked.  I've also printed a couple key variables (at the end of this log file) that show that the barrier is maxed out at 10.

Comment 36 Jonathan Earl Brassow 2014-08-31 18:00:13 UTC

I've tried testing the kernel around the following commit:
commit 6c942dae97c883e67dee6364532a767d1343f2b9
Author: Jes Sorensen <Jes.Sorensen>
Date:   Fri May 9 08:26:22 2014 -0400

    [md] raid1: Rewrite the implementation of iobarrier

I am testing this commit and the one before it (d724f8e8942cf571d74266b043105175f075f01a).  It is not even possible to run the tests on commit d724f8e8.  You need only create a RAID1 LV and attempt to mkfs.ext3 on it after it has synced and the machine will crash.  I don't know why the machine is crashing yet, but it does not crash in that way in subsequent commits.

I have also hit a strange problem while testing at commit 6c942dae.  The test fails right away, but the sync state is "frozen" - similar to the problem seen in the very first attachment.  I will post the associated log for that shortly.

I am a bit worried that there are so many different bugs in this series of commits and that some are fixed by subsequent commits that it may be possible to simply hit bugs on every variant and not actually find any information on the bug we are presently trying to solve.

Comment 37 Jonathan Earl Brassow 2014-08-31 18:10:32 UTC

(In reply to Jonathan Earl Brassow from comment #36)

> I am testing this commit and the one before it
> (d724f8e8942cf571d74266b043105175f075f01a).  It is not even possible to run
> the tests on commit d724f8e8.  You need only create a RAID1 LV and attempt
> to mkfs.ext3 on it after it has synced and the machine will crash.  I don't
> know why the machine is crashing yet, but it does not crash in that way in
> subsequent commits.

Sorry, this should have been RAID6 - you can't create a FS on a simple RAID6 LV from commit d724f8e.

Comment 38 Jonathan Earl Brassow 2014-08-31 18:21:32 UTC

Created attachment 933162 [details]
Problem showing "frozen" sync action from rhel kernel 6c942da



It seems the test is running a 'dmsetup suspend' after it finds that the resync is not proceeding as expected.  This action hangs indefinitely, but I believe the problem came before that.  The resync shouldn't be stalling in the first place.

Comment 39 Jes Sorensen 2014-09-01 09:20:36 UTC

Looking through the 7.1 drivers/md update, this one did raise a flag. We don't
have it in RHEL6.6:

commit 2844dc32ea67044b345221067207ce67ffe8da76
Author: hui jiao <simonjiaoh>
Date:   Thu Jun 5 11:34:24 2014 +0800

    md/raid5: deadlock between retry_aligned_read with barrier io
    
    A chunk aligned read increases counter active_aligned_reads and
    decreases it after sub-device handle it successfully. But when a read
    error occurs,  the read redispatched by raid5d, and the
    active_aligned_reads will not be decreased until we can grab a stripe
    head in retry_aligned_read. Now suppose, a barrier io comes, set
    conf->quiesce to 2, and wait until both active_stripes and
    active_aligned_reads are zero. The retried chunk aligned read gets
    stuck at get_active_stripe waiting until conf->quiesce becomes 0.
    Retry_aligned_read and barrier io are waiting each other now.
    One possible solution is that we ignore conf->quiesce, let the retried
    aligned read finish. I reproduced this deadlock and test this patch on
    centos6.0
    
    Signed-off-by: NeilBrown <neilb>

Comment 40 Jonathan Earl Brassow 2014-09-01 13:45:03 UTC

there may be something there too, but we are searching for a RAID1 bug at the moment.

Note that commit e86e6cf (the commit before the MD rebase for 6.6) works just fine.

Also, commit d724f8e (the commit before the iobarrier rewrite - aka 6c942da) operates fine when the test is modified to exclude non-raid1[0] tests.  If the other tests are included, this commit causes a panic with raid6; but that seems to be resolved by later commits.

Commit 6c942da is still the most likely culprit.

Comment 41 Jonathan Earl Brassow 2014-09-01 17:33:02 UTC

Commit 6c942da does not survive the tests due to the RAID6 issue mentioned.  I am also currently testing it with only RAID 1/10 tests.  I'm assuming at some point it will fail and we will know for sure at that time that it is the problem commit.

Comment 42 Jonathan Earl Brassow 2014-09-02 13:44:37 UTC

I've now tested commit 6c942da with only RAID1/10 and was able to hit the issue after 5 hours.

This confirms commit 6c942da as the problem (given that the previous commit d724f8e cannot repo the issue).

Comment 43 Jonathan Earl Brassow 2014-09-02 13:46:43 UTC

Created attachment 933777 [details]
Latest repo with commit 6c942da

This is the log for a failed test run on commit 6c942da - the first commit to cause the problem.

Comment 44 Jonathan Earl Brassow 2014-09-03 04:26:21 UTC

Pulling the following commits seems to fix the issue:
  8b592b6 [md] raid1: fix request counting bug in new 'barrier' code
  6c942da [md] raid1: Rewrite the implementation of iobarrier
HOWEVER, I'll run tests through the night to make sure.

I am also working with Neil Brown (MD upstream maintainer) to investigate the problem.  There is no solution to the issue yet.

Comment 45 Jes Sorensen 2014-09-03 09:17:24 UTC

OK, if these two make the problem go away, at least we should be safe for
7.1 as those two patches are included there.

Cheers,
Jes

Comment 46 Jonathan Earl Brassow 2014-09-03 13:38:24 UTC

Jes, sorry that's not what I meant.  I mean those two patches need to be removed from RHEL for the time being.

I've tested the latest RHEL6 kernel without these two patches for almost a day now - well more than necessary to exhibit the problem - without issue.  They should be removed until we know the problem has been fixed upstream.

Comment 47 Jes Sorensen 2014-09-03 13:56:07 UTC

I will post a revert of those two for 6.6 straight away.

Comment 48 Rafael Aquini 2014-09-08 14:47:02 UTC

Patch(es) available on kernel-2.6.32-501.el6

Comment 55 errata-xmlrpc 2014-10-14 06:20:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-1392.html

Note You need to log in before you can comment on or make changes to this bug.