Bug 1443999

Summary: Deadlock in reshape on single core machine
Product: Red Hat Enterprise Linux 7 Reporter: Alasdair Kergon <agk>
Component: kernelAssignee: Heinz Mauelshagen <heinzm>
kernel sub component: RAID QA Contact: Zhang Yi <yizhan>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: agk, cmarthal, heinzm, jbrassow, lmiksik, mcsontos, mpatocka, msnitzer, neilb, ovasik, yizhan
Version: 7.4   
Target Milestone: rc   
Target Release: 7.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-3.10.0-683.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-02 06:10:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1442137    
Attachments:
Description Flags
verbose lvconvert attempt none

Description Alasdair Kergon 2017-04-20 12:02:24 UTC
Only on a single core machine, the test suite hangs in lvconvert-raid-reshape-linear_to_striped.sh during suspend with flush after new table reload at:

# Convert raid5_n adding stripes
lvconvert -y --stripes 4 LVMTEST13012vg/LV1


libdm-deptree.c:2803   Adding target to (252:6): 0 131072 raid raid5_n 7 128 region_size 1024 data_offset 2048 delta_disks 3 5 252:7 252:8 252:9 252:10 252:11 252:12 252:13 252:14 252:15 252:16
...
libdm-deptree.c:2803   Adding target to (252:6): 0 131072 raid raid5_n 3 128 region_size 1024 5 252:7 252:8 252:9 252:10 252:11 252:12 252:13 252:14 252:15 252:16

libdm-deptree.c:1419   Suspending LVMTEST13012vg-LV1 (252:6) with device flush


Apr 19 17:50:53 r-bot-rhel74 kernel: Call Trace:
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9c9c989>] schedule+0x29/0x70
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9c9a499>] schedule_timeout+0x239/0x2c0
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb96bf245>] ? check_preempt_curr+0x85/0xa0
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb96bf279>] ? ttwu_do_wakeup+0x19/0xd0
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9c9cd3d>] wait_for_completion+0xfd/0x140
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb96c27c0>] ? wake_up_state+0x20/0x20
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb96aeaca>] kthread_stop+0x4a/0xe0
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9b0335b>] md_unregister_thread+0x4b/0x80
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9b0a279>] md_reap_sync_thread+0x19/0x150
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9b0a6ab>] __md_stop_writes+0x3b/0xb0
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9b0a741>] md_stop_writes+0x21/0x30
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffc07ca976>] raid_presuspend+0x16/0x20 [dm_raid]
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffc05c6c7a>] dm_table_presuspend_targets+0x4a/0x60 [dm_mod]
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffc05c1928>] __dm_suspend+0xd8/0x210 [dm_mod]
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffc05c3fc0>] dm_suspend+0xc0/0xd0 [dm_mod]
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffc05c97b4>] dev_suspend+0x194/0x250 [dm_mod]
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffc05c9620>] ? table_load+0x390/0x390 [dm_mod]
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffc05c9fe5>] ctl_ioctl+0x1e5/0x500 [dm_mod]
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffc05ca313>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb981249d>] do_vfs_ioctl+0x33d/0x540
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb98011b1>] ? __sb_end_write+0x31/0x60
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb97fe3ca>] ? vfs_write+0x17a/0x1e0
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9812741>] SyS_ioctl+0xa1/0xc0
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9ca8489>] system_call_fastpath+0x16/0x1b


Apr 19 17:50:53 r-bot-rhel74 kernel: Call Trace:
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9c9c989>] schedule+0x29/0x70
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffc07ace12>] raid5_get_active_stripe+0x4d2/0x710 [raid456]
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb96bb494>] ? __wake_up+0x44/0x50
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb96af8a0>] ? wake_up_atomic_t+0x30/0x30
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffc07b1498>] reshape_request+0x528/0x980 [raid456]
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffc07b1b6f>] raid5_sync_request+0x27f/0x400 [raid456]
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb96bb494>] ? __wake_up+0x44/0x50
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9b06fe4>] md_do_sync+0x9c4/0x1070
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9b032c5>] md_thread+0x155/0x1a0
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9b03170>] ? find_pers+0x80/0x80
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb96ae85f>] kthread+0xcf/0xe0
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb96ae790>] ? insert_kthread_work+0x40/0x40
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb9ca83d8>] ret_from_fork+0x58/0x90
Apr 19 17:50:53 r-bot-rhel74 kernel: [<ffffffffb96ae790>] ? insert_kthread_work+0x40/0x40

Comment 1 Jonathan Earl Brassow 2017-04-24 14:21:58 UTC
*** Bug 1441622 has been marked as a duplicate of this bug. ***

Comment 3 Corey Marthaler 2017-04-24 15:30:35 UTC
Created attachment 1273633 [details]
verbose lvconvert attempt

Comment 4 Heinz Mauelshagen 2017-05-19 16:24:26 UTC
The flaw seems to be in md kernel deadlocking on single core when reapplying the synchronization thread as of the call chain
raid_presuspend ->
md_stop_writes ->
md_reap_sync_thread ->
md_unregister_thread ->
kthread_stop (deadlocksi, i.e. the sync thread never finishes)

Comment 5 Mikuláš Patočka 2017-05-27 22:08:46 UTC
I tested this bug.

The reported stacktraces are symptoms of the bug, not the reason. The reason is that there's a process mdX_raid5 in an infinite loop, consuming 100% CPU.

The function handle_stripe is called for a stripe:
These are the debugging prints:
handling stripe 96, state=0x1401 cnt=1, pd_idx=4, qd_idx=-1 , check:0, reconstruct:6
locked=1 uptodate=5 to_read=0 to_write=0 failed=3 failed_num=4,3

The device failed, so this branch is entered:
if (s.failed > conf->max_degraded || s.log_failed) { // s.failed == 3, conf->max_degraded == 1, s.log_failed == 0
the branch sets:
        sh->check_state = 0;
        sh->reconstruct_state = 0;

Next, we enter this branch (s.to_write is zero, so it does nothing):
if (!sh->reconstruct_state && !sh->check_state && !sh->log_io) {
        if (!r5c_is_writeback(conf->log)) {
                if (s.to_write)       //  s.to_write == 0

Next, we enter this branch:
if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) && !sh->reconstruct_state) {
        schedule_reconstruction(sh, &s, 1, 1);
        schedule_reconstruction ends with s->locked == 2, s->ops_request == (1 << STRIPE_OP_RECONSTRUCT)

Next, we enter this branch:
if (s.ops_request)
        raid_run_ops(sh, s.ops_request);
        raid_run_ops calls ops_run_reconstruct5
        ops_run_reconstruct5 calls async_xor with the callback ops_complete_reconstruct (the xor operation is actually synchronous and the callback is called immediatelly because system doesn't have dedicated xor hardware)
        ops_complete_reconstruct is called with sh->sector==96, sh->state==0x1401, sh->reconstruct_state==3
        ops_complete_reconstruct calls set_bit(STRIPE_HANDLE, &sh->state); at the end and it just submits the stripe back to stripe_handle

stripe_handle is called with the same stripe again and goes exactly the same way as before. The process consumes 100% CPU in a loop, repeatedly recalculating parity for the same stripe.

I found out that this patch fixes it - if we clear the bit STRIPE_EXPANDING, we won't go to the branch that calles schedule_reconstruction and the infinite loop won't happen. However, the code is so messy that only the author can tell if this patch is correct or not.
--- rhel7.orig/drivers/md/raid5.c
+++ rhel7/drivers/md/raid5.c
@@ -4517,6 +4517,7 @@ static void handle_stripe(struct stripe_
                        handle_failed_stripe(conf, sh, &s, disks, &s.return_bi);
                if (s.syncing + s.replacing)
                        handle_failed_sync(conf, sh, &s);
+               clear_bit(STRIPE_EXPANDING, &sh->state);
        }

        /* Now we check to see if any write operations have recently

Generally, the code has very poor quality and it is unverifiable. There's one big 390-line function handle_stripe that does all possible operations that can be done on a stripe - it doesn't have straight control flow, the action that the function does is determined by a large number of bit flags. It is impossible to verify that it handles device failure correctly in all states and that there are no more infinite loops like this.

Properly, handle_stripe should be broken to smaller functions where each function does just one thing and it should use callbacks, not bit flags, to determine what operation should be done next - then, the sequence of operations on a stripe could be understood and verified.

Comment 6 Alasdair Kergon 2017-05-30 14:58:43 UTC
Neil - have you seen anything like this upstream?

Comment 8 Heinz Mauelshagen 2017-05-30 17:18:31 UTC
Added upstream commit 3719f4bc5441cb5f29ad4beb91ccaa6b234ea8e1 to lvm2 rejecting changing number of stripes on single core until MD kernel fix available.
Once we have that, a feature flag is mandatory to tell the difference between flawed and fixed MD kernel.

Comment 9 NeilBrown 2017-05-30 21:53:28 UTC
> Neil - have you seen anything like this upstream?

No.

> only the author can tell if this patch is correct or not.

You possibly have unfounded faith in the author....

I assume the context is multiple device failure for a RAID5 while it is being reshaped?  When I try to make that happen it does behave a little strangely.
md0_raid5 and md0_reshape are in 'S' rather than 'R' - haven't looked more deeply yet.

> +               clear_bit(STRIPE_EXPANDING, &sh->state);

This looks like a sensible patch.  As the array is now dead, we just need to stop trying to reshape and let it die in piece.
I suggest posting it to linux-raid.

Comment 10 Heinz Mauelshagen 2017-05-30 22:29:59 UTC
(In reply to NeilBrown from comment #9)
> > Neil - have you seen anything like this upstream?
> 
> No.
> 
> > only the author can tell if this patch is correct or not.
> 
> You possibly have unfounded faith in the author....
> 
> I assume the context is multiple device failure for a RAID5 while it is
> being reshaped?  When I try to make that happen it does behave a little
> strangely.
> md0_raid5 and md0_reshape are in 'S' rather than 'R' - haven't looked more
> deeply yet.
> 
> > +               clear_bit(STRIPE_EXPANDING, &sh->state);
> 
> This looks like a sensible patch.  As the array is now dead, we just need to
> stop trying to reshape and let it die in piece.
> I suggest posting it to linux-raid.

Neil,

the deadlock does not occur on failing devices. They are all accesible when we
add/remove stripes.

I tried your supend check move in md_check_recovery as of "4.11.2: reshape raid5 -> raid6 atop bcache deadlocks at start on md_attr_store / raid5_make_request" and it fixes the deadlock as of this bz.
More testing to prove this right follows...

Comment 11 Mikuláš Patočka 2017-05-31 01:03:22 UTC
As Heinz pointed out, there should be no device failure in the test, but the kernel goes down the failed device path ("if (s.failed > conf->max_degraded || s.log_failed)") on single-core system, so there seems to be another bug that needs to be analyzed.

Comment 12 NeilBrown 2017-05-31 01:55:10 UTC
I saw:
 // s.failed == 3, 

and I just assumed....

What exactly is the test case trying to do?

Comment 13 Heinz Mauelshagen 2017-05-31 12:28:10 UTC
(In reply to NeilBrown from comment #12)
> I saw:
>  // s.failed == 3, 
> 
> and I just assumed....
> 
> What exactly is the test case trying to do?

There's various examples deadlocking on single core machines.

For instance, a specific example is a 3-way striped raid5_ls (i.e. 4 stripes total) converted to 5 data stripes (ie. 6 stripes total) using:

lvconvert --yes --stripes 5 RaidLV

Comment 14 Heinz Mauelshagen 2017-05-31 12:39:08 UTC
(In reply to Mikulas Patocka from comment #11)
> As Heinz pointed out, there should be no device failure in the test, but the
> kernel goes down the failed device path ("if (s.failed > conf->max_degraded
> || s.log_failed)") on single-core system, so there seems to be another bug
> that needs to be analyzed.

Mikulas,

mind you're referring to RHEL7 kernel-3.10.0-671.el7 and Neil to upstream. May be more MD code to backport to RHEL7 kernel influencing the device failure path to be hit.

Comment 15 Mikuláš Patočka 2017-05-31 18:27:55 UTC
I don't know if this bug is present in the upstream kernel or not because in the upstream kernel (since 4.12-rc) the raid subsystem crashes on BUG in lib/percpu-refcount.c:192.

Comment 16 Mikuláš Patočka 2017-05-31 19:13:24 UTC
raid_ctr loads two devices at drivers/md/raid5.c:6824, with this stacktrace
[ 1921.532124]  [<ffffffff9fa8c05b>] dump_stack+0x19/0x1b
[ 1921.532132]  [<ffffffffc0d27c7f>] setup_conf+0x6bf/0xa60 [raid456]
[ 1921.532136]  [<ffffffffc0d2899c>] raid5_run+0x87c/0xbb0 [raid456]
[ 1921.532145]  [<ffffffffc0d00ded>] md_run+0x70d/0xa90 [md_mod]
[ 1921.532148]  [<ffffffffc0cd4202>] ? super_validate.part.26+0x2d2/0x800 [dm_raid]
[ 1921.532151]  [<ffffffffc0cd6075>] raid_ctr+0x755/0x17d0 [dm_raid]
[ 1921.532171]  [<ffffffffc0a4aa57>] dm_table_add_target+0x177/0x430 [dm_mod]
[ 1921.532176]  [<ffffffffc0a4e417>] table_load+0x157/0x390 [dm_mod]
[ 1921.532182]  [<ffffffffc0a4e2c0>] ? retrieve_status+0x1c0/0x1c0 [dm_mod]
[ 1921.532187]  [<ffffffffc0a4f015>] ctl_ioctl+0x1e5/0x500 [dm_mod]
[ 1921.532192]  [<ffffffffc0a4f343>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 1921.532196]  [<ffffffff9f61416d>] do_vfs_ioctl+0x33d/0x540
[ 1921.532200]  [<ffffffff9f6b618f>] ? file_has_perm+0x9f/0xb0
[ 1921.532203]  [<ffffffff9f602e71>] ? __sb_end_write+0x31/0x60
[ 1921.532205]  [<ffffffff9f614411>] SyS_ioctl+0xa1/0xc0
[ 1921.532209]  [<ffffffff9fa9cb09>] system_call_fastpath+0x16/0x1b

then, raid_ctr ends, raid_preresume is called, and it goes to raid5_start_reshape; raid5_start_reshape starts the kernel thread md_do_sync:
mddev->sync_thread = md_register_thread(md_do_sync, mddev, "reshape");
[ 1921.771345]  [<ffffffff9fa8c05b>] dump_stack+0x19/0x1b
[ 1921.771353]  [<ffffffffc0d270c0>] raid5_start_reshape+0x330/0x430 [raid456]
[ 1921.771357]  [<ffffffffc0cd37b8>] raid_preresume+0x198/0x310 [dm_raid]
[ 1921.771377]  [<ffffffffc0a4bdd4>] dm_table_resume_targets+0x54/0xe0 [dm_mod]
[ 1921.771382]  [<ffffffffc0a490b9>] dm_resume+0xb9/0xe0 [dm_mod]
[ 1921.771387]  [<ffffffffc0a4e77b>] dev_suspend+0x12b/0x250 [dm_mod]
[ 1921.771392]  [<ffffffffc0a4e650>] ? table_load+0x390/0x390 [dm_mod]
[ 1921.771397]  [<ffffffffc0a4f015>] ctl_ioctl+0x1e5/0x500 [dm_mod]
[ 1921.771403]  [<ffffffffc0a4f343>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 1921.771407]  [<ffffffff9f61416d>] do_vfs_ioctl+0x33d/0x540
[ 1921.771411]  [<ffffffff9f6b618f>] ? file_has_perm+0x9f/0xb0
[ 1921.771414]  [<ffffffff9f602e71>] ? __sb_end_write+0x31/0x60
[ 1921.771416]  [<ffffffff9f614411>] SyS_ioctl+0xa1/0xc0
[ 1921.771420]  [<ffffffff9fa9cb09>] system_call_fastpath+0x16/0x1b

Then, md_do_sync runs, it prints "md: reshape of RAID array mdX" to the log

Then, raid5d does the reshape work, it finds out in analyse_stripe that only 2 out of 5 devices are active (conf->disks[i].rdev is non-NULL for i == 0,1 and NULL for i == 2,3,4)
[ 1921.836627]  [<ffffffff9fa8c05b>] dump_stack+0x19/0x1b
[ 1921.836635]  [<ffffffffc0d26485>] analyse_stripe+0x4d5/0x900 [raid456]
[ 1921.836639]  [<ffffffffc0d2e50c>] handle_stripe+0x14c/0x2360 [raid456]
[ 1921.836642]  [<ffffffff9f4c4533>] ? try_to_wake_up+0x183/0x340
[ 1921.836646]  [<ffffffff9f58c846>] ? __alloc_pages_nodemask+0x176/0x420
[ 1921.836650]  [<ffffffffc0d30acd>] handle_active_stripes.isra.54+0x3ad/0x530 [raid456]
[ 1921.836653]  [<ffffffffc0d31148>] raid5d+0x4f8/0x7f0 [raid456]
[ 1921.836661]  [<ffffffffc0cf9135>] md_thread+0x155/0x1a0 [md_mod]
[ 1921.836664]  [<ffffffff9f4b18e0>] ? wake_up_atomic_t+0x30/0x30
[ 1921.836668]  [<ffffffffc0cf8fe0>] ? find_pers+0x80/0x80 [md_mod]
[ 1921.836670]  [<ffffffff9f4b095f>] kthread+0xcf/0xe0
[ 1921.836672]  [<ffffffff9f4b0890>] ? insert_kthread_work+0x40/0x40
[ 1921.836676]  [<ffffffff9fa9ca58>] ret_from_fork+0x58/0x90
[ 1921.836678]  [<ffffffff9f4b0890>] ? insert_kthread_work+0x40/0x40
So, analyse stripe concludes that the device has failed - it sees only 2 out of expected 5 devices active (it triggers the previous bug from comment #5 )

Later, raid_ctr is called again, this time it assigns all 5 devices (at drivers/md/raid5.c:6824), but it is already too late and the array is considered failed:

[ 1921.928256]  [<ffffffff9fa8c05b>] dump_stack+0x19/0x1b
[ 1921.928264]  [<ffffffffc0d27c7f>] setup_conf+0x6bf/0xa60 [raid456]
[ 1921.928268]  [<ffffffffc0d2899c>] raid5_run+0x87c/0xbb0 [raid456]
[ 1921.928277]  [<ffffffffc0d00ded>] md_run+0x70d/0xa90 [md_mod]
[ 1921.928281]  [<ffffffffc0cd4202>] ? super_validate.part.26+0x2d2/0x800 [dm_raid]
[ 1921.928283]  [<ffffffffc0cd6075>] raid_ctr+0x755/0x17d0 [dm_raid]
[ 1921.928303]  [<ffffffffc0a4aa57>] dm_table_add_target+0x177/0x430 [dm_mod]
[ 1921.928309]  [<ffffffffc0a4e417>] table_load+0x157/0x390 [dm_mod]
[ 1921.928314]  [<ffffffffc0a4e2c0>] ? retrieve_status+0x1c0/0x1c0 [dm_mod]
[ 1921.928319]  [<ffffffffc0a4f015>] ctl_ioctl+0x1e5/0x500 [dm_mod]
[ 1921.928325]  [<ffffffffc0a4f343>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 1921.928329]  [<ffffffff9f61416d>] do_vfs_ioctl+0x33d/0x540
[ 1921.928333]  [<ffffffff9f6b618f>] ? file_has_perm+0x9f/0xb0
[ 1921.928335]  [<ffffffff9f602e71>] ? __sb_end_write+0x31/0x60
[ 1921.928337]  [<ffffffff9f614411>] SyS_ioctl+0xa1/0xc0
[ 1921.928341]  [<ffffffff9fa9cb09>] system_call_fastpath+0x16/0x1b

So, it seems like invalid sequencing by the userspace LVM tool - it loads the raid array with two devices, then calls resume that starts the reshape operation and later reloads the device with all five devices, but it is already too late.

On SMP systems, the second call to raid5_ctr happens before raid5d starts doing the actual reshape, so the reshape sees all 5 devices and succeeds. On single-core systems, raid5d runs before the userspace reloads the table the second time with all five devices, so raid5d sees only two devices (expecting five) and fails the whole array.

Comment 17 Heinz Mauelshagen 2017-05-31 21:27:19 UTC
(In reply to Mikulas Patocka from comment #15)
> I don't know if this bug is present in the upstream kernel or not because in
> the upstream kernel (since 4.12-rc) the raid subsystem crashes on BUG in
> lib/percpu-refcount.c:192.

It is present. Tested with 4.11, which doesn't have the percpu-refcount bug.

Comment 18 Heinz Mauelshagen 2017-05-31 21:37:21 UTC
(In reply to Mikulas Patocka from comment #16)

As mentioned in comment #10 adding Neil's patch allows to write the superblocks and fixes the deadlock on single core.

The 2 raid_ctr calls are intentional with the first one passing in the reshape related information (delta_disks N and additional device tuppels) but not actually starting the reshape by setting the array to frozen. This one needs to update the superblocks to keep that information persistant and retrivable by any followup raid_ctr call. The 2nd raid_ctr call does _not_ pass in delta_disks N and clears the frozen array state thus actually starting the reshape. This 2nd one fails to retrieve proper superblock content without Neil's patch.

Comment 19 Mikuláš Patočka 2017-05-31 22:05:42 UTC
yes, with this patch https://www.spinics.net/lists/raid/msg57911.html , premature reshape with only two active devices doesn't happen.

Comment 20 NeilBrown 2017-06-01 07:06:15 UTC
> the raid subsystem crashes on BUG in lib/percpu-refcount.c:192.

Oops, that was careless.  mddev->writes_pending is initialised in a function that dm-raid doesn't call.  Sorry.
I think I'll get ->run() to initialise it and ->free() to free it.
That was it doesn't need to be initialised for personalities that don't use it.

Comment 21 NeilBrown 2017-06-01 07:10:16 UTC
> As mentioned in comment #10 adding Neil's patch allows to write the superblocks and fixes the deadlock on single core.

That patch isn't the final version that will go upstream.  It helps when mddev_suspend() is called without ->reconfig_mutex held, but not if it is held - and md mostly calls mddev_suspend() with it held.  I should have the replacement patch out early next week.

I've been looking at the dm-raid code and see things like 
	if (!rs->md.suspended)
		mddev_suspend(&rs->md);

and

	if (mddev->suspended)
		mddev_resume(mddev);

These bother me.  The suspended field should only be for internal use.
Do you remeber, Heinz, why you didn't just call mddev_suspend() when you need to suspend, and mddev_resume() when you don't need it to be suspended any more?

Comment 22 Heinz Mauelshagen 2017-06-01 09:46:58 UTC
(In reply to NeilBrown from comment #21)
> > As mentioned in comment #10 adding Neil's patch allows to write the superblocks and fixes the deadlock on single core.
> 
> That patch isn't the final version that will go upstream.  It helps when
> mddev_suspend() is called without ->reconfig_mutex held, but not if it is
> held - and md mostly calls mddev_suspend() with it held.  I should have the
> replacement patch out early next week.
> 
> I've been looking at the dm-raid code and see things like 
> 	if (!rs->md.suspended)
> 		mddev_suspend(&rs->md);
> 
> and
> 
> 	if (mddev->suspended)
> 		mddev_resume(mddev);
> 
> These bother me.  The suspended field should only be for internal use.
> Do you remeber, Heinz, why you didn't just call mddev_suspend() when you
> need to suspend, and mddev_resume() when you don't need it to be suspended
> any more?

This was conservative to avoid unbalanced mddev_(suspend|resume) cycles.
I'll have a look if we can dodge the conditionals, which I assume we can.

Comment 23 Heinz Mauelshagen 2017-06-01 14:20:15 UTC
(In reply to NeilBrown from comment #20)
> > the raid subsystem crashes on BUG in lib/percpu-refcount.c:192.
> 
> Oops, that was careless.  mddev->writes_pending is initialised in a function
> that dm-raid doesn't call.  Sorry.
> I think I'll get ->run() to initialise it and ->free() to free it.
> That was it doesn't need to be initialised for personalities that don't use
> it.

Not following...

mddev->writes_pending is being initialized in md_run() which the dm-raid constructor calls. Or did you mean unitialized request queue member q_usage_counter, which can't be initialized due to md underneath dm not managing a queue?

Comment 24 Heinz Mauelshagen 2017-06-01 14:36:11 UTC
(In reply to Heinz Mauelshagen from comment #22)
> (In reply to NeilBrown from comment #21)
> > > As mentioned in comment #10 adding Neil's patch allows to write the superblocks and fixes the deadlock on single core.
> > 
> > That patch isn't the final version that will go upstream.  It helps when
> > mddev_suspend() is called without ->reconfig_mutex held, but not if it is
> > held - and md mostly calls mddev_suspend() with it held.  I should have the
> > replacement patch out early next week.
> > 
> > I've been looking at the dm-raid code and see things like 
> > 	if (!rs->md.suspended)
> > 		mddev_suspend(&rs->md);
> > 
> > and
> > 
> > 	if (mddev->suspended)
> > 		mddev_resume(mddev);
> > 
> > These bother me.  The suspended field should only be for internal use.
> > Do you remeber, Heinz, why you didn't just call mddev_suspend() when you
> > need to suspend, and mddev_resume() when you don't need it to be suspended
> > any more?
> 
> This was conservative to avoid unbalanced mddev_(suspend|resume) cycles.
> I'll have a look if we can dodge the conditionals, which I assume we can.

Initial tests show it runs fine without the conditionals, but...

From dm, we allow secondary resumes on already resumed raid sets to attempt to restore any previously failed component devices. IOW: suspend and resumed aren't balanced to allow for this functionality. Though any N-th (N > 1) resume will bail out decrementing mddev->suspended, a followup suspend won't work on the same mddev because that int variable turned negative.
Could change logic in raid_resume() based on existing dm-raid resume flag handling the inbalance to avoid accessing mddev->suspended.

Comment 25 Heinz Mauelshagen 2017-06-01 17:42:48 UTC
Pushing to 7.5 to have proper md fixes to address the single core deadlock.

Comment 26 Heinz Mauelshagen 2017-06-01 21:23:22 UTC
Neil,

I propose to have an md api/code version to associate features/API (bugs) with it as we do in dm (for instance to not allow reshape on single core if md version < x). Like a new md_version() API returning major+minor+subminor or these as members of the mddev structure (the latter causing memory overhead).

Comment 27 Heinz Mauelshagen 2017-06-07 11:53:33 UTC
Neil submitted final patches to linux-raid which have been applied by Shaohua to be pushed in the 4.12 cycle:
http://marc.info/?l=linux-raid&m=149664539101400&w=2

Comment 28 Heinz Mauelshagen 2017-06-08 12:16:04 UTC
http://marc.info/?l=linux-raid&m=149664539101400&w=2
does not solve the single core deadlock issue on upstream!

https://www.spinics.net/lists/raid/msg57911.html, which is not the final patch Neil said is still mandatory to avoid the deadlock.

Comment 34 NeilBrown 2017-06-14 02:30:59 UTC
(In reply to Heinz Mauelshagen from comment #26)
> I propose to have an md api/code version to associate features/API (bugs)
> with it as we do in dm (for instance to not allow reshape on single core if
> md version < x). Like a new md_version() API returning major+minor+subminor
> or these as members of the mddev structure (the latter causing memory
> overhead).

It's not really up to me, but I wouldn't take this approach.
If there was an API issue that couldn't be tested directly, I'd just look at the kernel version.  If some distro back-ports a fix (or a bug), it is up to them to make corresponding changes to mdadm.

Comment 35 Heinz Mauelshagen 2017-06-14 11:08:13 UTC
(In reply to NeilBrown from comment #34)
> (In reply to Heinz Mauelshagen from comment #26)
> > I propose to have an md api/code version to associate features/API (bugs)
> > with it as we do in dm (for instance to not allow reshape on single core if
> > md version < x). Like a new md_version() API returning major+minor+subminor
> > or these as members of the mddev structure (the latter causing memory
> > overhead).
> 
> It's not really up to me, but I wouldn't take this approach.
> If there was an API issue that couldn't be tested directly, I'd just look at
> the kernel version.  If some distro back-ports a fix (or a bug), it is up to
> them to make corresponding changes to mdadm.

Kernel version may do, right.

Meanwhile,

do you have an enhanced patch for
https://www.spinics.net/lists/raid/msg57911.html ?

Comment 37 NeilBrown 2017-06-15 00:55:59 UTC
I'd like to understand why that patch makes a difference first.

According to comment #18, there is some problem with a second ctr call not retrieving the correct metadata.
Maybe you are expecting the metadata to be written, and then reloaded, all while the array is suspended?  Why would you even reload the metadata - don't you have an internal copy.  I think I might be missing something.

If there is a place where you need the metadata to be written synchronously for correct operation, then you should write it synchronously.  Call md_update_sb, making sure you hold the mddev_lock().

I've looked some more at your use of mddev_suspend/resume and the fact that you resume, and then suspend, inside rs_start_reshape() bothers me.
I presume you do that because ->start_reshape requires a non-suspended device.
That is true as it allocates memory, and so need writeout to be working on all devices.
I think it would be best to find a way to delay the ->start_reshape() call until raid_resume(), after the call the mddev_resume().  It should be perfectly safe to delay start_reshape like this.  All the changes that need to be atomic have already happened, and ->start_reshape() just starts the process of moving data around.

To answer comment #23 - you have already worked this out yourself, but ->writes_pending is initialised on md_alloc()
 
	if (percpu_ref_init(&mddev->writes_pending, no_op, 0, GFP_KERNEL) < 0)
		goto abort;

and md_alloc() doesn't call that function.

Comment 40 Heinz Mauelshagen 2017-06-16 21:32:58 UTC
(In reply to NeilBrown from comment #37)
> I'd like to understand why that patch makes a difference first.

Ok.

> 
> According to comment #18, there is some problem with a second ctr call not
> retrieving the correct metadata.

Yes.

> Maybe you are expecting the metadata to be written, and then reloaded, all
> while the array is suspended?  Why would you even reload the metadata -
> don't you have an internal copy.  I think I might be missing something.

dm has an active and inactive table slot to allow for mapping switches
on open devices and either causes constructor calls.  For dm-raid we always
read in superblocks from disk processing them.

Thus, the active one holding the table for the current raid device has to be
suspended leaving up-to-date superblocks behind.  The raid_ctr() call for the inactive table read the superblocks in and prepares them for reshape (new_layout/delta_disk/...) setting recovery to frozen.
This does not start the reshape until preresume.

> 
> If there is a place where you need the metadata to be written synchronously
> for correct operation, then you should write it synchronously.  Call
> md_update_sb, making sure you hold the mddev_lock().

Doing that already.
Auditing if it's in the right place...

> 
> I've looked some more at your use of mddev_suspend/resume and the fact that
> you resume, and then suspend, inside rs_start_reshape() bothers me.
> I presume you do that because ->start_reshape requires a non-suspended
> device.

Correct.
As mentioned, it's done with the recovery set to frozen though.

> That is true as it allocates memory, and so need writeout to be working on
> all devices.
> I think it would be best to find a way to delay the ->start_reshape() call
> until raid_resume(), after the call the mddev_resume().  It should be
> perfectly safe to delay start_reshape like this.  All the changes that need
> to be atomic have already happened, and ->start_reshape() just starts the
> process of moving data around.

start_reshape can fail and we can't report that in raid_resume().
That's why it's done in raid_preresume() which can return error to userspace.

> 
> To answer comment #23 - you have already worked this out yourself, but
> ->writes_pending is initialised on md_alloc()
>  
> 	if (percpu_ref_init(&mddev->writes_pending, no_op, 0, GFP_KERNEL) < 0)
> 		goto abort;
> 
> and md_alloc() doesn't call that function.

Yes, thanks.

Comment 41 NeilBrown 2017-06-18 22:45:14 UTC
> Doing that already.

Ah yes, the rs_update_sbs() calls.  They do seem to do the right thing, but I don't have a clear picture of the sequencing so I cannot be sure.

> dm has an active and inactive table slot to allow for mapping switches
on open devices

Ahh... so when rs_start_reshape() is called, is that md array always an "inactive table" - so no I/O could possibly want to go through it?  Any io to the DM device will go through the separate "active table"?

That means you don't need to worry about mem-alloc deadlocks like md does.
So I think there should be no need at all to resume/suspend in rs_start_reshape().

Does it cause any problem to remove the mddev_resume()/mddev_suspend() from rs_start_reshape()?

Comment 42 Rafael Aquini 2017-06-19 16:57:26 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 44 Jonathan Earl Brassow 2017-06-19 18:05:43 UTC
*** Bug 1442137 has been marked as a duplicate of this bug. ***

Comment 45 Rafael Aquini 2017-06-20 14:57:42 UTC
Patch(es) available on kernel-3.10.0-683.el7

Comment 48 errata-xmlrpc 2017-08-02 06:10:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842