Description of problem: I saw this while trying to verify the fix for bz 199766. I created a 300G mirror and then quickly start writing to the mirror from all for nodes in the cluster at the same time using dd (while the cmirror was being sync'ed). This caused many kernel region request and region recovery messages to scroll onto the console/. Server: [...] Nov 27 10:05:39 link-08 kernel: dm-cmirror: Current recoverer: 3 Nov 27 10:05:39 link-08 kernel: dm-cmirror: Mark requester : 2 Nov 27 10:05:39 link-08 kernel: dm-cmirror: unable to get server (2) to mark region (180) Nov 27 10:05:39 link-08 kernel: dm-cmirror: Reason :: -16 Nov 27 10:05:39 link-08 kernel: dm-cmirror: Attempt to mark a region 180, which is being recovered. Nov 27 10:05:39 link-08 kernel: dm-cmirror: Current recoverer: 3 Nov 27 10:05:39 link-08 kernel: dm-cmirror: Mark requester : 2 Nov 27 10:05:39 link-08 kernel: dm-cmirror: unable to get server (2) to mark region (180) Nov 27 10:05:39 link-08 kernel: dm-cmirror: Reason :: -16 Nov 27 10:05:39 link-08 kernel: dm-cmirror: Attempt to mark a region 180, which is being recovered. Nov 27 10:05:39 link-08 kernel: dm-cmirror: Current recoverer: 3 Nov 27 10:05:39 link-08 kernel: dm-cmirror: Mark requester : 2 Nov 27 10:05:39 link-08 kernel: dm-cmirror: unable to get server (2) to mark region (180) Nov 27 10:05:39 link-08 kernel: dm-cmirror: Reason :: -16 Nov 27 10:05:39 link-08 kernel: dm-cmirror: Attempt to mark a region 180, which is being recovered. Nov 27 10:05:39 link-08 kernel: dm-cmirror: Current recoverer: 3 Nov 27 10:05:39 link-08 kernel: dm-cmirror: Mark requester : 2 Nov 27 10:05:39 link-08 kernel: dm-cmirror: unable to get server (2) to mark region (180) Nov 27 10:05:39 link-08 kernel: dm-cmirror: Reason :: -16 Nov 27 10:05:39 link-08 kernel: dm-cmirror: Attempt to mark a region 180, which is being recovered. [...] Clients: [...] ov 27 15:09:21 link-07 kernel: dm-cmirror: unable to get server (2) to mark region (0) Nov 27 15:09:21 link-07 kernel: dm-cmirror: Reason :: -16 Nov 27 15:09:21 link-07 kernel: dm-cmirror: unable to get server (2) to mark region (0) Nov 27 15:09:21 link-07 kernel: dm-cmirror: Reason :: -16 Nov 27 15:09:21 link-07 kernel: dm-cmirror: unable to get server (2) to mark region (0) Nov 27 15:09:21 link-07 kernel: dm-cmirror: Reason :: -16 Nov 27 15:09:21 link-07 kernel: dm-cmirror: unable to get server (2) to mark region (0) Nov 27 15:09:21 link-07 kernel: dm-cmirror: Reason :: -16 Nov 27 15:09:21 link-07 kernel: dm-cmirror: unable to get server (2) to mark region (0) Nov 27 15:09:21 link-07 kernel: dm-cmirror: Reason :: -16 Nov 27 15:09:21 link-07 kernel: dm-cmirror: unable to get server (2) to mark region (0) Nov 27 15:09:21 link-07 kernel: dm-cmirror: Reason :: -16 [...] [...] Nov 27 09:53:46 link-02 kernel: dm-cmirror: Reason :: -16 Nov 27 09:53:46 link-02 kernel: dm-cmirror: unable to get server (2) to mark region (760) Nov 27 09:53:46 link-02 kernel: dm-cmirror: Reason :: -16 Nov 27 09:53:46 link-02 kernel: dm-cmirror: unable to get server (2) to mark region (760) Nov 27 09:53:46 link-02 kernel: dm-cmirror: Reason :: -16 Nov 27 09:53:46 link-02 kernel: dm-cmirror: unable to get server (2) to mark region (760) Nov 27 09:53:46 link-02 kernel: dm-cmirror: Reason :: -16 Nov 27 09:53:46 link-02 kernel: dm-cmirror: unable to get server (2) to mark region (762) Nov 27 09:53:46 link-02 kernel: dm-cmirror: Reason :: -16 Nov 27 09:53:46 link-02 kernel: dm-cmirror: unable to get server (2) to mark region (764) Nov 27 09:53:46 link-02 kernel: dm-cmirror: Reason :: -16 [...] Version-Release number of selected component (if applicable): [root@link-08 ~]# rpm -q lvm2 lvm2-2.02.15-3 [root@link-08 ~]# rpm -q lvm2-cluster lvm2-cluster-2.02.15-3 [root@link-08 ~]# rpm -q device-mapper device-mapper-1.02.12-3 [root@link-08 ~]# rpm -q cmirror cmirror-1.0.1-0 [root@link-08 ~]# rpm -q cmirror-kernel cmirror-kernel-2.6.9-13.0
Possibly: 1) recovery complete
Is the fix for this in lvm2-2.02.17-1/lvm2-cluster-2.02.17-1? I'm still seeing these in the latest build.
This bug still exists. I basically did exactly what was done in the original report, that is, create a cmirror and begin writing to it before is was fully synced (in this case I used GFS). I then noticed that the copy percent had gotten stuck at 88%. This causes the filesytem to deadlock. [root@link-08 ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Devices fail_primary_synced_2_legs helter_skelter mwi-ao 800.00M fail_primary_synced_2_legs_mlog 88.00 fail_primary_synced_2_legs_mimage_0(0),fail_primary_synced_2_legs_mimage_1(0) [fail_primary_synced_2_legs_mimage_0] helter_skelter iwi-ao 800.00M /dev/sde1(0) [fail_primary_synced_2_legs_mimage_1] helter_skelter iwi-ao 800.00M /dev/sdf1(0) [fail_primary_synced_2_legs_mlog] helter_skelter lwi-ao 4.00M /dev/sdb1(0) [...] Apr 4 09:50:12 link-08 kernel: dm-cmirror: Attempt to mark a region 0/6t9xT1Dd which is being recovered. Apr 4 09:50:12 link-08 kernel: dm-cmirror: Current recoverer: 1 Apr 4 09:50:12 link-08 kernel: dm-cmirror: Mark requester : 4 Apr 4 09:50:12 link-08 kernel: dm-cmirror: Delaying mark to region 0, due to recovery Apr 4 09:50:12 link-08 kernel: dm-cmirror: Attempt to mark a region 0/6t9xT1Dd which is being recovered. Apr 4 09:50:12 link-08 kernel: dm-cmirror: Current recoverer: 1 Apr 4 09:50:12 link-08 kernel: dm-cmirror: Mark requester : 4 Apr 4 09:50:12 link-08 kernel: dm-cmirror: Delaying mark to region 0, due to recovery Apr 4 09:50:13 link-08 kernel: dm-cmirror: Attempt to mark a region 0/6t9xT1Dd which is being recovered. Apr 4 09:50:13 link-08 kernel: dm-cmirror: Current recoverer: 1 Apr 4 09:50:13 link-08 kernel: dm-cmirror: Mark requester : 4 Apr 4 09:50:13 link-08 kernel: dm-cmirror: Delaying mark to region 0, due to recovery Apr 4 09:50:13 link-08 kernel: dm-cmirror: Attempt to mark a region 0/6t9xT1Dd which is being recovered. Apr 4 09:50:13 link-08 kernel: dm-cmirror: Current recoverer: 1 Apr 4 09:50:13 link-08 kernel: dm-cmirror: Mark requester : 4 Apr 4 09:50:13 link-08 kernel: dm-cmirror: Delaying mark to region 0, due to recovery Apr 4 09:50:14 link-08 kernel: dm-cmirror: Attempt to mark a region 0/6t9xT1Dd which is being recovered. Apr 4 09:50:14 link-08 kernel: dm-cmirror: Current recoverer: 1 Apr 4 09:50:14 link-08 kernel: dm-cmirror: Mark requester : 4 Apr 4 09:50:14 link-08 kernel: dm-cmirror: Delaying mark to region 0, due to recovery Apr 4 09:50:14 link-08 kernel: dm-cmirror: Attempt to mark a region 0/6t9xT1Dd which is being recovered. Apr 4 09:50:14 link-08 kernel: dm-cmirror: Current recoverer: 1 Apr 4 09:50:14 link-08 kernel: dm-cmirror: Mark requester : 4 Apr 4 09:50:14 link-08 kernel: dm-cmirror: Delaying mark to region 0, due to recovery Apr 4 09:50:15 link-08 kernel: dm-cmirror: Attempt to mark a region 0/6t9xT1Dd which is being recovered. Apr 4 09:50:15 link-08 kernel: dm-cmirror: Current recoverer: 1 Apr 4 09:50:15 link-08 kernel: dm-cmirror: Mark requester : 4 Apr 4 09:50:15 link-08 kernel: dm-cmirror: Delaying mark to region 0, due to recovery [...] All the other nodes in the cluster say this: Apr 4 14:13:12 link-07 kernel: dm-cmirror: Delaying mark to region 0, due to recovery
These are the current running versions: [root@link-08 ~]# uname -ar Linux link-08 2.6.9-50.ELsmp #1 SMP Tue Mar 6 18:04:58 EST 2007 x86_64 x86_64 x86_64 GNU/Linux [root@link-08 ~]# rpm -q lvm2 lvm2-2.02.21-3.el4 [root@link-08 ~]# rpm -q lvm2-cluster lvm2-cluster-2.02.21-3.el4 [root@link-08 ~]# rpm -q cmirror-kernel cmirror-kernel-2.6.9-27.0
Hit another variation of this while attempting to umount a cmirror w/ gfs on top. In this case, the fs wasn't created until after the mirror was in sync, same with the I/O. Then a failure scenario was attemped, and again, no I/O was done until the mirror was in sync. [...] Apr 4 12:01:20 link-02 kernel: dm-cmirror: Attempt to mark a region 0/rto3kYQN which is being recovered. Apr 4 12:01:20 link-02 kernel: dm-cmirror: Current recoverer: 3 Apr 4 12:01:20 link-02 kernel: dm-cmirror: Mark requester : 1 Apr 4 12:01:20 link-02 kernel: dm-cmirror: Delaying mark to region 0, due to recovery Apr 4 12:01:21 link-02 kernel: dm-cmirror: Attempt to mark a region 0/rto3kYQN which is being recovered. Apr 4 12:01:21 link-02 kernel: dm-cmirror: Current recoverer: 3 Apr 4 12:01:21 link-02 kernel: dm-cmirror: Mark requester : 1 Apr 4 12:01:21 link-02 kernel: dm-cmirror: Delaying mark to region 0, due to recovery Apr 4 12:01:21 link-02 kernel: dm-cmirror: Attempt to mark a region 0/rto3kYQN which is being recovered. Apr 4 12:01:21 link-02 kernel: dm-cmirror: Current recoverer: 3 Apr 4 12:01:21 link-02 kernel: dm-cmirror: Mark requester : 1 Apr 4 12:01:21 link-02 kernel: dm-cmirror: Delaying mark to region 0, due to recovery [...] Here the stuck umount attempt. umount D 0000000000000000 0 6369 6376 (NOTLB) 000001003196fa48 0000000000000006 0000010001013a80 0000000100000008 0000000100000000 ffffffffffffffff 00000000fa6bba34 00000000a023f7aa 000001002fea77f0 0000000000000d1a Call Trace:<ffffffff8030cea7>{io_schedule+38} <ffffffff8015a8f0>{wait_on_page_bit+180} <ffffffff8015a796>{page_wake_function+0} <ffffffff8015a796>{page_wake_function+0} <ffffffff8015a977>{wait_on_page_writeback_range+104} <ffffffff8015a69c>{__filemap_fdatawrite_range+95} <ffffffffa02400dc>{:gfs:gfs_sync_buf+54} <ffffffffa024d8cb>{:gfs:inode_go_sync+87} <ffffffffa024b83c>{:gfs:gfs_glock_drop_th+203} <ffffffffa0249fe5>{:gfs:run_queue+314} <ffffffffa024ad8c>{:gfs:gfs_glock_dq+400} <ffffffffa024add4>{:gfs:gfs_glock_dq_uninit+9} <ffffffffa026c9a5>{:gfs:gfs_statfs_sync+359} <ffffffffa026c9f5>{:gfs:gfs_make_fs_ro+63} <ffffffffa0264100>{:gfs:gfs_put_super+630} <ffffffff8017f6cd>{generic_shutdown_super+198} <ffffffffa02618fe>{:gfs:gfs_kill_sb+41} <ffffffff8017f5ee>{deactivate_super+95} <ffffffff801952e2>{sys_umount+925} <ffffffff80182a67>{sys_newstat+17} <ffffffff80110d91>{error_exit+0} <ffffffff8011026a>{system_call+126}
proposing blocker flag, as we're starting to hit this more often.
*** Bug 235252 has been marked as a duplicate of this bug. ***
I've pushed the check for recovery/mark colision into the flush function... delaying the check until it is necessary. testing has begun on the code, but the package has not yet been built assigned -> post
Bringing out of post because I missed part of the check-in, so it is incomplete. corey is testing the complete version now. After a cursory check, I check-in the rest of the code and put this bug back into POST post -> assigned
assigned -> post
post -> modified built 4/5/07
Have not seen this issue with the latest fix. Marking verified.
Fixed in current release (4.7).