Description of problem: I had I/O going to 3 cmirrored gfs filesystems over the weekend while also doing up and down mirror conversions of those volumes. Those conversion cmds started to fail due to locking errors and the filesystems also now appear hung (as a df command hangs). I then noticed in the console of one of the nodes in the cluster that the following cmirror request had been running for days slowly going though each region of the mirror and failing each time. # this may be a similar bug to 225337 [...] Feb 2 22:04:21 link-04 kernel: dm-cmirror: Request to remove unrecorded region user (4/2100) Feb 2 22:04:21 link-04 lvm[4578]: No longer monitoring mirror device vg-mirror1 for events Feb 2 22:04:42 link-04 kernel: dm-cmirror: Clean-up required due to new server Feb 2 22:04:42 link-04 kernel: dm-cmirror: - Wiping clear region list Feb 2 22:04:42 link-04 kernel: dm-cmirror: - 0 clear region requests wiped Feb 2 22:04:42 link-04 kernel: dm-cmirror: - Resending all mark region requests Feb 2 22:04:42 link-04 kernel: dm-cmirror: Clean-up complete Feb 2 22:04:42 link-04 kernel: dm-cmirror: Continuing request type, 8 (LRT_GET_SYNC_COUNT) Feb 2 22:04:42 link-04 lvm[4578]: Monitoring mirror device vg-mirror1 for events Feb 2 22:04:48 link-04 lvm[4578]: No longer monitoring mirror device vg-mirror2 for events Feb 2 22:04:48 link-04 kernel: dm-cmirror: I'm the cluster mirror log server for CTwiYWOQ Feb 2 22:04:48 link-04 kernel: dm-cmirror: Disk Resume:: CTwiYWOQ (active) Feb 2 22:04:48 link-04 kernel: dm-cmirror: Live nodes :: 4 Feb 2 22:04:48 link-04 kernel: dm-cmirror: In-Use Regions :: 0 Feb 2 22:04:48 link-04 kernel: dm-cmirror: Good IUR's :: 0 Feb 2 22:04:48 link-04 kernel: dm-cmirror: Bad IUR's :: 0 Feb 2 22:04:48 link-04 kernel: dm-cmirror: Sync count :: 2441 Feb 2 22:04:48 link-04 kernel: dm-cmirror: Disk Region count :: 10240 Feb 2 22:04:48 link-04 kernel: dm-cmirror: Region count :: 10240 Feb 2 22:04:48 link-04 kernel: dm-cmirror: Marked regions:: Feb 2 22:04:48 link-04 kernel: dm-cmirror: 2438 Feb 2 22:04:48 link-04 kernel: dm-cmirror: 2442 - 10240 Feb 2 22:04:48 link-04 kernel: dm-cmirror: Total = 7799 Feb 2 22:04:48 link-04 kernel: dm-cmirror: Out-of-sync regions:: Feb 2 22:04:48 link-04 kernel: dm-cmirror: 2438 Feb 2 22:04:48 link-04 kernel: dm-cmirror: 2442 - 10240 Feb 2 22:04:48 link-04 kernel: dm-cmirror: Total = 7799 Feb 2 22:04:48 link-04 kernel: dm-cmirror: Write bits failed on mirror log device, 253:8 Feb 2 22:04:48 link-04 kernel: dm-cmirror: Write bits failed on mirror log device, 253:8 Feb 2 22:04:48 link-04 kernel: dm-cmirror: Leaving while clear region requests remain. Feb 2 22:04:53 link-04 kernel: dm-cmirror: Clustered mirror retried requests :: 416 of 7966006 (1%) Feb 2 22:04:53 link-04 kernel: dm-cmirror: Last request: Feb 2 22:04:53 link-04 kernel: dm-cmirror: - my_id :: 4 Feb 2 22:04:53 link-04 kernel: dm-cmirror: - server :: 57005 Feb 2 22:04:53 link-04 kernel: dm-cmirror: - log uuid:: CTwiYWOQ (suspended) Feb 2 22:04:53 link-04 kernel: dm-cmirror: - request :: LRT_CLEAR_REGION Feb 2 22:04:53 link-04 kernel: dm-cmirror: - error :: -6 Feb 2 22:05:01 link-04 kernel: dm-cmirror: Clustered mirror retried requests :: 448 of 7966088 (1%) Feb 2 22:05:01 link-04 kernel: dm-cmirror: Last request: Feb 2 22:05:01 link-04 kernel: dm-cmirror: - my_id :: 4 Feb 2 22:05:01 link-04 kernel: dm-cmirror: - server :: 57005 Feb 2 22:05:01 link-04 kernel: dm-cmirror: - log uuid:: CTwiYWOQ (suspended) Feb 2 22:05:01 link-04 kernel: dm-cmirror: - request :: LRT_CLEAR_REGION Feb 2 22:05:01 link-04 kernel: dm-cmirror: - error :: -6 [...] Feb 5 08:51:57 link-04 kernel: dm-cmirror: Clustered mirror retried requests :: 846464 of 9658132 (9%) Feb 5 08:51:57 link-04 kernel: dm-cmirror: Last request: Feb 5 08:51:57 link-04 kernel: dm-cmirror: - my_id :: 4 Feb 5 08:51:57 link-04 kernel: dm-cmirror: - server :: 1049214464 Feb 5 08:51:57 link-04 kernel: dm-cmirror: - log uuid:: (active) Feb 5 08:51:57 link-04 kernel: dm-cmirror: - request :: LRT_CLEAR_REGION Feb 5 08:51:57 link-04 kernel: dm-cmirror: - error :: -6 Feb 5 08:52:05 link-04 kernel: dm-cmirror: Clustered mirror retried requests :: 846496 of 9658196 (9%) Feb 5 08:52:05 link-04 kernel: dm-cmirror: Last request: Feb 5 08:52:05 link-04 kernel: dm-cmirror: - my_id :: 4 Feb 5 08:52:05 link-04 kernel: dm-cmirror: - server :: 1049214464 Feb 5 08:52:05 link-04 kernel: dm-cmirror: - log uuid:: (active) Feb 5 08:52:05 link-04 kernel: dm-cmirror: - request :: LRT_CLEAR_REGION Feb 5 08:52:05 link-04 kernel: dm-cmirror: - error :: -6 Feb 5 08:52:13 link-04 kernel: dm-cmirror: Clustered mirror retried requests :: 846528 of 9658260 (9%) Feb 5 08:52:13 link-04 kernel: dm-cmirror: Last request: Feb 5 08:52:13 link-04 kernel: dm-cmirror: - my_id :: 4 Feb 5 08:52:13 link-04 kernel: dm-cmirror: - server :: 1049214464 Feb 5 08:52:13 link-04 kernel: dm-cmirror: - log uuid:: (active) Feb 5 08:52:13 link-04 kernel: dm-cmirror: - request :: LRT_CLEAR_REGION Feb 5 08:52:13 link-04 kernel: dm-cmirror: - error :: -6 This process has been going on now for 3+ days and getting an error each time and has only made it through 9% of the total mirror. An lvs cmd on link-04 hangs but here's what another node in the cluster sees the mirrors as (and the sync percents are all stuck). [root@link-07 ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Devices mirror1 vg Mwi-so 5.00G mirror1_mlog 58.52 mirror1_mimage_0(0),mirror1_mimage_1(0) [mirror1_mimage_0] vg iwi-so 5.00G /dev/sdh1(1280) [mirror1_mimage_1] vg iwi-so 5.00G /dev/sdd1(0) [mirror1_mlog] vg lwi-so 4.00M /dev/sda1(2893) mirror2 vg Mwi-so 5.00G mirror2_mlog 100.00 mirror2_mimage_0(0),mirror2_mimage_1(0) [mirror2_mimage_0] vg iwi-so 5.00G /dev/sdh1(2560) [mirror2_mimage_1] vg iwi-so 5.00G /dev/sdb1(0) [mirror2_mlog] vg lwi-so 4.00M /dev/sda1(2894) mirror3 vg Mwi-so 5.00G mirror3_mlog 37.27 mirror3_mimage_0(0),mirror3_mimage_1(0) [mirror3_mimage_0] vg iwi-so 5.00G /dev/sdh1(3840) [mirror3_mimage_1] vg iwi-so 5.00G /dev/sdc1(0) [mirror3_mlog] vg lwi-so 4.00M /dev/sda1(2892) Version-Release number of selected component (if applicable): lvm2-2.02.21-1.el4 lvm2-cluster-2.02.21-1.el4 cmirror-1.0.1-1 cmirror-kernel-smp-2.6.9-18.9 device-mapper-1.02.17-2.el4 2.6.9-45.ELsmp
Just a note that this has been reproduced. The following has been going on now for 20+ hours. Feb 12 13:00:31 link-04 kernel: dm-cmirror: Clustered mirror retried requests :: 32 of 437462 (1%) Feb 12 13:00:39 link-04 kernel: dm-cmirror: Clustered mirror retried requests :: 64 of 437526 (1%) Feb 12 13:00:47 link-04 kernel: dm-cmirror: Clustered mirror retried requests :: 96 of 437590 (1%) [...] Feb 13 08:52:14 link-04 kernel: dm-cmirror: Clustered mirror retried requests :: 285728 of 1008866 (29%) Feb 13 08:52:22 link-04 kernel: dm-cmirror: Clustered mirror retried requests :: 285760 of 1008930 (29%)
please help me reproduce with the latest cmirror-kernel package (>= 2/21/2007)
Marking modified, as I believe this has been fixed in the process of fixing other bugs.
Hit this last night on link-04 with the latest cmirror build on the link cluster (link-02,4,7,8)
Ok, we've proved the bug is still around. Now we need to either: A) find the boundary conditions of the bug B) find quicker/easier ways to reproduce the bug Obviously, 'B' is the best. I need to know how many mirrors you had running on the system at the time, what was being done to them, whether they were 3-way mirrors or not, etc. Then I need to know if you can reproduce with just one mirror on the system, and/or with just 2-way mirrors... simplify. Again, the best thing that you could do is find a way to trigger the bug that is simple and straight forward that causes it to show up in a reasonable timeframe.
I suspect that the server is getting stuck while trying to read a log device that has been suspended. I'm now forcing the resume to establish the log server before it completes to try to mitigate this problem.
post -> modified
modified -> needinfo Changes for bug 231230 included a partial reversal of some of the changes designed for this bug. I've tried to be careful in my selection of which changes to revert, but this bug should be reinvestigated.
needinfo -> modified I guess it was pointless for me to change that... QA will retest to validate it anyway.
I had I/O going to 3 gfs filesystems on top of cmirrors, all up and down converting for almost 24 hours and wasn't able to reproduce this bug. Marking verified.
Fixed in current release (4.7).