Description of problem: This may just be a different version of bz 193704, but in this issue, the mirror syncing also stopped/hung along with the write attempt. [root@taft-04 ~]# lvcreate -L 750M -m 1 -n coreymirror mirror_1 Rounding up size to full physical extent 752.00 MB Logical volume "coreymirror" created May 31 11:16:02 taft-04 lvm[5195]: mirror_1-coreymirror is now in-sync May 31 11:20:00 taft-04 kernel: device-mapper: A node has left the cluster. May 31 11:20:10 taft-04 kernel: device-mapper: Cluster log server is shutting down. May 31 11:21:00 taft-04 kernel: device-mapper: I'm the cluster log server for LVM-0I9prx81Ohk May 31 11:21:00 taft-04 kernel: device-mapper: Disk Resume:: May 31 11:21:00 taft-04 kernel: device-mapper: Live nodes :: 1 May 31 11:21:00 taft-04 kernel: device-mapper: In-Use Regions :: 0 May 31 11:21:00 taft-04 kernel: device-mapper: Good IUR's :: 0 May 31 11:21:00 taft-04 kernel: device-mapper: Bad IUR's :: 0 May 31 11:21:00 taft-04 kernel: device-mapper: Sync count :: 0 May 31 11:21:00 taft-04 kernel: device-mapper: Disk Region count :: 18446744073709551615 May 31 11:21:00 taft-04 kernel: device-mapper: Region count :: 1504 May 31 11:21:00 taft-04 kernel: device-mapper: NOTE: Mapping has changed. May 31 11:21:00 taft-04 kernel: device-mapper: Marked regions:: May 31 11:21:00 taft-04 kernel: device-mapper: 0 - -1 May 31 11:21:00 taft-04 kernel: device-mapper: Total = -1 May 31 11:21:00 taft-04 kernel: device-mapper: Out-of-sync regions:: May 31 11:21:00 taft-04 kernel: device-mapper: 0 - -1 May 31 11:21:00 taft-04 kernel: device-mapper: Total = -1 May 31 11:21:01 taft-04 lvm[5311]: Monitoring mirror device, mirror_1-coreymirror for events [root@taft-04 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% coreymirror mirror_1 mwi-a- 752.00M coreymirror_mlog 39.36 [root@taft-04 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% coreymirror mirror_1 mwi-a- 752.00M coreymirror_mlog 47.34 [root@taft-04 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% coreymirror mirror_1 mwi-a- 752.00M coreymirror_mlog 49.47 Different node: [root@taft-03 ~]# gfs_mkfs -j 4 -p lock_dlm -t TAFT_CLUSTER:mirror /dev/mirror_1/coreymirror -O [HANG] Caused the syncing to be stuck now as well: [root@taft-04 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% coreymirror mirror_1 mwi-a- 752.00M coreymirror_mlog 49.47 [root@taft-04 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% coreymirror mirror_1 mwi-a- 752.00M coreymirror_mlog 49.47 [root@taft-04 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% coreymirror mirror_1 mwi-a- 752.00M coreymirror_mlog 49.47 May 31 10:08:04 taft-02 lvm[5280]: Monitoring mirror device, mirror_1-coreymirror for events May 31 10:12:09 taft-02 kernel: device-mapper: A node has left the cluster. May 31 10:12:19 taft-02 last message repeated 2 times May 31 10:12:24 taft-02 kernel: device-mapper: Cluster log server is shutting down. May 31 10:13:28 taft-02 lvm[5396]: Monitoring mirror device, mirror_1-coreymirror for events Version-Release number of selected component (if applicable): [root@taft-04 ~]# rpm -q cmirror-kernel-smp cmirror-kernel-smp-2.6.9-4.2
Looks like I/O will hang to even an in sync mirror: May 31 11:34:26 taft-03 clvmd: Activating VGs: succeeded device-mapper: unable to get server (2) to mark region (120) device-mapper: Reason :: 1 May 31 11:35:52 taft-03 lvm[4538]: mirror_1-coreymirror is now in-sync May 31 11:35:52 taft-03 kernel: device-mapper: unable to get server (2) to mark region (120) May 31 11:35:52 taft-03 kernel: device-mapper: Reason :: 1 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 128 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 128 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 256 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 384 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 512 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 512 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 640 May 31 11:35:57 taft-03 kernel: device-mapper: clear_region_count :: 640 May 31 11:35:58 taft-03 kernel: device-mapper: clear_region_count :: 768
I created a GFS filesystem on a 900M cmirror last night and attempted to run simple I/O on it, and this morning it was all hung.
Switching this to RHEL4/kernel
Created attachment 130973 [details] Potential Patch (waiting for a little feedback before posting) The problem was in drivers/md/dm-raid1.c:do_writes, where I was adding writes that are blocked behind remote recovering regions ('requeue') to a bio list ('writes') that was only valid in that function. The result is that the memory is leaked and the request lost - stalling all I/O for that process as it waits for the write to complete. The fix simply adds the bio to the main write queue via queue_bio - effectively causing the write to be deferred until the region is recovered. The reason for the change to do_mirror is that the call to 'wake' in queue_bio is impotent - given the fact that it is effectively telling itself to wake up. (That wake up call is disregarded because the thread is already running.) The while loop continues until do_writes does not requeue bios due to remote recovery. You may ask why it is ok to check 'ms->writes.head' without holding the spin lock. The reason is because do_mirror is the only one that can clear the list, and any additions to the list which we are concerned about happen in do_writes (that is, before the check in the same thread).
committed in stream U4 build 39.2. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0575.html