Bug 193596
Summary: | node recovery can cause cluster mirror copying to halt | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
Component: | cmirror | Assignee: | Jonathan Earl Brassow <jbrassow> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | agk, cfeist, dwysocha, mbroz |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-08-05 21:31:14 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2006-05-30 20:46:20 UTC
Just a note that I have also seen this where the copy percent restarts and instead of getting stuck at 0.00% it get's stuck else where, like at say 5.08%. [root@taft-03 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% deanmirror mirror_1 mwi-a- 1.00G deanmirror_mlog 16.02 (after recovery) [root@taft-04 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% deanmirror mirror_1 mwi-a- 1.00G deanmirror_mlog 0.00 [root@taft-04 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% deanmirror mirror_1 mwi-a- 1.00G deanmirror_mlog 5.08 [root@taft-04 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% deanmirror mirror_1 mwi-a- 1.00G deanmirror_mlog 5.08 [STUCK] It turns out that recovering/relocating the mirror server isn't required for this issue to appear. Any recovery of a node (server or not) can cause this. Also, it appears that mirrors under 10G have a better chance of making this happen. How did you relocate the server? In any event, the following works to relocate the server and create the problem: 1) create cluster mirror 2) find log server (check log), and run 'lvchange -aln <vg>/<lv>' on it 3) recovery stops Before we had the ability to detect whether or not another member node was recovering, I made sure that only the server would recover the mirror. However, this meant that other nodes would recieve the signal not to recover. So, in this case, when the server failed/left, recovery would stop. Now that we have the ability to detect remote recovery, I have remove the stipulation that only the server can recover. This has fixed the problem for me. Fix will be in next rev after cmirror-kernel*-2.6.9-4.2. Fix verified. However, recovery during cmirror creation still results in bz 195392. May have spoke to soon. During a mirror creation, taft-01 hit bz 198563 which caused recovery to take place and that then caused the the mirror coyping to halt at 96.06%. [root@taft-02 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% mirror_1_340 mirror_1_34 mwi-a- 136.23G mirror_1_340_mlog 96.06 [root@taft-02 ~]# dmsetup status mirror_1_34-mirror_1_340_mlog: 0 8192 linear mirror_1_34-mirror_1_340_mimage_1: 0 285696000 linear VolGroup00-LogVol01: 0 4063232 linear mirror_1_34-mirror_1_340_mimage_0: 0 285696000 linear VolGroup00-LogVol00: 0 138674176 linear mirror_1_34-mirror_1_340: 0 285696000 mirror 2 253:3 253:4 268017/279000 1 AA 3 clustered_disk 253:2 A taft-02: Jul 12 06:13:08 taft-02 kernel: device-mapper: I'm the cluster mirror log server for BG1zl4bX Jul 12 06:13:08 taft-02 kernel: device-mapper: Disk Resume:: Jul 12 06:13:08 taft-02 kernel: device-mapper: Live nodes :: 1 Jul 12 06:13:08 taft-02 kernel: device-mapper: In-Use Regions :: 0 Jul 12 06:13:08 taft-02 kernel: device-mapper: Good IUR's :: 0 Jul 12 06:13:08 taft-02 kernel: device-mapper: Bad IUR's :: 0 Jul 12 06:13:08 taft-02 kernel: device-mapper: Sync count :: 254424 Jul 12 06:13:08 taft-02 kernel: device-mapper: Disk Region count :: 18446744073709551615 Jul 12 06:13:08 taft-02 kernel: device-mapper: Region count :: 279000 Jul 12 06:13:08 taft-02 kernel: device-mapper: NOTE: Mapping has changed. Jul 12 06:13:08 taft-02 kernel: device-mapper: Marked regions:: Jul 12 06:13:08 taft-02 kernel: device-mapper: 0 - 24575 Jul 12 06:13:08 taft-02 kernel: device-mapper: Total = 24576 Jul 12 06:13:08 taft-02 dmeventd[5286]: Monitoring mirror device, mirror_1_34-mirror_1_340 for events Jul 12 06:13:08 taft-02 kernel: device-mapper: Out-of-sync regions:: Jul 12 06:13:08 taft-02 kernel: device-mapper: 0 - 24575 Jul 12 06:13:08 taft-02 kernel: device-mapper: Total = 24576 Jul 12 06:13:26 taft-02 kernel: CMAN: node taft-01 has been removed from the cluster : No response to messages Jul 12 06:13:27 taft-02 fenced[4328]: fencing deferred to taft-03 Jul 12 06:14:07 taft-02 kernel: device-mapper: A cluster mirror log member has failed. Jul 12 06:14:07 taft-02 kernel: device-mapper: Disk Resume:: Jul 12 06:14:07 taft-02 kernel: device-mapper: Live nodes :: 2 Jul 12 06:14:07 taft-02 kernel: device-mapper: In-Use Regions :: 2 Jul 12 06:14:07 taft-02 kernel: device-mapper: Good IUR's :: 2 Jul 12 06:14:07 taft-02 kernel: device-mapper: Bad IUR's :: 0 Jul 12 06:14:07 taft-02 kernel: device-mapper: Sync count :: 255053 Jul 12 06:14:07 taft-02 kernel: device-mapper: Disk Region count :: 279000 Jul 12 06:14:07 taft-02 kernel: device-mapper: Region count :: 279000 Jul 12 06:14:07 taft-02 kernel: device-mapper: Marked regions:: Jul 12 06:14:07 taft-02 kernel: device-mapper: 627 - 24575 Jul 12 06:14:07 taft-02 kernel: device-mapper: Total = 23949 Jul 12 06:14:07 taft-02 kernel: device-mapper: Out-of-sync regions:: Jul 12 06:14:07 taft-02 kernel: device-mapper: 629 - 24575 Jul 12 06:14:07 taft-02 kernel: device-mapper: Total = 23947 device-mapper: Error (-22) while processing request (type = 5) Jul 12 06:16:43 taft-02 kernel: device-mapper: Request to remove unrecorded region user (3/13590) Jul 12 06:16:43 taft-02 kernel: device-mapper: Error (-22) while processing request (type = 5) taft-03: Jul 12 06:12:29 taft-03 dmeventd[4754]: Monitoring mirror device, mirror_1_34-mirror_1_340 for events Jul 12 06:12:42 taft-03 kernel: CMAN: node taft-01 has been removed from the cluster : No response to messages Jul 12 06:13:13 taft-03 fenced[4250]: taft-01 not a cluster member after 30 sec post_fail_delay Jul 12 06:13:13 taft-03 fenced[4250]: fencing node "taft-01" Jul 12 06:13:13 taft-03 fenced[4250]: fence "taft-01" success Jul 12 06:13:19 taft-03 kernel: device-mapper: A cluster mirror log member has failed. taft-04: CMAN: removing node taft-01 from the cluster : No response to messages Jul 12 07:21:11 taft-04 kernel: CMAN: removing node taft-01 from the cluster : No response to messages Jul 12 07:21:12 taft-04 fenced[4166]: fencing deferred to taft-03 Jul 12 07:21:57 taft-04 dmeventd[4659]: Monitoring mirror device, mirror_1_34-mirror_1_340 for events device-mapper: an error occured on the server while processing our request Jul 12 07:24:28 taft-04 kernel: device-mapper: an error occured on the server while processing our request Version: [root@taft-02 ~]# rpm -q lvm2 lvm2-2.02.06-4.0.RHEL4 [root@taft-02 ~]# rpm -q lvm2-cluster lvm2-cluster-2.02.06-4.0.RHEL4 [root@taft-02 ~]# rpm -q cmirror cmirror-1.0.1-0 [root@taft-02 ~]# rpm -q cmirror-kernel cmirror-kernel-2.6.9-10.0 [root@taft-02 ~]# rpm -q device-mapper device-mapper-1.02.07-3.0.RHEL4 [root@taft-02 ~]# uname -ar Linux taft-02 2.6.9-39.1.ELsmp #1 SMP Fri Jun 16 16:47:43 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux Just a note that once I brought the recovered node back into the cluster the syncing started again. Devel ACK. Just a note to re-iterate comment #7, am still able to reproduce this where the copy percent stops when a node is recovered and will stay at the percent until that node is brought back into the cluster and clvmd is once again started. I am no longer able to reproduce this. Can you recreate with latest rpms >= 11/29/2006? I have been unable to reproduce this issue with the latest version of cmirror-kernel, marking verified. Closing as this has been fixed in the current (4.7) release. |