Bug 237028
Summary: | cmirror recovery deadlock due to machine failure + primary leg failure of multiple mirrors | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | cmirror | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4 | CC: | agk, cfeist, dwysocha, jbrassow, mbroz, prockai | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-08-05 21:41:38 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Corey Marthaler
2007-04-18 22:31:55 UTC
All lvm commands are blocked, suggesting that another lvm process is in the works. 'ps' confirms this, and shows that one thread is blocked waiting for the kernel. This means the vgreduce is currently running. dmsetup status is currently blocking because the server cannot be contacted. The server cannot be contacted because it is not known. It isn't known because elections are failing. Why are elections failing? Is there a catch-22? D Apr18 0:05 [cluster_log_ser] log server seems stuck on link-02... the likely server when this all began. Since elections go around to all machines, it seems reasonable that this machine is the plug on the election - causing all the other problems. Now, what is it stuck on? Created attachment 153015 [details]
trace of cluster log server in kernel
Reveals it is stuck waiting for 'write_bits' - likely from the 'flush' taking
place on link-08
'dmsetup info' on link-02 (the stuck server) shows: Name: vg3-cmirror3_mlog State: SUSPENDED Tables present: LIVE Open count: 2 Event number: 0 Major, minor: 253, 10 Number of targets: 1 UUID: LVM-naVlSFduTdI43dyc04lU4NL2gRPIpesXG6ORhAO4ejRXAkonnHawdwAbe0n6hMq0 (The other mirrors are cluster core logs - so this is the only one that could block like this.) It is likely that the server is attempting to flush when the log is suspended. This would certainly cause the hang-up. I'll test this in a minute by resuming the log device. This should free things up and allow things to go... if my theory is correct. First, I want to ensure that the flush on link-08 is indeed linked to the stall server on link-02. link-08 reports that it _is_ flushing that particular log "...e0n6hMq0". As I had thought, resuming the disk log on link-02 has unplugged the clot. ** Before resume Apr 19 06:11:23 link-08 kernel: dm-cmirror: Failed to receive election results from server: (e0n6hMq0,-110) Apr 19 06:11:23 link-08 kernel: dm-cmirror: server_id=dead, server_valid=1, e0n6hMq0 Apr 19 06:11:23 link-08 kernel: dm-cmirror: trigger = LRT_FLUSH ** After resume Apr 19 06:11:24 link-08 kernel: dm-cmirror: Clean-up required due to new server Apr 19 06:11:24 link-08 kernel: dm-cmirror: - Resending all mark region requests Here is the presuspend function in the cluster logging code: static int cluster_presuspend(struct dirty_log *log) { /* FIXME: flush is work-around for bug 235040 */ DMDEBUG("Performing flush to work around bug 235040"); cluster_flush(log); DMDEBUG("Log flush complete"); return 0; } This is almost certainly the flush that was happening from link-08. It could be that I need to pull out this work-around. That would result in some bits being left as 'marked' in the log - even though they are clean. What that means is that there will always be a small amount of resyncing necessary when a mirror comes back on-line. I'm going to make sure that this is really a problem before pulling the work-around. corey, can you recreate this with *.debug in syslog? If it is what I propose in comment #8, you will see the unique DEBUG prints. I have reproduced this issue twice with two 2-way cmirror, and once with 1 2-way mirror. In the case of the two 2-way mirrors, after I shot a node plus failed a primary in each mirror, another machine always ends up "locking up" to the point that the other nodes fence it, and from there quroum is lost and clvmd is deadlocked. Also, clvmd does end up dying sometimes as well, possibly point to another bug. Apr 20 10:13:14 link-08 lvm[13739]: EOF reading CLVMD In the case of the one 2-way mirror, after I shot a node plus failed a primary in each mirror, clvmd as well as all I/O to the cmirror ends up deadlocked. Apr 20 11:26:28 link-08 kernel: dm-cmirror: LOG INFO: Apr 20 11:26:28 link-08 kernel: dm-cmirror: uuid: LVM-tfUW1A1zzWkxldikE6dQLZop1Se6pcdUP3XLjK1CMYgfcvZNvCMxqQ7UCss1 olfv Apr 20 11:26:28 link-08 kernel: dm-cmirror: uuid_ref : 1 Apr 20 11:26:28 link-08 kernel: dm-cmirror: ?region_count: 16384 Apr 20 11:26:28 link-08 kernel: dm-cmirror: ?sync_count : 0 Apr 20 11:26:28 link-08 kernel: dm-cmirror: ?sync_search : 0 Apr 20 11:26:28 link-08 kernel: dm-cmirror: in_sync : YES Apr 20 11:26:28 link-08 kernel: dm-cmirror: server_id : 3 Apr 20 11:26:28 link-08 kernel: dm-cmirror: server_valid: YES Apr 20 11:26:35 link-08 kernel: dm-cmirror: LOG INFO: Apr 20 11:26:35 link-08 kernel: dm-cmirror: uuid: LVM-tfUW1A1zzWkxldikE6dQLZop1Se6pcdUP3XLjK1CMYgfcvZNvCMxqQ7UCss1 olfv Apr 20 11:26:35 link-08 kernel: dm-cmirror: uuid_ref : 1 Apr 20 11:26:35 link-08 kernel: dm-cmirror: ?region_count: 16384 Apr 20 11:26:35 link-08 kernel: dm-cmirror: ?sync_count : 0 Apr 20 11:26:35 link-08 kernel: dm-cmirror: ?sync_search : 0 Apr 20 11:26:35 link-08 kernel: dm-cmirror: in_sync : YES Apr 20 11:26:35 link-08 kernel: dm-cmirror: server_id : 3 Apr 20 11:26:35 link-08 kernel: dm-cmirror: server_valid: YES Apr 20 11:27:29 link-08 kernel: CMAN: removing node link-07 from the cluster : Missed too many heartbeats Apr 20 11:28:00 link-08 fenced[5648]: link-07 not a cluster member after 30 sec post_fail_delay Apr 20 11:28:00 link-08 fenced[5648]: fencing node "link-07" [...] Apr 20 11:28:28 link-08 kernel: dm-cmirror: LOG INFO: Apr 20 11:28:28 link-08 lvm[5941]: Setting log/prefix to Apr 20 11:28:28 link-08 kernel: dm-cmirror: uuid: LVM-tfUW1A1zzWkxldikE6dQLZop1Se6pcdUP3XLjK1CMYgfcvZNvCMxqQ7UCss1 olfv Apr 20 11:28:28 link-08 lvm[5941]: Setting log/command_names to 0 Apr 20 11:28:28 link-08 kernel: dm-cmirror: uuid_ref : 1 Apr 20 11:28:28 link-08 lvm[5941]: Setting global/test to 0 Apr 20 11:28:28 link-08 kernel: dm-cmirror: ?region_count: 16384 Apr 20 11:28:28 link-08 lvm[5941]: Setting log/overwrite to 0 Apr 20 11:28:28 link-08 kernel: dm-cmirror: ?sync_count : 0 Apr 20 11:28:28 link-08 lvm[5941]: log/activation not found in config: defaulting to 0 Apr 20 11:28:28 link-08 kernel: dm-cmirror: ?sync_search : 0 Apr 20 11:28:28 link-08 lvm[5941]: Logging initialised at Fri Apr 20 11:28:28 2007 Apr 20 11:28:28 link-08 kernel: dm-cmirror: in_sync : NO Apr 20 11:28:28 link-08 lvm[5941]: Setting global/umask to 63 Apr 20 11:28:28 link-08 kernel: dm-cmirror: server_id : 3 Apr 20 11:28:28 link-08 lvm[5941]: Setting devices/dir to /dev Apr 20 11:28:28 link-08 kernel: dm-cmirror: server_valid: YES Apr 20 11:28:28 link-08 lvm[5941]: Setting global/proc to /proc Apr 20 11:28:28 link-08 kernel: GFS: fsid=LINK_128:1.1: jid=3: Trying to acquire journal lock... Apr 20 11:28:28 link-08 lvm[5941]: Setting global/activation to 1 Apr 20 11:28:28 link-08 kernel: GFS: fsid=LINK_128:1.1: jid=3: Looking at journal... Apr 20 11:28:28 link-08 lvm[5941]: global/suffix not found in config: defaulting to 1 Apr 20 11:28:28 link-08 lvm[5941]: global/units not found in config: defaulting to h Apr 20 11:28:28 link-08 lvm[5941]: Matcher built with 11 dfa states Apr 20 11:28:28 link-08 lvm[5941]: Setting devices/ignore_suspended_devices to 1 Apr 20 11:28:28 link-08 lvm[5941]: Setting devices/cache to /etc/lvm/.cache Apr 20 11:28:28 link-08 lvm[5941]: Setting devices/write_cache_state to 1 Apr 20 11:28:28 link-08 lvm[5941]: Initialised format: lvm1 Apr 20 11:28:28 link-08 lvm[5941]: Initialised format: pool Apr 20 11:28:28 link-08 lvm[5941]: Initialised format: lvm2 Apr 20 11:28:28 link-08 lvm[5941]: global/format not found in config: defaulting to lvm2 Apr 20 11:28:28 link-08 lvm[5941]: Initialised segtype: striped Apr 20 11:28:28 link-08 lvm[5941]: Initialised segtype: zero Apr 20 11:28:28 link-08 lvm[5941]: Initialised segtype: error Apr 20 11:28:28 link-08 lvm[5941]: Initialised segtype: snapshot Apr 20 11:28:28 link-08 lvm[5941]: Initialised segtype: mirror Apr 20 11:28:28 link-08 lvm[5941]: Processing: vgreduce --config devices{ignore_suspended_devices=1} --removemissing V1 Apr 20 11:28:28 link-08 lvm[5941]: O_DIRECT will be used Apr 20 11:28:28 link-08 lvm[5941]: Setting global/locking_type to 3 Apr 20 11:28:28 link-08 lvm[5941]: Cluster locking selected. Apr 20 11:28:28 link-08 lvm[5941]: Finding volume group "V1" Apr 20 11:28:28 link-08 lvm[5941]: Locking V_V1 at 0x4 I appear to have reproduced this bug over the weekend, though I didn't kill any machine, only a secondary mirror leg while the mirror was syncing. TEST OUTPUT: Senario: Kill secondary leg of non synced 3 leg mirror ****** Mirror hash info for this scenario ****** * name: fail_secondary_non_synced_3_legs * sync: 0 * disklog: 1 * failpv: /dev/sdf1 * legs: 3 * pvs: /dev/sdb1 /dev/sdf1 /dev/sdg1 /dev/sdd1 ************************************************ /dev/dm-2: read failed after 0 of 4096 at 0: Input/output error Creating mirror on link-07... qarsh root@link-07 lvcreate -m 2 -n fail_secondary_non_synced_3_legs -L 800M helter_skelter /dev/sdb1:0-500 /dev/sdf1:0-500 /dev/sdg1:0-500 /dev/sdd1:0-50 mirror is only 12.00% synced right now Creating gfs on top of mirror on link-07... Creating mnt point /mnt/fail_secondary_non_synced_3_legs on link-02... Mounting gfs on link-02... Creating mnt point /mnt/fail_secondary_non_synced_3_legs on link-04... Mounting gfs on link-04... Creating mnt point /mnt/fail_secondary_non_synced_3_legs on link-07... Mounting gfs on link-07... Creating mnt point /mnt/fail_secondary_non_synced_3_legs on link-08... Mounting gfs on link-08... Writing files to gfs on... link-02 checkit starting with: CREATE Num files: 100 Random Seed: 25368 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs Working dir: /mnt/fail_secondary_non_synced_3_legs link-04 checkit starting with: CREATE Num files: 100 Random Seed: 25102 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs Working dir: /mnt/fail_secondary_non_synced_3_legs link-07 checkit starting with: CREATE Num files: 100 Random Seed: 26060 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs Working dir: /mnt/fail_secondary_non_synced_3_legs link-08 checkit starting with: CREATE Num files: 100 Random Seed: 24559 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs Working dir: /mnt/fail_secondary_non_synced_3_legs Disabling device sdf on link-02 Disabling device sdf on link-04 Disabling device sdf on link-07 Disabling device sdf on link-08 Attempting I/O to cause mirror down conversion on link-07 10+0 records in 10+0 records out Verifying the down conversion from mirror to linear /dev/sdf1: read failed after 0 of 2048 at 0: Input/output error [HANG] LINK-02: Apr 20 13:47:31 link-02 qarshd[25379]: Running cmdline: echo offline > /sys/block/sdf/device/state Apr 20 13:47:31 link-02 qarshd[25379]: That's enough Apr 20 13:47:31 link-02 kernel: dm-cmirror: Resync work completed: 890 Apr 20 13:47:31 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 891 Apr 20 13:47:31 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 891/MLVHW0ji/3 Apr 20 13:47:31 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 892 Apr 20 13:47:31 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 892/MLVHW0ji/3 Apr 20 13:47:31 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 893 Apr 20 13:47:31 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 893/MLVHW0ji/3 Apr 20 13:47:31 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 894 Apr 20 13:47:32 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 894/MLVHW0ji/3 [...] Apr 20 13:47:42 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 1089/MLVHW0ji/3 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 1090 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 1090/MLVHW0ji/3 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 1091 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 1091/MLVHW0ji/3 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 1092 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Creating MLVHW0ji (2) Apr 20 13:47:42 link-02 kernel: dm-cmirror: server_id=dead, server_valid=0, MLVHW0ji Apr 20 13:47:42 link-02 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT Apr 20 13:47:42 link-02 lvm[7587]: No longer monitoring mirror device helter_skelter-fail_secondary_non_synce d_3_legs for events Apr 20 13:47:42 link-02 lvm[7587]: Unlocking memory Apr 20 13:47:42 link-02 lvm[7587]: memlock_count dec to 0 Apr 20 13:47:42 link-02 lvm[7587]: Dumping persistent device cache to /etc/lvm/.cache Apr 20 13:47:42 link-02 lvm[7587]: Locking /etc/lvm/.cache (F_WRLCK, 1) Apr 20 13:47:42 link-02 lvm[7587]: Unlocking fd 8 Apr 20 13:47:42 link-02 lvm[7587]: Wiping internal VG cache Apr 20 13:47:42 link-02 kernel: dm-cmirror: Performing flush to work around bug 235040 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 1092/MLVHW0ji/3 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 1093 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 1093/MLVHW0ji/3 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 1094 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 1094/MLVHW0ji/3 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 1095 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 1095/MLVHW0ji/3 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 1096 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 1096/MLVHW0ji/3 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 1097 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 1097/MLVHW0ji/3 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 1098 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 1098/MLVHW0ji/3 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 1099 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 1099/MLVHW0ji/3 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 1100 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Setting recovering region out-of-sync: 1100/MLVHW0ji/3 Apr 20 13:47:42 link-02 kernel: dm-cmirror: Assigning recovery work to 3/MLVHW0ji: 1101 Apr 20 13:48:02 link-02 kernel: dm-cmirror: Failed to receive election results from server: (MLVHW0ji,-110) Apr 20 13:48:02 link-02 kernel: dm-cmirror: server_id=dead, server_valid=1, MLVHW0ji Apr 20 13:48:02 link-02 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT Apr 20 13:48:22 link-02 kernel: dm-cmirror: Failed to receive election results from server: (MLVHW0ji,-110) Apr 20 13:48:22 link-02 kernel: dm-cmirror: server_id=dead, server_valid=1, MLVHW0ji Apr 20 13:48:22 link-02 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT LINK-04: Apr 20 14:16:06 link-04 kernel: dm-cmirror: Client received resync work: 886/MLVHW0ji Apr 20 14:16:06 link-04 kernel: dm-cmirror: Client received resync work: 887/MLVHW0ji Apr 20 14:16:06 link-04 kernel: dm-cmirror: Client received resync work: 888/MLVHW0ji Apr 20 14:16:06 link-04 kernel: dm-cmirror: Client received resync work: 889/MLVHW0ji Apr 20 14:16:06 link-04 kernel: dm-cmirror: Client received resync work: 890/MLVHW0ji Apr 20 14:16:06 link-04 qarshd[25113]: Talking to peer 10.15.80.47:60377 Apr 20 14:16:06 link-04 qarshd[25113]: Running cmdline: echo offline > /sys/block/sdf/device/state Apr 20 14:16:06 link-04 kernel: dm-cmirror: Client received resync work: 891/MLVHW0ji Apr 20 14:16:06 link-04 qarshd[25113]: That's enough Apr 20 14:16:06 link-04 kernel: scsi1 (0:6): rejecting I/O to offline device Apr 20 14:16:06 link-04 kernel: scsi1 (0:6): rejecting I/O to offline device Apr 20 14:16:06 link-04 kernel: device-mapper: Write error during recovery (error = 0x1) Apr 20 14:16:06 link-04 kernel: device-mapper: recovery failed on region 891 Apr 20 14:16:06 link-04 kernel: dm-cmirror: Client received resync work: 892/MLVHW0ji Apr 20 14:16:06 link-04 kernel: scsi1 (0:6): rejecting I/O to offline device Apr 20 14:16:06 link-04 kernel: scsi1 (0:6): rejecting I/O to offline device Apr 20 14:16:06 link-04 kernel: device-mapper: Write error during recovery (error = 0x1) Apr 20 14:16:06 link-04 kernel: device-mapper: recovery failed on region 892 [...] Apr 20 14:16:17 link-04 kernel: dm-cmirror: Client received resync work: 1101/MLVHW0ji Apr 20 14:16:17 link-04 kernel: dm-cmirror: Creating MLVHW0ji (2) Apr 20 14:16:17 link-04 kernel: dm-cmirror: server_id=dead, server_valid=0, MLVHW0ji Apr 20 14:16:17 link-04 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT Apr 20 14:16:17 link-04 kernel: scsi1 (0:6): rejecting I/O to offline device Apr 20 14:16:17 link-04 lvm[7956]: No longer monitoring mirror device helter_skelter-fail_secondary_non_synced_3_legs for events Apr 20 14:16:17 link-04 lvm[7956]: Unlocking memory Apr 20 14:16:17 link-04 lvm[7956]: memlock_count dec to 0 Apr 20 14:16:17 link-04 lvm[7956]: Dumping persistent device cache to /etc/lvm/.cache Apr 20 14:16:17 link-04 lvm[7956]: Locking /etc/lvm/.cache (F_WRLCK, 1) Apr 20 14:16:17 link-04 lvm[7956]: Unlocking fd 8 Apr 20 14:16:17 link-04 lvm[7956]: Wiping internal VG cache Apr 20 14:16:17 link-04 kernel: scsi1 (0:6): rejecting I/O to offline device Apr 20 14:16:17 link-04 kernel: device-mapper: Write error during recovery (error = 0x1) Apr 20 14:16:17 link-04 kernel: device-mapper: recovery failed on region 1101 Apr 20 14:16:37 link-04 kernel: dm-cmirror: Failed to receive election results from server: (MLVHW0ji,-110) Apr 20 14:16:37 link-04 kernel: dm-cmirror: server_id=dead, server_valid=1, MLVHW0ji Apr 20 14:16:37 link-04 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT Apr 20 14:16:57 link-04 kernel: dm-cmirror: Failed to receive election results from server: (MLVHW0ji,-110) Apr 20 14:16:57 link-04 kernel: dm-cmirror: server_id=dead, server_valid=1, MLVHW0ji Apr 20 14:16:57 link-04 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT LINK-07: Apr 20 19:30:26 link-07 qarshd[26066]: Running cmdline: echo offline > /sys/block/sdf/device/state Apr 20 19:30:26 link-07 qarshd[26066]: That's enough Apr 20 19:30:31 link-07 qarshd[26073]: Talking to peer 10.15.80.47:42708 Apr 20 19:30:31 link-07 qarshd[26073]: Running cmdline: dd if=/dev/zero of=/mnt/fail_secondary_non_synced_3_legs/ddfile count=10 bs=4M Apr 20 19:30:31 link-07 qarshd[26073]: That's enough Apr 20 19:30:31 link-07 qarshd[26075]: Talking to peer 10.15.80.47:42709 Apr 20 19:30:31 link-07 qarshd[26075]: Running cmdline: sync Apr 20 19:30:34 link-07 qarshd[26075]: That's enough Apr 20 19:30:36 link-07 kernel: scsi2 (0:6): rejecting I/O to offline device Apr 20 19:30:36 link-07 kernel: dm-cmirror: Creating MLVHW0ji (2) Apr 20 19:30:36 link-07 kernel: dm-cmirror: server_id=dead, server_valid=0, MLVHW0ji Apr 20 19:30:36 link-07 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT Apr 20 19:30:36 link-07 lvm[8709]: No longer monitoring mirror device helter_skelter-fail_secondary_non_synced_3_legs for events Apr 20 19:30:36 link-07 lvm[8709]: Unlocking memory Apr 20 19:30:36 link-07 lvm[8709]: memlock_count dec to 0 Apr 20 19:30:36 link-07 lvm[8709]: Dumping persistent device cache to /etc/lvm/.cache Apr 20 19:30:36 link-07 lvm[8709]: Locking /etc/lvm/.cache (F_WRLCK, 1) Apr 20 19:30:36 link-07 lvm[8709]: Opened /etc/lvm/.cache RO Apr 20 19:30:36 link-07 lvm[8709]: Loaded persistent filter cache from /etc/lvm/.cache Apr 20 19:30:36 link-07 lvm[8709]: Unlocking fd 10 Apr 20 19:30:36 link-07 lvm[8709]: Closed /etc/lvm/.cache Apr 20 19:30:36 link-07 lvm[8709]: Wiping internal VG cache Apr 20 19:30:36 link-07 kernel: dm-cmirror: Performing flush to work around bug 235040 Apr 20 19:30:36 link-07 kernel: dm-cmirror: Log flush complete Apr 20 19:30:40 link-07 qarshd[26088]: Talking to peer 10.15.80.47:42714 Apr 20 19:30:40 link-07 qarshd[26088]: Running cmdline: lvs Apr 20 19:30:40 link-07 kernel: scsi2 (0:6): rejecting I/O to offline device Apr 20 19:30:43 link-07 qarshd[26088]: Nothing to do Apr 20 19:30:55 link-07 last message repeated 4 times Apr 20 19:30:56 link-07 kernel: dm-cmirror: Failed to receive election results from server: (MLVHW0ji,-110) Apr 20 19:30:56 link-07 kernel: dm-cmirror: server_id=dead, server_valid=1, MLVHW0ji Apr 20 19:30:56 link-07 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT Apr 20 19:30:58 link-07 qarshd[26088]: Nothing to do Apr 20 19:31:16 link-07 last message repeated 6 times Apr 20 19:31:16 link-07 kernel: dm-cmirror: Failed to receive election results from server: (MLVHW0ji,-110) Apr 20 19:31:16 link-07 kernel: dm-cmirror: server_id=dead, server_valid=1, MLVHW0ji Apr 20 19:31:16 link-07 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT Apr 20 19:31:19 link-07 qarshd[26088]: Nothing to do Apr 20 19:31:34 link-07 last message repeated 5 times Apr 20 19:31:36 link-07 kernel: dm-cmirror: Failed to receive election results from server: (MLVHW0ji,-110) Apr 20 19:31:36 link-07 kernel: dm-cmirror: server_id=dead, server_valid=1, MLVHW0ji Apr 20 19:31:36 link-07 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT LINK-08: Apr 20 14:23:12 link-08 kernel: dm-cmirror: Client received resync work: 621/MLVHW0ji Apr 20 14:23:12 link-08 kernel: dm-cmirror: Client received resync work: 622/MLVHW0ji Apr 20 14:23:12 link-08 kernel: dm-cmirror: Client received resync work: 623/MLVHW0ji Apr 20 14:23:13 link-08 qarshd[24558]: Talking to peer 10.15.80.47:37776 Apr 20 14:23:13 link-08 qarshd[24558]: Running cmdline: /usr/tests/sts/bin/checkit -w /mnt/fail_secondary_non_synced_3_legs -f /tmp/checkit_fail_secondary_non_synced_3_legs -n 100 Apr 20 14:23:15 link-08 qarshd[24558]: That's enough Apr 20 14:23:20 link-08 qarshd[24565]: Talking to peer 10.15.80.47:37791 Apr 20 14:23:20 link-08 qarshd[24565]: Running cmdline: echo offline > /sys/block/sdf/device/state Apr 20 14:23:20 link-08 qarshd[24565]: That's enough Apr 20 14:23:25 link-08 kernel: scsi2 (0:6): rejecting I/O to offline device Apr 20 14:23:25 link-08 last message repeated 4 times Apr 20 14:23:25 link-08 lvm[7805]: Mirror device, 253:4, has failed. Apr 20 14:23:25 link-08 lvm[7805]: Device failure in helter_skelter-fail_secondary_non_synced_3_legs Apr 20 14:23:25 link-08 lvm[7805]: Parsing: vgreduce --config devices{ignore_suspended_devices=1} --removemissing helter_skelter Apr 20 14:23:25 link-08 lvm[7805]: Reloading config files Apr 20 14:23:25 link-08 lvm[7805]: Wiping internal VG cache Apr 20 14:23:25 link-08 lvm[7805]: Loading config file: /etc/lvm/lvm.conf Apr 20 14:23:25 link-08 lvm[7805]: WARNING: dev_open(/etc/lvm/lvm.conf) called while suspended Apr 20 14:23:25 link-08 lvm[7805]: Opened /etc/lvm/lvm.conf RO Apr 20 14:23:25 link-08 lvm[7805]: Closed /etc/lvm/lvm.conf Apr 20 14:23:25 link-08 lvm[7805]: Setting log/syslog to 1 Apr 20 14:23:25 link-08 lvm[7805]: Setting log/level to 0 Apr 20 14:23:25 link-08 lvm[7805]: Setting log/verbose to 0 Apr 20 14:23:25 link-08 lvm[7805]: Setting log/indent to 1 Apr 20 14:23:25 link-08 kernel: dm-cmirror: LOG INFO: Apr 20 14:23:25 link-08 lvm[7805]: Setting log/prefix to Apr 20 14:23:25 link-08 kernel: dm-cmirror: uuid: LVM-xVnnXA2JtbxslpR6f47gh6Y0CQhRhWuCsrMJWxn24sUX9aTpuLv0A7wSMLVHW0ji Apr 20 14:23:25 link-08 lvm[7805]: Setting log/command_names to 0 Apr 20 14:23:25 link-08 kernel: dm-cmirror: uuid_ref : 1 Apr 20 14:23:25 link-08 lvm[7805]: Setting global/test to 0 Apr 20 14:23:25 link-08 kernel: dm-cmirror: ?region_count: 1600 Apr 20 14:23:25 link-08 lvm[7805]: Setting log/overwrite to 0 Apr 20 14:23:25 link-08 kernel: dm-cmirror: ?sync_count : 0 Apr 20 14:23:25 link-08 lvm[7805]: log/activation not found in config: defaulting to 0 Apr 20 14:23:25 link-08 kernel: dm-cmirror: ?sync_search : 0 Apr 20 14:23:25 link-08 lvm[7805]: Logging initialised at Fri Apr 20 14:23:25 2007 [...] Apr 20 14:23:29 link-08 lvm[7805]: Couldn't find device with uuid 'xKcsaA-A8zQ-4WfC-D5Po-SPvJ-F6K2-I7V3lU'. Apr 20 14:23:29 link-08 lvm[7805]: description not found in config: defaulting to Apr 20 14:23:29 link-08 lvm[7805]: Read helter_skelter metadata (436) from /dev/sdg1 at 141312 size 3043 Apr 20 14:23:29 link-08 lvm[7805]: Closed /dev/sdg1 Apr 20 14:23:29 link-08 lvm[7805]: Cached VG helter_skelter had incorrect PV list Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdd1 0: 0 1: fail_secondary_non_synced_3_legs_mlog(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdd1 1: 1 34727: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sda1 0: 0 34728: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdg1 0: 0 200: fail_secondary_non_synced_3_legs_mimage_2(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdg1 1: 200 34528: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdc1 0: 0 34728: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sde1 0: 0 34728: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdb1 0: 0 200: fail_secondary_non_synced_3_legs_mimage_0(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdb1 1: 200 34528: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: unknown device 0: 0 200: fail_secondary_non_synced_3_legs_mimage_1 (0:0) Apr 20 14:23:29 link-08 lvm[7805]: unknown device 1: 200 34528: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: Archiving volume group "helter_skelter" metadata (seqno 436). Apr 20 14:23:29 link-08 lvm[7805]: helter_skelter/fail_secondary_non_synced_3_legs_mimage_1 has missing exten ts: removing (including dependencies) Apr 20 14:23:29 link-08 lvm[7805]: Getting device info for helter_skelter-fail_secondary_non_synced_3_legs_mi mage_1 Apr 20 14:23:29 link-08 lvm[7805]: dm info LVM-xVnnXA2JtbxslpR6f47gh6Y0CQhRhWuCZ1SPEGektHmOD4cvBQHA0wC4DswjY g6t NF [16384] Apr 20 14:23:29 link-08 lvm[7805]: Removing PV with UUID xKcsaA-A8zQ-4WfC-D5Po-SPvJ-F6K2-I7V3lU from VG helte r_skelter Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdd1 0: 0 1: fail_secondary_non_synced_3_legs_mlog(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdd1 1: 1 34727: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sda1 0: 0 34728: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdg1 0: 0 200: fail_secondary_non_synced_3_legs_mimage_2(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdg1 1: 200 34528: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdc1 0: 0 34728: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sde1 0: 0 34728: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdb1 0: 0 200: fail_secondary_non_synced_3_legs_mimage_0(0:0) Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdb1 1: 200 34528: NULL(0:0) Apr 20 14:23:29 link-08 lvm[7805]: WARNING: dev_open(/dev/sda1) called while suspended Apr 20 14:23:29 link-08 lvm[7805]: Opened /dev/sda1 RW O_DIRECT Apr 20 14:23:29 link-08 lvm[7805]: /dev/sda1: block size is 512 bytes Apr 20 14:23:29 link-08 lvm[7805]: Writing helter_skelter metadata to /dev/sda1 at 183808 len 2836 Apr 20 14:23:29 link-08 lvm[7805]: WARNING: dev_open(/dev/sdb1) called while suspended Apr 20 14:23:29 link-08 lvm[7805]: Opened /dev/sdb1 RW O_DIRECT Apr 20 14:23:29 link-08 lvm[7805]: /dev/sdb1: block size is 512 bytes [...] Apr 20 14:26:30 link-08 lvm[7805]: Initialised segtype: mirror Apr 20 14:26:30 link-08 lvm[7805]: Completed: vgreduce --config devices{ignore_suspended_devices=1} --removem issing helter_skelter Apr 20 14:26:30 link-08 lvm[7805]: Failed to remove faulty devices in helter_skelter-fail_secondary_non_synce d_3_legs Apr 20 14:26:35 link-08 kernel: dm-cmirror: Error while listening for server response: -110 Apr 20 14:27:20 link-08 last message repeated 3 times Apr 20 14:28:35 link-08 last message repeated 5 times Apr 20 14:29:50 link-08 last message repeated 5 times Apr 20 14:31:05 link-08 last message repeated 5 times Apr 20 14:31:50 link-08 last message repeated 3 times Apr 20 14:31:50 link-08 kernel: dm-cmirror: Clustered mirror retried requests :: 64 of 328378 (1%) Apr 20 14:31:50 link-08 kernel: dm-cmirror: Last request: Apr 20 14:31:50 link-08 kernel: dm-cmirror: - my_id :: 4 Apr 20 14:31:50 link-08 kernel: dm-cmirror: - server :: 4 Apr 20 14:31:50 link-08 kernel: dm-cmirror: - log uuid:: MLVHW0ji (active) Apr 20 14:31:50 link-08 kernel: dm-cmirror: - request :: LRT_GET_SYNC_COUNT Apr 20 14:31:50 link-08 kernel: dm-cmirror: - error :: -110 Apr 20 14:31:50 link-08 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection. Apr 20 14:31:50 link-08 kernel: dm-cmirror: server_id=dead, server_valid=1, MLVHW0ji Apr 20 14:31:50 link-08 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT Apr 20 14:32:10 link-08 kernel: dm-cmirror: Failed to receive election results from server: (MLVHW0ji,-110) Apr 20 14:32:10 link-08 kernel: dm-cmirror: server_id=dead, server_valid=1, MLVHW0ji Apr 20 14:32:10 link-08 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT Apr 20 14:32:30 link-08 kernel: dm-cmirror: Failed to receive election results from server: (MLVHW0ji,-110) Apr 20 14:32:30 link-08 kernel: dm-cmirror: server_id=dead, server_valid=1, MLVHW0ji Apr 20 14:32:30 link-08 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT assigned -> post post -> modified. Marking this fixed as I haven't seen the error messages listed in this bug during failure scenarios anymore. 2.6.9-56.ELsmp/cmirror-kernel-2.6.9-33.2 Fixed in current release (4.7). |