Bug 1628529
Summary: | upconvert attempt after secondary mirror leg failure causes clvmd to hang and gfs to withdraw | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Roman Bednář <rbednar> | ||||||||||||||||||||||||
Component: | lvm2 | Assignee: | Zdenek Kabelac <zkabelac> | ||||||||||||||||||||||||
lvm2 sub component: | Clustered Mirror / cmirrord | QA Contact: | cluster-qe <cluster-qe> | ||||||||||||||||||||||||
Status: | CLOSED ERRATA | Docs Contact: | Marek Suchánek <msuchane> | ||||||||||||||||||||||||
Severity: | high | ||||||||||||||||||||||||||
Priority: | high | CC: | agk, cluster-qe, cmarthal, heinzm, jbrassow, mcsontos, msnitzer, msuchane, ncroxon, prajnoha, rbednar, rhandlin, salmy, teigland, toneata, zkabelac | ||||||||||||||||||||||||
Version: | 7.6 | Keywords: | Regression, ZStream | ||||||||||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||
Fixed In Version: | lvm2-2.02.185-1.el7 | Doc Type: | If docs needed, set a value | ||||||||||||||||||||||||
Doc Text: |
Previously, clustered LVM caused the GFS file system to withdraw when the secondary mirror leg failed with auto allocation. As a consequence, all commands using the lvm utility became unresponsive. With this update, the problem has been fixed.
|
Story Points: | --- | ||||||||||||||||||||||||
Clone Of: | |||||||||||||||||||||||||||
: | 1638084 (view as bug list) | Environment: | |||||||||||||||||||||||||
Last Closed: | 2019-08-06 13:10:41 UTC | Type: | Bug | ||||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||||||||
Bug Blocks: | 1638084 | ||||||||||||||||||||||||||
Attachments: |
|
Description
Roman Bednář
2018-09-13 11:18:35 UTC
It's not at all clear what's wrong, or that there's anything wrong, from an lvm perspective. Could you please highlight a specific lvm command that is failing, and provide the debugging output from it. Or, the specific output you expect to see from a command and the different output you're seeing instead (along with the debugging output from the command you think is printing incorrect results.) Also, I don't understand the description of what's being tested. Can you add lvmdump output for each of the involved nodes, please? Is there any kernel log on either node showing some issues wrt the mirror repair? First a few data points. 1. This exact test case (fail second mirror image with auto allocation) passes in exclusive activation lvmlockd mode. 2. A test case failing the primary leg (also with auto allocation) instead of the secondary mirror leg passes in clvmd mode. 3. It appears in this scenario, the failing secondary leg, and subsequent auto allocation of a new image causes gfs to withdraw and lvm cmds to hang/deadlock. ### A few notes inline about what the test is checking and expects: mckinley-01: pvcreate /dev/mapper/mpatha1 /dev/mapper/mpathb1 /dev/mapper/mpathc1 /dev/mapper/mpathd1 /dev/mapper/mpathe1 /dev/mapper/mpathf1 /dev/mapper/mpathg1 mckinley-01: vgcreate helter_skelter /dev/mapper/mpatha1 /dev/mapper/mpathb1 /dev/mapper/mpathc1 /dev/mapper/mpathd1 /dev/mapper/mpathe1 /dev/mapper/mpathf1 /dev/mapper/mpathg1 Enabling mirror allocate fault policies on: mckinley-01 mckinley-02 mckinley-03 ================================================================================ Iteration 0.1 started at Wed Sep 19 14:39:47 CDT 2018 ================================================================================ Scenario kill_secondary_synced_3_legs: Kill secondary leg of synced 3 leg mirror(s) ********* Mirror hash info for this scenario ********* * names: syncd_secondary_3legs_1 * sync: 1 * striped: 0 * leg devices: /dev/mapper/mpathd1 /dev/mapper/mpathc1 /dev/mapper/mpathf1 * log devices: /dev/mapper/mpathe1 * no MDA devices: * failpv(s): /dev/mapper/mpathc1 * failnode(s): mckinley-01 mckinley-02 mckinley-03 * lvmetad: 0 * leg fault policy: allocate * log fault policy: allocate ****************************************************** Creating mirror(s) on mckinley-03... mckinley-03: lvcreate -ay --type mirror -m 2 -n syncd_secondary_3legs_1 -L 500M helter_skelter /dev/mapper/mpathd1:0-2400 /dev/mapper/mpathc1:0-2400 /dev/mapper/mpathf1:0-2400 /dev/mapper/mpathe1:0-150 WARNING: gfs2 signature detected on /dev/helter_skelter/syncd_secondary_3legs_1 at offset 65536. Wipe it? [y/n]: [n] Aborted wiping of gfs2. 1 existing signature left on the device. Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices syncd_secondary_3legs_1 mwi-a-m--- 500.00m 10.40 syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_1(0),syncd_secondary_3legs_1_mimage_2(0) [syncd_secondary_3legs_1_mimage_0] Iwi-aom--- 500.00m /dev/mapper/mpathd1(0) [syncd_secondary_3legs_1_mimage_1] Iwi-aom--- 500.00m /dev/mapper/mpathc1(0) [syncd_secondary_3legs_1_mimage_2] Iwi-aom--- 500.00m /dev/mapper/mpathf1(0) [syncd_secondary_3legs_1_mlog] lwi-aom--- 4.00m /dev/mapper/mpathe1(0) Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Sleeping 15 sec Sleeping 15 sec Sleeping 15 sec Creating gfs2 on top of mirror(s) on mckinley-01... mkfs.gfs2 -o align=0 -J 32M -j 3 -p lock_dlm -t MCKINLEY:gfs1 /dev/helter_skelter/syncd_secondary_3legs_1 -O Warning: device is not properly aligned. This may harm performance. Mounting mirrored gfs2 filesystems on mckinley-01... Mounting mirrored gfs2 filesystems on mckinley-02... Mounting mirrored gfs2 filesystems on mckinley-03... PV=/dev/mapper/mpathc1 syncd_secondary_3legs_1_mimage_1: 4.1 PV=/dev/mapper/mpathc1 syncd_secondary_3legs_1_mimage_1: 4.1 Writing verification files (checkit) to mirror(s) on... ---- mckinley-01 ---- ---- mckinley-02 ---- ---- mckinley-03 ---- Sleeping 15 seconds to get some outsanding I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- mckinley-01 ---- ---- mckinley-02 ---- ---- mckinley-03 ---- ### Failing the underlying devices associated w/ the secondary image /dev/mapper/mpathc1 on each of the three nodes in the cluster Failing path sdd on mckinley-01 Failing path sdl on mckinley-01 Failing path sdt on mckinley-01 Failing path sdab on mckinley-01 Failing path sde on mckinley-02 Failing path sdm on mckinley-02 Failing path sdu on mckinley-02 Failing path sdac on mckinley-02 Failing path sde on mckinley-03 Failing path sdm on mckinley-03 Failing path sdu on mckinley-03 Failing path sdac on mckinley-03 Attempting I/O to cause mirror down conversion(s) on mckinley-01 dd if=/dev/zero of=/mnt/syncd_secondary_3legs_1/ddfile count=10 bs=4M 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.152766 s, 275 MB/s Verifying current sanity of lvm after the failure ### Already, the failed image /dev/mapper/mpathc1 has been removed, healthy /dev/mapper/mpathf1 moved to the secondary position, and /dev/mapper/mpatha1 added as the new third image. I'm not sure what the syncd_secondary_3legs_1_mimagetmp_3 device is all about though now? Current mirror/raid device structure(s): Couldn't find device with uuid EJ2ttC-Mw13-H8PD-L72C-5lZs-LMTQ-LuZeBx. LV Attr LSize Cpy%Sync Devices syncd_secondary_3legs_1 cwi-aom--- 500.00m 100.00 syncd_secondary_3legs_1_mimagetmp_3(0),syncd_secondary_3legs_1_mimage_3(0) [syncd_secondary_3legs_1_mimage_0] iwi-aom--- 500.00m /dev/mapper/mpathd1(0) [syncd_secondary_3legs_1_mimage_2] iwi-aom--- 500.00m /dev/mapper/mpathf1(0) [syncd_secondary_3legs_1_mimage_3] iwi-aom--- 500.00m /dev/mapper/mpatha1(0) [syncd_secondary_3legs_1_mimagetmp_3] mwi-aom--- 500.00m 100.00 syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_2(0) [syncd_secondary_3legs_1_mlog] lwi-aom--- 4.00m /dev/mapper/mpathe1(0) ### The test is happy with the device placements, basically that the failed device/image is gone, and a new device/image in its place Verifying FAILED device /dev/mapper/mpathc1 is *NOT* in the volume(s) olog: 1 Verifying LOG device(s) /dev/mapper/mpathe1 *ARE* in the mirror(s) Verifying LEG device /dev/mapper/mpathd1 *IS* in the volume(s) Verifying LEG device /dev/mapper/mpathf1 *IS* in the volume(s) verify the dm devices associated with /dev/mapper/mpathc1 have been removed as expected Checking REMOVAL of syncd_secondary_3legs_1_mimage_1 on: mckinley-01 mckinley-02 mckinley-03 verify the newly allocated dm devices were added as a result of the failures Checking EXISTENCE of syncd_secondary_3legs_1_mimage_3 on: mckinley-01 mckinley-02 mckinley-03 Verify that the mirror image order remains the same after the down conversion EXPECTED LEG ORDER: /dev/mapper/mpathd1 /dev/mapper/mpathf1 unknown ACTUAL LEG ORDER: /dev/mapper/mpathd1 /dev/mapper/mpathf1 /dev/mapper/mpatha1 ### Now, when the test goes to verify the filesystem integrity, it finds the data is gone! ### These mean the fs is no longer accessable (either force unmounted, or now corrupt) Verifying files (checkit) on mirror(s) on... ---- mckinley-01 ---- ---- mckinley-02 ---- Can not open wlrqyfaxxtlgdthwxklbphbh: Input/output error checkit write verify failed [root@mckinley-01 ~]# ls /mnt/syncd_secondary_3legs_1 ls: cannot access /mnt/syncd_secondary_3legs_1: Input/output error ### This never returns, lvm cmds are now deadlocked [root@mckinley-01 ~]# lvs -a -o +devices [DEADLOCK] Sep 19 14:41:55 mckinley-01 lvm[3062]: Device failure in helter_skelter-syncd_secondary_3legs_1. Sep 19 14:41:55 mckinley-01 lvm[3062]: WARNING: Disabling lvmetad cache for repair command. Sep 19 14:41:55 mckinley-01 lvm[3062]: WARNING: Not using lvmetad because of repair. Sep 19 14:41:55 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 331679 Sep 19 14:41:55 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 331823 Sep 19 14:41:55 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 334359 Sep 19 14:41:55 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 334583 Sep 19 14:41:55 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 330663 Sep 19 14:41:55 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 0 Sep 19 14:41:55 mckinley-01 lvm[3062]: Couldn't find device with uuid EJ2ttC-Mw13-H8PD-L72C-5lZs-LMTQ-LuZeBx. Sep 19 14:41:55 mckinley-01 lvm[3062]: WARNING: Couldn't find all devices for LV helter_skelter/syncd_secondary_3legs_1_mimage_1 while checking used and assumed devices. Sep 19 14:41:55 mckinley-01 qarshd[12914]: Running cmdline: dd if=/dev/zero of=/mnt/syncd_secondary_3legs_1/ddfile count=10 bs=4M Sep 19 14:41:55 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.218:48406). Sep 19 14:41:55 mckinley-01 qarshd[12966]: Talking to peer ::ffff:10.15.80.218:48406 (IPv6) Sep 19 14:41:55 mckinley-01 multipathd: dm-21: remove map (uevent) Sep 19 14:41:55 mckinley-01 multipathd: dm-21: devmap not registered, can't remove Sep 19 14:41:55 mckinley-01 multipathd: dm-21: remove map (uevent) Sep 19 14:41:55 mckinley-01 multipathd: dm-24: remove map (uevent) Sep 19 14:41:55 mckinley-01 multipathd: dm-24: devmap not registered, can't remove Sep 19 14:41:55 mckinley-01 multipathd: dm-24: remove map (uevent) Sep 19 14:41:55 mckinley-01 qarshd[12966]: Running cmdline: sync Sep 19 14:41:56 mckinley-01 multipathd: sdl: unusable path Sep 19 14:41:56 mckinley-01 multipathd: sdt: unusable path Sep 19 14:41:56 mckinley-01 multipathd: sdab: unusable path Sep 19 14:41:58 mckinley-01 lvm[3062]: Mirror status: 1 of 3 images failed. Sep 19 14:41:58 mckinley-01 lvm[3062]: WARNING: Trying to up-convert to 3 images, 1 logs. Sep 19 14:41:59 mckinley-01 systemd: Started LVM2 poll daemon. Sep 19 14:41:59 mckinley-01 lvm[3062]: helter_skelter/syncd_secondary_3legs_1: Converted: 4.00% Sep 19 14:41:59 mckinley-01 lvmpolld: W: #011LVPOLL: PID 13076: STDERR: ' Couldn't find device with uuid EJ2ttC-Mw13-H8PD-L72C-5lZs-LMTQ-LuZeBx.' Sep 19 14:42:00 mckinley-01 multipathd: sdd: unusable path Sep 19 14:42:01 mckinley-01 multipathd: sdl: unusable path Sep 19 14:42:01 mckinley-01 multipathd: sdt: unusable path Sep 19 14:42:01 mckinley-01 multipathd: sdab: unusable path Sep 19 14:42:03 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.218:48450). Sep 19 14:42:03 mckinley-01 qarshd[13091]: Talking to peer ::ffff:10.15.80.218:48450 (IPv6) Sep 19 14:42:03 mckinley-01 qarshd[13091]: Running cmdline: dmsetup ls | grep syncd_secondary_3legs_1_mimage_1 Sep 19 14:42:04 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.218:48456). Sep 19 14:42:04 mckinley-01 qarshd[13097]: Talking to peer ::ffff:10.15.80.218:48456 (IPv6) Sep 19 14:42:04 mckinley-01 qarshd[13097]: Running cmdline: dmsetup ls | grep syncd_secondary_3legs_1_mimage_3 Sep 19 14:42:05 mckinley-01 multipathd: sdd: unusable path Sep 19 14:42:06 mckinley-01 multipathd: sdl: unusable path Sep 19 14:42:06 mckinley-01 multipathd: sdt: unusable path Sep 19 14:42:06 mckinley-01 multipathd: sdab: unusable path Sep 19 14:42:10 mckinley-01 multipathd: sdd: unusable path Sep 19 14:42:11 mckinley-01 multipathd: sdl: unusable path Sep 19 14:42:11 mckinley-01 multipathd: sdt: unusable path Sep 19 14:42:11 mckinley-01 multipathd: sdab: unusable path Sep 19 14:42:11 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.218:48918). Sep 19 14:42:11 mckinley-01 qarshd[13134]: Talking to peer ::ffff:10.15.80.218:48918 (IPv6) Sep 19 14:42:11 mckinley-01 qarshd[13134]: Running cmdline: /usr/tests/sts-rhel7.6/bin/checkit -w /mnt/syncd_secondary_3legs_1/checkit -f /tmp/checkit_syncd_secondary_3legs_1 -v Sep 19 14:42:14 mckinley-01 lvm[3062]: helter_skelter/syncd_secondary_3legs_1: Converted: 100.00% Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: 25 callbacks suppressed Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 0 Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 40 Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 0 Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 40 Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 0 Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 40 Sep 19 14:42:14 mckinley-01 kernel: device-mapper: raid1: Primary mirror (253:21) failed while out-of-sync: Reads may fail. Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: Error -5 writing to journal, jid=0 Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: error -5: withdrawing the file system to prevent further damage. Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: Error -5 writing to journal, jid=0 Sep 19 14:42:14 mckinley-01 kernel: VFS:Filesystem freeze failed Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: about to withdraw this file system Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: telling LM to unmount Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: recover generation 5 done Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: withdrawn Sep 19 14:42:14 mckinley-01 kernel: CPU: 7 PID: 12704 Comm: gfs2_logd Kdump: loaded Not tainted 3.10.0-951.el7.x86_64 #1 Sep 19 14:42:14 mckinley-01 kernel: Hardware name: Dell Inc. PowerEdge R820/0RN9TC, BIOS 2.0.20 01/16/2014 Sep 19 14:42:14 mckinley-01 kernel: Call Trace: Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffaa361ca8>] dump_stack+0x19/0x1b Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffc08ff206>] gfs2_lm_withdraw+0x146/0x180 [gfs2] Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffc08e3e15>] gfs2_logd+0x315/0x340 [gfs2] Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffa9cc2d60>] ? wake_up_atomic_t+0x30/0x30 Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffc08e3b00>] ? gfs2_meta_syncfs+0x50/0x50 [gfs2] Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffa9cc1c91>] kthread+0xd1/0xe0 Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffa9cc1bc0>] ? insert_kthread_work+0x40/0x40 Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffaa374c37>] ret_from_fork_nospec_begin+0x21/0x21 Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffa9cc1bc0>] ? insert_kthread_work+0x40/0x40 Sep 19 14:42:15 mckinley-01 multipathd: sdd: unusable path Sep 19 14:42:16 mckinley-01 multipathd: sdl: unusable path Sep 19 14:42:16 mckinley-01 multipathd: sdt: unusable path ### In my reproduction, pcs seems to think the cluster is still fine: [root@mckinley-01 ~]# pcs status Cluster name: MCKINLEY Stack: corosync Current DC: mckinley-02 (version 1.1.19-7.el7-c3c624ea3d) - partition with quorum Last updated: Wed Sep 19 14:55:21 2018 Last change: Wed Sep 19 13:35:51 2018 by root via cibadmin on mckinley-01 3 nodes configured 7 resources configured Online: [ mckinley-01 mckinley-02 mckinley-03 ] Full list of resources: mckinley-apc (stonith:fence_apc): Started mckinley-01 Clone Set: dlm-clone [dlm] Started: [ mckinley-01 mckinley-02 mckinley-03 ] Clone Set: clvmd-clone [clvmd] Started: [ mckinley-01 mckinley-02 mckinley-03 ] This seems eerily familiar to closed bugs 1039218 and 1038870 and may having something to do with prior run cmirror device failure allocation scenarios not fully cleaning up after itself. However, if that's the case, what changed for us to have to deal w/ this again after 4+ years... "lvm[3062]" -- what process is this? Is it one of the automatically run lvconvert --repair commands that operates so secretively? If so we need to know exactly what that thing is doing. Any kind of automatically run repair command needs to leave very obvious and verbose logs about exactly what it's doing. Will it leave a full debug log of itself by setting lvm.conf log level and file? Also, I'm surprised to see no trace of cmirror anywhere yet. Shouldn't that be leaving some evidence of itself somewhere? Another data point, this test scenario passes in single machine mode w/o clvmd/cmirror. That syncd_secondary_3legs_1_mimagetmp_3 image is present in this passing case, so I don't think this is related to the above mentioned old bugs 1039218 and 1038870 after all then. Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices syncd_secondary_3legs_1 mwi-a-m--- 500.00m 7.20 syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_1(0),syncd_secondary_3legs_1_mimage_2(0) [syncd_secondary_3legs_1_mimage_0] Iwi-aom--- 500.00m /dev/sde1(0) [syncd_secondary_3legs_1_mimage_1] Iwi-aom--- 500.00m /dev/sdi1(0) [syncd_secondary_3legs_1_mimage_2] Iwi-aom--- 500.00m /dev/sdm1(0) [syncd_secondary_3legs_1_mlog] lwi-aom--- 4.00m /dev/sdb1(0) Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Sleeping 15 sec Creating xfs on top of mirror(s) on hayes-02... Mounting mirrored xfs filesystems on hayes-02... PV=/dev/sdi1 syncd_secondary_3legs_1_mimage_1: 4.1 PV=/dev/sdi1 syncd_secondary_3legs_1_mimage_1: 4.1 Writing verification files (checkit) to mirror(s) on... ---- hayes-02 ---- Sleeping 15 seconds to get some outsanding I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- hayes-02 ---- Disabling device sdi on hayes-02rescan device... Attempting I/O to cause mirror down conversion(s) on hayes-02 dd if=/dev/zero of=/mnt/syncd_secondary_3legs_1/ddfile count=10 bs=4M 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.0233912 s, 1.8 GB/s Verifying current sanity of lvm after the failure Current mirror/raid device structure(s): WARNING: Not using lvmetad because a repair command was run. /dev/sdi: open failed: No such device or address /dev/sdi1: open failed: No such device or address Couldn't find device with uuid YbtcAp-Z0tv-qBWL-GY2e-60LI-Yb3f-m1evQY. LV Attr LSize Cpy%Sync Devices syncd_secondary_3legs_1 cwi-aom--- 500.00m 100.00 syncd_secondary_3legs_1_mimagetmp_3(0),syncd_secondary_3legs_1_mimage_3(0) [syncd_secondary_3legs_1_mimage_0] iwi-aom--- 500.00m /dev/sde1(0) [syncd_secondary_3legs_1_mimage_2] iwi-aom--- 500.00m /dev/sdm1(0) [syncd_secondary_3legs_1_mimage_3] iwi-aom--- 500.00m /dev/sdo1(0) [syncd_secondary_3legs_1_mimagetmp_3] mwi-aom--- 500.00m 100.00 syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_2(0) [syncd_secondary_3legs_1_mlog] lwi-aom--- 4.00m /dev/sdb1(0) Sep 19 16:06:33 hayes-02 lvm[10402]: Mirror status: 1 of 3 images failed. Sep 19 16:06:33 hayes-02 lvm[10402]: WARNING: Trying to up-convert to 3 images, 1 logs. Sep 19 16:06:33 hayes-02 systemd: Started LVM2 poll daemon. Sep 19 16:06:33 hayes-02 lvm[10402]: helter_skelter/syncd_secondary_3legs_1: Converted: 0.80% Sep 19 16:06:33 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: ' WARNING: Not using lvmetad because a repair command was run.' Sep 19 16:06:33 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: ' /dev/sdi: open failed: No such device or address' Sep 19 16:06:33 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: ' /dev/sdi1: open failed: No such device or address' Sep 19 16:06:33 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: ' Couldn't find device with uuid YbtcAp-Z0tv-qBWL-GY2e-60LI-Yb3f-m1evQY.' Sep 19 16:06:39 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45410). Sep 19 16:06:39 hayes-02 qarshd[23013]: Talking to peer ::ffff:10.15.80.218:45410 (IPv6) Sep 19 16:06:39 hayes-02 qarshd[23013]: Running cmdline: lvs > /dev/null 2>&1 Sep 19 16:06:39 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45412). Sep 19 16:06:39 hayes-02 qarshd[23023]: Talking to peer ::ffff:10.15.80.218:45412 (IPv6) [...] Sep 19 16:06:48 hayes-02 qarshd[23087]: Running cmdline: /usr/tests/sts-rhel7.6/bin/checkit -w /mnt/syncd_secondary_3legs_1/checkit -f /tmp/checkit_syncd_secondary_3legs_1 -v Sep 19 16:06:48 hayes-02 lvm[10402]: helter_skelter/syncd_secondary_3legs_1: Converted: 100.00% Sep 19 16:06:48 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: ' /dev/sdi: open failed: No such device or address' Sep 19 16:06:49 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45448). Sep 19 16:06:49 hayes-02 qarshd[23106]: Talking to peer ::ffff:10.15.80.218:45448 (IPv6) Sep 19 16:06:49 hayes-02 qarshd[23106]: Running cmdline: dmstats list --histogram Sep 19 16:06:49 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45450). Sep 19 16:06:49 hayes-02 qarshd[23110]: Talking to peer ::ffff:10.15.80.218:45450 (IPv6) Sep 19 16:06:49 hayes-02 qarshd[23110]: Running cmdline: dmstats list --group Sep 19 16:06:49 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45452). Sep 19 16:06:49 hayes-02 qarshd[23114]: Talking to peer ::ffff:10.15.80.218:45452 (IPv6) Sep 19 16:06:50 hayes-02 qarshd[23114]: Running cmdline: dmstats delete /dev/helter_skelter/syncd_secondary_3legs_1 --allregions Sep 19 16:06:50 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45454). Sep 19 16:06:50 hayes-02 qarshd[23118]: Talking to peer ::ffff:10.15.80.218:45454 (IPv6) Sep 19 16:06:50 hayes-02 qarshd[23118]: Running cmdline: echo running > /sys/block/sdi/device/state Sep 19 16:06:50 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45456). Sep 19 16:06:50 hayes-02 qarshd[23122]: Talking to peer ::ffff:10.15.80.218:45456 (IPv6) Sep 19 16:06:50 hayes-02 qarshd[23122]: Running cmdline: pvscan --cache /dev/sdi1 Sep 19 16:06:50 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45458). Sep 19 16:06:50 hayes-02 qarshd[23127]: Talking to peer ::ffff:10.15.80.218:45458 (IPv6) Sep 19 16:06:50 hayes-02 qarshd[23127]: Running cmdline: vgs Sep 19 16:06:56 hayes-02 lvm[10402]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 19 16:06:56 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: ' /dev/sdi1: open failed: No such device or address' Sep 19 16:06:56 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: ' WARNING: This metadata update is NOT backed up.' Sep 19 16:06:56 hayes-02 lvm[10402]: helter_skelter-syncd_secondary_3legs_1 is now in-sync. Sep 19 16:06:56 hayes-02 multipathd: dm-2: remove map (uevent) Sep 19 16:06:56 hayes-02 multipathd: dm-2: devmap not registered, can't remove Sep 19 16:06:56 hayes-02 multipathd: dm-2: remove map (uevent) Sep 19 16:06:56 hayes-02 lvm[10402]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 19 16:07:04 hayes-02 lvm[10402]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 19 16:07:04 hayes-02 lvm[10402]: helter_skelter/syncd_secondary_3legs_1: Converted: 100.00% Everything goes south after the upconvert is attempted. Either by the user/test when in "remove" leg fault policy, or by lvm/dmevent automatically when in "allocate" leg fault policy. Created attachment 1484952 [details]
log of test run with "remove" policy
Created attachment 1484953 [details]
verbose lvconvert upconvert
# This is just before running an upconvert (long after the failure, auto down convert, device turned back on, vgextend'ed). This is presumably also what lvm does automatically when in "allocate" policy
[root@mckinley-01 ~]# lvs -a -o +devices
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices
syncd_secondary_3legs_1 helter_skelter mwi-aom--- 500.00m [syncd_secondary_3legs_1_mlog] 100.00 syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_2(0)
[syncd_secondary_3legs_1_mimage_0] helter_skelter iwi-aom--- 500.00m /dev/mapper/mpathg1(0)
[syncd_secondary_3legs_1_mimage_2] helter_skelter iwi-aom--- 500.00m /dev/mapper/mpathf1(0)
[syncd_secondary_3legs_1_mlog] helter_skelter lwi-aom--- 4.00m /dev/mapper/mpathc1(0)
lvconvert -vvvv --yes --type mirror --mirrorlog disk -m 2 helter_skelter/syncd_secondary_3legs_1 /dev/mapper/mpathg1:0-2400 /dev/mapper/mpatha1:0-2400 /dev/mapper/mpathf1:0-2400 /dev/mapper/mpathc1:0-150
After this ever
This is what is in the log as a direct result of the above lvconvert cmd being run: Sep 19 17:16:16 mckinley-01 dmeventd[2952]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 19 17:16:17 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: Error -5 writing to journal, jid=0 Sep 19 17:16:17 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: error -5: withdrawing the file system to prevent further damage. Sep 19 17:16:17 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: about to withdraw this file system Sep 19 17:16:17 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: telling LM to unmount Sep 19 17:16:17 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: withdrawn Sep 19 17:16:17 mckinley-01 corosync[2480]: [CPG ] *** 0x55aaf14c3180 can't mcast to group state:0, error:12 Sep 19 17:16:17 mckinley-01 kernel: CPU: 5 PID: 5856 Comm: gfs2_logd Kdump: loaded Not tainted 3.10.0-951.el7.x86_64 #1 Sep 19 17:16:17 mckinley-01 kernel: Hardware name: Dell Inc. PowerEdge R820/0RN9TC, BIOS 2.0.20 01/16/2014 Sep 19 17:16:17 mckinley-01 kernel: Call Trace: Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffad161ca8>] dump_stack+0x19/0x1b Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffc0797206>] gfs2_lm_withdraw+0x146/0x180 [gfs2] Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffc077be15>] gfs2_logd+0x315/0x340 [gfs2] Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffacac2d60>] ? wake_up_atomic_t+0x30/0x30 Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffc077bb00>] ? gfs2_meta_syncfs+0x50/0x50 [gfs2] Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffacac1c91>] kthread+0xd1/0xe0 Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffacac1bc0>] ? insert_kthread_work+0x40/0x40 Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffad174c37>] ret_from_fork_nospec_begin+0x21/0x21 Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffacac1bc0>] ? insert_kthread_work+0x40/0x40 Sep 19 17:16:20 mckinley-01 lvm[2952]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1_mimagetmp_3 for events. Sep 19 17:16:20 mckinley-01 lvm[2952]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 19 17:16:20 mckinley-01 systemd: Started LVM2 poll daemon. Sep 19 17:16:24 mckinley-01 lvm[2952]: helter_skelter-syncd_secondary_3legs_1 is now in-sync. Sep 19 17:16:35 mckinley-01 lvm[2952]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1_mimagetmp_3 for events. Sep 19 17:16:36 mckinley-01 dmeventd[2952]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 19 17:16:37 mckinley-01 qarshd[5940]: Sending child 5941 signal 2 Sep 19 17:16:43 mckinley-01 qarshd[5940]: Sending child 5941 signal 15 Sep 19 17:17:12 mckinley-01 systemd-logind: New session 3 of user root. Sep 19 17:17:12 mckinley-01 systemd: Started Session 3 of user root. Sep 19 17:17:35 mckinley-01 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 7520) has timed out Sep 19 17:17:50 mckinley-01 lvmpolld: W: #011LVPOLL: PID 7520: STDERR: ' Error locking on node 1: Command timed out' Sep 19 17:18:14 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.218:55888). Sep 19 17:18:14 mckinley-01 qarshd[8035]: Talking to peer ::ffff:10.15.80.218:55888 (IPv6) Sep 19 17:18:14 mckinley-01 qarshd[8035]: Sending file /tmp/lvconvert Sep 19 17:18:19 mckinley-01 systemd: qarshd.104.61:5016-10.15.80.218:54668.service stop-sigterm timed out. Killing. Sep 19 17:18:50 mckinley-01 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 7520) has timed out Sep 19 17:19:19 mckinley-01 systemd: Starting Cleanup of Temporary Directories... Sep 19 17:19:19 mckinley-01 systemd: Started Cleanup of Temporary Directories. Sep 19 17:19:49 mckinley-01 systemd: qarshd.104.61:5016-10.15.80.218:54668.service still around after SIGKILL. Ignoring. Sep 19 17:19:50 mckinley-01 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 7520) has timed out Sep 19 17:20:00 mckinley-01 lvmpolld: W: #011LVPOLL: PID 7520: STDERR: ' Failed to lock logical volume helter_skelter/syncd_secondary_3legs_1.' Sep 19 17:20:00 mckinley-01 lvmpolld: W: #011LVPOLL: PID 7520: STDERR: ' Error locking on node 1: Command timed out' Sep 19 17:20:02 mckinley-01 kernel: INFO: task clvmd:2870 blocked for more than 120 seconds. Sep 19 17:20:02 mckinley-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 19 17:20:02 mckinley-01 kernel: clvmd D ffff9c873eaba080 0 2870 1 0x00000080 Sep 19 17:20:02 mckinley-01 kernel: Call Trace: Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffad167aa9>] schedule+0x29/0x70 Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacc44285>] sb_wait_write+0xb5/0xd0 Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacac2d60>] ? wake_up_atomic_t+0x30/0x30 Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacc444b8>] freeze_super+0x68/0x130 Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacc812b5>] freeze_bdev+0x75/0xd0 Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffc00a8b45>] __dm_suspend+0x105/0x240 [dm_mod] Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffc00ab27d>] dm_suspend+0xad/0xc0 [dm_mod] Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffc00b1053>] dev_suspend+0x1c3/0x260 [dm_mod] Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffc00b20ee>] ctl_ioctl+0x24e/0x550 [dm_mod] Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffc00b0e90>] ? table_load+0x390/0x390 [dm_mod] Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffc00b23fe>] dm_ctl_ioctl+0xe/0x20 [dm_mod] Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacc56230>] do_vfs_ioctl+0x3a0/0x5a0 Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacce1f0b>] ? unmerge_queues+0x1b/0x70 Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacce02ce>] ? ipcget+0x14e/0x1e0 Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacc564d1>] SyS_ioctl+0xa1/0xc0 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad174d15>] ? system_call_after_swapgs+0xa2/0x146 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad174ddb>] system_call_fastpath+0x22/0x27 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad174d21>] ? system_call_after_swapgs+0xae/0x146 Sep 19 17:20:03 mckinley-01 kernel: INFO: task xdoio:5944 blocked for more than 120 seconds. Sep 19 17:20:03 mckinley-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 19 17:20:03 mckinley-01 kernel: xdoio D ffff9c977fb05140 0 5944 1 0x00000084 Sep 19 17:20:03 mckinley-01 kernel: Call Trace: Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc785dd>] ? __brelse+0x3d/0x50 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad165b70>] ? out_of_line_wait_on_atomic_t+0x100/0x100 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad167aa9>] schedule+0x29/0x70 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad165b81>] bit_wait+0x11/0x50 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad1656f7>] __wait_on_bit+0x67/0x90 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad165b70>] ? out_of_line_wait_on_atomic_t+0x100/0x100 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad165861>] out_of_line_wait_on_bit+0x81/0xb0 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacac2e20>] ? wake_bit_function+0x40/0x40 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc0776c3e>] gfs2_glock_wait+0x3e/0x80 [gfs2] Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc0778720>] gfs2_glock_nq+0x250/0x420 [gfs2] Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc0796796>] gfs2_trans_begin+0xf6/0x1b0 [gfs2] Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc079523f>] gfs2_dirty_inode+0x1bf/0x280 [gfs2] Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc079526b>] ? gfs2_dirty_inode+0x1eb/0x280 [gfs2] Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc6fedd>] __mark_inode_dirty+0x16d/0x270 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc5dfd9>] update_time+0x89/0xd0 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffaca7956f>] ? gup_pud_range+0x13f/0x2a0 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc5e0c0>] file_update_time+0xa0/0xf0 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacbb8668>] __generic_file_aio_write+0x198/0x400 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacbb8929>] generic_file_aio_write+0x59/0xa0 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc0781130>] gfs2_file_aio_write+0xd0/0x160 [gfs2] Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc406db>] do_sync_readv_writev+0x7b/0xd0 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc4231e>] do_readv_writev+0xce/0x260 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc0781060>] ? gfs2_release+0x50/0x50 [gfs2] Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc40580>] ? do_sync_read+0xe0/0xe0 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc408a4>] ? generic_file_llseek+0x24/0x30 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc078156d>] ? gfs2_llseek+0x5d/0x120 [gfs2] Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc42545>] vfs_writev+0x35/0x60 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc426ff>] SyS_writev+0x7f/0x110 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad174d21>] ? system_call_after_swapgs+0xae/0x146 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad174ddb>] system_call_fastpath+0x22/0x27 Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad174d21>] ? system_call_after_swapgs+0xae/0x146 Sep 19 17:21:00 mckinley-01 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 7520) has timed out Sep 19 17:21:19 mckinley-01 systemd: qarshd.104.61:5016-10.15.80.218:54668.service stop-final-sigterm timed out. Killing. Sep 19 17:22:00 mckinley-01 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 7520) has timed out Sep 19 17:22:00 mckinley-01 lvmpolld: W: #011LVPOLL: PID 7520: STDERR: ' Attempt to drop cached metadata failed after reverted update for VG helter_skelter.' Sep 19 17:22:00 mckinley-01 lvmpolld: W: #011LVPOLL: PID 7520: STDERR: ' Error locking on node 1: Command timed out' Sep 19 17:22:03 mckinley-01 kernel: INFO: task clvmd:2870 blocked for more than 120 seconds. Sep 19 17:22:03 mckinley-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 19 17:22:03 mckinley-01 kernel: clvmd D ffff9c873eaba080 0 2870 1 0x00000080 Sep 19 17:22:03 mckinley-01 kernel: Call Trace: Sep 19 17:22:03 mckinley-01 kernel: [<ffffffffad167aa9>] schedule+0x29/0x70 Sep 19 17:22:03 mckinley-01 kernel: [<ffffffffacc44285>] sb_wait_write+0xb5/0xd0 Sep 19 17:22:03 mckinley-01 kernel: [<ffffffffacac2d60>] ? wake_up_atomic_t+0x30/0x30 Sep 19 17:22:03 mckinley-01 kernel: [<ffffffffacc444b8>] freeze_super+0x68/0x130 Sep 19 17:22:03 mckinley-01 kernel: [<ffffffffacc812b5>] freeze_bdev+0x75/0xd0 Can we have "dmsetup ls --tree; dmsetup table; dmsetup status" output from before and after the "lvconvert -vvvv --yes --type mirror --mirrorlog disk -m 2 helter_skelter/syncd_secondary_3legs_1 /dev/mapper/mpathg1:0-2400 /dev/mapper/mpatha1:0-2400 /dev/mapper/mpathf1:0-2400 /dev/mapper/mpathc1:0-150", please ? The lvconvert in comment 11 is blocked because clvmd is not responding. In comment 12 clvmd is blocked in D state suspending the device under gfs. The suspend is blocked waiting for gfs to freeze. The gfs freeze is blocked because gfs is in the middle of withdraw due to io errors and/or the xdoio process that's in the middle of writing. I don't think clvmd is supposed to be suspending devices. We should get debug output from clvmd to see what it might be doing wrong. So looking at the 'flush' issue part of problem here - there should be basically no flush in the particular case. The flush could have been introduced with this commit 672b8c196b91e4cf6567223ce44dfb902281b791 - fixing the splitmirror preloading. So it looks like the necessity for flushing is not properly estimated in this case. I'll try to provide some patch - detected where the flush must not happen in this preload case (mirror leg with 'D') and ensure the preload is going without it. Not really sure it's the core issue - but it's definitely a bug - and can block the cluster probably. Created attachment 1485256 [details] requested dm information FWIW, here's the dm info requested in comment #13. When things go bad here are errors in several layers (cluster, kernel, lvm, gfs) all at about the same time. What's not yet clear is whether there is one initial issue that triggers everything else. Sep 20 15:09:29 harding-02 qarshd[11771]: Running cmdline: vgextend helter_skelter /dev/mapper/mpathf1 Sep 20 15:11:11 harding-02 dmeventd[5990]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 20 15:11:11 harding-02 kernel: device-mapper: raid1: Unable to read primary mirror during recovery Sep 20 15:11:11 harding-02 kernel: device-mapper: raid1: Primary mirror (253:21) failed while out-of-sync: Reads may fail. Sep 20 15:11:12 harding-02 kernel: GFS2: fsid=HARDING:gfs1.0: recover generation 3 done Sep 20 15:11:15 harding-02 lvm[5990]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1_mimagetmp_3 for events. Sep 20 15:11:15 harding-02 lvm[5990]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 20 15:11:15 harding-02 systemd: Started LVM2 poll daemon. Sep 20 15:11:17 harding-02 lvm[5990]: Primary mirror device 253:21 sync failed. Sep 20 15:11:17 harding-02 lvm[5990]: helter_skelter-syncd_secondary_3legs_1 is now in-sync. Sep 20 15:11:30 harding-02 lvm[5990]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1_mimagetmp_3 for events. Sep 20 15:11:30 harding-02 dmeventd[5990]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 20 15:11:30 harding-02 cmirrord[7446]: Log entry already exists: LVM-4MhnZKHIY4FOEGVEtM2hszEiDk3YRGYsEi3SAeDOrLBJcV8FRuoZxnC4HYg03EtH Sep 20 15:11:30 harding-02 cmirrord[7446]: clog_resume: Failed to create cluster CPG Sep 20 15:11:30 harding-02 kernel: device-mapper: raid1: log presuspend failed Sep 20 15:11:30 harding-02 kernel: device-mapper: raid1: log presuspend failed Sep 20 15:11:30 harding-02 kernel: device-mapper: raid1: log postsuspend failed Sep 20 15:12:30 harding-02 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 12145) has timed out Sep 20 15:13:30 harding-02 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 12145) has timed out Sep 20 15:14:30 harding-02 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 12145) has timed out Sep 20 15:14:30 harding-02 lvmpolld[12142]: LVMPOLLD: LVM2 cmd is unresponsive too long (PID 12145) (no output for 180 seconds) Sep 20 15:14:45 harding-02 lvm[5990]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 20 15:10:23 harding-03 crmd[11466]: notice: State transition S_IDLE -> S_POLICY_ENGINE Sep 20 15:10:23 harding-03 pengine[11465]: notice: Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-578.bz2 Sep 20 15:10:23 harding-03 crmd[11466]: notice: Transition 2 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-578.bz2): Complete Sep 20 15:10:23 harding-03 crmd[11466]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE Sep 20 15:11:11 harding-03 dmeventd[11964]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 20 15:11:12 harding-03 kernel: GFS2: fsid=HARDING:gfs1.1: Error -5 writing to journal, jid=1 Sep 20 15:11:12 harding-03 kernel: GFS2: fsid=HARDING:gfs1.1: error -5: withdrawing the file system to prevent further damage. Sep 20 15:11:12 harding-03 kernel: GFS2: fsid=HARDING:gfs1.1: about to withdraw this file system Sep 20 15:11:12 harding-03 kernel: GFS2: fsid=HARDING:gfs1.1: telling LM to unmount Sep 20 15:11:12 harding-03 kernel: GFS2: fsid=HARDING:gfs1.1: withdrawn Sep 20 15:11:12 harding-03 kernel: CPU: 7 PID: 16487 Comm: gfs2_logd Kdump: loaded Not tainted 3.10.0-951.el7.x86_64 #1 Sep 20 15:11:12 harding-03 kernel: Hardware name: Dell Inc. PowerEdge R820/0YWR73, BIOS 1.5.0 03/08/2013 Sep 20 15:11:12 harding-03 kernel: Call Trace: Sep 20 15:11:12 harding-03 kernel: [<ffffffffa8361ca8>] dump_stack+0x19/0x1b Sep 20 15:11:12 harding-03 kernel: [<ffffffffc0b3f206>] gfs2_lm_withdraw+0x146/0x180 [gfs2] Sep 20 15:11:12 harding-03 kernel: [<ffffffffc0b23e15>] gfs2_logd+0x315/0x340 [gfs2] Sep 20 15:11:12 harding-03 kernel: [<ffffffffa7cc2d60>] ? wake_up_atomic_t+0x30/0x30 Sep 20 15:11:12 harding-03 kernel: [<ffffffffc0b23b00>] ? gfs2_meta_syncfs+0x50/0x50 [gfs2] Sep 20 15:11:12 harding-03 kernel: [<ffffffffa7cc1c91>] kthread+0xd1/0xe0 Sep 20 15:11:12 harding-03 kernel: [<ffffffffa7cc1bc0>] ? insert_kthread_work+0x40/0x40 Sep 20 15:11:12 harding-03 kernel: [<ffffffffa8374c37>] ret_from_fork_nospec_begin+0x21/0x21 Sep 20 15:11:12 harding-03 kernel: [<ffffffffa7cc1bc0>] ? insert_kthread_work+0x40/0x40 Sep 20 15:11:15 harding-03 lvm[11964]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1_mimagetmp_3 for events. Sep 20 15:11:15 harding-03 lvm[11964]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 20 15:11:30 harding-03 lvm[11964]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1_mimagetmp_3 for events. Sep 20 15:11:30 harding-03 dmeventd[11964]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events. Sep 20 15:14:52 harding-03 kernel: INFO: task clvmd:14014 blocked for more than 120 seconds. Talking with Corey, it sounds like there's an lvm feature that automatically adds a mirror leg back into a mirrored LV after it has lost a leg. It appears that this action is what causes everything to fall apart, but what is doing this action, how is it triggered, has it left any evidence of itself in the logs, and how can we get a handle on it to see exactly what it's doing? FWIW: the test passes single hosted under load to an xfs for "remove" and "allocate" mirror image policies. (In reply to David Teigland from comment #18) > Talking with Corey, it sounds like there's an lvm feature that automatically > adds a mirror leg back into a mirrored LV after it has lost a leg. It > appears that this action is what causes everything to fall apart, but what > is doing this action, how is it triggered, has it left any evidence of > itself in the logs, and how can we get a handle on it to see exactly what > it's doing? Dave, when monitored, depending on configuration option activation/mirror_image_fault_policy set to "remove" or "allocate", "mirror" legs are either automatically removed or tried being allocated. We have to think through implications of comment #15. Is it the respective change leading to this regression reasoned by bogus flushing? Looking forward to Zdeneck's patch... It looks like the problem is related to a hidden "lvconvert --repair" command that is being run by dmeventd, but is basically leaving no clear record of itself or what it's doing or what might be wrong. (A repair like this should be leaving a lot of data behind by default.) To really debug this we should have the full -vvvv output from the lvconvert --repair command and clvmd, but it's not clear how to collect either of those in this environment. I'll try to cobble together all these pieces on my own machines so I can run everything directly with full debugging. (So many separate, complex moving parts that all have to work together correctly... the compounding complexity of it all looks incredibly fragile. failing and replacing disks, dm-mirror, cmirrord, dmeventd, lvconvert --repair in the background, lvmpolld, clvmd, corosync, all somehow working together and simultaneously on three different nodes.) I reproduced this manually on my own test machines so we should now have all the debugging that's available. There's a lot of data from clvmd, cmirrord and the kernel on all the nodes. There's a lot that looks bad, but it's all foreign to me so I can't say what could be an initial cause vs effects. The basic sequence is: - mirror LV with three legs active on three nodes - kill the device under the middle log on all three nodes - dmeventd triggers recovery to remove that leg from the mirror - later, manually add another leg back to the LV, this is where everything falls apart In my test, the device I killed is brought back online and used to add the new leg again. It's probably worth checking if adding a leg using a different device (not the one that failed) would have the same problem. The problems start when this command is run from node1 to add a new mirror leg: lvconvert -vvvv --yes --type mirror --mirrorlog disk -m 2 dcl/lvt the lvconvert command hangs here waiting for clvmd: #locking/cluster_locking.c:511 Locking LV 5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp PW (LV|NONBLOCK|CLUSTER) (0x9c) and eventually times out: #locking/cluster_locking.c:397 Error locking on node 2: Command timed out clvmd on node1: CLVMD[e10ff700]: Sep 21 18:27:59 do_lock_lv: resource '5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp', cmd = 0x9e LCK_LV _RESUME (UNLOCK|LV|NONBLOCK|CLUSTER_VG), flags = 0x4 ( DMEVENTD_MONITOR ), critical_section = 1 CLVMD[e1232740]: Sep 21 18:27:59 1 got message from nodeid 1 for 0. len 84 #cache/lvmcache.c:203 lvmcache: no saved_vg for vgid "5F4GKTODs380YAx7e9GfesBNqaMkqR5P" #cache/lvmcache.c:506 lvmcache: no saved vg latest 5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp #activate/activate.c:2493 Resuming LVID 5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp did not find saved VG #activate/activate.c:2501 Resuming LVID 5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp reading VG ... #activate/dev_manager.c:2869 Adding new LV dcl/lvt to dtree ... #libdm-deptree.c:2696 Loading table for dcl-lvt_mlog (253:3). #libdm-deptree.c:2641 Adding target to (253:3): 0 8192 linear 8:80 2048 #ioctl/libdm-iface.c:1857 dm table (253:3) [ opencount flush ] [16384] (*1) #libdm-deptree.c:2729 Suppressed dcl-lvt_mlog (253:3) identical table reload. #libdm-deptree.c:2696 Loading table for dcl-lvt_mimage_0 (253:4). #libdm-deptree.c:2641 Adding target to (253:4): 0 1024000 linear 8:17 2048 #ioctl/libdm-iface.c:1857 dm table (253:4) [ opencount flush ] [16384] (*1) #libdm-deptree.c:2729 Suppressed dcl-lvt_mimage_0 (253:4) identical table reload. #libdm-deptree.c:2696 Loading table for dcl-lvt_mimage_2 (253:6). #libdm-deptree.c:2641 Adding target to (253:6): 0 1024000 linear 8:49 2048 #ioctl/libdm-iface.c:1857 dm table (253:6) [ opencount flush ] [16384] (*1) #libdm-deptree.c:2729 Suppressed dcl-lvt_mimage_2 (253:6) identical table reload. #libdm-deptree.c:2696 Loading table for dcl-lvt (253:7). #libdm-deptree.c:2641 Adding target to (253:7): 0 1024000 mirror userspace 4 LVM-5F4GKTODs380YAx7e9GfesBNqaMkqR5PzkTcLQzehPhzSB1hHZfQU8nZp4ZZ7JWN clustered-disk 253:3 4096 2 253:4 0 253:6 0 1 handle_errors #ioctl/libdm-iface.c:1857 dm table (253:7) [ opencount flush ] [16384] (*1) #ioctl/libdm-iface.c:1998 Internal error: Performing unsafe table load while 6 device(s) are known to be suspended: (253:7) #ioctl/libdm-iface.c:1857 dm reload (253:7) [ noopencount flush ] [16384] (*1) #libdm-deptree.c:1302 Resuming dcl-lvt_mimage_3 (253:8). #libdm-common.c:2325 Udev cookie 0xd4d020c (semid 425985) incremented to 3 #libdm-common.c:2575 Udev cookie 0xd4d020c (semid 425985) assigned to RESUME task(5) with flags DISABLE_SUBSYSTEM_RULES DISABLE_DISK_RULES DISABLE_OTHER_RULES DISABLE_LIBRARY_FALLBACK (0x2e) #ioctl/libdm-iface.c:1857 dm resume (253:8) [ noopencount flush ] [16384] (*1) #libdm-common.c:1484 dcl-lvt_mimage_3: Stacking NODE_ADD (253,8) 0:6 0660 [trust_udev] #libdm-common.c:1494 dcl-lvt_mimage_3: Stacking NODE_READ_AHEAD 0 (flags=0) #libdm-common.c:288 Suspended device counter reduced to 5 #libdm-deptree.c:1302 Resuming dcl-lvt_mimagetmp_3 (253:5). #libdm-common.c:2325 Udev cookie 0xd4d020c (semid 425985) incremented to 4 #libdm-common.c:2575 Udev cookie 0xd4d020c (semid 425985) assigned to RESUME task(5) with flags DISABLE_SUBSYSTEM_RULES DISABLE_DISK_RULES DISABLE_OTHER_RULES DISABLE_LIBRARY_FALLBACK (0x2e) #ioctl/libdm-iface.c:1857 dm resume (253:5) [ noopencount flush ] [16384] (*1) CLVMD[e1232740]: Sep 21 18:29:00 Request to client 0x55ef9349ef30 timed-out (send: 1537572479, now: 1537572540) CLVMD[e1232740]: Sep 21 18:29:00 (0x55ef9349ef30) Request timed-out. padding On nodes 1 and 3, cmirrord is sitting still, but on node 2 it's spinning, repeatedly printing data like: Data ready on 4 Issuing callback on local/4 [r6iV5sUp] Request from kernel received: [DM_ULOG_IN_SYNC/144913] [r6iV5sUp] Region is not in-sync: 0 kernel messages: Sep 21 18:21:59 null-01 dmeventd[62194]: device-mapper: waitevent ioctl on LVM-5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp failed: Interrupted system call Sep 21 18:21:59 null-01 dmeventd[62194]: Unlocking memory Sep 21 18:21:59 null-01 dmeventd[62194]: No longer monitoring mirror device dcl-lvt for events. Sep 21 18:21:59 null-01 kernel: device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail. Sep 21 18:21:59 null-01 kernel: device-mapper: raid1: Read failure on mirror device 253:5. Trying alternative device. Sep 21 18:21:59 null-01 kernel: device-mapper: raid1: Mirror read failed. Sep 21 18:22:59 null-01 systemd-udevd: worker [24271] /devices/virtual/block/dm-7 is taking a long time Sep 21 18:24:59 null-01 systemd-udevd: worker [24271] /devices/virtual/block/dm-7 timeout; kill it Sep 21 18:24:59 null-01 systemd-udevd: seq 3179 '/devices/virtual/block/dm-7' killed Sep 21 18:27:57 null-01 kernel: INFO: task systemd-udevd:24271 blocked for more than 120 seconds. ... Sep 21 18:28:15 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3438] - retrying Sep 21 18:28:30 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3439] - retrying Sep 21 18:28:45 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3440] - retrying Sep 21 18:29:00 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3441] - retrying Sep 21 18:29:15 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3442] - retrying Sep 21 18:29:30 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3443] - retrying Sep 21 18:29:45 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3444] - retrying Sep 21 07:03:01 null-02 dmeventd[16703]: device-mapper: waitevent ioctl on LVM-5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp failed: Interrupted system call Sep 21 07:03:01 null-02 dmeventd[16703]: Unlocking memory Sep 21 07:03:01 null-02 dmeventd[16703]: Device cache incomplete - not writing to /etc/lvm/cache/.cache Sep 21 07:03:01 null-02 dmeventd[16703]: No longer monitoring mirror device dcl-lvt for events. Sep 21 07:03:01 null-02 kernel: [1653475.201560] device-mapper: raid1: Primary mirror (253:6) failed while out-of-sync: Reads may fail. Sep 21 07:03:01 null-02 kernel: [1653475.211831] device-mapper: raid1: Read failure on mirror device 253:6. Trying alternative device. Sep 21 07:03:01 null-02 kernel: [1653475.222182] device-mapper: raid1: Mirror read failed from (null). Trying alternative device. Sep 21 07:03:01 null-02 kernel: [1653475.231984] device-mapper: raid1: Mirror read failed from (null). Trying alternative device. (kernel spins printing these, repeating many thousands of times, it may be udev trying to read?) Sep 21 07:08:44 null-02 kernel: [1653817.926216] device-mapper: raid1: Mirror read failed from (null). Trying alternative device. Sep 21 07:08:50 null-02 kernel: [1653823.450292] INFO: task systemd-udevd:32267 blocked for more than 120 seconds. Sep 21 07:08:59 null-02 kernel: [1653833.018474] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144916] - retrying Sep 21 07:09:14 null-02 kernel: [1653848.027227] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144917] - retrying Sep 21 07:09:29 null-02 kernel: [1653863.035980] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144918] - retrying Sep 21 07:09:44 null-02 kernel: [1653878.044731] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144919] - retrying Sep 21 07:09:59 null-02 kernel: [1653893.053482] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144920] - retrying Sep 21 07:10:14 null-02 kernel: [1653908.061235] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144921] - retrying Sep 21 07:10:29 null-02 kernel: [1653923.068986] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144922] - retrying Sep 21 07:10:44 null-02 kernel: [1653938.076738] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144923] - retrying Sep 21 07:08:15 null-03 dmeventd[31812]: device-mapper: waitevent ioctl on LVM-5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp failed: Interrupted system call Sep 21 07:08:15 null-03 dmeventd[31812]: Unlocking memory Sep 21 07:08:15 null-03 dmeventd[31812]: No longer monitoring mirror device dcl-lvt for events. Sep 21 07:08:15 null-03 kernel: [1653447.996238] device-mapper: raid1: Unable to read primary mirror during recovery Sep 21 07:08:15 null-03 kernel: [1653448.004306] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail. Sep 21 07:08:15 null-03 kernel: [1653448.014076] device-mapper: raid1: Mirror read failed. Sep 21 07:08:15 null-03 kernel: [1653448.020532] device-mapper: raid1: Mirror read failed. Sep 21 07:08:15 null-03 kernel: [1653448.026364] buffer_io_error: 2 callbacks suppressed Sep 21 07:08:15 null-03 kernel: [1653448.032017] Buffer I/O error on dev dm-7, logical block 127984, async page read Sep 21 07:14:30 null-03 kernel: [1653823.389915] device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/1096] - retrying Sep 21 07:14:45 null-03 kernel: [1653838.397113] device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/1097] - retrying Sep 21 07:15:00 null-03 kernel: [1653853.404633] device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/1098] - retrying Sep 21 07:15:15 null-03 kernel: [1653868.412175] device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/1099] - retrying (continues forever) Sep 21 07:53:47 null-03 kernel: [1656179.569617] device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/1253] - retrying Changed the test so that the new leg is added from a new device (not the device that had previously failed.) This produced a kernel panic on the second node: [ 3351.179294] device-mapper: raid1: Primary mirror (253:4) failed while out-of-sync: Reads may fail. [ 3351.189841] device-mapper: raid1: Read failure on mirror device 253:4. Trying alternative device. [ 3351.200476] general protection fault: 0000 [#1] SMP [ 3351.206772] Modules linked in: gfs2 dm_log_userspace dlm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi amd64_edac_mod edac_mce_amd edac_cord [ 3351.283044] CPU: 11 PID: 66 Comm: kworker/11:0 Not tainted 3.10.0-663.el7.x86_64 #1 [ 3351.292200] Hardware name: Dell Inc. PowerEdge R415/08WNM9, BIOS 1.9.3 04/26/2012 [ 3351.301192] Workqueue: kmirrord do_mirror [dm_mirror] [ 3351.307764] task: ffff9347dd29edd0 ti: ffff9347dd2d0000 task.ti: ffff9347dd2d0000 [ 3351.316780] RIP: 0010:[<ffffffffa452901d>] [<ffffffffa452901d>] strnlen+0xd/0x40 [ 3351.325817] RSP: 0018:ffff9347dd2d3af0 EFLAGS: 00010086 [ 3351.332667] RAX: ffffffffa4b02b8e RBX: ffffffffa4f7b570 RCX: fffffffffffffffe [ 3351.341347] RDX: 54495257225b2049 RSI: ffffffffffffffff RDI: 54495257225b2049 [ 3351.350018] RBP: ffff9347dd2d3af0 R08: 000000000000ffff R09: 000000000000ffff [ 3351.358690] R10: 0000000000000000 R11: 0000000000000000 R12: 54495257225b2049 [ 3351.367348] R13: ffffffffa4f7b920 R14: 00000000ffffffff R15: 0000000000000000 [ 3351.376015] FS: 00007f4eabaa3700(0000) GS:ffff93495e740000(0000) knlGS:0000000000000000 [ 3351.385646] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 3351.392935] CR2: 00007f4fbe739490 CR3: 000000041562a000 CR4: 00000000000407e0 [ 3351.401611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3351.410278] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 3351.418926] Stack: [ 3351.422420] ffff9347dd2d3b28 ffffffffa452b50b ffffffffa4f7b570 ffffffffa4f7b920 [ 3351.431369] ffff9347dd2d3c40 ffffffffc004d49a ffffffffc004d49a ffff9347dd2d3b98 [ 3351.440317] ffffffffa452cc11 dead000000000200 000000008a09800c 0000000000000030 [ 3351.449272] Call Trace: [ 3351.453234] [<ffffffffa452b50b>] string.isra.7+0x3b/0xf0 [ 3351.460167] [<ffffffffa452cc11>] vsnprintf+0x201/0x6a0 [ 3351.466932] [<ffffffffa452d0bd>] vscnprintf+0xd/0x30 [ 3351.473516] [<ffffffffa428a49b>] vprintk_emit+0x11b/0x510 [ 3351.480546] [<ffffffffa428aab9>] vprintk_default+0x29/0x40 [ 3351.487615] [<ffffffffa4899d0f>] printk+0x5d/0x74 [ 3351.493870] [<ffffffffc004bd70>] mirror_end_io+0xa0/0x1d0 [dm_mirror] [ 3351.501840] [<ffffffffc0019beb>] clone_endio+0x4b/0x100 [dm_mod] [ 3351.509336] [<ffffffffa443b814>] bio_endio+0x64/0xa0 [ 3351.515778] [<ffffffffc004adf7>] do_reads+0x1a7/0x200 [dm_mirror] [ 3351.523341] [<ffffffffc03cf496>] ? userspace_get_sync_count+0x56/0xa0 [dm_log_userspace] [ 3351.532910] [<ffffffffc004c292>] do_mirror+0x102/0x230 [dm_mirror] [ 3351.540573] [<ffffffffa42a874a>] process_one_work+0x17a/0x440 [ 3351.547794] [<ffffffffa42a97e6>] worker_thread+0x126/0x3c0 [ 3351.554759] [<ffffffffa42a96c0>] ? manage_workers.isra.24+0x2a0/0x2a0 [ 3351.562691] [<ffffffffa42b093f>] kthread+0xcf/0xe0 [ 3351.568977] [<ffffffffa42b0870>] ? insert_kthread_work+0x40/0x40 [ 3351.576493] [<ffffffffa48b0d18>] ret_from_fork+0x58/0x90 [ 3351.583290] [<ffffffffa42b0870>] ? insert_kthread_work+0x40/0x40 [ 3351.590768] Code: c0 01 80 38 00 75 f7 48 29 f8 5d c3 31 c0 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 8d 4e ff 48 89 e5 74 2a [ 3351.613186] RIP [<ffffffffa452901d>] strnlen+0xd/0x40 [ 3351.619713] RSP <ffff9347dd2d3af0> [ 3351.629783] ---[ end trace b8c67e0a02f0ed9c ]--- [ 3351.636543] Kernel panic - not syncing: Fatal exception [ 3351.643062] Kernel Offset: 0x23200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 3351.655946] ------------[ cut here ]------------ [ 3351.661808] WARNING: CPU: 11 PID: 66 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x5f/0x70 [ 3351.672375] Modules linked in: gfs2 dm_log_userspace dlm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi amd64_edac_mod edac_mce_amd edac_cord [ 3351.748220] CPU: 11 PID: 66 Comm: kworker/11:0 Tainted: G D ------------ 3.10.0-663.el7.x86_64 #1 [ 3351.759743] Hardware name: Dell Inc. PowerEdge R415/08WNM9, BIOS 1.9.3 04/26/2012 [ 3351.768681] Workqueue: kmirrord do_mirror [dm_mirror] [ 3351.775196] 0000000000000000 000000008a09800c ffff93495e743d90 ffffffffa489fc54 [ 3351.784101] ffff93495e743dd0 ffffffffa4287928 0000007f5e743dd0 0000000000000000 [ 3351.793002] ffff93495e756cc0 00000001002e8e0a 000000000000000b ffff934757616cc0 [ 3351.801900] Call Trace: [ 3351.805797] <IRQ> [<ffffffffa489fc54>] dump_stack+0x19/0x1b [ 3351.813046] [<ffffffffa4287928>] __warn+0xd8/0x100 [ 3351.819392] [<ffffffffa4287a6d>] warn_slowpath_null+0x1d/0x20 [ 3351.826682] [<ffffffffa4250aef>] native_smp_send_reschedule+0x5f/0x70 [ 3351.834671] [<ffffffffa42d2cc7>] trigger_load_balance+0x177/0x1e0 [ 3351.842302] [<ffffffffa42c2d84>] scheduler_tick+0x104/0x150 [ 3351.849397] [<ffffffffa42f2e90>] ? tick_sched_do_timer+0x50/0x50 [ 3351.856924] [<ffffffffa429a2e5>] update_process_times+0x65/0x80 [ 3351.864368] [<ffffffffa42f2c55>] tick_sched_handle.isra.13+0x25/0x60 [ 3351.872255] [<ffffffffa42f2ed3>] tick_sched_timer+0x43/0x70 [ 3351.879356] [<ffffffffa42b4a94>] __hrtimer_run_queues+0xd4/0x260 [ 3351.886900] [<ffffffffa42b502f>] hrtimer_interrupt+0xaf/0x1d0 [ 3351.894185] [<ffffffffa4253815>] local_apic_timer_interrupt+0x35/0x60 [ 3351.902165] [<ffffffffa48b34bd>] smp_apic_timer_interrupt+0x3d/0x50 [ 3351.909971] [<ffffffffa48b1a1d>] apic_timer_interrupt+0x6d/0x80 [ 3351.917426] <EOI> [<ffffffffa4899c6e>] ? panic+0x1c9/0x20d [ 3351.924560] [<ffffffffa48a9095>] oops_end+0xc5/0xe0 [ 3351.930971] [<ffffffffa422e97b>] die+0x4b/0x70 [ 3351.936931] [<ffffffffa48a89fe>] do_general_protection+0x12e/0x1c0 [ 3351.944630] [<ffffffffa48a82a8>] general_protection+0x28/0x30 [ 3351.951896] [<ffffffffa452901d>] ? strnlen+0xd/0x40 [ 3351.958287] [<ffffffffa452b50b>] string.isra.7+0x3b/0xf0 [ 3351.965111] [<ffffffffa452cc11>] vsnprintf+0x201/0x6a0 [ 3351.971758] [<ffffffffa452d0bd>] vscnprintf+0xd/0x30 [ 3351.978225] [<ffffffffa428a49b>] vprintk_emit+0x11b/0x510 [ 3351.985123] [<ffffffffa428aab9>] vprintk_default+0x29/0x40 [ 3351.992072] [<ffffffffa4899d0f>] printk+0x5d/0x74 [ 3351.998211] [<ffffffffc004bd70>] mirror_end_io+0xa0/0x1d0 [dm_mirror] [ 3352.006078] [<ffffffffc0019beb>] clone_endio+0x4b/0x100 [dm_mod] [ 3352.013507] [<ffffffffa443b814>] bio_endio+0x64/0xa0 [ 3352.019895] [<ffffffffc004adf7>] do_reads+0x1a7/0x200 [dm_mirror] [ 3352.027412] [<ffffffffc03cf496>] ? userspace_get_sync_count+0x56/0xa0 [dm_log_userspace] [ 3352.036919] [<ffffffffc004c292>] do_mirror+0x102/0x230 [dm_mirror] [ 3352.044490] [<ffffffffa42a874a>] process_one_work+0x17a/0x440 [ 3352.051603] [<ffffffffa42a97e6>] worker_thread+0x126/0x3c0 [ 3352.058421] [<ffffffffa42a96c0>] ? manage_workers.isra.24+0x2a0/0x2a0 [ 3352.066165] [<ffffffffa42b093f>] kthread+0xcf/0xe0 [ 3352.072219] [<ffffffffa42b0870>] ? insert_kthread_work+0x40/0x40 [ 3352.079459] [<ffffffffa48b0d18>] ret_from_fork+0x58/0x90 [ 3352.085966] [<ffffffffa42b0870>] ? insert_kthread_work+0x40/0x40 [ 3352.093133] ---[ end trace b8c67e0a02f0ed9d ]--- cmirrord log captured just before the panic: Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] Request from kernel received: [DM_ULOG_CTR/879] Sep 21 09:24:13 null-02 cmirrord[1992]: Checking /dev/mapper for device 253:2 Sep 21 09:24:13 null-02 cmirrord[1992]: .: NO Sep 21 09:24:13 null-02 cmirrord[1992]: ..: NO Sep 21 09:24:13 null-02 cmirrord[1992]: cvg-clv_mimagetmp_3: NO Sep 21 09:24:13 null-02 cmirrord[1992]: cvg-clv: NO Sep 21 09:24:13 null-02 cmirrord[1992]: cvg-clv_mimage_2: NO Sep 21 09:24:13 null-02 cmirrord[1992]: cvg-clv_mimage_0: NO Sep 21 09:24:13 null-02 cmirrord[1992]: cvg-clv_mlog: YES Sep 21 09:24:13 null-02 cmirrord[1992]: Clustered log disk is /dev/mapper/cvg-clv_mlog Sep 21 09:24:13 null-02 cmirrord[1992]: Disk log ready Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] Cluster log created Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] Request from kernel received: [DM_ULOG_GET_REGION_SIZE/880] Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/881] Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Request from kernel received: [DM_ULOG_CTR/882] Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Cluster log created Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Request from kernel received: [DM_ULOG_GET_REGION_SIZE/883] Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/884] Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] Request from kernel received: [DM_ULOG_PRESUSPEND/885] Sep 21 09:24:13 null-02 cmirrord[1992]: WARNING: log still marked as 'touched' during suspend Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] Request from kernel received: [DM_ULOG_POSTSUSPEND/886] Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] clog_postsuspend: leaving CPG Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Finalizing leave... Sep 21 09:24:13 null-02 cmirrord[1992]: Freeing up cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: used_pfds = 1, free_pfds = 2 Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] clog_postsuspend: finalizing Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Request from kernel received: [DM_ULOG_RESUME/887] Sep 21 09:24:13 null-02 cmirrord[1992]: New handle: 8408462744751177728 Sep 21 09:24:13 null-02 cmirrord[1992]: New name: LVM-4Zjg4QhQ7IwysAmEYDWwKfLe2uTjYrwzO9v3WtkD3KcdVX30Gt523KwhNuvivmCd Sep 21 09:24:13 null-02 cmirrord[1992]: Adding cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: used_pfds = 2, free_pfds = 1 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Log not valid yet, storing request Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] loading sync_bits (sync_count = 0): Sep 21 09:24:13 null-02 cmirrord[1992]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 21 09:24:13 null-02 cmirrord[1992]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] loading clean_bits: Sep 21 09:24:13 null-02 cmirrord[1992]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 21 09:24:13 null-02 cmirrord[1992]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Processing delayed request: DM_ULOG_RESUME Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Non-master resume: bits pre-loaded Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Request from kernel received: [DM_ULOG_GET_RESYNC_WORK/888] Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] Request from kernel received: [DM_ULOG_DTR/889] Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] Cluster log removed Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] storing clean_bits: Sep 21 09:24:13 null-02 cmirrord[1992]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 21 09:24:13 null-02 cmirrord[1992]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] storing sync_bits (sync_count = 0): Sep 21 09:24:13 null-02 cmirrord[1992]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 21 09:24:13 null-02 cmirrord[1992]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Checkpoint prepared for node 1: Sep 21 09:24:13 null-02 cmirrord[1992]: bitmap_size = 32 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/890] Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Request from kernel received: [DM_ULOG_PRESUSPEND/891] Sep 21 09:24:13 null-02 cmirrord[1992]: WARNING: log still marked as 'touched' during suspend Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/892] Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Request from kernel received: [DM_ULOG_IN_SYNC/893] Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Region is not in-sync: 249 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 kernel: [ 3351.179294] device-mapper: raid1: Primary mirror (253:4) failed while out-of-sync: Reads may fail. Sep 21 09:24:13 null-02 kernel: [ 3351.189841] device-mapper: raid1: Read failure on mirror device 253:4. Trying alternative device. Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/894] Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4 Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Request from kernel received: [DM_ULOG_IN_SYNC/895] Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Region is not in-sync: 249 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6 Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6 Retried with: reverted commit 672b8c196b91e4cf6567223ce44dfb902281b791 used a new disk to restore the leg instead of the previously failed disk ran the following command on null-01 to restore the leg, specifying only one (new) PV: lvconvert -vvvv --yes --type mirror --mirrorlog disk -m 2 cvg/clv /dev/sdf [root@null-01 ~]# lvs -a cvg -o+devices /dev/sdc: open failed: No such device or address LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices clv cvg mwi-a-m--- 500.00m [clv_mlog] 100.00 clv_mimage_0(0),clv_mimage_2(0),clv_mimage_3(0) [clv_mimage_0] cvg iwi-aom--- 500.00m /dev/sdb(0) [clv_mimage_2] cvg iwi-aom--- 500.00m /dev/sdd(0) [clv_mimage_3] cvg iwi-aom--- 500.00m /dev/sdf(0) [clv_mlog] cvg lwi-aom--- 4.00m /dev/sde(0) So, some of the problems we've seen did not appear (lvconvert didn't hang, no panic). Given the kernel errors in the logs, I suspect that if gfs was mounted it still would have read bad data and panicked/withdrawn from io errors. I've copied out some parts of the logs from each node beginning with the command that was restoring the mirror leg. Some of these messages still look quite concerning, and we should have someone who is familiar with them evaluate how bad they are, i.e. whether they are expected during recovery or whether they should not be appearing at all. Sep 24 14:53:19 null-01 dmeventd[1831]: Repair of mirrored device cvg-clv finished successfully. Sep 24 14:53:19 null-01 dmeventd[1831]: cvg-clv is now in-sync. Sep 24 14:55:14 null-01 root: vgreduce removemissing Sep 24 14:59:11 null-01 root: lvconvert m2 from sdf Sep 24 14:59:11 null-01 dmeventd[1831]: device-mapper: waitevent ioctl on LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call Sep 24 14:59:11 null-01 dmeventd[1831]: Unlocking memory Sep 24 14:59:11 null-01 dmeventd[1831]: No longer monitoring mirror device cvg-clv for events. ... Sep 24 14:59:15 null-01 dmeventd[1831]: Monitoring mirror device cvg-clv_mimagetmp_3 for events. Sep 24 14:59:15 null-01 dmeventd[1831]: Monitoring mirror device cvg-clv for events. Sep 24 14:59:30 null-01 dmeventd[1831]: device-mapper: waitevent ioctl on LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWcALs3MMP7Ke2x9q4TQ7DHWGp1qtMrC9y failed: Interrupted system call Sep 24 14:59:30 null-01 dmeventd[1831]: No longer monitoring mirror device cvg-clv_mimagetmp_3 for events. Sep 24 14:59:30 null-01 dmeventd[1831]: device-mapper: waitevent ioctl on LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call Sep 24 14:59:30 null-01 dmeventd[1831]: Unlocking memory Sep 24 14:59:30 null-01 dmeventd[1831]: No longer monitoring mirror device cvg-clv for events. Sep 24 14:59:30 null-01 cmirrord[1810]: Log entry already exists: LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWdcwnQs0yU8HuJ8dlYfutK5xc77KBKJNW Sep 24 14:59:30 null-01 cmirrord[1810]: clog_resume: Failed to create cluster CPG Sep 24 14:59:30 null-01 kernel: device-mapper: raid1: log presuspend failed Sep 24 14:59:30 null-01 kernel: device-mapper: raid1: log presuspend failed Sep 24 14:59:30 null-01 kernel: device-mapper: raid1: log postsuspend failed .... Sep 24 14:59:30 null-01 dmeventd[1831]: Monitoring mirror device cvg-clv for events. Sep 24 14:59:30 null-01 dmeventd[1831]: cvg-clv is now in-sync. Sep 24 14:59:31 null-01 dmeventd[1831]: device-mapper: waitevent ioctl on LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call Sep 24 14:59:31 null-01 dmeventd[1831]: No longer monitoring mirror device cvg-clv for events. ... Sep 24 14:59:34 null-01 dmeventd[1831]: Monitoring mirror device cvg-clv for events. Sep 24 03:40:15 null-02 root: lvconvert m2 from sdf run on null-01 Sep 24 03:40:15 null-02 cmirrord[1201]: Data ready on 6 Sep 24 03:40:15 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 24 03:40:15 null-02 cmirrord[1201]: Data ready on 6 Sep 24 03:40:15 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 24 03:40:16 null-02 cmirrord[1201]: Data ready on 4 Sep 24 03:40:16 null-02 dmeventd[1228]: device-mapper: waitevent ioctl on LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call Sep 24 03:40:16 null-02 dmeventd[1228]: Unlocking memory Sep 24 03:40:16 null-02 dmeventd[1228]: Device cache incomplete - not writing to /etc/lvm/cache/.cache Sep 24 03:40:16 null-02 dmeventd[1228]: No longer monitoring mirror device cvg-clv for events. Sep 24 03:40:16 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 24 03:40:16 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_CTR/100507] Sep 24 03:40:16 null-02 cmirrord[1201]: Checking /dev/mapper for device 253:2 Sep 24 03:40:16 null-02 cmirrord[1201]: .: NO Sep 24 03:40:16 null-02 cmirrord[1201]: ..: NO Sep 24 03:40:16 null-02 cmirrord[1201]: cvg-clv_mimagetmp_3: NO Sep 24 03:40:16 null-02 cmirrord[1201]: cvg-clv: NO Sep 24 03:40:16 null-02 cmirrord[1201]: cvg-clv_mimage_2: NO Sep 24 03:40:16 null-02 cmirrord[1201]: cvg-clv_mimage_0: NO Sep 24 03:40:16 null-02 cmirrord[1201]: cvg-clv_mlog: YES Sep 24 03:40:16 null-02 cmirrord[1201]: Clustered log disk is /dev/mapper/cvg-clv_mlog Sep 24 03:40:16 null-02 cmirrord[1201]: Disk log ready Sep 24 03:40:16 null-02 cmirrord[1201]: [77KBKJNW] Cluster log created ... Sep 24 03:40:29 null-02 dmeventd[1228]: cvg-clv is now in-sync. Sep 24 03:40:35 null-02 dmeventd[1228]: device-mapper: waitevent ioctl on LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWcALs3MMP7Ke2x9q4TQ7DHWGp1qtMrC9y failed: Interrupted system call Sep 24 03:40:35 null-02 dmeventd[1228]: No longer monitoring mirror device cvg-clv_mimagetmp_3 for events. Sep 24 03:40:35 null-02 dmeventd[1228]: device-mapper: waitevent ioctl on LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call Sep 24 03:40:35 null-02 dmeventd[1228]: No longer monitoring mirror device cvg-clv for events. Sep 24 03:40:35 null-02 cmirrord[1201]: Log entry already exists: LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWdcwnQs0yU8HuJ8dlYfutK5xc77KBKJNW Sep 24 03:40:35 null-02 cmirrord[1201]: clog_resume: Failed to create cluster CPG Sep 24 03:40:35 null-02 kernel: [ 1703.100972] device-mapper: raid1: log presuspend failed Sep 24 03:40:35 null-02 kernel: [ 1703.108117] device-mapper: raid1: log presuspend failed Sep 24 03:40:35 null-02 kernel: [ 1703.114601] device-mapper: raid1: log postsuspend failed Sep 24 03:39:33 null-03 dmeventd[1126]: cvg-clv is now in-sync. Sep 24 03:45:33 null-03 root: lvconvert m2 from sdf run on null-01 Sep 24 03:45:33 null-03 dmeventd[1126]: device-mapper: waitevent ioctl on LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call Sep 24 03:45:33 null-03 dmeventd[1126]: Unlocking memory Sep 24 03:45:33 null-03 dmeventd[1126]: No longer monitoring mirror device cvg-clv for events. Sep 24 03:45:33 null-03 kernel: [ 1685.367551] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail. Sep 24 03:45:33 null-03 kernel: [ 1685.377763] device-mapper: raid1: Read failure on mirror device 253:5. Trying alternative device. Sep 24 03:45:33 null-03 kernel: [ 1685.387977] device-mapper: raid1: Unable to read primary mirror during recovery Sep 24 03:45:33 null-03 kernel: [ 1685.396486] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail. Sep 24 03:45:33 null-03 kernel: [ 1685.407294] device-mapper: raid1: Mirror read failed. Sep 24 03:45:36 null-03 kernel: [ 1688.506127] device-mapper: raid1: Mirror read failed. Sep 24 03:45:36 null-03 kernel: [ 1688.512437] Buffer I/O error on dev dm-7, logical block 127984, async page read Sep 24 03:45:36 null-03 kernel: [ 1688.529859] device-mapper: raid1: Mirror read failed. Sep 24 03:45:36 null-03 kernel: [ 1688.536808] device-mapper: raid1: Mirror read failed. ep 24 03:45:36 null-03 dmeventd[1126]: Monitoring mirror device cvg-clv_mimagetmp_3 for events. Sep 24 03:45:36 null-03 kernel: [ 1688.543037] Buffer I/O error on dev dm-7, logical block 127984, async page read Sep 24 03:45:36 null-03 dmeventd[1126]: Monitoring mirror device cvg-clv for events. Sep 24 03:45:52 null-03 dmeventd[1126]: device-mapper: waitevent ioctl on LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWcALs3MMP7Ke2x9q4TQ7DHWGp1qtMrC9y failed: Interrupted system call Sep 24 03:45:52 null-03 dmeventd[1126]: No longer monitoring mirror device cvg-clv_mimagetmp_3 for events. Sep 24 03:45:52 null-03 dmeventd[1126]: device-mapper: waitevent ioctl on LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call Sep 24 03:45:52 null-03 dmeventd[1126]: Unlocking memory Sep 24 03:45:52 null-03 dmeventd[1126]: No longer monitoring mirror device cvg-clv for events. Sep 24 03:45:52 null-03 cmirrord[1099]: Log entry already exists: LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWdcwnQs0yU8HuJ8dlYfutK5xc77KBKJNW Sep 24 03:45:52 null-03 cmirrord[1099]: clog_resume: Failed to create cluster CPG Sep 24 03:45:52 null-03 kernel: [ 1704.096658] device-mapper: raid1: log presuspend failed Sep 24 03:45:52 null-03 kernel: [ 1704.107810] device-mapper: raid1: log presuspend failed Sep 24 03:45:52 null-03 kernel: [ 1704.114334] device-mapper: raid1: log postsuspend failed Created attachment 1486443 [details]
working lvconvert
vvvv output from the lvconvert command that completed in the last comment.
Again, no evidence of any problem after recovering for the failed leg. The problems appeared only when I added another leg: [root@null-01 ~]# lvs -a -o+devices cvg /dev/sdd: open failed: No such device or address LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices clv cvg cwi-aom--- 500.00m 57.60 [clv_mimagetmp_4] clv_mimagetmp_4(0),clv_mimage_4(0) [clv_mimage_0] cvg iwi-aom--- 500.00m /dev/sdb(0) [clv_mimage_3] cvg iwi-aom--- 500.00m /dev/sdf(0) [clv_mimage_4] cvg Iwi-aom--- 500.00m /dev/sdg(0) [clv_mimagetmp_4] cvg mwi-aom--- 500.00m [clv_mlog] 100.00 clv_mimage_0(0),clv_mimage_3(0) [clv_mlog] cvg lwi-aom--- 4.00m /dev/sde(0) [root@null-02 ~]# lvs -a -o+devices cvg /dev/sdd: open failed: No such device or address LV VG Attr LSize Log Cpy%Sync Convert Devices clv cvg cwi-aom--- 500.00m 57.60 [clv_mimagetmp_4] clv_mimagetmp_4(0),clv_mimage_4(0) [clv_mimage_0] cvg iwi-aom--- 500.00m /dev/sdb(0) [clv_mimage_3] cvg iwi-aom--- 500.00m /dev/sdf(0) [clv_mimage_4] cvg Iwi-aom--- 500.00m /dev/sdg(0) [clv_mimagetmp_4] cvg mwi-aom--- 500.00m [clv_mlog] 100.00 clv_mimage_0(0),clv_mimage_3(0) [clv_mlog] cvg lwi-aom--- 4.00m /dev/sde(0) [root@null-03 ~]# lvs -a -o+devices cvg /dev/sdd: open failed: No such device or address LV VG Attr LSize Log Cpy%Sync Convert Devices clv cvg cwi-aom--- 500.00m 57.60 [clv_mimagetmp_4] clv_mimagetmp_4(0),clv_mimage_4(0) [clv_mimage_0] cvg iwi-aom--- 500.00m /dev/sdb(0) [clv_mimage_3] cvg iwi-aom--- 500.00m /dev/sdf(0) [clv_mimage_4] cvg Iwi-aom--- 500.00m /dev/sdg(0) [clv_mimagetmp_4] cvg mwi-aom--- 500.00m [clv_mlog] 100.00 clv_mimage_0(0),clv_mimage_3(0) [clv_mlog] cvg lwi-aom--- 4.00m /dev/sde(0) [root@null-01 ~]# dmsetup ls --tree rhel_null--01-home (253:2) └─ (8:2) cvg-clv (253:7) ├─cvg-clv_mimage_4 (253:6) │ └─ (8:96) └─cvg-clv_mimagetmp_4 (253:5) ├─cvg-clv_mimage_3 (253:8) │ └─ (8:80) ├─cvg-clv_mimage_0 (253:4) │ └─ (8:16) └─cvg-clv_mlog (253:3) └─ (8:64) rhel_null--01-swap (253:1) └─ (8:2) rhel_null--01-root (253:0) └─ (8:2) [root@null-02 ~]# dmsetup ls --tree rhel_null--02-swap (253:0) └─ (8:2) rhel_null--02-root (253:1) └─ (8:2) cvg-clv (253:6) ├─cvg-clv_mimage_4 (253:5) │ └─ (8:96) └─cvg-clv_mimagetmp_4 (253:4) ├─cvg-clv_mimage_3 (253:7) │ └─ (8:80) ├─cvg-clv_mimage_0 (253:3) │ └─ (8:16) └─cvg-clv_mlog (253:2) └─ (8:64) [root@null-03 ~]# dmsetup ls --tree rhel_null--03-home (253:2) └─ (8:2) rhel_null--03-swap (253:0) └─ (8:2) rhel_null--03-root (253:1) └─ (8:2) cvg-clv (253:7) ├─cvg-clv_mimage_4 (253:6) │ └─ (8:96) └─cvg-clv_mimagetmp_4 (253:5) ├─cvg-clv_mimage_3 (253:8) │ └─ (8:80) ├─cvg-clv_mimage_0 (253:4) │ └─ (8:16) └─cvg-clv_mlog (253:3) └─ (8:64) In status, note that null-02 says "AA" and the others say "SA". [root@null-01 ~]# dmsetup status cvg-clv_mimage_4: 0 1024000 linear cvg-clv_mlog: 0 8192 linear cvg-clv_mimage_3: 0 1024000 linear cvg-clv_mimagetmp_4: 0 1024000 mirror 2 253:4 253:8 250/250 1 AA 3 clustered-disk 253:3 A cvg-clv_mimage_0: 0 1024000 linear rhel_null--01-home: 0 853286912 linear cvg-clv: 0 1024000 mirror 2 253:5 253:6 144/250 1 SA 1 clustered-core rhel_null--01-swap: 0 16515072 linear rhel_null--01-root: 0 104857600 linear [root@null-02 ~]# dmsetup status rhel_null--02-swap: 0 16416768 linear rhel_null--02-root: 0 104857600 linear cvg-clv_mimage_4: 0 1024000 linear cvg-clv_mlog: 0 8192 linear cvg-clv_mimage_3: 0 1024000 linear cvg-clv_mimagetmp_4: 0 1024000 mirror 2 253:3 253:7 250/250 1 AA 3 clustered-disk 253:3 A cvg-clv_mimage_0: 0 1024000 linear cvg-clv: 0 1024000 mirror 2 253:4 253:5 144/250 1 AA 1 clustered-core [root@null-03 ~]# dmsetup status rhel_null--03-home: 0 854466560 linear rhel_null--03-swap: 0 16416768 linear rhel_null--03-root: 0 104857600 linear cvg-clv_mimage_4: 0 1024000 linear cvg-clv_mlog: 0 8192 linear cvg-clv_mimage_3: 0 1024000 linear cvg-clv_mimagetmp_4: 0 1024000 mirror 2 253:4 253:8 250/250 1 AA 3 clustered-disk 253:3 A cvg-clv_mimage_0: 0 1024000 linear cvg-clv: 0 1024000 mirror 2 253:5 253:6 144/250 1 SA 1 clustered-core null-02 is where we see this, so it indicates the "AA" could be related: [88511.157159] GFS2: fsid=dcl:clv.1: Error -5 writing to log More kernel log messages for comment 30. null-02 with status "AA" does not have the dm kernel messages that the other two nodes with status "SA" show. null-01: [88525.369285] device-mapper: raid1: Unable to read primary mirror during recovery [88525.376599] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail. [88525.386834] device-mapper: raid1: Mirror read failed. [88525.393633] device-mapper: raid1: Mirror read failed. [88525.398680] Buffer I/O error on dev dm-7, logical block 127984, async page read [88528.681717] device-mapper: raid1: Mirror read failed. [88528.686768] Buffer I/O error on dev dm-7, logical block 127984, async page read null-02: [88511.157159] GFS2: fsid=dcl:clv.1: Error -5 writing to log null-03: [88508.205581] device-mapper: raid1: Unable to read primary mirror during recovery [88508.210503] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail. [88508.210505] device-mapper: raid1: Read failure on mirror device 253:5. Trying alternative device. [88508.210576] device-mapper: raid1: Mirror read failed. [88508.210668] device-mapper: raid1: Mirror read failed. [88508.210671] Buffer I/O error on dev dm-7, logical block 127984, async page read [88508.254580] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail. [88511.706583] device-mapper: raid1: Mirror read failed. [88511.713702] device-mapper: raid1: Mirror read failed. [88511.719832] Buffer I/O error on dev dm-7, logical block 127984, async page read null-02 is the node where cmirrord was active. Here are the cmirrord logs surrounding the GFS2 io error message. Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Region is pending remote recovery: 144 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Adding region to priority list: 144 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233270] Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_GET_RESYNC_WORK/233271] Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233272] Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_IS_REMOTE_RECOVERING/233273] Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Region is pending remote recovery: 102 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Adding region to priority list: 102 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_PRESUSPEND/233274] Sep 25 03:47:22 null-02 cmirrord[1201]: WARNING: log still marked as 'touched' during suspend Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] storing clean_bits: Sep 25 03:47:22 null-02 cmirrord[1201]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 25 03:47:22 null-02 cmirrord[1201]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] storing sync_bits (sync_count = 0): Sep 25 03:47:22 null-02 cmirrord[1201]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 25 03:47:22 null-02 cmirrord[1201]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Checkpoint prepared for node 1: Sep 25 03:47:22 null-02 cmirrord[1201]: bitmap_size = 32 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233275] Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Recovery halted... [not remote recovering]: 41 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Recovery halted... [not remote recovering]: 144 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:23 null-02 kernel: [88511.157159] GFS2: fsid=dcl:clv.1: Error -5 writing to log Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233276] Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_IS_REMOTE_RECOVERING/233277] Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Recovery halted... [not remote recovering]: 102 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_IN_SYNC/233278] Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 102 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_IN_SYNC/233279] Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 102 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_POSTSUSPEND/233280] Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233281] Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] clog_postsuspend: leaving CPG Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: Finalizing leave... Sep 25 03:47:23 null-02 cmirrord[1201]: Freeing up cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: used_pfds = 1, free_pfds = 2 Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] clog_postsuspend: finalizing Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_RESUME/233282] Sep 25 03:47:23 null-02 cmirrord[1201]: New handle: 8782769015748558848 Sep 25 03:47:23 null-02 cmirrord[1201]: New name: LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWdcwnQs0yU8HuJ8dlYfutK5xc77KBKJNW Sep 25 03:47:23 null-02 cmirrord[1201]: Adding cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: used_pfds = 2, free_pfds = 1 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] Master resume: reading disk log Sep 25 03:47:23 null-02 cmirrord[1201]: Disk log initialized Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] storing clean_bits: Sep 25 03:47:23 null-02 cmirrord[1201]: [ 0 - 15] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Sep 25 03:47:23 null-02 cmirrord[1201]: [ 16 - 31] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 03 Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] storing sync_bits (sync_count = 250): Sep 25 03:47:23 null-02 cmirrord[1201]: [ 0 - 15] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Sep 25 03:47:23 null-02 cmirrord[1201]: [ 16 - 31] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 03 Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] Checkpoint prepared for node 3: Sep 25 03:47:23 null-02 cmirrord[1201]: bitmap_size = 32 Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] storing clean_bits: Sep 25 03:47:23 null-02 cmirrord[1201]: [ 0 - 15] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Sep 25 03:47:23 null-02 cmirrord[1201]: [ 16 - 31] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 03 Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] storing sync_bits (sync_count = 250): Sep 25 03:47:23 null-02 cmirrord[1201]: [ 0 - 15] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Sep 25 03:47:23 null-02 cmirrord[1201]: [ 16 - 31] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 03 Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] Checkpoint prepared for node 1: Sep 25 03:47:23 null-02 cmirrord[1201]: bitmap_size = 32 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_RESUME/233283] Sep 25 03:47:26 null-02 cmirrord[1201]: New handle: 8450407966594564097 Sep 25 03:47:26 null-02 cmirrord[1201]: New name: LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV Sep 25 03:47:26 null-02 cmirrord[1201]: Adding cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: used_pfds = 3, free_pfds = 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_GET_RESYNC_WORK/233284] Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Master resume. Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] storing clean_bits: Sep 25 03:47:26 null-02 cmirrord[1201]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 25 03:47:26 null-02 cmirrord[1201]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] storing sync_bits (sync_count = 0): Sep 25 03:47:26 null-02 cmirrord[1201]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 25 03:47:26 null-02 cmirrord[1201]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Checkpoint prepared for node 3: Sep 25 03:47:26 null-02 cmirrord[1201]: bitmap_size = 32 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_GET_RESYNC_WORK/233285] Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] storing clean_bits: Sep 25 03:47:26 null-02 cmirrord[1201]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 25 03:47:26 null-02 cmirrord[1201]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] storing sync_bits (sync_count = 0): Sep 25 03:47:26 null-02 cmirrord[1201]: [ 0 - 15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 25 03:47:26 null-02 cmirrord[1201]: [ 16 - 31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Checkpoint prepared for node 1: Sep 25 03:47:26 null-02 cmirrord[1201]: bitmap_size = 32 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233286] Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 6 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/6 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_IN_SYNC/233287] Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233288] Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233289] Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233290] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_IN_SYNC/233291] Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 249 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233292] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_IN_SYNC/233293] Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 193 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233294] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_IN_SYNC/233295] Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 193 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233296] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233297] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233298] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233299] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233300] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233301] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 249 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233302] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 193 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233303] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 193 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is pending remote recovery: 41 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Adding region to priority list: 41 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 dmeventd[1228]: devices/ignore_lvm_mirrors not found in config: defaulting to 1 Sep 25 03:47:26 null-02 dmeventd[1228]: devices/cache not found in config: defaulting to /etc/lvm/cache/.cache Sep 25 03:47:26 null-02 dmeventd[1228]: Loaded persistent filter cache from /etc/lvm/cache/.cache Sep 25 03:47:26 null-02 dmeventd[1228]: Failed to load existing device cache from /etc/lvm/cache/.cache Sep 25 03:47:26 null-02 dmeventd[1228]: Locking memory Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233304] Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Request from kernel received: [DM_ULOG_IN_SYNC/233305] Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 249 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233306] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 249 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233307] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233308] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is currently remote recovering: 144 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233309] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233310] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Request from kernel received: [DM_ULOG_IN_SYNC/233311] Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0 Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4 Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4 (In reply to David Teigland from comment #32) > null-02 is the node where cmirrord was active. Here are the cmirrord logs > surrounding the GFS2 io error message. > > > Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Region is pending remote > recovery: 144 > Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Adding region to priority > list: 144 > Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6 I don't see anything alarming in comment 32. (Note that cmirrord should be active on all nodes.) (In reply to David Teigland from comment #31) > More kernel log messages for comment 30. null-02 with status "AA" does not > have the dm kernel messages that the other two nodes with status "SA" show. > > > null-01: > [88525.369285] device-mapper: raid1: Unable to read primary mirror during > recovery Yes, the 'S' in status output means exactly is printed in this error message. For some reason, the primary mirror could not be read. The primary mirror in this case is itself a mirror: └─cvg-clv_mimagetmp_4 (253:5) <------------ here. ├─cvg-clv_mimage_3 (253:8) │ └─ (8:80) ├─cvg-clv_mimage_0 (253:4) │ └─ (8:16) └─cvg-clv_mlog (253:3) └─ (8:64) Reproducible without the device failure. Steps to reproduce: # Create 3-way cluster mirror [root@bp-01 ~]# lvcreate --type mirror -L 1G -n lv -m 2 vg # Find center mimage device (/dev/sdc1) [root@bp-01 ~]# devices LV Attr Cpy%Sync Devices home -wi-ao---- /dev/sda2(2016) root -wi-ao---- /dev/sda2(106177) swap -wi-ao---- /dev/sda2(0) lv mwi-a-m--- 100.00 lv_mimage_0(0),lv_mimage_1(0),lv_mimage_2(0) [lv_mimage_0] iwi-aom--- /dev/sdb1(0) [lv_mimage_1] iwi-aom--- /dev/sdc1(0) [lv_mimage_2] iwi-aom--- /dev/sdd1(0) [lv_mlog] lwi-aom--- /dev/sdi1(0) # Remove center mimage device [root@bp-01 ~]# lvconvert -m1 vg/lv /dev/sdc1 Logical volume vg/lv converted. # Observe that mimage_1 (/dev/sdc1) has been removed [root@bp-01 ~]# devices LV Attr Cpy%Sync Devices home -wi-ao---- /dev/sda2(2016) root -wi-ao---- /dev/sda2(106177) swap -wi-ao---- /dev/sda2(0) lv mwi-a-m--- 100.00 lv_mimage_0(0),lv_mimage_2(0) [lv_mimage_0] iwi-aom--- /dev/sdb1(0) [lv_mimage_2] iwi-aom--- /dev/sdd1(0) [lv_mlog] lwi-aom--- /dev/sdi1(0) # Select a new device (/dev/sde1) to upconvert with [root@bp-01 ~]# lvconvert -m 2 vg/lv /dev/sde1 Logical volume vg/lv being converted. vg/lv: Converted: 2.73% vg/lv: Converted: 100.00% # See that the logs contain messages that they should not bp-01 (where the command was run): [525708.909079] device-mapper: raid1: Unable to read primary mirror during recovery [525708.916474] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fa. [525708.925639] device-mapper: raid1: Mirror read failed. bp-02: [525716.599401] device-mapper: raid1: log presuspend failed [525716.613783] device-mapper: raid1: log presuspend failed [525716.619219] device-mapper: raid1: log postsuspend failed bp-03: [525690.181011] device-mapper: raid1: log presuspend failed [525690.198245] device-mapper: raid1: log presuspend failed [525690.203626] device-mapper: raid1: log postsuspend failed After rebooting I didn't remove any devices or any legs, I just added a new mirror leg and triggered the error. [root@null-01 ~]# lvs -a -o+devices cvg LV VG Attr LSize Log Cpy%Sync Devices clv cvg mwi-a-m--- 500.00m [clv_mlog] 100.00 clv_mimage_0(0),clv_mimage_3(0),clv_mimage_4(0) [clv_mimage_0] cvg iwi-aom--- 500.00m /dev/sdb(0) [clv_mimage_3] cvg iwi-aom--- 500.00m /dev/sdf(0) [clv_mimage_4] cvg iwi-aom--- 500.00m /dev/sdg(0) [clv_mlog] cvg lwi-aom--- 4.00m /dev/sde(0) [root@null-01 ~]# lvconvert -m 3 cvg/clv /dev/sdd Logical volume cvg/clv being converted. /run/lvm/lvmpolld.socket: connect failed: No such file or directory WARNING: Failed to connect to lvmpolld. Proceeding with polling without using lvmpolld. WARNING: Check global/use_lvmpolld in lvm.conf or the lvmpolld daemon state. cvg/clv: Converted: 1.60% cvg/clv: Converted: 100.00% Logical volume cvg/clv converted. [root@null-01 ~]# dmsetup status cvg-clv_mimage_5: 0 1024000 linear cvg-clv_mimage_4: 0 1024000 linear cvg-clv_mlog: 0 8192 linear cvg-clv_mimage_3: 0 1024000 linear cvg-clv_mimage_0: 0 1024000 linear rhel_null--01-home: 0 853286912 linear cvg-clv: 0 1024000 mirror 4 253:4 253:5 253:7 253:9 250/250 1 AAAA 3 clustered-disk 253:3 A rhel_null--01-swap: 0 16515072 linear rhel_null--01-root: 0 104857600 linear [root@null-02 ~]# dmsetup status rhel_null--02-swap: 0 16416768 linear rhel_null--02-root: 0 104857600 linear cvg-clv_mimage_5: 0 1024000 linear cvg-clv_mimage_4: 0 1024000 linear cvg-clv_mlog: 0 8192 linear cvg-clv_mimage_3: 0 1024000 linear cvg-clv_mimage_0: 0 1024000 linear cvg-clv: 0 1024000 mirror 4 253:3 253:4 253:6 253:8 250/250 1 AAAA 3 clustered-disk 253:3 A [root@null-03 ~]# dmsetup status rhel_null--03-home: 0 854466560 linear rhel_null--03-swap: 0 16416768 linear cvg-clv_mimage_5: 0 1024000 linear rhel_null--03-root: 0 104857600 linear cvg-clv_mimage_4: 0 1024000 linear cvg-clv_mlog: 0 8192 linear cvg-clv_mimage_3: 0 1024000 linear cvg-clv_mimage_0: 0 1024000 linear cvg-clv: 0 1024000 mirror 4 253:4 253:5 253:7 253:9 250/250 1 AAAA 3 clustered-disk 253:3 A null-01: [ 3778.178020] device-mapper: dm-log-userspace: version 1.3.0 loaded [ 3794.007824] device-mapper: raid1: log presuspend failed [ 3794.014788] device-mapper: raid1: log presuspend failed [ 3794.020111] device-mapper: raid1: log postsuspend failed [ 3980.771125] device-mapper: raid1: log presuspend failed [ 3980.777218] device-mapper: raid1: log presuspend failed [ 3980.782570] device-mapper: raid1: log postsuspend failed null-02: [ 4878.089705] device-mapper: dm-log-userspace: version 1.3.0 loaded [ 4893.891383] device-mapper: raid1: log presuspend failed [ 4893.898711] device-mapper: raid1: log presuspend failed [ 4893.905391] device-mapper: raid1: log postsuspend failed [ 5080.630072] device-mapper: raid1: log presuspend failed [ 5080.637326] device-mapper: raid1: log presuspend failed [ 5080.644137] device-mapper: raid1: log postsuspend failed null-03: [ 3766.366141] device-mapper: dm-log-userspace: version 1.3.0 loaded [ 3782.170933] device-mapper: raid1: log presuspend failed [ 3782.186368] device-mapper: raid1: log presuspend failed [ 3782.193035] device-mapper: raid1: log postsuspend failed [ 3950.081949] device-mapper: raid1: Unable to read primary mirror during recovery [ 3950.086238] device-mapper: raid1: Primary mirror (253:6) failed while out-of-sync: Reads may fail. [ 3950.086239] device-mapper: raid1: Read failure on mirror device 253:6. Trying alternative device. [ 3950.086312] device-mapper: raid1: Mirror read failed. [ 3950.086407] device-mapper: raid1: Mirror read failed. [ 3950.086410] Buffer I/O error on dev dm-8, logical block 127984, async page read [ 3950.132341] device-mapper: raid1: Primary mirror (253:6) failed while out-of-sync: Reads may fail. [ 3953.272811] device-mapper: raid1: Mirror read failed. [ 3953.279836] device-mapper: raid1: Mirror read failed. [ 3953.286215] Buffer I/O error on dev dm-8, logical block 127984, async page read [ 3968.913718] device-mapper: raid1: log presuspend failed [ 3968.923439] device-mapper: raid1: log presuspend failed [ 3968.930068] device-mapper: raid1: log postsuspend failed In my own testing, I found that the problems appear from come from commit fade45b1d14c90eac7c6cc41019ef946659a8fb5 To cleanly revert fade45b1 from the stable branch, first revert 5f648406b0f2e322851fce36b062a298cb3d6736 which is fixing an unrelated regression from fade45b1 (see bug 1594916). When I revert both of these on the stable branch I no longer see the problem. (In reply to David Teigland from comment #37) > In my own testing, I found that the problems appear from come from commit > fade45b1d14c90eac7c6cc41019ef946659a8fb5 > > To cleanly revert fade45b1 from the stable branch, first revert > 5f648406b0f2e322851fce36b062a298cb3d6736 which is fixing an unrelated > regression from fade45b1 (see bug 1594916). > > When I revert both of these on the stable branch I no longer see the problem. I'll try that as soon as I can. Yesterday, I ran into a lot of trouble trying to bisect - many checkins did not compile/work. The commit I know to work is: 877c2f2ffb3813e3236a55f07958009633dc8558 - good Mon Apr 30 09:34:12 2018 However, I've been running into bcache issues and have been unable to purge the VG and try other commits... will keep at it. (In reply to David Teigland from comment #37) > In my own testing, I found that the problems appear from come from commit > fade45b1d14c90eac7c6cc41019ef946659a8fb5 > > To cleanly revert fade45b1 from the stable branch, first revert > 5f648406b0f2e322851fce36b062a298cb3d6736 which is fixing an unrelated > regression from fade45b1 (see bug 1594916). > > When I revert both of these on the stable branch I no longer see the problem. I've landed on the exact same conclusion with 'git bisect' (after getting through the bcache issues). I can also confirm that reverting those two commits fixes the problem for me. Set blocker+. RE: Regression. Created attachment 1487280 [details]
lvconvert output
verbose output of lvconvert successfully adding a leg to a cluster mirror. Using stable branch with the two identified commits reverted.
Created attachment 1487285 [details]
lvconvert output
lvconvert successfully adding a leg to a cluster mirror with the two commits reverted.
Created attachment 1487286 [details] clvmd output node 1 for lvconvert in comment 42 Created attachment 1487287 [details] clvmd output node 2 for lvconvert in comment 42 Created attachment 1487288 [details] clvmd output node 3 for lvconvert in comment 42 Unfortunately, we need more time for analysis. We are fairly certain we know the fix, but we have seen the possibility for another regression with it. We are going to have to push for a 0-day. We can get through this issue with a release note and a 0-day fix. 1. This issue only affects cluster mirror users 2. It can only manifest if a certain device in the mirror fails or is removed 3. It can only happen to 3+ way mirrors 4. It only happens when up-converting a mirror to a mirror Hello We need PMApproved for zero-day. Hmm after some tracing period hunting numerous bugs in my local fake cluster (while thinking our singlenode cluster used-to-be quite precise emulation of cluster behavior) - looking through debug traces on Corey cluster nodes - there is one major issue - SUSPEND seems to be never made. So my current findings: Probabbly major issue is there is always PREALOD operation followed with direct resume (ACTIVATE) (which is posted with 'flushing' - while the normal behavior was always a separate suspend without flushing in this case. I'm not quite sure what has changed on suspend processing in cluster. Single node seems to be going through suspend->resume cycle as expected. Other noticed issues - dm multipath DM status is also using flushing - that should be eliminated (1 missing dm_task_no_flush()) Clvmd is doing scans while already holding VG lock (thus in case of missing PV we do a several new reprocessing of metadata happens) - related to whole new scanning. It seems scan is able to issue 'read-ahead' operation on device this is later ignored by some filter. Comment 22 shows internal error - which could be related to missing suspend. ATM I'm mostly puzzled with this logged output from clvmd: Using JUST linear LV and creating its snapshot (So 0-relation to mirror) I can see this: Loading table for helter_skelter-lvol2 (253:22). Adding target to (253:22): 0 24576 snapshot 253:20 253:21 P 8 dm table (253:22) [ opencount flush ] [16384] (*1) dm reload (253:22) [ noopencount flush ] [16384] (*1) Table size changed from 0 to 24576 for helter_skelter-lvol2 (253:22). Requiring flush for LV helter_skelter/lvol1. Leaving critical section (resumed). dm info LVM-p12xDsu6sa5K2mm8XD3MTZeZTa6ByIwb0XWlQHvGyfn0TyuEPFhpeeejWCmj8zwb [ opencount flush ] [16384] (*1) helter_skelter-lvol2: device not registered. Monitoring helter_skelter/lvol2 with libdevmapper-event-lvm2snapshot.so. I'm looking at how is it possible that inside 'suspend' code we are suddenly reporting 'Leaving critical section (resumed).' - asked Marian for -O0 build. For compare here is expected code on my Rawhide build: libdm-deptree.c:1944 Creating @PREFIX@vg-snap ioctl/libdm-iface.c:1857 dm create @PREFIX@vg-snap LVM-9V7ciyfXKFP8G7LvwPWpnc6dClye1nym91e2C99e50ZMorEbkUcdPlPhgx0LfO3f [ noopencount flush ] [16384] (*1) libdm-deptree.c:2696 Loading table for @PREFIX@vg-snap (253:9). libdm-deptree.c:2641 Adding target to (253:9): 0 20480 snapshot 253:7 253:8 P 8 ioctl/libdm-iface.c:1857 dm table (253:9) [ opencount flush ] [16384] (*1) ioctl/libdm-iface.c:1857 dm reload (253:9) [ noopencount flush ] [16384] (*1) libdm-deptree.c:2746 Table size changed from 0 to 20480 for @PREFIX@vg-snap (253:9). activate/activate.c:2345 Requiring flush for LV @PREFIX@vg/LV1. mm/memlock.c:623 Entering critical section (suspending). mm/memlock.c:583 Lock: Memlock counters: prioritized:1 locked:1 critical:1 daemon:1 suspended:0 activate/dev_manager.c:3225 Creating SUSPEND_WITH_LOCKFS tree for @PREFIX@vg/LV1. activate/dev_manager.c:779 Getting device info for @PREFIX@vg-LV1 [LVM-9V7ciyfXKFP8G7LvwPWpnc6dClye1nym9kfomEL72OoCYV4DVo1eydSyHMKXz35O]. ioctl/libdm-iface.c:1857 dm info LVM-9V7ciyfXKFP8G7LvwPWpnc6dClye1nym9kfomEL72OoCYV4DVo1eydSyHMKXz35O [ opencount flush enablechecks ] [16384] (*1) ioctl/libdm-iface.c:1857 dm deps (253:6) [ opencount flush ] [16384] (*1) ioctl/libdm-iface.c:1857 dm deps (253:4) [ opencount flush ] [16384] (*1) activate/dev_manager.c:779 Getting device info for @PREFIX@vg-LV1-real [LVM-9V7ciyfXKFP8G7LvwPWpnc6dClye1nym9kfomEL72OoCYV4DVo1eydSyHMKXz35O-real]. So the missing log from critical section from comment 52 - is the disabled 'activation' logging - which I've completely forget to watch for - since on my matchine it's always set to 1... continue exploring differences. Created attachment 1492603 [details]
Fixing clustered mirror patch
Patch fixes unwanted table reloading during lvconvert operation in cluster.
Hi Steve, do we need this cloned? I cannot push the changes Upstreamed slightly modified version of the attached patch: https://www.redhat.com/archives/lvm-devel/2019-May/msg00017.html There is still a known issue in cluster mirrors that needs to be resolved - bug 1711427 (also a blocker). It would be best if this bug was tested with that fix also (should be available soon). This scenario is once again passing with the latest rpms. Marking verified. 3.10.0-1057.el7.x86_64 lvm2-2.02.185-2.el7 BUILT: Fri Jun 21 04:18:48 CDT 2019 lvm2-libs-2.02.185-2.el7 BUILT: Fri Jun 21 04:18:48 CDT 2019 lvm2-cluster-2.02.185-2.el7 BUILT: Fri Jun 21 04:18:48 CDT 2019 lvm2-lockd-2.02.185-2.el7 BUILT: Fri Jun 21 04:18:48 CDT 2019 lvm2-python-boom-0.9-18.el7 BUILT: Fri Jun 21 04:18:58 CDT 2019 cmirror-2.02.185-2.el7 BUILT: Fri Jun 21 04:18:48 CDT 2019 device-mapper-1.02.158-2.el7 BUILT: Fri Jun 21 04:18:48 CDT 2019 device-mapper-libs-1.02.158-2.el7 BUILT: Fri Jun 21 04:18:48 CDT 2019 device-mapper-event-1.02.158-2.el7 BUILT: Fri Jun 21 04:18:48 CDT 2019 device-mapper-event-libs-1.02.158-2.el7 BUILT: Fri Jun 21 04:18:48 CDT 2019 device-mapper-persistent-data-0.8.5-1.el7 BUILT: Mon Jun 10 03:58:20 CDT 2019 ********* Mirror hash info for this scenario ********* * names: syncd_secondary_3legs_1 * sync: 1 * striped: 0 * leg devices: /dev/mapper/mpathd1 /dev/mapper/mpathc1 /dev/mapper/mpathg1 * log devices: /dev/mapper/mpathb1 * no MDA devices: * failpv(s): /dev/mapper/mpathc1 * failnode(s): harding-02 harding-03 * lvmetad: 0 * leg fault policy: allocate * log fault policy: allocate ****************************************************** Creating mirror(s) on harding-02... harding-02: lvcreate -ay --type mirror -m 2 -n syncd_secondary_3legs_1 -L 500M helter_skelter /dev/mapper/mpathd1:0-2400 /dev/mapper/mpathc1:0-2400 /dev/mapper/mpathg1:0-2400 /dev/mapper/mpathb1:0-150 WARNING: gfs2 signature detected on /dev/helter_skelter/syncd_secondary_3legs_1 at offset 65536. Wipe it? [y/n]: [n] Aborted wiping of gfs2. 1 existing signature left on the device. Current mirror/raid device structure(s): LV Attr LSize Cpy%Sync Devices syncd_secondary_3legs_1 mwi-a-m--- 500.00m 6.40 syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_1(0),syncd_secondary_3legs_1_mimage_2(0) [syncd_secondary_3legs_1_mimage_0] Iwi-aom--- 500.00m /dev/mapper/mpathd1(0) [syncd_secondary_3legs_1_mimage_1] Iwi-aom--- 500.00m /dev/mapper/mpathc1(0) [syncd_secondary_3legs_1_mimage_2] Iwi-aom--- 500.00m /dev/mapper/mpathg1(0) [syncd_secondary_3legs_1_mlog] lwi-aom--- 4.00m /dev/mapper/mpathb1(0) Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Sleeping 15 sec Sleeping 15 sec Creating gfs2 on top of mirror(s) on harding-02... mkfs.gfs2 -o align=0 -J 32M -j 2 -p lock_dlm -t HARDING:gfs1 /dev/helter_skelter/syncd_secondary_3legs_1 -O Warning: device is not properly aligned. This may harm performance. Mounting mirrored gfs2 filesystems on harding-02... Mounting mirrored gfs2 filesystems on harding-03... PV=/dev/mapper/mpathc1 syncd_secondary_3legs_1_mimage_1: 4.1 PV=/dev/mapper/mpathc1 syncd_secondary_3legs_1_mimage_1: 4.1 Writing verification files (checkit) to mirror(s) on... ---- harding-02 ---- ---- harding-03 ---- Sleeping 15 seconds to get some outsanding I/O locks before the failure Verifying files (checkit) on mirror(s) on... ---- harding-02 ---- ---- harding-03 ---- Failing path sdn on harding-02 Failing path sdv on harding-02 Failing path sdn on harding-03 Failing path sdv on harding-03 Failing path sdad on harding-03 Failing path sdf on harding-03 Attempting I/O to cause mirror down conversion(s) on harding-02 dd if=/dev/zero of=/mnt/syncd_secondary_3legs_1/ddfile count=10 bs=4M 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.059059 s, 710 MB/s Verifying current sanity of lvm after the failure Current mirror/raid device structure(s): Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb. LV Attr LSize Cpy%Sync Devices syncd_secondary_3legs_1 cwi-aom--- 500.00m 100.00 syncd_secondary_3legs_1_mimagetmp_3(0),syncd_secondary_3legs_1_mimage_3(0) [syncd_secondary_3legs_1_mimage_0] Iwi-aom--- 500.00m /dev/mapper/mpathd1(0) [syncd_secondary_3legs_1_mimage_2] Iwi-aom--- 500.00m /dev/mapper/mpathg1(0) [syncd_secondary_3legs_1_mimage_3] iwi-aom--- 500.00m /dev/mapper/mpatha1(0) [syncd_secondary_3legs_1_mimagetmp_3] mwi---m--- 500.00m syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_2(0) [syncd_secondary_3legs_1_mlog] lwi-aom--- 4.00m /dev/mapper/mpathb1(0) Verifying FAILED device /dev/mapper/mpathc1 is *NOT* in the volume(s) Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb. olog: 1 Verifying LOG device(s) /dev/mapper/mpathb1 *ARE* in the mirror(s) Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb. Verifying LEG device /dev/mapper/mpathd1 *IS* in the volume(s) Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb. Verifying LEG device /dev/mapper/mpathg1 *IS* in the volume(s) Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb. verify the dm devices associated with /dev/mapper/mpathc1 have been removed as expected Checking REMOVAL of syncd_secondary_3legs_1_mimage_1 on: harding-02 harding-03 verify the newly allocated dm devices were added as a result of the failures Checking EXISTENCE of syncd_secondary_3legs_1_mimage_3 on: harding-02 harding-03 Verify that the mirror image order remains the same after the down conversion EXPECTED LEG ORDER: /dev/mapper/mpathd1 /dev/mapper/mpathg1 unknown Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb. Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb. Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb. Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb. ACTUAL LEG ORDER: /dev/mapper/mpathd1 /dev/mapper/mpathg1 /dev/mapper/mpatha1 Verifying files (checkit) on mirror(s) on... ---- harding-02 ---- ---- harding-03 ---- Enabling path sdn on harding-02 Enabling path sdv on harding-02 Enabling path sdn on harding-03 Enabling path sdv on harding-03 Enabling path sdad on harding-03 Enabling path sdf on harding-03 ------------------------------------------------------------------------------- Force a vgreduce to either clean up the corrupt additional LV or lack of PV MDA ------------------------------------------------------------------------------- Scan of VG helter_skelter from /dev/mapper/mpathc1 found metadata seqno 81 vs previous 89. Scan of VG helter_skelter from /dev/mapper/mpathc1 found metadata seqno 81 vs previous 89. WARNING: Inconsistent metadata found for VG helter_skelter - updating to use version 89 Recreating PVs /dev/mapper/mpathc1 and then extending back into helter_skelter harding-02 pvcreate /dev/mapper/mpathc1 Scan of VG helter_skelter from /dev/mapper/mpathc1 found metadata seqno 90 vs previous 91. Scan of VG helter_skelter from /dev/mapper/mpathc1 found metadata seqno 90 vs previous 91. Scan of VG helter_skelter from /dev/mapper/mpathc1 found metadata seqno 90 vs previous 91. WARNING: Inconsistent metadata found for VG helter_skelter - updating to use version 91 WARNING: Removing PV /dev/mapper/mpathc1 (dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb) that no longer belongs to VG helter_skelter harding-02 vgextend helter_skelter /dev/mapper/mpathc1 Verify that each of the raid repairs finished successfully Checking for leftover '-missing_0_0' or 'unknown devices' Checking for PVs marked as missing (a-m)... Checking for PVs marked as missing (a-m)... Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Sleeping 15 sec Sleeping 15 sec Verifying files (checkit) on mirror(s) on... ---- harding-02 ---- ---- harding-03 ---- Stopping the io load (collie/xdoio) on mirror(s) Unmounting gfs and removing mnt point on harding-02... Unmounting gfs and removing mnt point on harding-03... Deactivating and removing mirror(s) Checking for left over mirror dm devices on ...harding-02 ...harding-03 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2253 |