1628529 – upconvert attempt after secondary mirror leg failure causes clvmd to hang and gfs to withdraw

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1628529 - upconvert attempt after secondary mirror leg failure causes clvmd to hang and gfs to withdraw

Summary: upconvert attempt after secondary mirror leg failure causes clvmd to hang and...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	7.6
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Zdenek Kabelac
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Marek Suchánek
URL:
Whiteboard:
Depends On:
Blocks:	1638084
TreeView+	depends on / blocked

Reported:	2018-09-13 11:18 UTC by Roman Bednář
Modified:	2021-09-03 12:50 UTC (History)
CC List:	16 users (show)
Fixed In Version:	lvm2-2.02.185-1.el7
Doc Type:	If docs needed, set a value
Doc Text:	Previously, clustered LVM caused the GFS file system to withdraw when the secondary mirror leg failed with auto allocation. As a consequence, all commands using the lvm utility became unresponsive. With this update, the problem has been fixed.
Clone Of:
Clones:	1638084 (view as bug list)
Environment:
Last Closed:	2019-08-06 13:10:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vvvv_logs (9.62 KB, application/zip) 2018-09-13 11:18 UTC, Roman Bednář	no flags	Details
log of test run with "remove" policy (31.11 KB, text/plain) 2018-09-19 21:56 UTC, Corey Marthaler	no flags	Details
verbose lvconvert upconvert (191.74 KB, text/plain) 2018-09-19 22:22 UTC, Corey Marthaler	no flags	Details
requested dm information (29.23 KB, text/plain) 2018-09-20 18:21 UTC, Corey Marthaler	no flags	Details
working lvconvert (251.71 KB, text/plain) 2018-09-24 15:30 UTC, David Teigland	no flags	Details
lvconvert output (122.80 KB, text/plain) 2018-09-26 14:26 UTC, David Teigland	no flags	Details
lvconvert output (120.17 KB, text/plain) 2018-09-26 14:45 UTC, David Teigland	no flags	Details
clvmd output node 1 (215.82 KB, text/plain) 2018-09-26 14:46 UTC, David Teigland	no flags	Details
clvmd output node 2 (46.15 KB, text/plain) 2018-09-26 14:47 UTC, David Teigland	no flags	Details
clvmd output node 3 (46.15 KB, text/plain) 2018-09-26 14:47 UTC, David Teigland	no flags	Details
Fixing clustered mirror patch (6.00 KB, text/plain) 2018-10-10 15:10 UTC, Zdenek Kabelac	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2253	0	None	None	None	2019-08-06 13:11:12 UTC

Description Roman Bednář 2018-09-13 11:18:35 UTC

Created attachment 1483035 [details]
vvvv_logs

Scenario: Leg failure triggers replacement with allocate then checks for data corruption (GFS2) but gets IO error. See scenario log below for details.

After that any lvm command seems to hang (attaching vgs,lvs verbose output) unless cluster is stopped (cluster vgs are left out).

Tried rebooting the node and rejoin the cluster, but clvmd times out and node gets fenced.

Sep 13 12:39:39 virt-379 pengine[14778]: warning: Processing failed op stop for clvmd:0 on virt-378: unknown error (1)
Sep 13 12:39:39 virt-379 pengine[14778]: warning: Processing failed op stop for clvmd:0 on virt-378: unknown error (1)
Sep 13 12:39:39 virt-379 pengine[14778]: warning: Cluster node virt-378 will be fenced: clvmd:0 failed there
Sep 13 12:39:39 virt-379 pengine[14778]: warning: Forcing clvmd-clone away from virt-378 after 1000000 failures (max=1000000)
Sep 13 12:39:39 virt-379 pengine[14778]: warning: Forcing clvmd-clone away from virt-378 after 1000000 failures (max=1000000)
Sep 13 12:39:39 virt-379 pengine[14778]: warning: Scheduling Node virt-378 for STONITH

Unkown error on clvmd does not really help much but that's about all I could find. It's a reliable reproducer in case further logs are requested.

Regression - test passing with 7.5:

https://beaker.cluster-qe.lab.eng.brq.redhat.com/bkr/jobs/86987
http://petan-01.cluster-qe.lab.eng.brq.redhat.com/beaker/logs/tasks/734+/734150/TESTOUT.log


===================================
# pcs status
Cluster name: STSRHTS29282
Stack: corosync
Current DC: virt-379 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Thu Sep 13 12:22:00 2018
Last change: Wed Sep 12 14:17:59 2018 by root via cibadmin on virt-378

2 nodes configured
6 resources configured

Online: [ virt-378 virt-379 ]

Full list of resources:

 fence-virt-378	(stonith:fence_xvm):	Started virt-379
 fence-virt-379	(stonith:fence_xvm):	Started virt-378
 Clone Set: dlm-clone [dlm]
     Started: [ virt-378 virt-379 ]
 Clone Set: clvmd-clone [clvmd]
     clvmd	(ocf::heartbeat:clvm):	FAILED virt-378
     Started: [ virt-379 ]

Failed Actions:
* clvmd_start_0 on virt-378 'unknown error' (1): call=28, status=Timed Out, exitreason='',
    last-rc-change='Thu Sep 13 12:19:46 2018', queued=0ms, exec=90007ms


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

===================================

 Scenario kill_secondary_synced_3_legs: Kill secondary leg of synced 3 leg mirror(s)
 ********* Mirror hash info for this scenario *********
 * names:              syncd_secondary_3legs_1
 * sync:               1
 * striped:            0
 * leg devices:        /dev/sdc1 /dev/sdh1 /dev/sdd1
 * log devices:        /dev/sdj1
 * no MDA devices:
 * failpv(s):          /dev/sdh1
 * failnode(s):        virt-378 virt-379
 * lvmetad:            0
 * leg fault policy:   allocate
 * log fault policy:   allocate
 ******************************************************

 Creating mirror(s) on virt-378...
 virt-378: lvcreate -ay --type mirror -m 2 -n syncd_secondary_3legs_1 -L 500M helter_skelter /dev/sdc1:0-2400 /dev/sdh1:0-2400 /dev/sdd1:0-2400 /dev/sdj1:0-150

 Current mirror/raid device structure(s):
   LV                                 Attr       LSize   Cpy%Sync Devices
    syncd_secondary_3legs_1            mwi-a-m--- 500.00m 5.60     syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_1(0),syncd_secondary_3legs_1_mimage_2(0)
    [syncd_secondary_3legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/sdc1(0)
    [syncd_secondary_3legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/sdh1(0)
    [syncd_secondary_3legs_1_mimage_2] Iwi-aom--- 500.00m          /dev/sdd1(0)
    [syncd_secondary_3legs_1_mlog]     lwi-aom---   4.00m          /dev/sdj1(0)
    root                               -wi-ao----  <6.20g          /dev/vda2(205)
    swap                               -wi-ao---- 820.00m          /dev/vda2(0)



 Waiting until all mirror|raid volumes become fully syncd...
    1/1 mirror(s) are fully synced: ( 100.00% )
 Sleeping 15 sec
 Sleeping 15 sec

 Creating gfs2 on top of mirror(s) on virt-378...
 mkfs.gfs2 -o align=0 -J 32M -j 2 -p lock_dlm -t STSRHTS29282:gfs1 /dev/helter_skelter/syncd_secondary_3legs_1 -O
 Mounting mirrored gfs2 filesystems on virt-378...
 Mounting mirrored gfs2 filesystems on virt-379...

 PV=/dev/sdh1
 	syncd_secondary_3legs_1_mimage_1: 4.1
 PV=/dev/sdh1
 	syncd_secondary_3legs_1_mimage_1: 4.1
 Writing verification files (checkit) to mirror(s) on...
 	---- virt-378 ----

 	---- virt-379 ----


 <start name="virt-378_syncd_secondary_3legs_1"  pid="15437" time="Wed Sep 12 14:19:58 2018 +0200" type="cmd" />
 <start name="virt-379_syncd_secondary_3legs_1"  pid="15438" time="Wed Sep 12 14:19:58 2018 +0200" type="cmd" />
 Sleeping 15 seconds to get some outsanding I/O locks before the failure
 Verifying files (checkit) on mirror(s) on...
 	---- virt-378 ----

 	---- virt-379 ----

 Name             GrpID RgID ObjType RgStart RgSize  #Areas ArSize  ProgID
 virt-378_load        0  0-1 group    96.65m 980.00k      1 980.00k dmstats

 Disabling device sdh on virt-378rescan device...

 Disabling device sdg on virt-379rescan device...


 Attempting I/O to cause mirror down conversion(s) on virt-378
 dd if=/dev/zero of=/mnt/syncd_secondary_3legs_1/ddfile count=10 bs=4M
 10+0 records in
 10+0 records out
 41943040 bytes (42 MB) copied, 0.0890541 s, 471 MB/s
 Verifying current sanity of lvm after the failure

 Current mirror/raid device structure(s):
   /dev/sdh: open failed: No such device or address
   /dev/sdh1: open failed: No such device or address
   Couldn't find device with uuid vCLnuy-Fb5g-EFRo-6mSx-SSYY-he4N-VcTNxG.
   LV                                    Attr       LSize   Cpy%Sync Devices
    syncd_secondary_3legs_1               cwi-aom--- 500.00m 100.00   syncd_secondary_3legs_1_mimagetmp_3(0),syncd_secondary_3legs_1_mimage_3(0)
    [syncd_secondary_3legs_1_mimage_0]    iwi-aom--- 500.00m          /dev/sdc1(0)
    [syncd_secondary_3legs_1_mimage_2]    iwi-aom--- 500.00m          /dev/sdd1(0)
    [syncd_secondary_3legs_1_mimage_3]    iwi-aom--- 500.00m          /dev/sde1(0)
    [syncd_secondary_3legs_1_mimagetmp_3] mwi-aom--- 500.00m 100.00   syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_2(0)
    [syncd_secondary_3legs_1_mlog]        lwi-aom---   4.00m          /dev/sdj1(0)
    root                                  -wi-ao----  <6.20g          /dev/vda2(205)
    swap                                  -wi-ao---- 820.00m          /dev/vda2(0)


 Verifying FAILED device /dev/sdh1 is *NOT* in the volume(s)
   /dev/sdh: open failed: No such device or address
   /dev/sdh1: open failed: No such device or address
   Couldn't find device with uuid vCLnuy-Fb5g-EFRo-6mSx-SSYY-he4N-VcTNxG.
 olog: 1
 Verifying LOG device(s) /dev/sdj1 *ARE* in the mirror(s)
   /dev/sdh: open failed: No such device or address
   /dev/sdh1: open failed: No such device or address
   Couldn't find device with uuid vCLnuy-Fb5g-EFRo-6mSx-SSYY-he4N-VcTNxG.
 Verifying LEG device /dev/sdc1 *IS* in the volume(s)
   /dev/sdh: open failed: No such device or address
   /dev/sdh1: open failed: No such device or address
   Couldn't find device with uuid vCLnuy-Fb5g-EFRo-6mSx-SSYY-he4N-VcTNxG.
 Verifying LEG device /dev/sdd1 *IS* in the volume(s)
   /dev/sdh: open failed: No such device or address
   /dev/sdh1: open failed: No such device or address
   Couldn't find device with uuid vCLnuy-Fb5g-EFRo-6mSx-SSYY-he4N-VcTNxG.
 verify the dm devices associated with /dev/sdh1 have been removed as expected
 Checking REMOVAL of syncd_secondary_3legs_1_mimage_1 on:  virt-378 virt-379
 verify the newly allocated dm devices were added as a result of the failures
 Checking EXISTENCE of syncd_secondary_3legs_1_mimage_3 on:  virt-378 virt-379

 Verify that the mirror image order remains the same after the down conversion
 EXPECTED LEG ORDER: /dev/sdc1 /dev/sdd1 unknown
   /dev/sdh: open failed: No such device or address
   /dev/sdh1: open failed: No such device or address
   Couldn't find device with uuid vCLnuy-Fb5g-EFRo-6mSx-SSYY-he4N-VcTNxG.
   /dev/sdh: open failed: No such device or address
   /dev/sdh1: open failed: No such device or address
   Couldn't find device with uuid vCLnuy-Fb5g-EFRo-6mSx-SSYY-he4N-VcTNxG.
   /dev/sdh: open failed: No such device or address
   /dev/sdh1: open failed: No such device or address
   Couldn't find device with uuid vCLnuy-Fb5g-EFRo-6mSx-SSYY-he4N-VcTNxG.
   /dev/sdh: open failed: No such device or address
   /dev/sdh1: open failed: No such device or address
   Couldn't find device with uuid vCLnuy-Fb5g-EFRo-6mSx-SSYY-he4N-VcTNxG.
 ACTUAL LEG ORDER: /dev/sdc1 /dev/sdd1 /dev/sde1
 Verifying files (checkit) on mirror(s) on...
 	---- virt-378 ----

 	---- virt-379 ----
 Can not open xlvhfsasohqtnhgxnucijmcbxgkgxlhxwdffyixnrojfvqdlnh: Input/output error
 checkit write verify failed
 FI_engine: recover() method failed
 <stop name="virt-379_syncd_secondary_3legs_1"  pid="15438" time="Wed Sep 12 14:20:46 2018 +0200" type="cmd" duration="48" signal="2" />




corosync-2.4.3-2.el7_5.1.x86_64
kernel-3.10.0-862.11.6.el7.x86_64
lvm2-2.02.180-8.el7.x86_64
pacemaker-1.1.18-11.el7_5.3.x86_64

Comment 3 David Teigland 2018-09-19 14:09:31 UTC

It's not at all clear what's wrong, or that there's anything wrong, from an lvm perspective.  Could you please highlight a specific lvm command that is failing, and provide the debugging output from it.  Or, the specific output you expect to see from a command and the different output you're seeing instead (along with the debugging output from the command you think is printing incorrect results.)  Also, I don't understand the description of what's being tested.

Comment 4 Heinz Mauelshagen 2018-09-19 14:22:44 UTC

Can you add lvmdump output for each of the involved nodes, please?

Is there any kernel log on either node showing some issues
wrt the mirror repair?

Comment 5 Corey Marthaler 2018-09-19 20:07:13 UTC

First a few data points. 

1. This exact test case (fail second mirror image with auto allocation) passes in exclusive activation lvmlockd mode.

2. A test case failing the primary leg (also with auto allocation) instead of the secondary mirror leg passes in clvmd mode.

3. It appears in this scenario, the failing secondary leg, and subsequent auto allocation of a new image causes gfs to withdraw and lvm cmds to hang/deadlock.


### A few notes inline about what the test is checking and expects:

mckinley-01: pvcreate /dev/mapper/mpatha1 /dev/mapper/mpathb1 /dev/mapper/mpathc1 /dev/mapper/mpathd1 /dev/mapper/mpathe1 /dev/mapper/mpathf1 /dev/mapper/mpathg1
mckinley-01: vgcreate   helter_skelter /dev/mapper/mpatha1 /dev/mapper/mpathb1 /dev/mapper/mpathc1 /dev/mapper/mpathd1 /dev/mapper/mpathe1 /dev/mapper/mpathf1 /dev/mapper/mpathg1

Enabling mirror allocate fault policies on: mckinley-01 mckinley-02 mckinley-03
================================================================================
Iteration 0.1 started at Wed Sep 19 14:39:47 CDT 2018
================================================================================
Scenario kill_secondary_synced_3_legs: Kill secondary leg of synced 3 leg mirror(s)
********* Mirror hash info for this scenario *********
* names:              syncd_secondary_3legs_1
* sync:               1
* striped:            0
* leg devices:        /dev/mapper/mpathd1 /dev/mapper/mpathc1 /dev/mapper/mpathf1
* log devices:        /dev/mapper/mpathe1
* no MDA devices:     
* failpv(s):          /dev/mapper/mpathc1
* failnode(s):        mckinley-01 mckinley-02 mckinley-03
* lvmetad:            0
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

Creating mirror(s) on mckinley-03...
mckinley-03: lvcreate -ay --type mirror -m 2 -n syncd_secondary_3legs_1 -L 500M helter_skelter /dev/mapper/mpathd1:0-2400 /dev/mapper/mpathc1:0-2400 /dev/mapper/mpathf1:0-2400 /dev/mapper/mpathe1:0-150
WARNING: gfs2 signature detected on /dev/helter_skelter/syncd_secondary_3legs_1 at offset 65536. Wipe it? [y/n]: [n]
  Aborted wiping of gfs2.
  1 existing signature left on the device.

Current mirror/raid device structure(s):
  LV                                 Attr       LSize   Cpy%Sync Devices
   syncd_secondary_3legs_1            mwi-a-m--- 500.00m 10.40    syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_1(0),syncd_secondary_3legs_1_mimage_2(0)
   [syncd_secondary_3legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/mapper/mpathd1(0)
   [syncd_secondary_3legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/mapper/mpathc1(0)
   [syncd_secondary_3legs_1_mimage_2] Iwi-aom--- 500.00m          /dev/mapper/mpathf1(0)
   [syncd_secondary_3legs_1_mlog]     lwi-aom---   4.00m          /dev/mapper/mpathe1(0)

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
Sleeping 15 sec
Sleeping 15 sec

Creating gfs2 on top of mirror(s) on mckinley-01...
mkfs.gfs2 -o align=0 -J 32M -j 3 -p lock_dlm -t MCKINLEY:gfs1 /dev/helter_skelter/syncd_secondary_3legs_1 -O
Warning: device is not properly aligned. This may harm performance.
Mounting mirrored gfs2 filesystems on mckinley-01...
Mounting mirrored gfs2 filesystems on mckinley-02...
Mounting mirrored gfs2 filesystems on mckinley-03...

PV=/dev/mapper/mpathc1
        syncd_secondary_3legs_1_mimage_1: 4.1
PV=/dev/mapper/mpathc1
        syncd_secondary_3legs_1_mimage_1: 4.1

Writing verification files (checkit) to mirror(s) on...
        ---- mckinley-01 ----
        ---- mckinley-02 ----
        ---- mckinley-03 ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- mckinley-01 ----
        ---- mckinley-02 ----
        ---- mckinley-03 ----


### Failing the underlying devices associated w/ the secondary image /dev/mapper/mpathc1 on each of the three nodes in the cluster
Failing path sdd on mckinley-01
Failing path sdl on mckinley-01
Failing path sdt on mckinley-01
Failing path sdab on mckinley-01
Failing path sde on mckinley-02
Failing path sdm on mckinley-02
Failing path sdu on mckinley-02
Failing path sdac on mckinley-02
Failing path sde on mckinley-03
Failing path sdm on mckinley-03
Failing path sdu on mckinley-03
Failing path sdac on mckinley-03

Attempting I/O to cause mirror down conversion(s) on mckinley-01
dd if=/dev/zero of=/mnt/syncd_secondary_3legs_1/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.152766 s, 275 MB/s
Verifying current sanity of lvm after the failure


### Already, the failed image /dev/mapper/mpathc1 has been removed, healthy /dev/mapper/mpathf1 moved to the secondary position, and /dev/mapper/mpatha1 added as the new third image. I'm not sure what the syncd_secondary_3legs_1_mimagetmp_3 device is all about though now?
Current mirror/raid device structure(s):
  Couldn't find device with uuid EJ2ttC-Mw13-H8PD-L72C-5lZs-LMTQ-LuZeBx.
  LV                                    Attr       LSize   Cpy%Sync Devices
   syncd_secondary_3legs_1               cwi-aom--- 500.00m 100.00   syncd_secondary_3legs_1_mimagetmp_3(0),syncd_secondary_3legs_1_mimage_3(0)
   [syncd_secondary_3legs_1_mimage_0]    iwi-aom--- 500.00m          /dev/mapper/mpathd1(0)
   [syncd_secondary_3legs_1_mimage_2]    iwi-aom--- 500.00m          /dev/mapper/mpathf1(0)
   [syncd_secondary_3legs_1_mimage_3]    iwi-aom--- 500.00m          /dev/mapper/mpatha1(0)
   [syncd_secondary_3legs_1_mimagetmp_3] mwi-aom--- 500.00m 100.00   syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_2(0)   
   [syncd_secondary_3legs_1_mlog]        lwi-aom---   4.00m          /dev/mapper/mpathe1(0)


### The test is happy with the device placements, basically that the failed device/image is gone, and a new device/image in its place
Verifying FAILED device /dev/mapper/mpathc1 is *NOT* in the volume(s)
olog: 1
Verifying LOG device(s) /dev/mapper/mpathe1 *ARE* in the mirror(s)
Verifying LEG device /dev/mapper/mpathd1 *IS* in the volume(s)
Verifying LEG device /dev/mapper/mpathf1 *IS* in the volume(s)

verify the dm devices associated with /dev/mapper/mpathc1 have been removed as expected
Checking REMOVAL of syncd_secondary_3legs_1_mimage_1 on:  mckinley-01 mckinley-02 mckinley-03
verify the newly allocated dm devices were added as a result of the failures
Checking EXISTENCE of syncd_secondary_3legs_1_mimage_3 on:  mckinley-01 mckinley-02 mckinley-03

Verify that the mirror image order remains the same after the down conversion
EXPECTED LEG ORDER: /dev/mapper/mpathd1 /dev/mapper/mpathf1 unknown
ACTUAL LEG ORDER: /dev/mapper/mpathd1 /dev/mapper/mpathf1 /dev/mapper/mpatha1

### Now, when the test goes to verify the filesystem integrity, it finds the data is gone!
### These mean the fs is no longer accessable (either force unmounted, or now corrupt)
Verifying files (checkit) on mirror(s) on...
        ---- mckinley-01 ----
        ---- mckinley-02 ----
Can not open wlrqyfaxxtlgdthwxklbphbh: Input/output error
checkit write verify failed
[root@mckinley-01 ~]# ls /mnt/syncd_secondary_3legs_1
ls: cannot access /mnt/syncd_secondary_3legs_1: Input/output error

### This never returns, lvm cmds are now deadlocked
[root@mckinley-01 ~]# lvs -a -o +devices
[DEADLOCK]


Sep 19 14:41:55 mckinley-01 lvm[3062]: Device failure in helter_skelter-syncd_secondary_3legs_1.
Sep 19 14:41:55 mckinley-01 lvm[3062]: WARNING: Disabling lvmetad cache for repair command.
Sep 19 14:41:55 mckinley-01 lvm[3062]: WARNING: Not using lvmetad because of repair.
Sep 19 14:41:55 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 331679
Sep 19 14:41:55 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 331823
Sep 19 14:41:55 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 334359
Sep 19 14:41:55 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 334583
Sep 19 14:41:55 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 330663
Sep 19 14:41:55 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 0
Sep 19 14:41:55 mckinley-01 lvm[3062]: Couldn't find device with uuid EJ2ttC-Mw13-H8PD-L72C-5lZs-LMTQ-LuZeBx.
Sep 19 14:41:55 mckinley-01 lvm[3062]: WARNING: Couldn't find all devices for LV helter_skelter/syncd_secondary_3legs_1_mimage_1 while checking used and assumed devices.
Sep 19 14:41:55 mckinley-01 qarshd[12914]: Running cmdline: dd if=/dev/zero of=/mnt/syncd_secondary_3legs_1/ddfile count=10 bs=4M
Sep 19 14:41:55 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.218:48406).
Sep 19 14:41:55 mckinley-01 qarshd[12966]: Talking to peer ::ffff:10.15.80.218:48406 (IPv6)
Sep 19 14:41:55 mckinley-01 multipathd: dm-21: remove map (uevent)
Sep 19 14:41:55 mckinley-01 multipathd: dm-21: devmap not registered, can't remove
Sep 19 14:41:55 mckinley-01 multipathd: dm-21: remove map (uevent)
Sep 19 14:41:55 mckinley-01 multipathd: dm-24: remove map (uevent)
Sep 19 14:41:55 mckinley-01 multipathd: dm-24: devmap not registered, can't remove
Sep 19 14:41:55 mckinley-01 multipathd: dm-24: remove map (uevent)
Sep 19 14:41:55 mckinley-01 qarshd[12966]: Running cmdline: sync
Sep 19 14:41:56 mckinley-01 multipathd: sdl: unusable path
Sep 19 14:41:56 mckinley-01 multipathd: sdt: unusable path
Sep 19 14:41:56 mckinley-01 multipathd: sdab: unusable path
Sep 19 14:41:58 mckinley-01 lvm[3062]: Mirror status: 1 of 3 images failed.
Sep 19 14:41:58 mckinley-01 lvm[3062]: WARNING: Trying to up-convert to 3 images, 1 logs.
Sep 19 14:41:59 mckinley-01 systemd: Started LVM2 poll daemon.
Sep 19 14:41:59 mckinley-01 lvm[3062]: helter_skelter/syncd_secondary_3legs_1: Converted: 4.00%
Sep 19 14:41:59 mckinley-01 lvmpolld: W: #011LVPOLL: PID 13076: STDERR: '  Couldn't find device with uuid EJ2ttC-Mw13-H8PD-L72C-5lZs-LMTQ-LuZeBx.'
Sep 19 14:42:00 mckinley-01 multipathd: sdd: unusable path
Sep 19 14:42:01 mckinley-01 multipathd: sdl: unusable path
Sep 19 14:42:01 mckinley-01 multipathd: sdt: unusable path
Sep 19 14:42:01 mckinley-01 multipathd: sdab: unusable path
Sep 19 14:42:03 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.218:48450).
Sep 19 14:42:03 mckinley-01 qarshd[13091]: Talking to peer ::ffff:10.15.80.218:48450 (IPv6)
Sep 19 14:42:03 mckinley-01 qarshd[13091]: Running cmdline: dmsetup ls | grep syncd_secondary_3legs_1_mimage_1
Sep 19 14:42:04 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.218:48456).
Sep 19 14:42:04 mckinley-01 qarshd[13097]: Talking to peer ::ffff:10.15.80.218:48456 (IPv6)
Sep 19 14:42:04 mckinley-01 qarshd[13097]: Running cmdline: dmsetup ls | grep syncd_secondary_3legs_1_mimage_3
Sep 19 14:42:05 mckinley-01 multipathd: sdd: unusable path
Sep 19 14:42:06 mckinley-01 multipathd: sdl: unusable path
Sep 19 14:42:06 mckinley-01 multipathd: sdt: unusable path
Sep 19 14:42:06 mckinley-01 multipathd: sdab: unusable path
Sep 19 14:42:10 mckinley-01 multipathd: sdd: unusable path
Sep 19 14:42:11 mckinley-01 multipathd: sdl: unusable path
Sep 19 14:42:11 mckinley-01 multipathd: sdt: unusable path
Sep 19 14:42:11 mckinley-01 multipathd: sdab: unusable path
Sep 19 14:42:11 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.218:48918).
Sep 19 14:42:11 mckinley-01 qarshd[13134]: Talking to peer ::ffff:10.15.80.218:48918 (IPv6)
Sep 19 14:42:11 mckinley-01 qarshd[13134]: Running cmdline: /usr/tests/sts-rhel7.6/bin/checkit -w /mnt/syncd_secondary_3legs_1/checkit -f /tmp/checkit_syncd_secondary_3legs_1 -v
Sep 19 14:42:14 mckinley-01 lvm[3062]: helter_skelter/syncd_secondary_3legs_1: Converted: 100.00%
Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: 25 callbacks suppressed
Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 0
Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 40
Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 0
Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 40
Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 0
Sep 19 14:42:14 mckinley-01 kernel: blk_update_request: I/O error, dev dm-6, sector 40
Sep 19 14:42:14 mckinley-01 kernel: device-mapper: raid1: Primary mirror (253:21) failed while out-of-sync: Reads may fail.
Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: Error -5 writing to journal, jid=0
Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: error -5: withdrawing the file system to prevent further damage.
Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: Error -5 writing to journal, jid=0
Sep 19 14:42:14 mckinley-01 kernel: VFS:Filesystem freeze failed
Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: about to withdraw this file system
Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: telling LM to unmount
Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: recover generation 5 done
Sep 19 14:42:14 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: withdrawn
Sep 19 14:42:14 mckinley-01 kernel: CPU: 7 PID: 12704 Comm: gfs2_logd Kdump: loaded Not tainted 3.10.0-951.el7.x86_64 #1
Sep 19 14:42:14 mckinley-01 kernel: Hardware name: Dell Inc. PowerEdge R820/0RN9TC, BIOS 2.0.20 01/16/2014
Sep 19 14:42:14 mckinley-01 kernel: Call Trace:
Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffaa361ca8>] dump_stack+0x19/0x1b
Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffc08ff206>] gfs2_lm_withdraw+0x146/0x180 [gfs2]
Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffc08e3e15>] gfs2_logd+0x315/0x340 [gfs2]
Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffa9cc2d60>] ? wake_up_atomic_t+0x30/0x30
Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffc08e3b00>] ? gfs2_meta_syncfs+0x50/0x50 [gfs2]
Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffa9cc1c91>] kthread+0xd1/0xe0
Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffa9cc1bc0>] ? insert_kthread_work+0x40/0x40
Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffaa374c37>] ret_from_fork_nospec_begin+0x21/0x21
Sep 19 14:42:14 mckinley-01 kernel: [<ffffffffa9cc1bc0>] ? insert_kthread_work+0x40/0x40
Sep 19 14:42:15 mckinley-01 multipathd: sdd: unusable path
Sep 19 14:42:16 mckinley-01 multipathd: sdl: unusable path
Sep 19 14:42:16 mckinley-01 multipathd: sdt: unusable path


### In my reproduction, pcs seems to think the cluster is still fine:
[root@mckinley-01 ~]# pcs status
Cluster name: MCKINLEY
Stack: corosync
Current DC: mckinley-02 (version 1.1.19-7.el7-c3c624ea3d) - partition with quorum
Last updated: Wed Sep 19 14:55:21 2018
Last change: Wed Sep 19 13:35:51 2018 by root via cibadmin on mckinley-01

3 nodes configured
7 resources configured

Online: [ mckinley-01 mckinley-02 mckinley-03 ]

Full list of resources:

 mckinley-apc   (stonith:fence_apc):    Started mckinley-01
 Clone Set: dlm-clone [dlm]
     Started: [ mckinley-01 mckinley-02 mckinley-03 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ mckinley-01 mckinley-02 mckinley-03 ]

Comment 6 Corey Marthaler 2018-09-19 20:21:40 UTC

This seems eerily familiar to closed bugs 1039218 and 1038870 and may having something to do with prior run cmirror device failure allocation scenarios not fully cleaning up after itself. However, if that's the case, what changed for us to have to deal w/ this again after 4+ years...

Comment 7 David Teigland 2018-09-19 21:00:28 UTC

"lvm[3062]" -- what process is this?  Is it one of the automatically run lvconvert --repair commands that operates so secretively?  If so we need to know exactly what that thing is doing.  Any kind of automatically run repair command needs to leave very obvious and verbose logs about exactly what it's doing.  Will it leave a full debug log of itself by setting lvm.conf log level and file?

Also, I'm surprised to see no trace of cmirror anywhere yet.  Shouldn't that be leaving some evidence of itself somewhere?

Comment 8 Corey Marthaler 2018-09-19 21:25:57 UTC

Another data point, this test scenario passes in single machine mode w/o clvmd/cmirror. That syncd_secondary_3legs_1_mimagetmp_3 image is present in this passing case, so I don't think this is related to the above mentioned old bugs 1039218 and 1038870 after all then.

Current mirror/raid device structure(s):
  LV                                 Attr       LSize   Cpy%Sync Devices                                                                                                    
   syncd_secondary_3legs_1            mwi-a-m--- 500.00m 7.20     syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_1(0),syncd_secondary_3legs_1_mimage_2(0)
   [syncd_secondary_3legs_1_mimage_0] Iwi-aom--- 500.00m          /dev/sde1(0)                                                                                               
   [syncd_secondary_3legs_1_mimage_1] Iwi-aom--- 500.00m          /dev/sdi1(0)                                                                                               
   [syncd_secondary_3legs_1_mimage_2] Iwi-aom--- 500.00m          /dev/sdm1(0)                                                                                               
   [syncd_secondary_3legs_1_mlog]     lwi-aom---   4.00m          /dev/sdb1(0)                                                                                               

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec

Creating xfs on top of mirror(s) on hayes-02...
Mounting mirrored xfs filesystems on hayes-02...

PV=/dev/sdi1
        syncd_secondary_3legs_1_mimage_1: 4.1
PV=/dev/sdi1
        syncd_secondary_3legs_1_mimage_1: 4.1

Writing verification files (checkit) to mirror(s) on...
        ---- hayes-02 ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- hayes-02 ----
Disabling device sdi on hayes-02rescan device...

Attempting I/O to cause mirror down conversion(s) on hayes-02
dd if=/dev/zero of=/mnt/syncd_secondary_3legs_1/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.0233912 s, 1.8 GB/s
Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  WARNING: Not using lvmetad because a repair command was run.
  /dev/sdi: open failed: No such device or address
  /dev/sdi1: open failed: No such device or address
  Couldn't find device with uuid YbtcAp-Z0tv-qBWL-GY2e-60LI-Yb3f-m1evQY.
  LV                                    Attr       LSize   Cpy%Sync Devices                                                                   
   syncd_secondary_3legs_1               cwi-aom--- 500.00m 100.00   syncd_secondary_3legs_1_mimagetmp_3(0),syncd_secondary_3legs_1_mimage_3(0)
   [syncd_secondary_3legs_1_mimage_0]    iwi-aom--- 500.00m          /dev/sde1(0)                                                              
   [syncd_secondary_3legs_1_mimage_2]    iwi-aom--- 500.00m          /dev/sdm1(0)                                                              
   [syncd_secondary_3legs_1_mimage_3]    iwi-aom--- 500.00m          /dev/sdo1(0)                                                              
   [syncd_secondary_3legs_1_mimagetmp_3] mwi-aom--- 500.00m 100.00   syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_2(0)   
   [syncd_secondary_3legs_1_mlog]        lwi-aom---   4.00m          /dev/sdb1(0)                                                              

Sep 19 16:06:33 hayes-02 lvm[10402]: Mirror status: 1 of 3 images failed.
Sep 19 16:06:33 hayes-02 lvm[10402]: WARNING: Trying to up-convert to 3 images, 1 logs.
Sep 19 16:06:33 hayes-02 systemd: Started LVM2 poll daemon.
Sep 19 16:06:33 hayes-02 lvm[10402]: helter_skelter/syncd_secondary_3legs_1: Converted: 0.80%
Sep 19 16:06:33 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: '  WARNING: Not using lvmetad because a repair command was run.'
Sep 19 16:06:33 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: '  /dev/sdi: open failed: No such device or address'
Sep 19 16:06:33 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: '  /dev/sdi1: open failed: No such device or address'
Sep 19 16:06:33 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: '  Couldn't find device with uuid YbtcAp-Z0tv-qBWL-GY2e-60LI-Yb3f-m1evQY.'
Sep 19 16:06:39 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45410).
Sep 19 16:06:39 hayes-02 qarshd[23013]: Talking to peer ::ffff:10.15.80.218:45410 (IPv6)
Sep 19 16:06:39 hayes-02 qarshd[23013]: Running cmdline: lvs > /dev/null 2>&1
Sep 19 16:06:39 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45412).
Sep 19 16:06:39 hayes-02 qarshd[23023]: Talking to peer ::ffff:10.15.80.218:45412 (IPv6)
[...]
Sep 19 16:06:48 hayes-02 qarshd[23087]: Running cmdline: /usr/tests/sts-rhel7.6/bin/checkit -w /mnt/syncd_secondary_3legs_1/checkit -f /tmp/checkit_syncd_secondary_3legs_1 -v
Sep 19 16:06:48 hayes-02 lvm[10402]: helter_skelter/syncd_secondary_3legs_1: Converted: 100.00%
Sep 19 16:06:48 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: '  /dev/sdi: open failed: No such device or address'
Sep 19 16:06:49 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45448).
Sep 19 16:06:49 hayes-02 qarshd[23106]: Talking to peer ::ffff:10.15.80.218:45448 (IPv6)
Sep 19 16:06:49 hayes-02 qarshd[23106]: Running cmdline: dmstats list --histogram
Sep 19 16:06:49 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45450).
Sep 19 16:06:49 hayes-02 qarshd[23110]: Talking to peer ::ffff:10.15.80.218:45450 (IPv6)
Sep 19 16:06:49 hayes-02 qarshd[23110]: Running cmdline: dmstats list --group
Sep 19 16:06:49 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45452).
Sep 19 16:06:49 hayes-02 qarshd[23114]: Talking to peer ::ffff:10.15.80.218:45452 (IPv6)
Sep 19 16:06:50 hayes-02 qarshd[23114]: Running cmdline: dmstats delete  /dev/helter_skelter/syncd_secondary_3legs_1 --allregions
Sep 19 16:06:50 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45454).
Sep 19 16:06:50 hayes-02 qarshd[23118]: Talking to peer ::ffff:10.15.80.218:45454 (IPv6)
Sep 19 16:06:50 hayes-02 qarshd[23118]: Running cmdline: echo running > /sys/block/sdi/device/state
Sep 19 16:06:50 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45456).
Sep 19 16:06:50 hayes-02 qarshd[23122]: Talking to peer ::ffff:10.15.80.218:45456 (IPv6)
Sep 19 16:06:50 hayes-02 qarshd[23122]: Running cmdline: pvscan --cache /dev/sdi1
Sep 19 16:06:50 hayes-02 systemd: Started qarsh Per-Connection Server (10.15.80.218:45458).
Sep 19 16:06:50 hayes-02 qarshd[23127]: Talking to peer ::ffff:10.15.80.218:45458 (IPv6)
Sep 19 16:06:50 hayes-02 qarshd[23127]: Running cmdline: vgs
Sep 19 16:06:56 hayes-02 lvm[10402]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.
Sep 19 16:06:56 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: '  /dev/sdi1: open failed: No such device or address'
Sep 19 16:06:56 hayes-02 lvmpolld: W: #011LVPOLL: PID 23008: STDERR: '  WARNING: This metadata update is NOT backed up.'
Sep 19 16:06:56 hayes-02 lvm[10402]: helter_skelter-syncd_secondary_3legs_1 is now in-sync.
Sep 19 16:06:56 hayes-02 multipathd: dm-2: remove map (uevent)
Sep 19 16:06:56 hayes-02 multipathd: dm-2: devmap not registered, can't remove
Sep 19 16:06:56 hayes-02 multipathd: dm-2: remove map (uevent)
Sep 19 16:06:56 hayes-02 lvm[10402]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.
Sep 19 16:07:04 hayes-02 lvm[10402]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.
Sep 19 16:07:04 hayes-02 lvm[10402]: helter_skelter/syncd_secondary_3legs_1: Converted: 100.00%

Comment 9 Corey Marthaler 2018-09-19 21:51:32 UTC

Everything goes south after the upconvert is attempted. Either by the user/test when in "remove" leg fault policy, or by lvm/dmevent automatically when in "allocate" leg fault policy.

Comment 10 Corey Marthaler 2018-09-19 21:56:25 UTC

Created attachment 1484952 [details]
log of test run with "remove" policy

Comment 11 Corey Marthaler 2018-09-19 22:22:04 UTC

Created attachment 1484953 [details]
verbose lvconvert upconvert

# This  is just before running an upconvert (long after the failure, auto down convert, device turned back on, vgextend'ed). This is presumably also what lvm does automatically when in "allocate" policy
[root@mckinley-01 ~]# lvs -a -o +devices
  LV                                 VG               Attr       LSize    Pool Origin Data%  Meta%  Move Log                            Cpy%Sync Convert Devices
  syncd_secondary_3legs_1            helter_skelter   mwi-aom---  500.00m                                [syncd_secondary_3legs_1_mlog] 100.00           syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_2(0)
  [syncd_secondary_3legs_1_mimage_0] helter_skelter   iwi-aom---  500.00m                                                                                /dev/mapper/mpathg1(0)
  [syncd_secondary_3legs_1_mimage_2] helter_skelter   iwi-aom---  500.00m                                                                                /dev/mapper/mpathf1(0)
  [syncd_secondary_3legs_1_mlog]     helter_skelter   lwi-aom---    4.00m                                                                                /dev/mapper/mpathc1(0)


lvconvert -vvvv --yes --type mirror --mirrorlog disk -m 2 helter_skelter/syncd_secondary_3legs_1 /dev/mapper/mpathg1:0-2400 /dev/mapper/mpatha1:0-2400 /dev/mapper/mpathf1:0-2400 /dev/mapper/mpathc1:0-150

After this ever

Comment 12 Corey Marthaler 2018-09-19 22:30:07 UTC

This is what is in the log as a direct result of the above lvconvert cmd being run:


Sep 19 17:16:16 mckinley-01 dmeventd[2952]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.
Sep 19 17:16:17 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: Error -5 writing to journal, jid=0
Sep 19 17:16:17 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: error -5: withdrawing the file system to prevent further damage.
Sep 19 17:16:17 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: about to withdraw this file system
Sep 19 17:16:17 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: telling LM to unmount
Sep 19 17:16:17 mckinley-01 kernel: GFS2: fsid=MCKINLEY:gfs1.0: withdrawn
Sep 19 17:16:17 mckinley-01 corosync[2480]: [CPG   ] *** 0x55aaf14c3180 can't mcast to group  state:0, error:12
Sep 19 17:16:17 mckinley-01 kernel: CPU: 5 PID: 5856 Comm: gfs2_logd Kdump: loaded Not tainted 3.10.0-951.el7.x86_64 #1
Sep 19 17:16:17 mckinley-01 kernel: Hardware name: Dell Inc. PowerEdge R820/0RN9TC, BIOS 2.0.20 01/16/2014
Sep 19 17:16:17 mckinley-01 kernel: Call Trace:
Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffad161ca8>] dump_stack+0x19/0x1b
Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffc0797206>] gfs2_lm_withdraw+0x146/0x180 [gfs2]
Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffc077be15>] gfs2_logd+0x315/0x340 [gfs2]
Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffacac2d60>] ? wake_up_atomic_t+0x30/0x30
Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffc077bb00>] ? gfs2_meta_syncfs+0x50/0x50 [gfs2]
Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffacac1c91>] kthread+0xd1/0xe0
Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffacac1bc0>] ? insert_kthread_work+0x40/0x40
Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffad174c37>] ret_from_fork_nospec_begin+0x21/0x21
Sep 19 17:16:17 mckinley-01 kernel: [<ffffffffacac1bc0>] ? insert_kthread_work+0x40/0x40
Sep 19 17:16:20 mckinley-01 lvm[2952]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1_mimagetmp_3 for events.
Sep 19 17:16:20 mckinley-01 lvm[2952]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.
Sep 19 17:16:20 mckinley-01 systemd: Started LVM2 poll daemon.
Sep 19 17:16:24 mckinley-01 lvm[2952]: helter_skelter-syncd_secondary_3legs_1 is now in-sync.
Sep 19 17:16:35 mckinley-01 lvm[2952]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1_mimagetmp_3 for events.
Sep 19 17:16:36 mckinley-01 dmeventd[2952]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.
Sep 19 17:16:37 mckinley-01 qarshd[5940]: Sending child 5941 signal 2
Sep 19 17:16:43 mckinley-01 qarshd[5940]: Sending child 5941 signal 15
Sep 19 17:17:12 mckinley-01 systemd-logind: New session 3 of user root.
Sep 19 17:17:12 mckinley-01 systemd: Started Session 3 of user root.
Sep 19 17:17:35 mckinley-01 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 7520) has timed out
Sep 19 17:17:50 mckinley-01 lvmpolld: W: #011LVPOLL: PID 7520: STDERR: '  Error locking on node 1: Command timed out'
Sep 19 17:18:14 mckinley-01 systemd: Started qarsh Per-Connection Server (10.15.80.218:55888).
Sep 19 17:18:14 mckinley-01 qarshd[8035]: Talking to peer ::ffff:10.15.80.218:55888 (IPv6)
Sep 19 17:18:14 mckinley-01 qarshd[8035]: Sending file /tmp/lvconvert
Sep 19 17:18:19 mckinley-01 systemd: qarshd.104.61:5016-10.15.80.218:54668.service stop-sigterm timed out. Killing.
Sep 19 17:18:50 mckinley-01 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 7520) has timed out
Sep 19 17:19:19 mckinley-01 systemd: Starting Cleanup of Temporary Directories...
Sep 19 17:19:19 mckinley-01 systemd: Started Cleanup of Temporary Directories.
Sep 19 17:19:49 mckinley-01 systemd: qarshd.104.61:5016-10.15.80.218:54668.service still around after SIGKILL. Ignoring.
Sep 19 17:19:50 mckinley-01 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 7520) has timed out
Sep 19 17:20:00 mckinley-01 lvmpolld: W: #011LVPOLL: PID 7520: STDERR: '  Failed to lock logical volume helter_skelter/syncd_secondary_3legs_1.'
Sep 19 17:20:00 mckinley-01 lvmpolld: W: #011LVPOLL: PID 7520: STDERR: '  Error locking on node 1: Command timed out'
Sep 19 17:20:02 mckinley-01 kernel: INFO: task clvmd:2870 blocked for more than 120 seconds.
Sep 19 17:20:02 mckinley-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 19 17:20:02 mckinley-01 kernel: clvmd           D ffff9c873eaba080     0  2870      1 0x00000080
Sep 19 17:20:02 mckinley-01 kernel: Call Trace:
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffad167aa9>] schedule+0x29/0x70
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacc44285>] sb_wait_write+0xb5/0xd0
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacac2d60>] ? wake_up_atomic_t+0x30/0x30
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacc444b8>] freeze_super+0x68/0x130
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacc812b5>] freeze_bdev+0x75/0xd0
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffc00a8b45>] __dm_suspend+0x105/0x240 [dm_mod]
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffc00ab27d>] dm_suspend+0xad/0xc0 [dm_mod]
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffc00b1053>] dev_suspend+0x1c3/0x260 [dm_mod]
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffc00b20ee>] ctl_ioctl+0x24e/0x550 [dm_mod]
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffc00b0e90>] ? table_load+0x390/0x390 [dm_mod]
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffc00b23fe>] dm_ctl_ioctl+0xe/0x20 [dm_mod]
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacc56230>] do_vfs_ioctl+0x3a0/0x5a0
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacce1f0b>] ? unmerge_queues+0x1b/0x70
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacce02ce>] ? ipcget+0x14e/0x1e0
Sep 19 17:20:02 mckinley-01 kernel: [<ffffffffacc564d1>] SyS_ioctl+0xa1/0xc0
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad174d15>] ? system_call_after_swapgs+0xa2/0x146
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad174ddb>] system_call_fastpath+0x22/0x27
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad174d21>] ? system_call_after_swapgs+0xae/0x146
Sep 19 17:20:03 mckinley-01 kernel: INFO: task xdoio:5944 blocked for more than 120 seconds.
Sep 19 17:20:03 mckinley-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 19 17:20:03 mckinley-01 kernel: xdoio           D ffff9c977fb05140     0  5944      1 0x00000084
Sep 19 17:20:03 mckinley-01 kernel: Call Trace:
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc785dd>] ? __brelse+0x3d/0x50
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad165b70>] ? out_of_line_wait_on_atomic_t+0x100/0x100
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad167aa9>] schedule+0x29/0x70
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad165b81>] bit_wait+0x11/0x50
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad1656f7>] __wait_on_bit+0x67/0x90
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad165b70>] ? out_of_line_wait_on_atomic_t+0x100/0x100
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad165861>] out_of_line_wait_on_bit+0x81/0xb0
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacac2e20>] ? wake_bit_function+0x40/0x40
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc0776c3e>] gfs2_glock_wait+0x3e/0x80 [gfs2]
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc0778720>] gfs2_glock_nq+0x250/0x420 [gfs2]
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc0796796>] gfs2_trans_begin+0xf6/0x1b0 [gfs2]
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc079523f>] gfs2_dirty_inode+0x1bf/0x280 [gfs2]
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc079526b>] ? gfs2_dirty_inode+0x1eb/0x280 [gfs2]
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc6fedd>] __mark_inode_dirty+0x16d/0x270
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc5dfd9>] update_time+0x89/0xd0
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffaca7956f>] ? gup_pud_range+0x13f/0x2a0
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc5e0c0>] file_update_time+0xa0/0xf0
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacbb8668>] __generic_file_aio_write+0x198/0x400
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacbb8929>] generic_file_aio_write+0x59/0xa0
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc0781130>] gfs2_file_aio_write+0xd0/0x160 [gfs2]
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc406db>] do_sync_readv_writev+0x7b/0xd0
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc4231e>] do_readv_writev+0xce/0x260
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc0781060>] ? gfs2_release+0x50/0x50 [gfs2]
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc40580>] ? do_sync_read+0xe0/0xe0
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc408a4>] ? generic_file_llseek+0x24/0x30
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffc078156d>] ? gfs2_llseek+0x5d/0x120 [gfs2]
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc42545>] vfs_writev+0x35/0x60
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffacc426ff>] SyS_writev+0x7f/0x110
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad174d21>] ? system_call_after_swapgs+0xae/0x146
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad174ddb>] system_call_fastpath+0x22/0x27
Sep 19 17:20:03 mckinley-01 kernel: [<ffffffffad174d21>] ? system_call_after_swapgs+0xae/0x146
Sep 19 17:21:00 mckinley-01 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 7520) has timed out
Sep 19 17:21:19 mckinley-01 systemd: qarshd.104.61:5016-10.15.80.218:54668.service stop-final-sigterm timed out. Killing.
Sep 19 17:22:00 mckinley-01 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 7520) has timed out
Sep 19 17:22:00 mckinley-01 lvmpolld: W: #011LVPOLL: PID 7520: STDERR: '  Attempt to drop cached metadata failed after reverted update for VG helter_skelter.'
Sep 19 17:22:00 mckinley-01 lvmpolld: W: #011LVPOLL: PID 7520: STDERR: '  Error locking on node 1: Command timed out'
Sep 19 17:22:03 mckinley-01 kernel: INFO: task clvmd:2870 blocked for more than 120 seconds.
Sep 19 17:22:03 mckinley-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 19 17:22:03 mckinley-01 kernel: clvmd           D ffff9c873eaba080     0  2870      1 0x00000080
Sep 19 17:22:03 mckinley-01 kernel: Call Trace:
Sep 19 17:22:03 mckinley-01 kernel: [<ffffffffad167aa9>] schedule+0x29/0x70
Sep 19 17:22:03 mckinley-01 kernel: [<ffffffffacc44285>] sb_wait_write+0xb5/0xd0
Sep 19 17:22:03 mckinley-01 kernel: [<ffffffffacac2d60>] ? wake_up_atomic_t+0x30/0x30
Sep 19 17:22:03 mckinley-01 kernel: [<ffffffffacc444b8>] freeze_super+0x68/0x130
Sep 19 17:22:03 mckinley-01 kernel: [<ffffffffacc812b5>] freeze_bdev+0x75/0xd0

Comment 13 Heinz Mauelshagen 2018-09-20 12:00:07 UTC

Can we have "dmsetup ls --tree; dmsetup table; dmsetup status" output
from before and after the "lvconvert -vvvv --yes --type mirror --mirrorlog disk -m 2 helter_skelter/syncd_secondary_3legs_1 /dev/mapper/mpathg1:0-2400 /dev/mapper/mpatha1:0-2400 /dev/mapper/mpathf1:0-2400 /dev/mapper/mpathc1:0-150",
please ?

Comment 14 David Teigland 2018-09-20 15:33:23 UTC

The lvconvert in comment 11 is blocked because clvmd is not responding.
In comment 12 clvmd is blocked in D state suspending the device under gfs.
The suspend is blocked waiting for gfs to freeze.  The gfs freeze is blocked because gfs is in the middle of withdraw due to io errors and/or the xdoio process that's in the middle of writing.

I don't think clvmd is supposed to be suspending devices.  We should get debug output from clvmd to see what it might be doing wrong.

Comment 15 Zdenek Kabelac 2018-09-20 16:06:20 UTC

So looking at the 'flush'  issue part of problem here - there should be basically no flush in the particular case.

The flush could have been introduced with this commit  672b8c196b91e4cf6567223ce44dfb902281b791 - fixing the splitmirror preloading.

So it looks like the necessity for flushing is not properly estimated in this case.  I'll try to provide some patch - detected where the flush must not happen in this preload case (mirror leg with 'D') and ensure the preload is going without it.

Not really sure it's the core issue - but it's definitely a bug - and can block the cluster probably.

Comment 16 Corey Marthaler 2018-09-20 18:21:17 UTC

Created attachment 1485256 [details]
requested dm information

FWIW, here's the dm info requested in comment #13.

Comment 17 David Teigland 2018-09-21 14:16:08 UTC

When things go bad here are errors in several layers (cluster, kernel, lvm, gfs) all at about the same time.  What's not yet clear is whether there is one initial issue that triggers everything else.

Sep 20 15:09:29 harding-02 qarshd[11771]: Running cmdline: vgextend helter_skelter /dev/mapper/mpathf1
Sep 20 15:11:11 harding-02 dmeventd[5990]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.
Sep 20 15:11:11 harding-02 kernel: device-mapper: raid1: Unable to read primary mirror during recovery
Sep 20 15:11:11 harding-02 kernel: device-mapper: raid1: Primary mirror (253:21) failed while out-of-sync: Reads may fail.
Sep 20 15:11:12 harding-02 kernel: GFS2: fsid=HARDING:gfs1.0: recover generation 3 done
Sep 20 15:11:15 harding-02 lvm[5990]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1_mimagetmp_3 for events.
Sep 20 15:11:15 harding-02 lvm[5990]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.
Sep 20 15:11:15 harding-02 systemd: Started LVM2 poll daemon.
Sep 20 15:11:17 harding-02 lvm[5990]: Primary mirror device 253:21 sync failed.
Sep 20 15:11:17 harding-02 lvm[5990]: helter_skelter-syncd_secondary_3legs_1 is now in-sync.
Sep 20 15:11:30 harding-02 lvm[5990]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1_mimagetmp_3 for events.
Sep 20 15:11:30 harding-02 dmeventd[5990]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.
Sep 20 15:11:30 harding-02 cmirrord[7446]: Log entry already exists: LVM-4MhnZKHIY4FOEGVEtM2hszEiDk3YRGYsEi3SAeDOrLBJcV8FRuoZxnC4HYg03EtH
Sep 20 15:11:30 harding-02 cmirrord[7446]: clog_resume:  Failed to create cluster CPG
Sep 20 15:11:30 harding-02 kernel: device-mapper: raid1: log presuspend failed
Sep 20 15:11:30 harding-02 kernel: device-mapper: raid1: log presuspend failed
Sep 20 15:11:30 harding-02 kernel: device-mapper: raid1: log postsuspend failed
Sep 20 15:12:30 harding-02 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 12145) has timed out
Sep 20 15:13:30 harding-02 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 12145) has timed out
Sep 20 15:14:30 harding-02 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 12145) has timed out
Sep 20 15:14:30 harding-02 lvmpolld[12142]: LVMPOLLD: LVM2 cmd is unresponsive too long (PID 12145) (no output for 180 seconds)
Sep 20 15:14:45 harding-02 lvm[5990]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.


Sep 20 15:10:23 harding-03 crmd[11466]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
Sep 20 15:10:23 harding-03 pengine[11465]:  notice: Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-578.bz2
Sep 20 15:10:23 harding-03 crmd[11466]:  notice: Transition 2 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-578.bz2): Complete
Sep 20 15:10:23 harding-03 crmd[11466]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Sep 20 15:11:11 harding-03 dmeventd[11964]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.
Sep 20 15:11:12 harding-03 kernel: GFS2: fsid=HARDING:gfs1.1: Error -5 writing to journal, jid=1
Sep 20 15:11:12 harding-03 kernel: GFS2: fsid=HARDING:gfs1.1: error -5: withdrawing the file system to prevent further damage.
Sep 20 15:11:12 harding-03 kernel: GFS2: fsid=HARDING:gfs1.1: about to withdraw this file system
Sep 20 15:11:12 harding-03 kernel: GFS2: fsid=HARDING:gfs1.1: telling LM to unmount
Sep 20 15:11:12 harding-03 kernel: GFS2: fsid=HARDING:gfs1.1: withdrawn
Sep 20 15:11:12 harding-03 kernel: CPU: 7 PID: 16487 Comm: gfs2_logd Kdump: loaded Not tainted 3.10.0-951.el7.x86_64 #1
Sep 20 15:11:12 harding-03 kernel: Hardware name: Dell Inc. PowerEdge R820/0YWR73, BIOS 1.5.0 03/08/2013
Sep 20 15:11:12 harding-03 kernel: Call Trace:
Sep 20 15:11:12 harding-03 kernel: [<ffffffffa8361ca8>] dump_stack+0x19/0x1b
Sep 20 15:11:12 harding-03 kernel: [<ffffffffc0b3f206>] gfs2_lm_withdraw+0x146/0x180 [gfs2]
Sep 20 15:11:12 harding-03 kernel: [<ffffffffc0b23e15>] gfs2_logd+0x315/0x340 [gfs2]
Sep 20 15:11:12 harding-03 kernel: [<ffffffffa7cc2d60>] ? wake_up_atomic_t+0x30/0x30
Sep 20 15:11:12 harding-03 kernel: [<ffffffffc0b23b00>] ? gfs2_meta_syncfs+0x50/0x50 [gfs2]
Sep 20 15:11:12 harding-03 kernel: [<ffffffffa7cc1c91>] kthread+0xd1/0xe0
Sep 20 15:11:12 harding-03 kernel: [<ffffffffa7cc1bc0>] ? insert_kthread_work+0x40/0x40
Sep 20 15:11:12 harding-03 kernel: [<ffffffffa8374c37>] ret_from_fork_nospec_begin+0x21/0x21
Sep 20 15:11:12 harding-03 kernel: [<ffffffffa7cc1bc0>] ? insert_kthread_work+0x40/0x40
Sep 20 15:11:15 harding-03 lvm[11964]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1_mimagetmp_3 for events.
Sep 20 15:11:15 harding-03 lvm[11964]: Monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.
Sep 20 15:11:30 harding-03 lvm[11964]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1_mimagetmp_3 for events.
Sep 20 15:11:30 harding-03 dmeventd[11964]: No longer monitoring mirror device helter_skelter-syncd_secondary_3legs_1 for events.
Sep 20 15:14:52 harding-03 kernel: INFO: task clvmd:14014 blocked for more than 120 seconds.

Comment 18 David Teigland 2018-09-21 14:34:38 UTC

Talking with Corey, it sounds like there's an lvm feature that automatically adds a mirror leg back into a mirrored LV after it has lost a leg.  It appears that this action is what causes everything to fall apart, but what is doing this action, how is it triggered, has it left any evidence of itself in the logs, and how can we get a handle on it to see exactly what it's doing?

Comment 19 Heinz Mauelshagen 2018-09-21 15:52:40 UTC

FWIW: the test passes single hosted under load to an xfs for "remove" and "allocate" mirror image policies.

Comment 20 Heinz Mauelshagen 2018-09-21 16:22:34 UTC

(In reply to David Teigland from comment #18)
> Talking with Corey, it sounds like there's an lvm feature that automatically
> adds a mirror leg back into a mirrored LV after it has lost a leg.  It
> appears that this action is what causes everything to fall apart, but what
> is doing this action, how is it triggered, has it left any evidence of
> itself in the logs, and how can we get a handle on it to see exactly what
> it's doing?

Dave,

when monitored, depending on configuration option activation/mirror_image_fault_policy set to "remove" or "allocate",
"mirror" legs are either automatically removed or tried being allocated.

We have to think through implications of comment #15.
Is it the respective change leading to this regression
reasoned by bogus flushing?  Looking forward to Zdeneck's patch...

Comment 21 David Teigland 2018-09-21 16:46:32 UTC

It looks like the problem is related to a hidden "lvconvert --repair" command that is being run by dmeventd, but is basically leaving no clear record of itself or what it's doing or what might be wrong.  (A repair like this should be leaving a lot of data behind by default.)

To really debug this we should have the full -vvvv output from the lvconvert --repair command and clvmd, but it's not clear how to collect either of those in this environment.  I'll try to cobble together all these pieces on my own machines so I can run everything directly with full debugging.

(So many separate, complex moving parts that all have to work together correctly... the compounding complexity of it all looks incredibly fragile.
failing and replacing disks, dm-mirror, cmirrord, dmeventd, lvconvert --repair in the background, lvmpolld, clvmd, corosync, all somehow working together and simultaneously on three different nodes.)

Comment 22 David Teigland 2018-09-21 19:41:11 UTC

I reproduced this manually on my own test machines so we should now have all the debugging that's available.  There's a lot of data from clvmd, cmirrord and the kernel on all the nodes.  There's a lot that looks bad, but it's all foreign to me so I can't say what could be an initial cause vs effects.

The basic sequence is:
- mirror LV with three legs active on three nodes
- kill the device under the middle log on all three nodes
- dmeventd triggers recovery to remove that leg from the mirror
- later, manually add another leg back to the LV, this is where everything falls apart

In my test, the device I killed is brought back online and used to add the new leg again.  It's probably worth checking if adding a leg using a different device (not the one that failed) would have the same problem.

The problems start when this command is run from node1 to add a new mirror leg:
lvconvert -vvvv --yes --type mirror --mirrorlog disk -m 2 dcl/lvt

the lvconvert command hangs here waiting for clvmd:
#locking/cluster_locking.c:511         Locking LV 5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp PW (LV|NONBLOCK|CLUSTER) (0x9c)

and eventually times out:
#locking/cluster_locking.c:397     Error locking on node 2: Command timed out

clvmd on node1:
CLVMD[e10ff700]: Sep 21 18:27:59 do_lock_lv: resource '5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp', cmd = 0x9e LCK_LV
_RESUME (UNLOCK|LV|NONBLOCK|CLUSTER_VG), flags = 0x4 ( DMEVENTD_MONITOR ), critical_section = 1
CLVMD[e1232740]: Sep 21 18:27:59 1 got message from nodeid 1 for 0. len 84
#cache/lvmcache.c:203           lvmcache: no saved_vg for vgid "5F4GKTODs380YAx7e9GfesBNqaMkqR5P"
#cache/lvmcache.c:506           lvmcache: no saved vg latest 5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp
#activate/activate.c:2493          Resuming LVID 5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp did not find saved VG
#activate/activate.c:2501          Resuming LVID 5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp reading VG
...
#activate/dev_manager.c:2869          Adding new LV dcl/lvt to dtree
...
#libdm-deptree.c:2696      Loading table for dcl-lvt_mlog (253:3).
#libdm-deptree.c:2641          Adding target to (253:3): 0 8192 linear 8:80 2048
#ioctl/libdm-iface.c:1857          dm table   (253:3) [ opencount flush ]   [16384] (*1)
#libdm-deptree.c:2729      Suppressed dcl-lvt_mlog (253:3) identical table reload.
#libdm-deptree.c:2696      Loading table for dcl-lvt_mimage_0 (253:4).
#libdm-deptree.c:2641          Adding target to (253:4): 0 1024000 linear 8:17 2048
#ioctl/libdm-iface.c:1857          dm table   (253:4) [ opencount flush ]   [16384] (*1)
#libdm-deptree.c:2729      Suppressed dcl-lvt_mimage_0 (253:4) identical table reload.
#libdm-deptree.c:2696      Loading table for dcl-lvt_mimage_2 (253:6).
#libdm-deptree.c:2641          Adding target to (253:6): 0 1024000 linear 8:49 2048
#ioctl/libdm-iface.c:1857          dm table   (253:6) [ opencount flush ]   [16384] (*1)
#libdm-deptree.c:2729      Suppressed dcl-lvt_mimage_2 (253:6) identical table reload.
#libdm-deptree.c:2696      Loading table for dcl-lvt (253:7).
#libdm-deptree.c:2641          Adding target to (253:7): 0 1024000 mirror userspace 4 LVM-5F4GKTODs380YAx7e9GfesBNqaMkqR5PzkTcLQzehPhzSB1hHZfQU8nZp4ZZ7JWN clustered-disk 253:3 4096 2 253:4 0 253:6 0 1 handle_errors
#ioctl/libdm-iface.c:1857          dm table   (253:7) [ opencount flush ]   [16384] (*1)
#ioctl/libdm-iface.c:1998    Internal error: Performing unsafe table load while 6 device(s) are known to be suspended:  (253:7)
#ioctl/libdm-iface.c:1857          dm reload   (253:7) [ noopencount flush ]   [16384] (*1)
#libdm-deptree.c:1302      Resuming dcl-lvt_mimage_3 (253:8).
#libdm-common.c:2325          Udev cookie 0xd4d020c (semid 425985) incremented to 3
#libdm-common.c:2575          Udev cookie 0xd4d020c (semid 425985) assigned to RESUME task(5) with flags DISABLE_SUBSYSTEM_RULES DISABLE_DISK_RULES DISABLE_OTHER_RULES DISABLE_LIBRARY_FALLBACK         (0x2e)
#ioctl/libdm-iface.c:1857          dm resume   (253:8) [ noopencount flush ]   [16384] (*1)
#libdm-common.c:1484          dcl-lvt_mimage_3: Stacking NODE_ADD (253,8) 0:6 0660 [trust_udev]
#libdm-common.c:1494          dcl-lvt_mimage_3: Stacking NODE_READ_AHEAD 0 (flags=0)
#libdm-common.c:288           Suspended device counter reduced to 5
#libdm-deptree.c:1302      Resuming dcl-lvt_mimagetmp_3 (253:5).
#libdm-common.c:2325          Udev cookie 0xd4d020c (semid 425985) incremented to 4
#libdm-common.c:2575          Udev cookie 0xd4d020c (semid 425985) assigned to RESUME task(5) with flags DISABLE_SUBSYSTEM_RULES DISABLE_DISK_RULES DISABLE_OTHER_RULES DISABLE_LIBRARY_FALLBACK         (0x2e)
#ioctl/libdm-iface.c:1857          dm resume   (253:5) [ noopencount flush ]   [16384] (*1)
CLVMD[e1232740]: Sep 21 18:29:00 Request to client 0x55ef9349ef30 timed-out (send: 1537572479, now: 1537572540)
CLVMD[e1232740]: Sep 21 18:29:00 (0x55ef9349ef30) Request timed-out. padding



On nodes 1 and 3, cmirrord is sitting still, but on node 2 it's spinning, repeatedly printing data like:
Data ready on 4
Issuing callback on local/4
[r6iV5sUp]  Request from kernel received: [DM_ULOG_IN_SYNC/144913]
[r6iV5sUp] Region is not in-sync: 0

kernel messages:

Sep 21 18:21:59 null-01 dmeventd[62194]: device-mapper: waitevent ioctl on  LVM-5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp failed: Interrupted system call
Sep 21 18:21:59 null-01 dmeventd[62194]: Unlocking memory
Sep 21 18:21:59 null-01 dmeventd[62194]: No longer monitoring mirror device dcl-lvt for events.
Sep 21 18:21:59 null-01 kernel: device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail.
Sep 21 18:21:59 null-01 kernel: device-mapper: raid1: Read failure on mirror device 253:5.  Trying alternative device.
Sep 21 18:21:59 null-01 kernel: device-mapper: raid1: Mirror read failed.
Sep 21 18:22:59 null-01 systemd-udevd: worker [24271] /devices/virtual/block/dm-7 is taking a long time
Sep 21 18:24:59 null-01 systemd-udevd: worker [24271] /devices/virtual/block/dm-7 timeout; kill it
Sep 21 18:24:59 null-01 systemd-udevd: seq 3179 '/devices/virtual/block/dm-7' killed
Sep 21 18:27:57 null-01 kernel: INFO: task systemd-udevd:24271 blocked for more than 120 seconds.
...
Sep 21 18:28:15 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3438] - retrying
Sep 21 18:28:30 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3439] - retrying
Sep 21 18:28:45 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3440] - retrying
Sep 21 18:29:00 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3441] - retrying
Sep 21 18:29:15 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3442] - retrying
Sep 21 18:29:30 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3443] - retrying
Sep 21 18:29:45 null-01 kernel: device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/3444] - retrying


Sep 21 07:03:01 null-02 dmeventd[16703]: device-mapper: waitevent ioctl on  LVM-5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp failed: Interrupted system call
Sep 21 07:03:01 null-02 dmeventd[16703]: Unlocking memory
Sep 21 07:03:01 null-02 dmeventd[16703]: Device cache incomplete - not writing to /etc/lvm/cache/.cache
Sep 21 07:03:01 null-02 dmeventd[16703]: No longer monitoring mirror device dcl-lvt for events.
Sep 21 07:03:01 null-02 kernel: [1653475.201560] device-mapper: raid1: Primary mirror (253:6) failed while out-of-sync: Reads may fail.
Sep 21 07:03:01 null-02 kernel: [1653475.211831] device-mapper: raid1: Read failure on mirror device 253:6.  Trying alternative device.
Sep 21 07:03:01 null-02 kernel: [1653475.222182] device-mapper: raid1: Mirror read failed from (null). Trying alternative device.
Sep 21 07:03:01 null-02 kernel: [1653475.231984] device-mapper: raid1: Mirror read failed from (null). Trying alternative device.
(kernel spins printing these, repeating many thousands of times, it may be udev trying to read?)
Sep 21 07:08:44 null-02 kernel: [1653817.926216] device-mapper: raid1: Mirror read failed from (null). Trying alternative device.
Sep 21 07:08:50 null-02 kernel: [1653823.450292] INFO: task systemd-udevd:32267 blocked for more than 120 seconds.
Sep 21 07:08:59 null-02 kernel: [1653833.018474] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144916] - retrying
Sep 21 07:09:14 null-02 kernel: [1653848.027227] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144917] - retrying
Sep 21 07:09:29 null-02 kernel: [1653863.035980] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144918] - retrying
Sep 21 07:09:44 null-02 kernel: [1653878.044731] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144919] - retrying
Sep 21 07:09:59 null-02 kernel: [1653893.053482] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144920] - retrying
Sep 21 07:10:14 null-02 kernel: [1653908.061235] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144921] - retrying
Sep 21 07:10:29 null-02 kernel: [1653923.068986] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144922] - retrying
Sep 21 07:10:44 null-02 kernel: [1653938.076738] device-mapper: dm-log-userspace: [r6iV5sUp] Request timed out: [14/144923] - retrying


Sep 21 07:08:15 null-03 dmeventd[31812]: device-mapper: waitevent ioctl on  LVM-5F4GKTODs380YAx7e9GfesBNqaMkqR5PcQh3PEK2j6VI09mnN9eIuyv9r6iV5sUp failed: Interrupted system call
Sep 21 07:08:15 null-03 dmeventd[31812]: Unlocking memory
Sep 21 07:08:15 null-03 dmeventd[31812]: No longer monitoring mirror device dcl-lvt for events.
Sep 21 07:08:15 null-03 kernel: [1653447.996238] device-mapper: raid1: Unable to read primary mirror during recovery
Sep 21 07:08:15 null-03 kernel: [1653448.004306] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail.
Sep 21 07:08:15 null-03 kernel: [1653448.014076] device-mapper: raid1: Mirror read failed.
Sep 21 07:08:15 null-03 kernel: [1653448.020532] device-mapper: raid1: Mirror read failed.
Sep 21 07:08:15 null-03 kernel: [1653448.026364] buffer_io_error: 2 callbacks suppressed
Sep 21 07:08:15 null-03 kernel: [1653448.032017] Buffer I/O error on dev dm-7, logical block 127984, async page read
Sep 21 07:14:30 null-03 kernel: [1653823.389915] device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/1096] - retrying
Sep 21 07:14:45 null-03 kernel: [1653838.397113] device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/1097] - retrying
Sep 21 07:15:00 null-03 kernel: [1653853.404633] device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/1098] - retrying
Sep 21 07:15:15 null-03 kernel: [1653868.412175] device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/1099] - retrying
(continues forever)
Sep 21 07:53:47 null-03 kernel: [1656179.569617] device-mapper: dm-log-userspace: [p4ZZ7JWN] Request timed out: [5/1253] - retrying

Comment 23 David Teigland 2018-09-21 20:50:15 UTC

Changed the test so that the new leg is added from a new device (not the device that had previously failed.)  This produced a kernel panic on the second node:

[ 3351.179294] device-mapper: raid1: Primary mirror (253:4) failed while out-of-sync: Reads may fail.
[ 3351.189841] device-mapper: raid1: Read failure on mirror device 253:4.  Trying alternative device.
[ 3351.200476] general protection fault: 0000 [#1] SMP 
[ 3351.206772] Modules linked in: gfs2 dm_log_userspace dlm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi amd64_edac_mod edac_mce_amd edac_cord
[ 3351.283044] CPU: 11 PID: 66 Comm: kworker/11:0 Not tainted 3.10.0-663.el7.x86_64 #1
[ 3351.292200] Hardware name: Dell Inc. PowerEdge R415/08WNM9, BIOS 1.9.3 04/26/2012
[ 3351.301192] Workqueue: kmirrord do_mirror [dm_mirror]
[ 3351.307764] task: ffff9347dd29edd0 ti: ffff9347dd2d0000 task.ti: ffff9347dd2d0000
[ 3351.316780] RIP: 0010:[<ffffffffa452901d>]  [<ffffffffa452901d>] strnlen+0xd/0x40
[ 3351.325817] RSP: 0018:ffff9347dd2d3af0  EFLAGS: 00010086
[ 3351.332667] RAX: ffffffffa4b02b8e RBX: ffffffffa4f7b570 RCX: fffffffffffffffe
[ 3351.341347] RDX: 54495257225b2049 RSI: ffffffffffffffff RDI: 54495257225b2049
[ 3351.350018] RBP: ffff9347dd2d3af0 R08: 000000000000ffff R09: 000000000000ffff
[ 3351.358690] R10: 0000000000000000 R11: 0000000000000000 R12: 54495257225b2049
[ 3351.367348] R13: ffffffffa4f7b920 R14: 00000000ffffffff R15: 0000000000000000
[ 3351.376015] FS:  00007f4eabaa3700(0000) GS:ffff93495e740000(0000) knlGS:0000000000000000
[ 3351.385646] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3351.392935] CR2: 00007f4fbe739490 CR3: 000000041562a000 CR4: 00000000000407e0
[ 3351.401611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3351.410278] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3351.418926] Stack:
[ 3351.422420]  ffff9347dd2d3b28 ffffffffa452b50b ffffffffa4f7b570 ffffffffa4f7b920
[ 3351.431369]  ffff9347dd2d3c40 ffffffffc004d49a ffffffffc004d49a ffff9347dd2d3b98
[ 3351.440317]  ffffffffa452cc11 dead000000000200 000000008a09800c 0000000000000030
[ 3351.449272] Call Trace:
[ 3351.453234]  [<ffffffffa452b50b>] string.isra.7+0x3b/0xf0
[ 3351.460167]  [<ffffffffa452cc11>] vsnprintf+0x201/0x6a0
[ 3351.466932]  [<ffffffffa452d0bd>] vscnprintf+0xd/0x30
[ 3351.473516]  [<ffffffffa428a49b>] vprintk_emit+0x11b/0x510
[ 3351.480546]  [<ffffffffa428aab9>] vprintk_default+0x29/0x40
[ 3351.487615]  [<ffffffffa4899d0f>] printk+0x5d/0x74
[ 3351.493870]  [<ffffffffc004bd70>] mirror_end_io+0xa0/0x1d0 [dm_mirror]
[ 3351.501840]  [<ffffffffc0019beb>] clone_endio+0x4b/0x100 [dm_mod]
[ 3351.509336]  [<ffffffffa443b814>] bio_endio+0x64/0xa0
[ 3351.515778]  [<ffffffffc004adf7>] do_reads+0x1a7/0x200 [dm_mirror]
[ 3351.523341]  [<ffffffffc03cf496>] ? userspace_get_sync_count+0x56/0xa0 [dm_log_userspace]
[ 3351.532910]  [<ffffffffc004c292>] do_mirror+0x102/0x230 [dm_mirror]
[ 3351.540573]  [<ffffffffa42a874a>] process_one_work+0x17a/0x440
[ 3351.547794]  [<ffffffffa42a97e6>] worker_thread+0x126/0x3c0
[ 3351.554759]  [<ffffffffa42a96c0>] ? manage_workers.isra.24+0x2a0/0x2a0
[ 3351.562691]  [<ffffffffa42b093f>] kthread+0xcf/0xe0
[ 3351.568977]  [<ffffffffa42b0870>] ? insert_kthread_work+0x40/0x40
[ 3351.576493]  [<ffffffffa48b0d18>] ret_from_fork+0x58/0x90
[ 3351.583290]  [<ffffffffa42b0870>] ? insert_kthread_work+0x40/0x40
[ 3351.590768] Code: c0 01 80 38 00 75 f7 48 29 f8 5d c3 31 c0 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 8d 4e ff 48 89 e5 74 2a 
[ 3351.613186] RIP  [<ffffffffa452901d>] strnlen+0xd/0x40
[ 3351.619713]  RSP <ffff9347dd2d3af0>
[ 3351.629783] ---[ end trace b8c67e0a02f0ed9c ]---
[ 3351.636543] Kernel panic - not syncing: Fatal exception
[ 3351.643062] Kernel Offset: 0x23200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 3351.655946] ------------[ cut here ]------------
[ 3351.661808] WARNING: CPU: 11 PID: 66 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x5f/0x70
[ 3351.672375] Modules linked in: gfs2 dm_log_userspace dlm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi amd64_edac_mod edac_mce_amd edac_cord
[ 3351.748220] CPU: 11 PID: 66 Comm: kworker/11:0 Tainted: G      D        ------------   3.10.0-663.el7.x86_64 #1
[ 3351.759743] Hardware name: Dell Inc. PowerEdge R415/08WNM9, BIOS 1.9.3 04/26/2012
[ 3351.768681] Workqueue: kmirrord do_mirror [dm_mirror]
[ 3351.775196]  0000000000000000 000000008a09800c ffff93495e743d90 ffffffffa489fc54
[ 3351.784101]  ffff93495e743dd0 ffffffffa4287928 0000007f5e743dd0 0000000000000000
[ 3351.793002]  ffff93495e756cc0 00000001002e8e0a 000000000000000b ffff934757616cc0
[ 3351.801900] Call Trace:
[ 3351.805797]  <IRQ>  [<ffffffffa489fc54>] dump_stack+0x19/0x1b
[ 3351.813046]  [<ffffffffa4287928>] __warn+0xd8/0x100
[ 3351.819392]  [<ffffffffa4287a6d>] warn_slowpath_null+0x1d/0x20
[ 3351.826682]  [<ffffffffa4250aef>] native_smp_send_reschedule+0x5f/0x70
[ 3351.834671]  [<ffffffffa42d2cc7>] trigger_load_balance+0x177/0x1e0
[ 3351.842302]  [<ffffffffa42c2d84>] scheduler_tick+0x104/0x150
[ 3351.849397]  [<ffffffffa42f2e90>] ? tick_sched_do_timer+0x50/0x50
[ 3351.856924]  [<ffffffffa429a2e5>] update_process_times+0x65/0x80
[ 3351.864368]  [<ffffffffa42f2c55>] tick_sched_handle.isra.13+0x25/0x60
[ 3351.872255]  [<ffffffffa42f2ed3>] tick_sched_timer+0x43/0x70
[ 3351.879356]  [<ffffffffa42b4a94>] __hrtimer_run_queues+0xd4/0x260
[ 3351.886900]  [<ffffffffa42b502f>] hrtimer_interrupt+0xaf/0x1d0
[ 3351.894185]  [<ffffffffa4253815>] local_apic_timer_interrupt+0x35/0x60
[ 3351.902165]  [<ffffffffa48b34bd>] smp_apic_timer_interrupt+0x3d/0x50
[ 3351.909971]  [<ffffffffa48b1a1d>] apic_timer_interrupt+0x6d/0x80
[ 3351.917426]  <EOI>  [<ffffffffa4899c6e>] ? panic+0x1c9/0x20d
[ 3351.924560]  [<ffffffffa48a9095>] oops_end+0xc5/0xe0
[ 3351.930971]  [<ffffffffa422e97b>] die+0x4b/0x70
[ 3351.936931]  [<ffffffffa48a89fe>] do_general_protection+0x12e/0x1c0
[ 3351.944630]  [<ffffffffa48a82a8>] general_protection+0x28/0x30
[ 3351.951896]  [<ffffffffa452901d>] ? strnlen+0xd/0x40
[ 3351.958287]  [<ffffffffa452b50b>] string.isra.7+0x3b/0xf0
[ 3351.965111]  [<ffffffffa452cc11>] vsnprintf+0x201/0x6a0
[ 3351.971758]  [<ffffffffa452d0bd>] vscnprintf+0xd/0x30
[ 3351.978225]  [<ffffffffa428a49b>] vprintk_emit+0x11b/0x510
[ 3351.985123]  [<ffffffffa428aab9>] vprintk_default+0x29/0x40
[ 3351.992072]  [<ffffffffa4899d0f>] printk+0x5d/0x74
[ 3351.998211]  [<ffffffffc004bd70>] mirror_end_io+0xa0/0x1d0 [dm_mirror]
[ 3352.006078]  [<ffffffffc0019beb>] clone_endio+0x4b/0x100 [dm_mod]
[ 3352.013507]  [<ffffffffa443b814>] bio_endio+0x64/0xa0
[ 3352.019895]  [<ffffffffc004adf7>] do_reads+0x1a7/0x200 [dm_mirror]
[ 3352.027412]  [<ffffffffc03cf496>] ? userspace_get_sync_count+0x56/0xa0 [dm_log_userspace]
[ 3352.036919]  [<ffffffffc004c292>] do_mirror+0x102/0x230 [dm_mirror]
[ 3352.044490]  [<ffffffffa42a874a>] process_one_work+0x17a/0x440
[ 3352.051603]  [<ffffffffa42a97e6>] worker_thread+0x126/0x3c0
[ 3352.058421]  [<ffffffffa42a96c0>] ? manage_workers.isra.24+0x2a0/0x2a0
[ 3352.066165]  [<ffffffffa42b093f>] kthread+0xcf/0xe0
[ 3352.072219]  [<ffffffffa42b0870>] ? insert_kthread_work+0x40/0x40
[ 3352.079459]  [<ffffffffa48b0d18>] ret_from_fork+0x58/0x90
[ 3352.085966]  [<ffffffffa42b0870>] ? insert_kthread_work+0x40/0x40
[ 3352.093133] ---[ end trace b8c67e0a02f0ed9d ]---

Comment 24 David Teigland 2018-09-21 21:40:11 UTC

cmirrord log captured just before the panic:

Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b]  Request from kernel received: [DM_ULOG_CTR/879]
Sep 21 09:24:13 null-02 cmirrord[1992]: Checking /dev/mapper for device 253:2
Sep 21 09:24:13 null-02 cmirrord[1992]:   .: NO
Sep 21 09:24:13 null-02 cmirrord[1992]:   ..: NO
Sep 21 09:24:13 null-02 cmirrord[1992]:   cvg-clv_mimagetmp_3: NO
Sep 21 09:24:13 null-02 cmirrord[1992]:   cvg-clv: NO
Sep 21 09:24:13 null-02 cmirrord[1992]:   cvg-clv_mimage_2: NO
Sep 21 09:24:13 null-02 cmirrord[1992]:   cvg-clv_mimage_0: NO
Sep 21 09:24:13 null-02 cmirrord[1992]:   cvg-clv_mlog: YES
Sep 21 09:24:13 null-02 cmirrord[1992]: Clustered log disk is /dev/mapper/cvg-clv_mlog
Sep 21 09:24:13 null-02 cmirrord[1992]: Disk log ready
Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] Cluster log created
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b]  Request from kernel received: [DM_ULOG_GET_REGION_SIZE/880]
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/881]
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd]  Request from kernel received: [DM_ULOG_CTR/882]
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Cluster log created
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd]  Request from kernel received: [DM_ULOG_GET_REGION_SIZE/883]
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/884]
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b]  Request from kernel received: [DM_ULOG_PRESUSPEND/885]
Sep 21 09:24:13 null-02 cmirrord[1992]: WARNING: log still marked as 'touched' during suspend
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b]  Request from kernel received: [DM_ULOG_POSTSUSPEND/886]
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] clog_postsuspend: leaving CPG
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Finalizing leave...
Sep 21 09:24:13 null-02 cmirrord[1992]: Freeing up cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]:  used_pfds = 1, free_pfds = 2
Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] clog_postsuspend: finalizing
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd]  Request from kernel received: [DM_ULOG_RESUME/887]
Sep 21 09:24:13 null-02 cmirrord[1992]: New   handle: 8408462744751177728
Sep 21 09:24:13 null-02 cmirrord[1992]: New   name: LVM-4Zjg4QhQ7IwysAmEYDWwKfLe2uTjYrwzO9v3WtkD3KcdVX30Gt523KwhNuvivmCd
Sep 21 09:24:13 null-02 cmirrord[1992]: Adding cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]:  used_pfds = 2, free_pfds = 1
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Log not valid yet, storing request
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] loading sync_bits (sync_count = 0):
Sep 21 09:24:13 null-02 cmirrord[1992]: [  0 -  15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 21 09:24:13 null-02 cmirrord[1992]: [ 16 -  31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] loading clean_bits:
Sep 21 09:24:13 null-02 cmirrord[1992]: [  0 -  15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 21 09:24:13 null-02 cmirrord[1992]: [ 16 -  31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Processing delayed request: DM_ULOG_RESUME
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Non-master resume: bits pre-loaded
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd]  Request from kernel received: [DM_ULOG_GET_RESYNC_WORK/888]
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b]  Request from kernel received: [DM_ULOG_DTR/889]
Sep 21 09:24:13 null-02 cmirrord[1992]: [ygewHh4b] Cluster log removed
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] storing clean_bits:
Sep 21 09:24:13 null-02 cmirrord[1992]: [  0 -  15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 21 09:24:13 null-02 cmirrord[1992]: [ 16 -  31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] storing sync_bits (sync_count = 0):
Sep 21 09:24:13 null-02 cmirrord[1992]: [  0 -  15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 21 09:24:13 null-02 cmirrord[1992]: [ 16 -  31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Checkpoint prepared for node 1:
Sep 21 09:24:13 null-02 cmirrord[1992]:   bitmap_size = 32
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/890]
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd]  Request from kernel received: [DM_ULOG_PRESUSPEND/891]
Sep 21 09:24:13 null-02 cmirrord[1992]: WARNING: log still marked as 'touched' during suspend
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/892]
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd]  Request from kernel received: [DM_ULOG_IN_SYNC/893]
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Region is not in-sync: 249
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 kernel: [ 3351.179294] device-mapper: raid1: Primary mirror (253:4) failed while out-of-sync: Reads may fail.
Sep 21 09:24:13 null-02 kernel: [ 3351.189841] device-mapper: raid1: Read failure on mirror device 253:4.  Trying alternative device.
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/894]
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 4
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on local/4
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd]  Request from kernel received: [DM_ULOG_IN_SYNC/895]
Sep 21 09:24:13 null-02 cmirrord[1992]: [NuvivmCd] Region is not in-sync: 249
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6
Sep 21 09:24:13 null-02 cmirrord[1992]: Data ready on 6
Sep 21 09:24:13 null-02 cmirrord[1992]: Issuing callback on cluster/6

Comment 26 David Teigland 2018-09-24 15:18:22 UTC

Retried with:

reverted commit 672b8c196b91e4cf6567223ce44dfb902281b791

used a new disk to restore the leg instead of the previously failed disk

ran the following command on null-01 to restore the leg, specifying only one (new) PV:
lvconvert -vvvv --yes --type mirror --mirrorlog disk -m 2 cvg/clv /dev/sdf

[root@null-01 ~]# lvs -a cvg -o+devices                                                                                                                  /dev/sdc: open failed: No such device or address
  LV             VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log        Cpy%Sync Convert Devices
  clv            cvg mwi-a-m--- 500.00m                                [clv_mlog] 100.00           clv_mimage_0(0),clv_mimage_2(0),clv_mimage_3(0)
  [clv_mimage_0] cvg iwi-aom--- 500.00m                                                            /dev/sdb(0)
  [clv_mimage_2] cvg iwi-aom--- 500.00m                                                            /dev/sdd(0)
  [clv_mimage_3] cvg iwi-aom--- 500.00m                                                            /dev/sdf(0)
  [clv_mlog]     cvg lwi-aom---   4.00m                                                            /dev/sde(0)


So, some of the problems we've seen did not appear (lvconvert didn't hang, no panic).  Given the kernel errors in the logs, I suspect that if gfs was mounted it still would have read bad data and panicked/withdrawn from io errors.

I've copied out some parts of the logs from each node beginning with the command that was restoring the mirror leg.
Some of these messages still look quite concerning, and we should have someone who is familiar with them evaluate
how bad they are, i.e. whether they are expected during recovery or whether they should not be appearing at all.
 

Sep 24 14:53:19 null-01 dmeventd[1831]: Repair of mirrored device cvg-clv finished successfully.
Sep 24 14:53:19 null-01 dmeventd[1831]: cvg-clv is now in-sync.
Sep 24 14:55:14 null-01 root: vgreduce removemissing
Sep 24 14:59:11 null-01 root: lvconvert m2 from sdf
Sep 24 14:59:11 null-01 dmeventd[1831]: device-mapper: waitevent ioctl on  LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call
Sep 24 14:59:11 null-01 dmeventd[1831]: Unlocking memory
Sep 24 14:59:11 null-01 dmeventd[1831]: No longer monitoring mirror device cvg-clv for events.
...
Sep 24 14:59:15 null-01 dmeventd[1831]: Monitoring mirror device cvg-clv_mimagetmp_3 for events.
Sep 24 14:59:15 null-01 dmeventd[1831]: Monitoring mirror device cvg-clv for events.
Sep 24 14:59:30 null-01 dmeventd[1831]: device-mapper: waitevent ioctl on  LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWcALs3MMP7Ke2x9q4TQ7DHWGp1qtMrC9y failed: Interrupted system call
Sep 24 14:59:30 null-01 dmeventd[1831]: No longer monitoring mirror device cvg-clv_mimagetmp_3 for events.
Sep 24 14:59:30 null-01 dmeventd[1831]: device-mapper: waitevent ioctl on  LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call
Sep 24 14:59:30 null-01 dmeventd[1831]: Unlocking memory
Sep 24 14:59:30 null-01 dmeventd[1831]: No longer monitoring mirror device cvg-clv for events.
Sep 24 14:59:30 null-01 cmirrord[1810]: Log entry already exists: LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWdcwnQs0yU8HuJ8dlYfutK5xc77KBKJNW
Sep 24 14:59:30 null-01 cmirrord[1810]: clog_resume:  Failed to create cluster CPG
Sep 24 14:59:30 null-01 kernel: device-mapper: raid1: log presuspend failed
Sep 24 14:59:30 null-01 kernel: device-mapper: raid1: log presuspend failed
Sep 24 14:59:30 null-01 kernel: device-mapper: raid1: log postsuspend failed
....
Sep 24 14:59:30 null-01 dmeventd[1831]: Monitoring mirror device cvg-clv for events.
Sep 24 14:59:30 null-01 dmeventd[1831]: cvg-clv is now in-sync.
Sep 24 14:59:31 null-01 dmeventd[1831]: device-mapper: waitevent ioctl on  LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call
Sep 24 14:59:31 null-01 dmeventd[1831]: No longer monitoring mirror device cvg-clv for events.
...
Sep 24 14:59:34 null-01 dmeventd[1831]: Monitoring mirror device cvg-clv for events.


Sep 24 03:40:15 null-02 root: lvconvert m2 from sdf run on null-01
Sep 24 03:40:15 null-02 cmirrord[1201]: Data ready on 6
Sep 24 03:40:15 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 24 03:40:15 null-02 cmirrord[1201]: Data ready on 6
Sep 24 03:40:15 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 24 03:40:16 null-02 cmirrord[1201]: Data ready on 4
Sep 24 03:40:16 null-02 dmeventd[1228]: device-mapper: waitevent ioctl on  LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call
Sep 24 03:40:16 null-02 dmeventd[1228]: Unlocking memory
Sep 24 03:40:16 null-02 dmeventd[1228]: Device cache incomplete - not writing to /etc/lvm/cache/.cache
Sep 24 03:40:16 null-02 dmeventd[1228]: No longer monitoring mirror device cvg-clv for events.
Sep 24 03:40:16 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 24 03:40:16 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_CTR/100507]
Sep 24 03:40:16 null-02 cmirrord[1201]: Checking /dev/mapper for device 253:2
Sep 24 03:40:16 null-02 cmirrord[1201]:   .: NO
Sep 24 03:40:16 null-02 cmirrord[1201]:   ..: NO
Sep 24 03:40:16 null-02 cmirrord[1201]:   cvg-clv_mimagetmp_3: NO
Sep 24 03:40:16 null-02 cmirrord[1201]:   cvg-clv: NO
Sep 24 03:40:16 null-02 cmirrord[1201]:   cvg-clv_mimage_2: NO
Sep 24 03:40:16 null-02 cmirrord[1201]:   cvg-clv_mimage_0: NO
Sep 24 03:40:16 null-02 cmirrord[1201]:   cvg-clv_mlog: YES
Sep 24 03:40:16 null-02 cmirrord[1201]: Clustered log disk is /dev/mapper/cvg-clv_mlog
Sep 24 03:40:16 null-02 cmirrord[1201]: Disk log ready
Sep 24 03:40:16 null-02 cmirrord[1201]: [77KBKJNW] Cluster log created
...
Sep 24 03:40:29 null-02 dmeventd[1228]: cvg-clv is now in-sync.
Sep 24 03:40:35 null-02 dmeventd[1228]: device-mapper: waitevent ioctl on  LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWcALs3MMP7Ke2x9q4TQ7DHWGp1qtMrC9y failed: Interrupted system call
Sep 24 03:40:35 null-02 dmeventd[1228]: No longer monitoring mirror device cvg-clv_mimagetmp_3 for events.
Sep 24 03:40:35 null-02 dmeventd[1228]: device-mapper: waitevent ioctl on  LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call
Sep 24 03:40:35 null-02 dmeventd[1228]: No longer monitoring mirror device cvg-clv for events.
Sep 24 03:40:35 null-02 cmirrord[1201]: Log entry already exists: LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWdcwnQs0yU8HuJ8dlYfutK5xc77KBKJNW
Sep 24 03:40:35 null-02 cmirrord[1201]: clog_resume:  Failed to create cluster CPG
Sep 24 03:40:35 null-02 kernel: [ 1703.100972] device-mapper: raid1: log presuspend failed
Sep 24 03:40:35 null-02 kernel: [ 1703.108117] device-mapper: raid1: log presuspend failed
Sep 24 03:40:35 null-02 kernel: [ 1703.114601] device-mapper: raid1: log postsuspend failed



Sep 24 03:39:33 null-03 dmeventd[1126]: cvg-clv is now in-sync.
Sep 24 03:45:33 null-03 root: lvconvert m2 from sdf run on null-01
Sep 24 03:45:33 null-03 dmeventd[1126]: device-mapper: waitevent ioctl on  LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call
Sep 24 03:45:33 null-03 dmeventd[1126]: Unlocking memory
Sep 24 03:45:33 null-03 dmeventd[1126]: No longer monitoring mirror device cvg-clv for events.
Sep 24 03:45:33 null-03 kernel: [ 1685.367551] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail.
Sep 24 03:45:33 null-03 kernel: [ 1685.377763] device-mapper: raid1: Read failure on mirror device 253:5.  Trying alternative device.
Sep 24 03:45:33 null-03 kernel: [ 1685.387977] device-mapper: raid1: Unable to read primary mirror during recovery
Sep 24 03:45:33 null-03 kernel: [ 1685.396486] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail.
Sep 24 03:45:33 null-03 kernel: [ 1685.407294] device-mapper: raid1: Mirror read failed.
Sep 24 03:45:36 null-03 kernel: [ 1688.506127] device-mapper: raid1: Mirror read failed.
Sep 24 03:45:36 null-03 kernel: [ 1688.512437] Buffer I/O error on dev dm-7, logical block 127984, async page read
Sep 24 03:45:36 null-03 kernel: [ 1688.529859] device-mapper: raid1: Mirror read failed.
Sep 24 03:45:36 null-03 kernel: [ 1688.536808] device-mapper: raid1: Mirror read failed.
ep 24 03:45:36 null-03 dmeventd[1126]: Monitoring mirror device cvg-clv_mimagetmp_3 for events.
Sep 24 03:45:36 null-03 kernel: [ 1688.543037] Buffer I/O error on dev dm-7, logical block 127984, async page read
Sep 24 03:45:36 null-03 dmeventd[1126]: Monitoring mirror device cvg-clv for events.
Sep 24 03:45:52 null-03 dmeventd[1126]: device-mapper: waitevent ioctl on  LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWcALs3MMP7Ke2x9q4TQ7DHWGp1qtMrC9y failed: Interrupted system call
Sep 24 03:45:52 null-03 dmeventd[1126]: No longer monitoring mirror device cvg-clv_mimagetmp_3 for events.
Sep 24 03:45:52 null-03 dmeventd[1126]: device-mapper: waitevent ioctl on  LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV failed: Interrupted system call
Sep 24 03:45:52 null-03 dmeventd[1126]: Unlocking memory
Sep 24 03:45:52 null-03 dmeventd[1126]: No longer monitoring mirror device cvg-clv for events.
Sep 24 03:45:52 null-03 cmirrord[1099]: Log entry already exists: LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWdcwnQs0yU8HuJ8dlYfutK5xc77KBKJNW
Sep 24 03:45:52 null-03 cmirrord[1099]: clog_resume:  Failed to create cluster CPG
Sep 24 03:45:52 null-03 kernel: [ 1704.096658] device-mapper: raid1: log presuspend failed
Sep 24 03:45:52 null-03 kernel: [ 1704.107810] device-mapper: raid1: log presuspend failed
Sep 24 03:45:52 null-03 kernel: [ 1704.114334] device-mapper: raid1: log postsuspend failed

Comment 27 David Teigland 2018-09-24 15:30:20 UTC

Created attachment 1486443 [details]
working lvconvert

vvvv output from the lvconvert command that completed in the last comment.

Comment 30 David Teigland 2018-09-25 15:19:46 UTC

Again, no evidence of any problem after recovering for the failed leg.  The problems appeared only when I added another leg:

[root@null-01 ~]# lvs -a -o+devices cvg
  /dev/sdd: open failed: No such device or address
  LV                VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log        Cpy%Sync Convert           Devices                           
  clv               cvg cwi-aom--- 500.00m                                           57.60    [clv_mimagetmp_4] clv_mimagetmp_4(0),clv_mimage_4(0)
  [clv_mimage_0]    cvg iwi-aom--- 500.00m                                                                      /dev/sdb(0)                       
  [clv_mimage_3]    cvg iwi-aom--- 500.00m                                                                      /dev/sdf(0)                       
  [clv_mimage_4]    cvg Iwi-aom--- 500.00m                                                                      /dev/sdg(0)                       
  [clv_mimagetmp_4] cvg mwi-aom--- 500.00m                                [clv_mlog] 100.00                     clv_mimage_0(0),clv_mimage_3(0)   
  [clv_mlog]        cvg lwi-aom---   4.00m                                                                      /dev/sde(0)                       


[root@null-02 ~]#  lvs -a -o+devices cvg
  /dev/sdd: open failed: No such device or address
  LV                VG  Attr       LSize   Log        Cpy%Sync Convert           Devices                           
  clv               cvg cwi-aom--- 500.00m            57.60    [clv_mimagetmp_4] clv_mimagetmp_4(0),clv_mimage_4(0)
  [clv_mimage_0]    cvg iwi-aom--- 500.00m                                       /dev/sdb(0)                       
  [clv_mimage_3]    cvg iwi-aom--- 500.00m                                       /dev/sdf(0)                       
  [clv_mimage_4]    cvg Iwi-aom--- 500.00m                                       /dev/sdg(0)                       
  [clv_mimagetmp_4] cvg mwi-aom--- 500.00m [clv_mlog] 100.00                     clv_mimage_0(0),clv_mimage_3(0)   
  [clv_mlog]        cvg lwi-aom---   4.00m                                       /dev/sde(0)                       


[root@null-03 ~]#  lvs -a -o+devices cvg
  /dev/sdd: open failed: No such device or address
  LV                VG  Attr       LSize   Log        Cpy%Sync Convert           Devices                           
  clv               cvg cwi-aom--- 500.00m            57.60    [clv_mimagetmp_4] clv_mimagetmp_4(0),clv_mimage_4(0)
  [clv_mimage_0]    cvg iwi-aom--- 500.00m                                       /dev/sdb(0)                       
  [clv_mimage_3]    cvg iwi-aom--- 500.00m                                       /dev/sdf(0)                       
  [clv_mimage_4]    cvg Iwi-aom--- 500.00m                                       /dev/sdg(0)                       
  [clv_mimagetmp_4] cvg mwi-aom--- 500.00m [clv_mlog] 100.00                     clv_mimage_0(0),clv_mimage_3(0)   
  [clv_mlog]        cvg lwi-aom---   4.00m                                       /dev/sde(0)                       


[root@null-01 ~]# dmsetup ls --tree
rhel_null--01-home (253:2)
 └─ (8:2)
cvg-clv (253:7)
 ├─cvg-clv_mimage_4 (253:6)
 │  └─ (8:96)
 └─cvg-clv_mimagetmp_4 (253:5)
    ├─cvg-clv_mimage_3 (253:8)
    │  └─ (8:80)
    ├─cvg-clv_mimage_0 (253:4)
    │  └─ (8:16)
    └─cvg-clv_mlog (253:3)
       └─ (8:64)
rhel_null--01-swap (253:1)
 └─ (8:2)
rhel_null--01-root (253:0)
 └─ (8:2)

[root@null-02 ~]# dmsetup ls --tree
rhel_null--02-swap (253:0)
 └─ (8:2)
rhel_null--02-root (253:1)
 └─ (8:2)
cvg-clv (253:6)
 ├─cvg-clv_mimage_4 (253:5)
 │  └─ (8:96)
 └─cvg-clv_mimagetmp_4 (253:4)
    ├─cvg-clv_mimage_3 (253:7)
    │  └─ (8:80)
    ├─cvg-clv_mimage_0 (253:3)
    │  └─ (8:16)
    └─cvg-clv_mlog (253:2)
       └─ (8:64)

[root@null-03 ~]# dmsetup ls --tree
rhel_null--03-home (253:2)
 └─ (8:2)
rhel_null--03-swap (253:0)
 └─ (8:2)
rhel_null--03-root (253:1)
 └─ (8:2)
cvg-clv (253:7)
 ├─cvg-clv_mimage_4 (253:6)
 │  └─ (8:96)
 └─cvg-clv_mimagetmp_4 (253:5)
    ├─cvg-clv_mimage_3 (253:8)
    │  └─ (8:80)
    ├─cvg-clv_mimage_0 (253:4)
    │  └─ (8:16)
    └─cvg-clv_mlog (253:3)
       └─ (8:64)

In status, note that null-02 says "AA" and the others say "SA".

[root@null-01 ~]# dmsetup status
cvg-clv_mimage_4: 0 1024000 linear 
cvg-clv_mlog: 0 8192 linear 
cvg-clv_mimage_3: 0 1024000 linear 
cvg-clv_mimagetmp_4: 0 1024000 mirror 2 253:4 253:8 250/250 1 AA 3 clustered-disk 253:3 A
cvg-clv_mimage_0: 0 1024000 linear 
rhel_null--01-home: 0 853286912 linear 
cvg-clv: 0 1024000 mirror 2 253:5 253:6 144/250 1 SA 1 clustered-core
rhel_null--01-swap: 0 16515072 linear 
rhel_null--01-root: 0 104857600 linear 

[root@null-02 ~]# dmsetup status
rhel_null--02-swap: 0 16416768 linear 
rhel_null--02-root: 0 104857600 linear 
cvg-clv_mimage_4: 0 1024000 linear 
cvg-clv_mlog: 0 8192 linear 
cvg-clv_mimage_3: 0 1024000 linear 
cvg-clv_mimagetmp_4: 0 1024000 mirror 2 253:3 253:7 250/250 1 AA 3 clustered-disk 253:3 A
cvg-clv_mimage_0: 0 1024000 linear 
cvg-clv: 0 1024000 mirror 2 253:4 253:5 144/250 1 AA 1 clustered-core

[root@null-03 ~]# dmsetup status
rhel_null--03-home: 0 854466560 linear
rhel_null--03-swap: 0 16416768 linear
rhel_null--03-root: 0 104857600 linear
cvg-clv_mimage_4: 0 1024000 linear
cvg-clv_mlog: 0 8192 linear
cvg-clv_mimage_3: 0 1024000 linear
cvg-clv_mimagetmp_4: 0 1024000 mirror 2 253:4 253:8 250/250 1 AA 3 clustered-disk 253:3 A
cvg-clv_mimage_0: 0 1024000 linear
cvg-clv: 0 1024000 mirror 2 253:5 253:6 144/250 1 SA 1 clustered-core


null-02 is where we see this, so it indicates the "AA" could be related:

[88511.157159] GFS2: fsid=dcl:clv.1: Error -5 writing to log

Comment 31 David Teigland 2018-09-25 15:28:08 UTC

More kernel log messages for comment 30.  null-02 with status "AA" does not have the dm kernel messages that the other two nodes with status "SA" show.


null-01:
[88525.369285] device-mapper: raid1: Unable to read primary mirror during recovery
[88525.376599] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail.
[88525.386834] device-mapper: raid1: Mirror read failed.
[88525.393633] device-mapper: raid1: Mirror read failed.
[88525.398680] Buffer I/O error on dev dm-7, logical block 127984, async page read
[88528.681717] device-mapper: raid1: Mirror read failed.
[88528.686768] Buffer I/O error on dev dm-7, logical block 127984, async page read


null-02:
[88511.157159] GFS2: fsid=dcl:clv.1: Error -5 writing to log


null-03:
[88508.205581] device-mapper: raid1: Unable to read primary mirror during recovery
[88508.210503] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail.
[88508.210505] device-mapper: raid1: Read failure on mirror device 253:5.  Trying alternative device.
[88508.210576] device-mapper: raid1: Mirror read failed.
[88508.210668] device-mapper: raid1: Mirror read failed.
[88508.210671] Buffer I/O error on dev dm-7, logical block 127984, async page read
[88508.254580] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fail.
[88511.706583] device-mapper: raid1: Mirror read failed.
[88511.713702] device-mapper: raid1: Mirror read failed.
[88511.719832] Buffer I/O error on dev dm-7, logical block 127984, async page read

Comment 32 David Teigland 2018-09-25 15:47:40 UTC

null-02 is the node where cmirrord was active.  Here are the cmirrord logs surrounding the GFS2 io error message.


Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Region is pending remote recovery: 144
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Adding region to priority list: 144
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233270]
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_GET_RESYNC_WORK/233271]
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233272]
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_IS_REMOTE_RECOVERING/233273]
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Region is pending remote recovery: 102
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Adding region to priority list: 102
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_PRESUSPEND/233274]
Sep 25 03:47:22 null-02 cmirrord[1201]: WARNING: log still marked as 'touched' during suspend
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] storing clean_bits:
Sep 25 03:47:22 null-02 cmirrord[1201]: [  0 -  15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 25 03:47:22 null-02 cmirrord[1201]: [ 16 -  31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] storing sync_bits (sync_count = 0):
Sep 25 03:47:22 null-02 cmirrord[1201]: [  0 -  15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 25 03:47:22 null-02 cmirrord[1201]: [ 16 -  31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Checkpoint prepared for node 1:
Sep 25 03:47:22 null-02 cmirrord[1201]:   bitmap_size = 32
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233275]
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Recovery halted... [not remote recovering]: 41
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:22 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Recovery halted... [not remote recovering]: 144
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:23 null-02 kernel: [88511.157159] GFS2: fsid=dcl:clv.1: Error -5 writing to log
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233276]
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_IS_REMOTE_RECOVERING/233277]
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Recovery halted... [not remote recovering]: 102
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_IN_SYNC/233278]
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 102
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_IN_SYNC/233279]
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 102
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_POSTSUSPEND/233280]
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233281]
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] clog_postsuspend: leaving CPG
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]: Finalizing leave...
Sep 25 03:47:23 null-02 cmirrord[1201]: Freeing up cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]:  used_pfds = 1, free_pfds = 2
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV] clog_postsuspend: finalizing
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_RESUME/233282]
Sep 25 03:47:23 null-02 cmirrord[1201]: New   handle: 8782769015748558848
Sep 25 03:47:23 null-02 cmirrord[1201]: New   name: LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGWdcwnQs0yU8HuJ8dlYfutK5xc77KBKJNW
Sep 25 03:47:23 null-02 cmirrord[1201]: Adding cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]:  used_pfds = 2, free_pfds = 1
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] Master resume: reading disk log
Sep 25 03:47:23 null-02 cmirrord[1201]: Disk log initialized
Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] storing clean_bits:
Sep 25 03:47:23 null-02 cmirrord[1201]: [  0 -  15] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Sep 25 03:47:23 null-02 cmirrord[1201]: [ 16 -  31] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 03
Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] storing sync_bits (sync_count = 250):
Sep 25 03:47:23 null-02 cmirrord[1201]: [  0 -  15] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Sep 25 03:47:23 null-02 cmirrord[1201]: [ 16 -  31] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 03
Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] Checkpoint prepared for node 3:
Sep 25 03:47:23 null-02 cmirrord[1201]:   bitmap_size = 32
Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] storing clean_bits:
Sep 25 03:47:23 null-02 cmirrord[1201]: [  0 -  15] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Sep 25 03:47:23 null-02 cmirrord[1201]: [ 16 -  31] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 03
Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] storing sync_bits (sync_count = 250):
Sep 25 03:47:23 null-02 cmirrord[1201]: [  0 -  15] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
Sep 25 03:47:23 null-02 cmirrord[1201]: [ 16 -  31] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 03
Sep 25 03:47:23 null-02 cmirrord[1201]: [77KBKJNW] Checkpoint prepared for node 1:
Sep 25 03:47:23 null-02 cmirrord[1201]:   bitmap_size = 32
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:23 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:23 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:23 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_RESUME/233283]
Sep 25 03:47:26 null-02 cmirrord[1201]: New   handle: 8450407966594564097
Sep 25 03:47:26 null-02 cmirrord[1201]: New   name: LVM-iLKoSiQC0Pbny8zPiAeB3HtswiXs1yGW1g5OasngM0qe7iWEjksvfp9kaYeLEPtV
Sep 25 03:47:26 null-02 cmirrord[1201]: Adding cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]:  used_pfds = 3, free_pfds = 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_GET_RESYNC_WORK/233284]
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Master resume.
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] storing clean_bits:
Sep 25 03:47:26 null-02 cmirrord[1201]: [  0 -  15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 25 03:47:26 null-02 cmirrord[1201]: [ 16 -  31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] storing sync_bits (sync_count = 0):
Sep 25 03:47:26 null-02 cmirrord[1201]: [  0 -  15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 25 03:47:26 null-02 cmirrord[1201]: [ 16 -  31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Checkpoint prepared for node 3:
Sep 25 03:47:26 null-02 cmirrord[1201]:   bitmap_size = 32
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_GET_RESYNC_WORK/233285]
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] storing clean_bits:
Sep 25 03:47:26 null-02 cmirrord[1201]: [  0 -  15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 25 03:47:26 null-02 cmirrord[1201]: [ 16 -  31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] storing sync_bits (sync_count = 0):
Sep 25 03:47:26 null-02 cmirrord[1201]: [  0 -  15] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 25 03:47:26 null-02 cmirrord[1201]: [ 16 -  31] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Checkpoint prepared for node 1:
Sep 25 03:47:26 null-02 cmirrord[1201]:   bitmap_size = 32
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233286]
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 6
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/6
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_IN_SYNC/233287]
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233288]
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233289]
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233290]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_IN_SYNC/233291]
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 249
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233292]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_IN_SYNC/233293]
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 193
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233294]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_IN_SYNC/233295]
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 193
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233296]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233297]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233298]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233299]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233300]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233301]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 249
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233302]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 193
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233303]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 193
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is pending remote recovery: 41
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Adding region to priority list: 41
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 dmeventd[1228]: devices/ignore_lvm_mirrors not found in config: defaulting to 1
Sep 25 03:47:26 null-02 dmeventd[1228]: devices/cache not found in config: defaulting to /etc/lvm/cache/.cache
Sep 25 03:47:26 null-02 dmeventd[1228]: Loaded persistent filter cache from /etc/lvm/cache/.cache
Sep 25 03:47:26 null-02 dmeventd[1228]: Failed to load existing device cache from /etc/lvm/cache/.cache
Sep 25 03:47:26 null-02 dmeventd[1228]: Locking memory
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_GET_SYNC_COUNT/233304]
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV]  Request from kernel received: [DM_ULOG_IN_SYNC/233305]
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is not in-sync: 249
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233306]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 249
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233307]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233308]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: [aYeLEPtV] Region is currently remote recovering: 144
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233309]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 7
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on cluster/7
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233310]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW]  Request from kernel received: [DM_ULOG_IN_SYNC/233311]
Sep 25 03:47:26 null-02 cmirrord[1201]: [77KBKJNW] Region is in-sync: 0
Sep 25 03:47:26 null-02 cmirrord[1201]: Data ready on 4
Sep 25 03:47:26 null-02 cmirrord[1201]: Issuing callback on local/4

Comment 33 Jonathan Earl Brassow 2018-09-25 16:04:26 UTC

(In reply to David Teigland from comment #32)
> null-02 is the node where cmirrord was active.  Here are the cmirrord logs
> surrounding the GFS2 io error message.
> 
> 
> Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Region is pending remote
> recovery: 144
> Sep 25 03:47:22 null-02 cmirrord[1201]: [aYeLEPtV] Adding region to priority
> list: 144
> Sep 25 03:47:22 null-02 cmirrord[1201]: Data ready on 6

I don't see anything alarming in comment 32.  (Note that cmirrord should be active on all nodes.)

Comment 34 Jonathan Earl Brassow 2018-09-25 16:11:39 UTC

(In reply to David Teigland from comment #31)
> More kernel log messages for comment 30.  null-02 with status "AA" does not
> have the dm kernel messages that the other two nodes with status "SA" show.
> 
> 
> null-01:
> [88525.369285] device-mapper: raid1: Unable to read primary mirror during
> recovery

Yes, the 'S' in status output means exactly is printed in this error message.  For some reason, the primary mirror could not be read.

The primary mirror in this case is itself a mirror:
 └─cvg-clv_mimagetmp_4 (253:5) <------------ here.
    ├─cvg-clv_mimage_3 (253:8)
    │  └─ (8:80)
    ├─cvg-clv_mimage_0 (253:4)
    │  └─ (8:16)
    └─cvg-clv_mlog (253:3)
       └─ (8:64)

Comment 35 Jonathan Earl Brassow 2018-09-25 16:40:04 UTC

Reproducible without the device failure.
Steps to reproduce:

# Create 3-way cluster mirror
[root@bp-01 ~]# lvcreate --type mirror -L 1G -n lv -m 2 vg

# Find center mimage device (/dev/sdc1)
[root@bp-01 ~]# devices
  LV            Attr       Cpy%Sync Devices
  home          -wi-ao----          /dev/sda2(2016)
  root          -wi-ao----          /dev/sda2(106177)
  swap          -wi-ao----          /dev/sda2(0)
  lv            mwi-a-m--- 100.00   lv_mimage_0(0),lv_mimage_1(0),lv_mimage_2(0)
  [lv_mimage_0] iwi-aom---          /dev/sdb1(0)
  [lv_mimage_1] iwi-aom---          /dev/sdc1(0)
  [lv_mimage_2] iwi-aom---          /dev/sdd1(0)
  [lv_mlog]     lwi-aom---          /dev/sdi1(0)

# Remove center mimage device
[root@bp-01 ~]# lvconvert -m1 vg/lv /dev/sdc1
  Logical volume vg/lv converted.

# Observe that mimage_1 (/dev/sdc1) has been removed
[root@bp-01 ~]# devices
  LV            Attr       Cpy%Sync Devices
  home          -wi-ao----          /dev/sda2(2016)
  root          -wi-ao----          /dev/sda2(106177)
  swap          -wi-ao----          /dev/sda2(0)
  lv            mwi-a-m--- 100.00   lv_mimage_0(0),lv_mimage_2(0)
  [lv_mimage_0] iwi-aom---          /dev/sdb1(0)
  [lv_mimage_2] iwi-aom---          /dev/sdd1(0)
  [lv_mlog]     lwi-aom---          /dev/sdi1(0)

# Select a new device (/dev/sde1) to upconvert with
[root@bp-01 ~]# lvconvert -m 2 vg/lv /dev/sde1
  Logical volume vg/lv being converted.
  vg/lv: Converted: 2.73%
  vg/lv: Converted: 100.00%

# See that the logs contain messages that they should not
bp-01 (where the command was run):
[525708.909079] device-mapper: raid1: Unable to read primary mirror during recovery
[525708.916474] device-mapper: raid1: Primary mirror (253:5) failed while out-of-sync: Reads may fa.
[525708.925639] device-mapper: raid1: Mirror read failed.

bp-02:
[525716.599401] device-mapper: raid1: log presuspend failed
[525716.613783] device-mapper: raid1: log presuspend failed
[525716.619219] device-mapper: raid1: log postsuspend failed

bp-03:
[525690.181011] device-mapper: raid1: log presuspend failed
[525690.198245] device-mapper: raid1: log presuspend failed
[525690.203626] device-mapper: raid1: log postsuspend failed

Comment 36 David Teigland 2018-09-25 18:12:29 UTC

After rebooting I didn't remove any devices or any legs, I just added a new mirror leg and triggered the error.

[root@null-01 ~]# lvs -a -o+devices cvg
  LV             VG  Attr       LSize   Log        Cpy%Sync Devices
  clv            cvg mwi-a-m--- 500.00m [clv_mlog] 100.00   clv_mimage_0(0),clv_mimage_3(0),clv_mimage_4(0)
  [clv_mimage_0] cvg iwi-aom--- 500.00m                     /dev/sdb(0)
  [clv_mimage_3] cvg iwi-aom--- 500.00m                     /dev/sdf(0)
  [clv_mimage_4] cvg iwi-aom--- 500.00m                     /dev/sdg(0)
  [clv_mlog]     cvg lwi-aom---   4.00m                     /dev/sde(0)

[root@null-01 ~]# lvconvert -m 3 cvg/clv /dev/sdd
  Logical volume cvg/clv being converted.
  /run/lvm/lvmpolld.socket: connect failed: No such file or directory
  WARNING: Failed to connect to lvmpolld. Proceeding with polling without using lvmpolld.
  WARNING: Check global/use_lvmpolld in lvm.conf or the lvmpolld daemon state.
  cvg/clv: Converted: 1.60%
  cvg/clv: Converted: 100.00%
  Logical volume cvg/clv converted.


[root@null-01 ~]# dmsetup status
cvg-clv_mimage_5: 0 1024000 linear
cvg-clv_mimage_4: 0 1024000 linear
cvg-clv_mlog: 0 8192 linear
cvg-clv_mimage_3: 0 1024000 linear
cvg-clv_mimage_0: 0 1024000 linear
rhel_null--01-home: 0 853286912 linear
cvg-clv: 0 1024000 mirror 4 253:4 253:5 253:7 253:9 250/250 1 AAAA 3 clustered-disk 253:3 A
rhel_null--01-swap: 0 16515072 linear
rhel_null--01-root: 0 104857600 linear

[root@null-02 ~]# dmsetup status
rhel_null--02-swap: 0 16416768 linear
rhel_null--02-root: 0 104857600 linear
cvg-clv_mimage_5: 0 1024000 linear
cvg-clv_mimage_4: 0 1024000 linear
cvg-clv_mlog: 0 8192 linear
cvg-clv_mimage_3: 0 1024000 linear
cvg-clv_mimage_0: 0 1024000 linear
cvg-clv: 0 1024000 mirror 4 253:3 253:4 253:6 253:8 250/250 1 AAAA 3 clustered-disk 253:3 A

[root@null-03 ~]# dmsetup status
rhel_null--03-home: 0 854466560 linear
rhel_null--03-swap: 0 16416768 linear
cvg-clv_mimage_5: 0 1024000 linear
rhel_null--03-root: 0 104857600 linear
cvg-clv_mimage_4: 0 1024000 linear
cvg-clv_mlog: 0 8192 linear
cvg-clv_mimage_3: 0 1024000 linear
cvg-clv_mimage_0: 0 1024000 linear
cvg-clv: 0 1024000 mirror 4 253:4 253:5 253:7 253:9 250/250 1 AAAA 3 clustered-disk 253:3 A


null-01:
[ 3778.178020] device-mapper: dm-log-userspace: version 1.3.0 loaded
[ 3794.007824] device-mapper: raid1: log presuspend failed
[ 3794.014788] device-mapper: raid1: log presuspend failed
[ 3794.020111] device-mapper: raid1: log postsuspend failed
[ 3980.771125] device-mapper: raid1: log presuspend failed
[ 3980.777218] device-mapper: raid1: log presuspend failed
[ 3980.782570] device-mapper: raid1: log postsuspend failed


null-02:
[ 4878.089705] device-mapper: dm-log-userspace: version 1.3.0 loaded
[ 4893.891383] device-mapper: raid1: log presuspend failed
[ 4893.898711] device-mapper: raid1: log presuspend failed
[ 4893.905391] device-mapper: raid1: log postsuspend failed
[ 5080.630072] device-mapper: raid1: log presuspend failed
[ 5080.637326] device-mapper: raid1: log presuspend failed
[ 5080.644137] device-mapper: raid1: log postsuspend failed


null-03:
[ 3766.366141] device-mapper: dm-log-userspace: version 1.3.0 loaded
[ 3782.170933] device-mapper: raid1: log presuspend failed
[ 3782.186368] device-mapper: raid1: log presuspend failed
[ 3782.193035] device-mapper: raid1: log postsuspend failed
[ 3950.081949] device-mapper: raid1: Unable to read primary mirror during recovery
[ 3950.086238] device-mapper: raid1: Primary mirror (253:6) failed while out-of-sync: Reads may fail.
[ 3950.086239] device-mapper: raid1: Read failure on mirror device 253:6.  Trying alternative device.
[ 3950.086312] device-mapper: raid1: Mirror read failed.
[ 3950.086407] device-mapper: raid1: Mirror read failed.
[ 3950.086410] Buffer I/O error on dev dm-8, logical block 127984, async page read
[ 3950.132341] device-mapper: raid1: Primary mirror (253:6) failed while out-of-sync: Reads may fail.
[ 3953.272811] device-mapper: raid1: Mirror read failed.
[ 3953.279836] device-mapper: raid1: Mirror read failed.
[ 3953.286215] Buffer I/O error on dev dm-8, logical block 127984, async page read
[ 3968.913718] device-mapper: raid1: log presuspend failed
[ 3968.923439] device-mapper: raid1: log presuspend failed
[ 3968.930068] device-mapper: raid1: log postsuspend failed

Comment 37 David Teigland 2018-09-25 22:11:12 UTC

In my own testing, I found that the problems appear from come from commit
fade45b1d14c90eac7c6cc41019ef946659a8fb5

To cleanly revert fade45b1 from the stable branch, first revert 5f648406b0f2e322851fce36b062a298cb3d6736 which is fixing an unrelated regression from fade45b1 (see bug 1594916).

When I revert both of these on the stable branch I no longer see the problem.

Comment 38 Jonathan Earl Brassow 2018-09-26 13:17:23 UTC

(In reply to David Teigland from comment #37)
> In my own testing, I found that the problems appear from come from commit
> fade45b1d14c90eac7c6cc41019ef946659a8fb5
> 
> To cleanly revert fade45b1 from the stable branch, first revert
> 5f648406b0f2e322851fce36b062a298cb3d6736 which is fixing an unrelated
> regression from fade45b1 (see bug 1594916).
> 
> When I revert both of these on the stable branch I no longer see the problem.

I'll try that as soon as I can.

Yesterday, I ran into a lot of trouble trying to bisect - many checkins did not compile/work.  The commit I know to work is:
877c2f2ffb3813e3236a55f07958009633dc8558 - good Mon Apr 30 09:34:12 2018
However, I've been running into bcache issues and have been unable to purge the VG and try other commits... will keep at it.

Comment 39 Jonathan Earl Brassow 2018-09-26 13:41:03 UTC

(In reply to David Teigland from comment #37)
> In my own testing, I found that the problems appear from come from commit
> fade45b1d14c90eac7c6cc41019ef946659a8fb5
> 
> To cleanly revert fade45b1 from the stable branch, first revert
> 5f648406b0f2e322851fce36b062a298cb3d6736 which is fixing an unrelated
> regression from fade45b1 (see bug 1594916).
> 
> When I revert both of these on the stable branch I no longer see the problem.

I've landed on the exact same conclusion with 'git bisect' (after getting through the bcache issues).

I can also confirm that reverting those two commits fixes the problem for me.

Comment 40 Bob Handlin 2018-09-26 13:50:41 UTC

Set blocker+.  RE: Regression.

Comment 41 David Teigland 2018-09-26 14:26:16 UTC

Created attachment 1487280 [details]
lvconvert output

verbose output of lvconvert successfully adding a leg to a cluster mirror.  Using stable branch with the two identified commits reverted.

Comment 42 David Teigland 2018-09-26 14:45:31 UTC

Created attachment 1487285 [details]
lvconvert output

lvconvert successfully adding a leg to a cluster mirror with the two commits reverted.

Comment 43 David Teigland 2018-09-26 14:46:29 UTC

Created attachment 1487286 [details]
clvmd output node 1

for lvconvert in comment 42

Comment 44 David Teigland 2018-09-26 14:47:08 UTC

Created attachment 1487287 [details]
clvmd output node 2

for lvconvert in comment 42

Comment 45 David Teigland 2018-09-26 14:47:48 UTC

Created attachment 1487288 [details]
clvmd output node 3

for lvconvert in comment 42

Comment 46 Jonathan Earl Brassow 2018-09-26 16:01:57 UTC

Unfortunately, we need more time for analysis.  We are fairly certain we know the fix, but we have seen the possibility for another regression with it.  We are going to have to push for a 0-day.

We can get through this issue with a release note and a 0-day fix.

1. This issue only affects cluster mirror users
2. It can only manifest if a certain device in the mirror fails or is removed
3. It can only happen to 3+ way mirrors
4. It only happens when up-converting a mirror to a mirror

Comment 48 Oneata Mircea Teodor 2018-09-26 16:42:28 UTC

Hello 
We need PMApproved for zero-day.

Comment 51 Zdenek Kabelac 2018-10-09 10:44:41 UTC

Hmm after some tracing period hunting numerous bugs in my local fake cluster  (while thinking our  singlenode cluster used-to-be quite precise emulation of cluster behavior) - looking through debug traces on Corey cluster nodes - there is one major issue - SUSPEND seems to be never made.

So my current findings:

Probabbly major issue is there is always PREALOD operation followed with direct resume (ACTIVATE) (which is posted with 'flushing' - while the normal behavior was always a separate suspend without flushing in this case.
I'm not quite sure what has changed on suspend processing in cluster.
Single node seems to be going through suspend->resume cycle as expected.

Other noticed issues -  dm multipath DM status is also using flushing - that should be eliminated (1 missing  dm_task_no_flush())

Clvmd is doing scans while already holding VG lock (thus in case of missing PV we do a several new reprocessing of metadata happens) - related to whole new scanning.

It seems scan is able to issue 'read-ahead' operation on device this is later ignored by some filter.

Comment 22  shows internal error - which could be related to missing suspend.

Comment 52 Zdenek Kabelac 2018-10-09 11:45:37 UTC

ATM I'm mostly puzzled with this logged output from clvmd:

Using JUST linear LV and creating its snapshot (So 0-relation to mirror) I can see this:

Loading table for helter_skelter-lvol2 (253:22).
Adding target to (253:22): 0 24576 snapshot 253:20 253:21 P 8
dm table   (253:22) [ opencount flush ]   [16384] (*1)
dm reload   (253:22) [ noopencount flush ]   [16384] (*1)
Table size changed from 0 to 24576 for helter_skelter-lvol2 (253:22).
Requiring flush for LV helter_skelter/lvol1.
Leaving critical section (resumed).
dm info  LVM-p12xDsu6sa5K2mm8XD3MTZeZTa6ByIwb0XWlQHvGyfn0TyuEPFhpeeejWCmj8zwb [ opencount flush ]   [16384] (*1)
helter_skelter-lvol2: device not registered.
Monitoring helter_skelter/lvol2 with libdevmapper-event-lvm2snapshot.so.


I'm looking at how is it possible that inside 'suspend' code we are suddenly reporting  'Leaving critical section (resumed).'  - asked Marian for -O0 build.



For compare here is expected code on my Rawhide build:

libdm-deptree.c:1944    Creating @PREFIX@vg-snap
ioctl/libdm-iface.c:1857    dm create @PREFIX@vg-snap LVM-9V7ciyfXKFP8G7LvwPWpnc6dClye1nym91e2C99e50ZMorEbkUcdPlPhgx0LfO3f [ noopencount flush ]   [16384] (*1)
libdm-deptree.c:2696    Loading table for @PREFIX@vg-snap (253:9).
libdm-deptree.c:2641    Adding target to (253:9): 0 20480 snapshot 253:7 253:8 P 8
ioctl/libdm-iface.c:1857    dm table   (253:9) [ opencount flush ]   [16384] (*1)
ioctl/libdm-iface.c:1857    dm reload   (253:9) [ noopencount flush ]   [16384] (*1)
libdm-deptree.c:2746    Table size changed from 0 to 20480 for @PREFIX@vg-snap (253:9).
activate/activate.c:2345    Requiring flush for LV @PREFIX@vg/LV1.
mm/memlock.c:623     Entering critical section (suspending).
mm/memlock.c:583     Lock:   Memlock counters: prioritized:1 locked:1 critical:1 daemon:1 suspended:0
activate/dev_manager.c:3225    Creating SUSPEND_WITH_LOCKFS tree for @PREFIX@vg/LV1.
activate/dev_manager.c:779     Getting device info for @PREFIX@vg-LV1 [LVM-9V7ciyfXKFP8G7LvwPWpnc6dClye1nym9kfomEL72OoCYV4DVo1eydSyHMKXz35O].
ioctl/libdm-iface.c:1857    dm info  LVM-9V7ciyfXKFP8G7LvwPWpnc6dClye1nym9kfomEL72OoCYV4DVo1eydSyHMKXz35O [ opencount flush enablechecks ]   [16384] (*1)
ioctl/libdm-iface.c:1857    dm deps   (253:6) [ opencount flush ]   [16384] (*1)
ioctl/libdm-iface.c:1857    dm deps   (253:4) [ opencount flush ]   [16384] (*1)
activate/dev_manager.c:779     Getting device info for @PREFIX@vg-LV1-real [LVM-9V7ciyfXKFP8G7LvwPWpnc6dClye1nym9kfomEL72OoCYV4DVo1eydSyHMKXz35O-real].

Comment 53 Zdenek Kabelac 2018-10-09 16:00:13 UTC

So the missing log from critical section from comment 52 - is the disabled 'activation' logging - which I've completely forget to watch for - since on my matchine it's always set to 1...

continue exploring differences.

Comment 54 Zdenek Kabelac 2018-10-10 15:10:41 UTC

Created attachment 1492603 [details]
Fixing clustered mirror patch

Patch fixes unwanted table reloading during lvconvert operation in cluster.

Comment 55 Marian Csontos 2018-10-10 15:25:19 UTC

Hi Steve, do we need this cloned? I cannot push the changes

Comment 63 Zdenek Kabelac 2019-05-13 07:48:37 UTC

Upstreamed slightly modified version of the attached patch:

https://www.redhat.com/archives/lvm-devel/2019-May/msg00017.html

Comment 65 Jonathan Earl Brassow 2019-06-19 21:12:03 UTC

There is still a known issue in cluster mirrors that needs to be resolved - bug 1711427 (also a blocker).  It would be best if this bug was tested with that fix also (should be available soon).

Comment 66 Corey Marthaler 2019-07-01 21:19:19 UTC

This scenario is once again passing with the latest rpms. Marking verified.

3.10.0-1057.el7.x86_64

lvm2-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
lvm2-libs-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
lvm2-cluster-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
lvm2-lockd-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
lvm2-python-boom-0.9-18.el7    BUILT: Fri Jun 21 04:18:58 CDT 2019
cmirror-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-1.02.158-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-libs-1.02.158-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-event-1.02.158-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-event-libs-1.02.158-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-persistent-data-0.8.5-1.el7    BUILT: Mon Jun 10 03:58:20 CDT 2019




********* Mirror hash info for this scenario *********
* names:              syncd_secondary_3legs_1
* sync:               1
* striped:            0
* leg devices:        /dev/mapper/mpathd1 /dev/mapper/mpathc1 /dev/mapper/mpathg1
* log devices:        /dev/mapper/mpathb1
* no MDA devices:     
* failpv(s):          /dev/mapper/mpathc1
* failnode(s):        harding-02 harding-03
* lvmetad:            0
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

Creating mirror(s) on harding-02...
harding-02: lvcreate -ay --type mirror -m 2 -n syncd_secondary_3legs_1 -L 500M helter_skelter /dev/mapper/mpathd1:0-2400 /dev/mapper/mpathc1:0-2400 /dev/mapper/mpathg1:0-2400 /dev/mapper/mpathb1:0-150
WARNING: gfs2 signature detected on /dev/helter_skelter/syncd_secondary_3legs_1 at offset 65536. Wipe it? [y/n]: [n]
  Aborted wiping of gfs2.
  1 existing signature left on the device.

Current mirror/raid device structure(s):
  LV                                 Attr       LSize    Cpy%Sync Devices                                                                                                    
   syncd_secondary_3legs_1            mwi-a-m---  500.00m 6.40     syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_1(0),syncd_secondary_3legs_1_mimage_2(0)
   [syncd_secondary_3legs_1_mimage_0] Iwi-aom---  500.00m          /dev/mapper/mpathd1(0)                                                                                     
   [syncd_secondary_3legs_1_mimage_1] Iwi-aom---  500.00m          /dev/mapper/mpathc1(0)                                                                                     
   [syncd_secondary_3legs_1_mimage_2] Iwi-aom---  500.00m          /dev/mapper/mpathg1(0)                                                                                     
   [syncd_secondary_3legs_1_mlog]     lwi-aom---    4.00m          /dev/mapper/mpathb1(0)                                                                                     

Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
Sleeping 15 sec

Creating gfs2 on top of mirror(s) on harding-02...
mkfs.gfs2 -o align=0 -J 32M -j 2 -p lock_dlm -t HARDING:gfs1 /dev/helter_skelter/syncd_secondary_3legs_1 -O
Warning: device is not properly aligned. This may harm performance.
Mounting mirrored gfs2 filesystems on harding-02...
Mounting mirrored gfs2 filesystems on harding-03...

PV=/dev/mapper/mpathc1
        syncd_secondary_3legs_1_mimage_1: 4.1
PV=/dev/mapper/mpathc1
        syncd_secondary_3legs_1_mimage_1: 4.1
Writing verification files (checkit) to mirror(s) on...
        ---- harding-02 ----
        ---- harding-03 ----

Sleeping 15 seconds to get some outsanding I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- harding-02 ----
        ---- harding-03 ----

Failing path sdn on harding-02
Failing path sdv on harding-02
Failing path sdn on harding-03
Failing path sdv on harding-03
Failing path sdad on harding-03
Failing path sdf on harding-03

Attempting I/O to cause mirror down conversion(s) on harding-02
dd if=/dev/zero of=/mnt/syncd_secondary_3legs_1/ddfile count=10 bs=4M
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.059059 s, 710 MB/s
Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb.
  LV                                    Attr       LSize    Cpy%Sync Devices                                                                   
   syncd_secondary_3legs_1               cwi-aom---  500.00m 100.00   syncd_secondary_3legs_1_mimagetmp_3(0),syncd_secondary_3legs_1_mimage_3(0)
   [syncd_secondary_3legs_1_mimage_0]    Iwi-aom---  500.00m          /dev/mapper/mpathd1(0)                                                    
   [syncd_secondary_3legs_1_mimage_2]    Iwi-aom---  500.00m          /dev/mapper/mpathg1(0)                                                    
   [syncd_secondary_3legs_1_mimage_3]    iwi-aom---  500.00m          /dev/mapper/mpatha1(0)                                                    
   [syncd_secondary_3legs_1_mimagetmp_3] mwi---m---  500.00m          syncd_secondary_3legs_1_mimage_0(0),syncd_secondary_3legs_1_mimage_2(0)   
   [syncd_secondary_3legs_1_mlog]        lwi-aom---    4.00m          /dev/mapper/mpathb1(0)                                                    

Verifying FAILED device /dev/mapper/mpathc1 is *NOT* in the volume(s)
  Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb.
olog: 1
Verifying LOG device(s) /dev/mapper/mpathb1 *ARE* in the mirror(s)
  Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb.
Verifying LEG device /dev/mapper/mpathd1 *IS* in the volume(s)
  Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb.
Verifying LEG device /dev/mapper/mpathg1 *IS* in the volume(s)
  Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb.
verify the dm devices associated with /dev/mapper/mpathc1 have been removed as expected
Checking REMOVAL of syncd_secondary_3legs_1_mimage_1 on:  harding-02 harding-03
verify the newly allocated dm devices were added as a result of the failures
Checking EXISTENCE of syncd_secondary_3legs_1_mimage_3 on:  harding-02 harding-03

Verify that the mirror image order remains the same after the down conversion
EXPECTED LEG ORDER: /dev/mapper/mpathd1 /dev/mapper/mpathg1 unknown
  Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb.
  Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb.
  Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb.
  Couldn't find device with uuid dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb.
ACTUAL LEG ORDER: /dev/mapper/mpathd1 /dev/mapper/mpathg1 /dev/mapper/mpatha1
Verifying files (checkit) on mirror(s) on...
        ---- harding-02 ----

        ---- harding-03 ----

Enabling path sdn on harding-02
Enabling path sdv on harding-02
Enabling path sdn on harding-03
Enabling path sdv on harding-03
Enabling path sdad on harding-03
Enabling path sdf on harding-03

-------------------------------------------------------------------------------
Force a vgreduce to either clean up the corrupt additional LV or lack of PV MDA
-------------------------------------------------------------------------------
  Scan of VG helter_skelter from /dev/mapper/mpathc1 found metadata seqno 81 vs previous 89.
  Scan of VG helter_skelter from /dev/mapper/mpathc1 found metadata seqno 81 vs previous 89.
  WARNING: Inconsistent metadata found for VG helter_skelter - updating to use version 89
Recreating PVs /dev/mapper/mpathc1 and then extending back into helter_skelter
harding-02 pvcreate /dev/mapper/mpathc1
  Scan of VG helter_skelter from /dev/mapper/mpathc1 found metadata seqno 90 vs previous 91.
  Scan of VG helter_skelter from /dev/mapper/mpathc1 found metadata seqno 90 vs previous 91.
  Scan of VG helter_skelter from /dev/mapper/mpathc1 found metadata seqno 90 vs previous 91.
  WARNING: Inconsistent metadata found for VG helter_skelter - updating to use version 91
  WARNING: Removing PV /dev/mapper/mpathc1 (dRUaO0-VFyt-vt9v-qMHq-DNiM-km3D-928ADb) that no longer belongs to VG helter_skelter
harding-02 vgextend helter_skelter /dev/mapper/mpathc1


Verify that each of the raid repairs finished successfully

Checking for leftover '-missing_0_0' or 'unknown devices'

Checking for PVs marked as missing (a-m)...

Checking for PVs marked as missing (a-m)...
Waiting until all mirror|raid volumes become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )
Sleeping 15 sec
Sleeping 15 sec

Verifying files (checkit) on mirror(s) on...
        ---- harding-02 ----

        ---- harding-03 ----

Stopping the io load (collie/xdoio) on mirror(s)
Unmounting gfs and removing mnt point on harding-02...
Unmounting gfs and removing mnt point on harding-03...

Deactivating and removing mirror(s)
Checking for left over mirror dm devices on
         ...harding-02 ...harding-03

Comment 68 errata-xmlrpc 2019-08-06 13:10:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2253

Note You need to log in before you can comment on or make changes to this bug.