Bug 222502

Summary: failures with dmevent cause cmirror issues
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: cmirrorAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 4CC: agk, dwysocha, jbrassow, mbroz, prockai
Target Milestone: ---Keywords: Reopened, TestBlocker
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-03-02 13:36:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2007-01-12 23:56:01 UTC
Description of problem:
I've tried this case many times with single node mirrors and it always appears
to work, but when I tried this on a cmirror with the latest rpms. It failed to
down convert the cmirror to a linear. This causes the volume group and mirror to
appear corrupted and also causes I/O errors when attempt to do I/O to the mirror.

[root@link-02 ~]# lvs -a -o +devices
  LV                VG   Attr   LSize   Origin Snap%  Move Log         Copy% 
Devices                          
  mirror            vg   mwi-a- 100.00M                    mirror_mlog 100.00
mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] vg   iwi-ao 100.00M                                      
/dev/sdh1(0)                     
  [mirror_mimage_1] vg   iwi-ao 100.00M                                      
/dev/sda1(0)                     
  [mirror_mlog]     vg   lwi-ao   4.00M                                      
/dev/sdb1(0)                     

# Here I turned off /dev/sdh

[root@link-02 ~]# lvs -a -o +devices
  /dev/sdh: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh2: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh3: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh5: read failed after 0 of 2048 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 104792064: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh: read failed after 0 of 4096 at 1296954228736: Input/output error
  /dev/sdh: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh1: read failed after 0 of 1024 at 259384082432: Input/output error
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh2: read failed after 0 of 512 at 259384082432: Input/output error
  /dev/sdh2: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh3: read failed after 0 of 512 at 259384082432: Input/output error
  /dev/sdh3: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh5: read failed after 0 of 1024 at 259384082432: Input/output error
  /dev/sdh5: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid 'KaBiI2-y4zs-4EPW-sPPS-X5uZ-gWLt-7zTctH'.
  Couldn't find all physical volumes for volume group vg.
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh2: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh3: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh5: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid 'KaBiI2-y4zs-4EPW-sPPS-X5uZ-gWLt-7zTctH'.
  Couldn't find all physical volumes for volume group vg.
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh2: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh3: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh5: read failed after 0 of 2048 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh2: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh3: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh5: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid 'KaBiI2-y4zs-4EPW-sPPS-X5uZ-gWLt-7zTctH'.
  Couldn't find all physical volumes for volume group vg.
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh2: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh3: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdh5: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid 'KaBiI2-y4zs-4EPW-sPPS-X5uZ-gWLt-7zTctH'.
  Couldn't find all physical volumes for volume group vg.
  Volume group "vg" not found


# I then attempted to write to the mirror to force the down convert, but it
didn't work so I then attempt a write from all nodes in the cluster and still
lvm reports this volume group and mirror as corrupted. 




Version-Release number of selected component (if applicable):
2.6.9-43.ELsmp
lvm2-cluster-2.02.18-1.el4
lvm2-2.02.18-1.el4
device-mapper-1.02.14-2.el4
cmirror-1.0.1-0
cmirror-kernel-smp-2.6.9-18.6

Comment 1 Corey Marthaler 2007-01-15 16:25:04 UTC
I tried that again with gfs on top of the 2 cmirrors and had I/O going to them
from all nodes in the cluster. After the primary leg failure, I again saw this
issue and all the I/O to those mirrors ended up deadlocking.

This bug needs to be on the blocker list for cmirrors. 

Comment 2 Kiersten (Kerri) Anderson 2007-01-15 16:33:13 UTC
Devel ACK for blocker-beta for cluster 4.5

Comment 3 Jonathan Earl Brassow 2007-01-15 16:43:01 UTC
Was the cluster mirror in-sync?

Did you get any core dumps?  (There have been changes made to dmeventd - check
to ensure that it is running.)


Comment 4 Corey Marthaler 2007-01-15 17:13:57 UTC
The mirrors have always been in-sync before attempting the failure case and I
didn't see any core dumps and have no reason to believe that dmeventd wasn't
running.

Comment 5 Corey Marthaler 2007-01-15 18:34:22 UTC
It appears that dmeventd is running right up until a write is attempted to the
failed mirror, at which time it stops for some reason.

Comment 6 Jonathan Earl Brassow 2007-01-15 19:07:56 UTC
where "for some reason" == seg fault

try:

sysctl kernel.core_pattern=/tmp/core

when you start up.  The your core files will appear as /tmp/core.*



Comment 7 Corey Marthaler 2007-01-15 20:55:59 UTC
I am also seeing cmirror creation issues now due to what also appears to be
dmeventd failures. Changing title of this bug and bumping priority.

[root@link-08 ~]# lvcreate -m 1 -L 1G -n mirror vg /dev/sda1:0-2000
/dev/sdb1:0-2000 /dev/sdh1:0-100
  Error locking on node link-07: vg-mirror: event registration failed:
libdevmapper-event-lvm2mirror.so
LVM-RgEPrNphjR3yUmyDl8At2ccc3MBgeEYiq09j2Q60abIsnd8liJdela6Z05Sotbss 65280 0
  Error locking on node link-04: vg-mirror: event registration failed:
libdevmapper-event-lvm2mirror.so
LVM-RgEPrNphjR3yUmyDl8At2ccc3MBgeEYiq09j2Q60abIsnd8liJdela6Z05Sotbss 65280 0
  Error locking on node link-02: vg-mirror: event registration failed:
libdevmapper-event-lvm2mirror.so
LVM-RgEPrNphjR3yUmyDl8At2ccc3MBgeEYiq09j2Q60abIsnd8liJdela6Z05Sotbss 65280 0
  Error locking on node link-08: vg-mirror: event registration failed:
libdevmapper-event-lvm2mirror.so
LVM-RgEPrNphjR3yUmyDl8At2ccc3MBgeEYiq09j2Q60abIsnd8liJdela6Z05Sotbss 65280 0
  Failed to activate new LV.

Comment 8 Corey Marthaler 2007-01-29 20:39:18 UTC
Just a note that QA is still seeing this issue and that this is blocking any
extensive cmirror testing. 

dmeventd:
[...]
select(6, [5], NULL, NULL, {1, 0})      = 0 (Timeout)
select(6, [5], NULL, NULL, {1, 0})      = 0 (Timeout)
select(6, [5], NULL, NULL, {1, 0})      = 0 (Timeout)
select(6, [5], NULL, NULL, {1, 0})      = 0 (Timeout)
select(6, [5], NULL, NULL, {1, 0})      = ? ERESTARTNOHAND (To be restarted)
PANIC: attached pid 3107 exited
Process 3107 detached


lvm2-cluster-2.02.20-1.el4
lvm2-2.02.20-1.el4
device-mapper-1.02.16-1.el4
2.6.9-43.ELsmp

Comment 9 Corey Marthaler 2007-01-29 23:24:52 UTC
The locking model in lvm.conf changed. 
Woulda been nice to know 10 days ago. :)