236289 – simultaneous device failures on cmirrors from mulitple VGs can lead to clvmd deadlock

Bug 236289 - simultaneous device failures on cmirrors from mulitple VGs can lead to clvmd deadlock

Summary: simultaneous device failures on cmirrors from mulitple VGs can lead to clvmd ...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cmirror
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-04-12 21:06 UTC by Corey Marthaler
Modified:	2010-01-12 02:03 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-09-05 16:38:20 UTC
Embargoed:

Attachments	(Terms of Use)
verbose output from failed convert (36.88 KB, text/plain) 2007-04-24 17:02 UTC, Jonathan Earl Brassow	no flags	Details
corresponding clvmd output when attempting failed lock operation (1.14 KB, text/plain) 2007-04-24 17:04 UTC, Jonathan Earl Brassow	no flags	Details
dmsetup info after the failed command (1.95 KB, application/octet-stream) 2007-04-24 17:06 UTC, Jonathan Earl Brassow	no flags	Details
clvmd patch to fix return code (753 bytes, patch) 2007-04-26 15:50 UTC, Christine Caulfield	no flags	Details \| Diff
View All

Description Corey Marthaler 2007-04-12 21:06:24 UTC

Description of problem:
There appears to be an issue when dealing with multilpe cmirrors from different
volume groups all experiencing leg device failures at the same time. The down
conversion to linear appears to work, however the up conversion back to a
cmirror can time out and lead to future clvmd command hangs, as well and I/O to
those cmirrors deadlocking.

I started with 3 mirrors on 3 vgs:
[root@link-08 ~]# lvs -a -o +devices
  LV                  VG         Attr   LSize   Origin Snap%  Move Log         
 Copy%  Devices

  LogVol00            VolGroup00 -wi-ao  72.38G                                
        /dev/hda2(0)

  LogVol01            VolGroup00 -wi-ao   1.94G                                
        /dev/hda2(2316)

  cmirror1            feist1     Mwi-ao 500.00M                    cmirror1_mlog
100.00 cmirror1_mimage_0(0),cmir
ge_1(0)
  [cmirror1_mimage_0] feist1     iwi-ao 500.00M                                
        /dev/sda1(0)

  [cmirror1_mimage_1] feist1     iwi-ao 500.00M                                
        /dev/sdb1(0)

  [cmirror1_mlog]     feist1     lwi-ao   4.00M                                
        /dev/sdc1(0)

  cmirror2            feist2     Mwi-ao 500.00M                    cmirror2_mlog
100.00 cmirror2_mimage_0(0),cmir
ge_1(0)
  [cmirror2_mimage_0] feist2     iwi-ao 500.00M                                
        /dev/sda2(0)

  [cmirror2_mimage_1] feist2     iwi-ao 500.00M                                
        /dev/sdd1(0)

  [cmirror2_mlog]     feist2     lwi-ao   4.00M                                
        /dev/sde1(0)

  cmirror3            feist3     Mwi-ao 500.00M                    cmirror3_mlog
100.00 cmirror3_mimage_0(0),cmir
ge_1(0)
  [cmirror3_mimage_0] feist3     iwi-ao 500.00M                                
        /dev/sda3(0)

  [cmirror3_mimage_1] feist3     iwi-ao 500.00M                                
        /dev/sdf1(0)

  [cmirror3_mlog]     feist3     lwi-ao   4.00M                                
        /dev/sdh1(0)


I then failed /dev/sda (the primary leg of each cmirror).

The down conversion worked:
[root@link-08 ~]# lvs -a -o +devices
  /dev/sda: read failed after 0 of 4096 at 0: Input/output error
  /dev/sda1: read failed after 0 of 2048 at 0: Input/output error
  /dev/sda2: read failed after 0 of 2048 at 0: Input/output error
  /dev/sda3: read failed after 0 of 2048 at 0: Input/output error
  LV       VG         Attr   LSize   Origin Snap%  Move Log Copy%  Devices
  LogVol00 VolGroup00 -wi-ao  72.38G                               /dev/hda2(0)
  LogVol01 VolGroup00 -wi-ao   1.94G                               /dev/hda2(2316)
  cmirror1 feist1     -wi-ao 500.00M                               /dev/sdb1(0)
  cmirror2 feist2     -wi-ao 500.00M                               /dev/sdd1(0)
  cmirror3 feist3     -wi-ao 500.00M                               /dev/sdf1(0)

I then brought /dev/sda back up, extended the 3 volume groups, and attempted to
up convert the three (now linears) back to mirrors:

# WORKED:
[root@link-08 ~]# lvconvert -m 2 --corelog feist1/cmirror1
  Logical volume cmirror1 converted.

# FAILED:
[root@link-08 ~]# lvconvert -m 2 --corelog feist2/cmirror2
  Error locking on node link-02: Command timed out
  Problem reactivating cmirror2

# FAILED:
[root@link-08 ~]# lvconvert -m 2  feist3/cmirror3
  Error locking on node link-02: Command timed out
  Aborting. Failed to activate mirror log. Remove new LVs and retry.
  Failed to create mirror log.

From here I attempted to remove the zombie mirror log created from the failed
mirror up conversion:

[root@link-08 ~]# lvremove -f feist3/cmirror3_mlog
[HANG]

An lvs also hung as well as I/O to those two failed mirrors. 

From here I then attempted to create another mirror and then failed that disk again.

I'll try and reproduce this better and include more information.

Version-Release number of selected component (if applicable):
2.6.9-50.ELsmp
cmirror-kernel-2.6.9-30.0
lvm2-cluster-2.02.21-3.el4

Comment 1 Corey Marthaler 2007-04-12 21:10:21 UTC

Current state according to dm:
[root@link-02 ~]# dmsetup ls --tree
feist1-cmirror1 (253:5)
 ├─feist1-cmirror1_mimage_2 (253:4)
 │  └─ (8:33)
 ├─feist1-cmirror1_mimage_1 (253:3)
 │  └─ (8:1)
 └─feist1-cmirror1_mimage_0 (253:2)
    └─ (8:17)
feist3-cmirror3 (253:13)
 └─ (8:81)
VolGroup00-LogVol01 (253:1)
 └─ (3:2)
VolGroup00-LogVol00 (253:0)
 └─ (3:2)
feist2-cmirror2 (253:9)
 ├─feist2-cmirror2_mimage_2 (253:8)
 │  └─ (8:65)
 ├─feist2-cmirror2_mimage_1 (253:7)
 │  └─ (8:2)
 └─feist2-cmirror2_mimage_0 (253:6)
    └─ (8:49)
[root@link-02 ~]# dmsetup info
Name:              feist1-cmirror1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      96
Major, minor:      253, 5
Number of targets: 1
UUID: LVM-kfyqMdtVGsEVpN1cvtX3NDTsV2FeSKU85DIlXwM9ob2sZDz2DutZPhEt1LJwGwLA

Name:              feist2-cmirror2_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 8
Number of targets: 1
UUID: LVM-b3Ca4y43lc3RIehcmyUBu81nNPs8YPLjkjbHKZ9o4vZjOVwjKIQIq5ChSsHd4ywX

Name:              feist3-cmirror3
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      11
Major, minor:      253, 13
Number of targets: 1
UUID: LVM-sfkiHiFe0zVVWyVd3khoI0zxEh9FpgMcpME61KCpsdYeLPlfZNGOo2zbgMgD8oyB

Name:              VolGroup00-LogVol01
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-i9Sdf1SXACVF330Sb7WdqECxIr1LdJtDxCyO7AGloRnyFBWnRK03Ir4tIJVh1hdX

Name:              feist1-cmirror1_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 4
Number of targets: 1
UUID: LVM-kfyqMdtVGsEVpN1cvtX3NDTsV2FeSKU8GeoQCiJmu8ffFpftl0UZmjbU148URkSK

Name:              feist2-cmirror2_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 7
Number of targets: 1
UUID: LVM-b3Ca4y43lc3RIehcmyUBu81nNPs8YPLjpoXz0hDcRA4im0JrngCd5I7enUcXHjxO

Name:              VolGroup00-LogVol00
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-i9Sdf1SXACVF330Sb7WdqECxIr1LdJtDiNgxxFoLGSZ82yY0VHP39K4fbTxLg5Xn

Name:              feist1-cmirror1_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 3
Number of targets: 1
UUID: LVM-kfyqMdtVGsEVpN1cvtX3NDTsV2FeSKU8Bp1l0PRAoCGH2Nt4M1oFhIC5lESFi5c8

Name:              feist2-cmirror2_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 6
Number of targets: 1
UUID: LVM-b3Ca4y43lc3RIehcmyUBu81nNPs8YPLjGHn8dzAFodD83lBnnenfpZSI0YLllsoA

Name:              feist2-cmirror2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      114
Major, minor:      253, 9
Number of targets: 1
UUID: LVM-b3Ca4y43lc3RIehcmyUBu81nNPs8YPLj24BUZX1wO88bvfBjAXfRM9RhWMaAOjuT

Name:              feist1-cmirror1_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 2
Number of targets: 1
UUID: LVM-kfyqMdtVGsEVpN1cvtX3NDTsV2FeSKU8376uhV36hkbITljHBGjMmNoHWrTGseDX

Comment 2 Jonathan Earl Brassow 2007-04-17 15:49:33 UTC

likely similar to 236432

Comment 3 Jonathan Earl Brassow 2007-04-24 17:02:44 UTC

Created attachment 153366 [details]
verbose output from failed convert

Steps to get this (you may find a simpler way):
1) vg containing a mirror is reduced by one device (sda) due to its failure
2) turn device back on and do a pvcreate on it
3) remove the VG (and all its volumes)
4) create all new pv's
5) create new volume group of the same name
6) create linear device
7) try to up convert (the attachment is a log of this failed step)

Comment 4 Jonathan Earl Brassow 2007-04-24 17:04:44 UTC

Created attachment 153368 [details]
corresponding clvmd output when attempting failed lock operation

Comment 5 Jonathan Earl Brassow 2007-04-24 17:06:58 UTC

Created attachment 153369 [details]
dmsetup info after the failed command

Note that the log device (vg1-lv1-mlog) and the mirror device (vg1-lv1) are
still suspended...  Likely that the failed lock op was suppose to resume these.
 The lock op UUID corresponds to the top level mirror device.

Comment 6 Jonathan Earl Brassow 2007-04-24 17:10:47 UTC

 at 0x9e == LCK_LV_RESUME

Comment 7 Christine Caulfield 2007-04-25 14:07:59 UTC

When I reproduce this (or something similar, but the steps I use are the same),
the clvmd process is stuck in 
ioctl(11, DM_TABLE_LOAD...). 

The sysrq-T seems to indicate it's waiting for mempool?

<ffffffff8015ca93>{mempool_resize+317} 
<ffffffffa009fe5b>{:dm_mod:resize_pool
ffffffff8030ad52>{__down_write+134} 
<ffffffffa00e49c2>{:dm_mirror:mirror_ctr+118
ffffffffa009d0bb>{:dm_mod:dm_table_add_target+328} 
ffffffffa009f29c>{:dm_mod:table_load+225} 
<ffffffffa009f1bb>{:dm_mod:table_load+
ffffffffa009fc34>{:dm_mod:ctl_ioctl+600}

Comment 8 Jonathan Earl Brassow 2007-04-25 14:42:56 UTC

I recently checked in a fix for mempools in cluster mirror, but the trace seems
to end in dm_mod; so I don't think it's related.  The fix was to clean-up
(mempool_destroy) when mirrors were shutdown.

Comment 9 Jonathan Earl Brassow 2007-04-25 14:47:01 UTC

here are my backtraces when it happens:

Apr 25 09:44:55 neo-04 kernel: clvmd         S C2301E98  2204  3375      1     
    3402  3363 (NOTLB)
Apr 25 09:44:55 neo-04 kernel: c2301ec4 00000082 0000001e c2301e98 c011cd42
c2396230 00000019 c2024de0
Apr 25 09:44:55 neo-04 kernel:        c2396230 00000000 c201d740 c201cde0
00000000 00000000 d69f6800 000f4239
Apr 25 09:44:55 neo-04 kernel:        c2396230 f6244730 f624489c 00000002
00000246 00042c80 00042c80 00000000
Apr 25 09:44:55 neo-04 kernel: Call Trace:
Apr 25 09:44:55 neo-04 kernel:  [<c011cd42>] recalc_task_prio+0x128/0x133
Apr 25 09:44:55 neo-04 kernel:  [<c02d4616>] schedule_timeout+0x139/0x154
Apr 25 09:44:55 neo-04 kernel:  [<c012a73a>] process_timeout+0x0/0x5
Apr 25 09:44:55 neo-04 kernel:  [<c016bdbc>] do_select+0x291/0x2c6
Apr 25 09:44:55 neo-04 kernel:  [<c016b9f1>] __pollwait+0x0/0x95
Apr 25 09:44:55 neo-04 kernel:  [<c016c0e4>] sys_select+0x2e0/0x43a
Apr 25 09:44:55 neo-04 kernel:  [<c02799e1>] sock_ioctl+0x2a7/0x2b4
Apr 25 09:44:55 neo-04 kernel:  [<c015a9f0>] sys_close+0x86/0x92
Apr 25 09:44:55 neo-04 kernel:  [<c02d5e97>] syscall_call+0x7/0xb
Apr 25 09:44:55 neo-04 kernel: clvmd         S 00004480  3380  3402      1     
    3403  3375 (NOTLB)
Apr 25 09:44:55 neo-04 kernel: f61e2f14 00000082 00000031 00004480 c2396230
f706d930 00000019 c2024de0
Apr 25 09:44:55 neo-04 kernel:        f706d930 00000000 c201d740 c201cde0
00000000 00000000 d69025c0 000f4239
Apr 25 09:44:55 neo-04 kernel:        f706d930 f62713b0 f627151c 00000002
f708f2a0 f708f298 f7049e80 f61e2f58
Apr 25 09:44:55 neo-04 kernel: Call Trace:
Apr 25 09:44:55 neo-04 kernel:  [<f89a2c91>] dlm_read+0xd5/0x3fc [dlm]
Apr 25 09:44:55 neo-04 kernel:  [<c011e851>] __wake_up+0x29/0x3c
Apr 25 09:44:55 neo-04 kernel:  [<c011e7cb>] default_wake_function+0x0/0xc
Apr 25 09:44:55 neo-04 kernel:  [<c0134d87>] wake_futex+0x3a/0x44
Apr 25 09:44:55 neo-04 kernel:  [<c011e7cb>] default_wake_function+0x0/0xc
Apr 25 09:44:55 neo-04 kernel:  [<c015b082>] vfs_read+0xb6/0xe2
Apr 25 09:44:55 neo-04 kernel:  [<c015b297>] sys_read+0x3c/0x62
Apr 25 09:44:55 neo-04 kernel:  [<c02d5e97>] syscall_call+0x7/0xb
Apr 25 09:44:55 neo-04 kernel: clvmd         S 00000003  2372  3403      1     
    3439  3402 (NOTLB)
Apr 25 09:44:55 neo-04 kernel: f5aa5eb4 00000082 00000006 00000003 f6244730
f6244730 00000019 c2024de0
Apr 25 09:44:55 neo-04 kernel:        f6244730 00000000 c201d740 c201cde0
00000000 00000000 d69f6800 000f4239
Apr 25 09:44:55 neo-04 kernel:        f6244730 f62708b0 f6270a1c 00000002
c0150845 00000000 7fffffff f5aa5000
Apr 25 09:44:55 neo-04 kernel: Call Trace:
Apr 25 09:44:55 neo-04 kernel:  [<c0150845>] find_extend_vma+0x12/0x4f
Apr 25 09:44:55 neo-04 kernel:  [<c02d4596>] schedule_timeout+0xb9/0x154
Apr 25 09:44:55 neo-04 kernel:  [<c0150845>] find_extend_vma+0x12/0x4f
Apr 25 09:44:55 neo-04 kernel:  [<c01203c5>] add_wait_queue+0x12/0x30
Apr 25 09:44:55 neo-04 kernel:  [<c01352dd>] futex_wait+0x133/0x194
Apr 25 09:44:55 neo-04 kernel:  [<c011e7cb>] default_wake_function+0x0/0xc
Apr 25 09:44:55 neo-04 kernel:  [<c011e7cb>] default_wake_function+0x0/0xc
Apr 25 09:44:55 neo-04 kernel:  [<c0162255>] blkdev_put+0x184/0x18f
Apr 25 09:44:55 neo-04 kernel:  [<c013558e>] do_futex+0x29/0x5a
Apr 25 09:44:55 neo-04 kernel:  [<c01356c0>] sys_futex+0x101/0x10c
Apr 25 09:44:55 neo-04 kernel:  [<c02d5e97>] syscall_call+0x7/0xb
Apr 25 09:44:55 neo-04 kernel:  [<c02d007b>] unix_dgram_recvmsg+0x5d/0x1da

Comment 10 Jonathan Earl Brassow 2007-04-25 16:19:28 UTC

If I do the steps in comment #3 *except* I use a different vgname in step 5, I
don't hit the problem.

Comment 11 Christine Caulfield 2007-04-26 15:50:44 UTC

Created attachment 153525 [details]
clvmd patch to fix return code

I'm having difficulty reproducing this when memory pressure is dicounted.
However I did spot a potential problem that could leave locks lying around.

Comment 12 Christine Caulfield 2007-04-27 12:50:41 UTC

With extreme care I can now reproduce this without loading memory pressure on
the machines (using touch rather than find|cpio). However, the traceback I get
is always identical to the one I posted in comment #7.

If I rename the VGs as Jon did in comment #10 then the problem still occurs,
just the same.

Comment 13 Jonathan Earl Brassow 2007-04-27 16:43:25 UTC

don't need 3 vg's to do this, and don't need FS or I/O.  Relevant history:



 1045  pvcreate /dev/sd*1
 1046  vgcreate vg /dev/sd*1
 1047  lvcreate -L 1G -n lv vg
 1049  lvconvert -m1 vg/lv # wait for sync
 1051  ./off.sh sda # on all machines
 1052  vgreduce --removemissing vg
 1054  ./on.sh sda # on all machines
 1055  pvcreate /dev/sda1
 1056  vgchange -an vg
 1057  lvremove -ff vg
 1059  vgremove vg
 1060  pvcreate /dev/sd*1
 1061  vgcreate vg /dev/sd*1
 1062  lvcreate -L 1G -n lv vg /dev/sda1
 1063  lvconvert -m1 vg/lv -vvvv /dev/sd[bc]1 >& output # BAD!

Comment 14 Jonathan Earl Brassow 2007-04-27 17:32:16 UTC

killing and restarting clvmd between 1062 and 1063 will allow the command to
succeed.

Comment 15 Jonathan Earl Brassow 2007-04-27 18:13:54 UTC

I get a problem in _vg_read_by_vgid()

        /* Is corresponding vgname already cached? */
        if ((vginfo = vginfo_from_vgid(vgid)) &&
            vginfo->vgname && *vginfo->vgname) {
                if ((vg = _vg_read(cmd, vginfo->vgname, vgid,
                                   &consistent, precommitted)) &&
                    !strncmp((char *)vg->id.uuid, vgid, ID_LEN)) {
                        if (!consistent) {
                                log_error("Volume group %s metadata is "
                                          "inconsistent", vginfo->vgname);
                                if (!partial_mode())
                                        return NULL; ****> HERE
                        }
                        return vg;
                }
        }

Comment 16 Jonathan Earl Brassow 2007-04-27 18:56:55 UTC

continuing from code in comment #15 and looking into _vg_read()

        /* Find the vgname in the cache */
        /* If it's not there we must do full scan to be completely sure */
        if (!(fmt = fmt_from_vgname(vgname, vgid))) {

The contents of that if statement are not run...  I think they should be. 
Especially since the resulting VG is missing a PV after it is done.

Comment 17 Jonathan Earl Brassow 2007-04-27 19:29:45 UTC

It seems that the cache is populated with a VG that is missing one of the PVs

Comment 18 Jonathan Earl Brassow 2007-04-27 21:56:07 UTC

another thing that is puzzling to me is that it seems that the mirror is being
resumed before the log sub lv...  This will certainly cause things to hang,
because when the mirror resumes, it tries to read the log device.   If the log
device is suspended, it will obviously hang.

That would likely explain your hang in comment #7.

Comment 19 Jonathan Earl Brassow 2007-04-27 22:17:41 UTC

you know...  I may be chasing a different bug.

I can't reproduce this issue when I simply pvcreate/vgextend the failed volumes
before doing an upconvert.

The timeouts in the original comment could be caused by the low timeout in clvmd
(which has been upped in the rpm via a command line option to clvmd).  I'm not
sure about what patrick is seeing...

Could we all update to the latest rpms and retry?  Corey, I'd especially like to
here that you recreated it.

Comment 20 Christine Caulfield 2007-04-30 09:44:59 UTC

Those steps don't reproduce it for me, It works perfectly every time. Obviously
my device names are different but I only need to subsititude sdm for sda.

Comment 21 Corey Marthaler 2007-05-02 05:22:13 UTC

I too was unable to re-create this issue with the latest build
(cmirror-kernel-2.6.9-32.0), this may have been fixed.


# After the failure and re-extension of the vg...
[root@link-08 ~]# lvconvert -m 2 --corelog vg1/mirror1
  Logical volume mirror1 converted.
[root@link-08 ~]# lvconvert -m 2 --corelog vg2/mirror2
  Logical volume mirror2 converted.
[root@link-08 ~]# lvconvert -m 2  vg3/mirror3
  Logical volume mirror3 converted.

Comment 22 Kiersten (Kerri) Anderson 2007-05-02 15:14:16 UTC

Moving to NEEDINFO until it can be recreated again.

Comment 23 Kiersten (Kerri) Anderson 2007-05-21 22:16:31 UTC

Moving defect to 4.6 list to track for the next release if more information is
uncovered.

Comment 24 Corey Marthaler 2007-09-05 16:38:20 UTC

I appear to have reproduced this, but it looks like that may just be
because I hit bz 227155 first. For now I'm going to close this bz and
make a note in bz 227155 about possibly being related to this bug.

Note You need to log in before you can comment on or make changes to this bug.