Bug 1105732

Summary:	"Attempted to decrement suspended device counter below zero" when deactivating VG containing cache volumes
Product:	Red Hat Enterprise Linux 7	Reporter:	Corey Marthaler <cmarthal>
Component:	lvm2	Assignee:	Zdenek Kabelac <zkabelac>
lvm2 sub component:	Cache Logical Volumes	QA Contact:	Cluster QE <mspqa-list>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	agk, cmarthal, heinzm, jbrassow, msnitzer, prajnoha, zkabelac
Version:	7.0	Keywords:	Reopened
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1185916 (view as bug list)		Environment:
Last Closed:	2015-01-26 15:58:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1119326, 1185916

Description Corey Marthaler 2014-06-06 20:45:36 UTC

Description of problem:
This may just be more fallout from bug 1091553.

lvremove -f /dev/cache_sanity/snap1
couldn't remove volume snap1

Internal error: Performing unsafe table load while 4 device(s) are known to be suspended:  (253:8) 
   device-mapper: resume ioctl on  failed: Invalid argument
   Unable to resume cache_sanity-rename_orig_A-real (253:8)
   Failed to resume rename_orig_A.
   libdevmapper exiting with 4 device(s) still suspended.

sleep 10

[root@host-034 ~]# lvs
  LV            VG            Attr       LSize   Pool          Origin                Data%
  rename_orig_A cache_sanity  owi-s-C---   4.00g rename_pool_A [rename_orig_A_corig]
  rename_pool_A cache_sanity  Cwi-s-C---   4.00g
  snap1         cache_sanity  -wi-a----- 500.00m
  snap2         cache_sanity  swi-s-s--- 500.00m               rename_orig_A           0.54

[root@host-034 ~]# lvremove -f /dev/cache_sanity/snap1
  Logical volume "snap1" successfully removed

[root@host-034 ~]# vgchange -an cache_sanity
  Attempted to decrement suspended device counter below zero.
  Attempted to decrement suspended device counter below zero.
  Attempted to decrement suspended device counter below zero.
  Attempted to decrement suspended device counter below zero.
  Attempted to decrement suspended device counter below zero.
  0 logical volume(s) in volume group "cache_sanity" now active


I'll try and narrow this down and provide more details...



Version-Release number of selected component (if applicable):
3.10.0-123.el7.x86_64
lvm2-2.02.105-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
lvm2-libs-2.02.105-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
lvm2-cluster-2.02.105-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
device-mapper-1.02.84-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
device-mapper-libs-1.02.84-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
device-mapper-event-1.02.84-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
device-mapper-event-libs-1.02.84-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014
device-mapper-persistent-data-0.3.2-1.el7    BUILT: Thu Apr  3 09:58:51 CDT 2014
cmirror-2.02.105-14.el7    BUILT: Wed Mar 26 08:29:41 CDT 2014

Comment 2 Jonathan Earl Brassow 2014-09-30 15:08:48 UTC

[root@bp-01 ~]# lvcreate -s -L 500M -n snap vg/lv
  Snapshot of cache LV is not yet supported.

Snapshots are not currently supported.  We can verify that fact to fix this bug.  A new feature bug will need to be created to add snapshot support.

I am marking this bug a duplicate of bug 1091553.

*** This bug has been marked as a duplicate of bug 1091553 ***

Comment 3 Corey Marthaler 2015-01-07 22:53:50 UTC

This message still exists and apparently didn't require snaps of cache volumes to hit. I'll try and look deeper into what's causing this, but could it be a well timed interupted remove attempt, followed by another remove?


# Test output (interupted during remove)
Removing cache pool cache_sanity/1K_cache
^C


Jan  7 16:40:48 host-110 qarshd[4800]: Running cmdline: lvremove -f /dev/cache_sanity/1K_cache
Jan  7 16:40:49 host-110 qarshd[4800]: Sending child 4803 signal 2
Jan  7 16:41:09 host-110 multipathd: dm-5: mapname not found for 253:5
Jan  7 16:41:09 host-110 multipathd: uevent trigger error
Jan  7 16:41:09 host-110 multipathd: dm-4: mapname not found for 253:4
Jan  7 16:41:09 host-110 multipathd: uevent trigger error
Jan  7 16:41:09 host-110 multipathd: dm-3: mapname not found for 253:3
Jan  7 16:41:09 host-110 multipathd: uevent trigger error
Jan  7 16:41:09 host-110 multipathd: dm-5: remove map (uevent)
Jan  7 16:41:09 host-110 multipathd: dm-4: remove map (uevent)
Jan  7 16:41:09 host-110 multipathd: dm-5: remove map (uevent)
Jan  7 16:41:09 host-110 multipathd: dm-3: remove map (uevent)
Jan  7 16:41:09 host-110 multipathd: dm-4: remove map (uevent)
Jan  7 16:41:09 host-110 multipathd: dm-3: remove map (uevent)
Jan  7 16:41:09 host-110 multipathd: dm-2: remove map (uevent)
Jan  7 16:41:09 host-110 multipathd: dm-2: devmap not registered, can't remove
Jan  7 16:41:09 host-110 multipathd: dm-2: remove map (uevent)





[root@host-110 ~]# pvscan
  PV /dev/sdb1   VG cache_sanity    lvm2 [499.00 MiB / 486.92 MiB free]
  PV /dev/sdc1   VG cache_sanity    lvm2 [499.00 MiB / 486.95 MiB free]
  PV /dev/vda2   VG rhel_host-110   lvm2 [7.51 GiB / 40.00 MiB free]
  PV /dev/sdf1                      lvm2 [15.00 GiB]
  PV /dev/sdd1                      lvm2 [15.00 GiB]
  Total: 5 [38.48 GiB] / in use: 3 [8.48 GiB] / in no VG: 2 [30.00 GiB]

[root@host-110 ~]# vgremove cache_sanity
Do you really want to remove volume group "cache_sanity" containing 1 logical volumes? [y/n]: y
Do you really want to remove active logical volume 1K_origin? [y/n]: y
  Flushing cache for 1K_origin.
  Attempted to decrement suspended device counter below zero.
  Logical volume "1K_cache" successfully removed
  Logical volume "1K_origin" successfully removed
  Volume group "cache_sanity" successfully removed


[root@host-110 ~]# /usr/tests/sts-rhel7.1/lvm2/bin/lvm_rpms 
3.10.0-219.el7.x86_64

lvm2-2.02.114-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
lvm2-libs-2.02.114-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
lvm2-cluster-2.02.114-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
device-mapper-1.02.92-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
device-mapper-libs-1.02.92-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
device-mapper-event-1.02.92-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
device-mapper-event-libs-1.02.92-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015
device-mapper-persistent-data-0.4.1-2.el7    BUILT: Wed Nov 12 12:39:46 CST 2014
cmirror-2.02.114-4.el7    BUILT: Wed Jan  7 07:07:47 CST 2015

Comment 4 Zdenek Kabelac 2015-01-24 23:36:20 UTC

I'm not able to reproduce this bug with my lvm2 code.

Is this bug reproducible with  2.02.115 build ?

If so - please attach 'lvs -a' before you run 'vgremove -vvvv vg'
and append  this vgremove debug trace.

Kernel messages from Comment 3 are unrelated - they go from multipathd - and they are weird and may require fixes - but should not be related to lvm2.

Unless multipath devices are involved as PV for your VG ?

Then please add 'dmsetup table' and 'pvs'

Comment 5 Corey Marthaler 2015-01-25 03:04:26 UTC

Yes, I've see this a couple times now with the latest rpms. I'm still looking for a definative reproducer however.

First reproduction:
[root@host-113 ~]# lvs -a -o +devices
  LV                         VG            Attr       LSize   Pool                 Origin                Data%  Meta% Cpy%Sync Devices
  cache_2_83500              cache_2_8350  Cwims-C---   4.00g [cache_2_83500_fast] [cache_2_83500_corig] 0.04   2.83  0.00     cache_2_83500_corig(0)
  [cache_2_83500_corig]      cache_2_8350  owi-aoC---   4.00g                                                                  /dev/sde1(0)
  [cache_2_83500_fast]       cache_2_8350  Cwi---C---   1.00g                                            0.04   2.83  0.00     cache_2_83500_fast_cdata(0)
  [cache_2_83500_fast_cdata] cache_2_8350  Cwi-ao----   1.00g                                                                  /dev/sdg1(0)
  [cache_2_83500_fast_cmeta] cache_2_8350  ewi-ao----   8.00m                                                                  /dev/sdg1(256)
  [lvol0_pmspare]            cache_2_8350  ewi-------   8.00m                                                                  /dev/sdb1(0)
[root@host-113 ~]# vgremove cache_2_8350
Do you really want to remove volume group "cache_2_8350" containing 1 logical volumes? [y/n]: y
Do you really want to remove active logical volume cache_2_83500? [y/n]: y
  Attempted to decrement suspended device counter below zero.
  Logical volume "cache_2_83500_fast" successfully removed
  Logical volume "cache_2_83500" successfully removed
  Volume group "cache_2_8350" successfully removed



Second reproduction:
[root@host-119 ~]# lvs -a -o +devices
  LV                        VG            Attr       LSize   Pool                Origin               Data%  Meta% Cpy%Sync Devices
  cache_5_4770              cache_5_477   CwimsoC---   4.00g [cache_5_4770_fast] [cache_5_4770_corig] 0.00   2.83  100.00   cache_5_4770_corig(0)
  [cache_5_4770_corig]      cache_5_477   owi-aoC---   4.00g                                                                /dev/sdb1(0)
  [cache_5_4770_fast]       cache_5_477   Cwi---C---   1.00g                                          0.00   2.83  100.00   cache_5_4770_fast_cdata(0)
  [cache_5_4770_fast_cdata] cache_5_477   Cwi-ao----   1.00g                                                                /dev/sdd1(0)
  [cache_5_4770_fast_cmeta] cache_5_477   ewi-ao----   8.00m                                                                /dev/sdd1(256)
  [lvol0_pmspare]           cache_5_477   ewi-------   8.00m                                                                /dev/sdc2(0)
[root@host-119 ~]# lvchange -prw /dev/cache_5_477/cache_5_4770
  Logical volume "cache_5_4770" is already writable
[root@host-119 ~]# lvchange -pr /dev/cache_5_477/cache_5_4770
  Attempted to decrement suspended device counter below zero.
  Logical volume "cache_5_4770" changed.



3.10.0-223.el7.x86_64
lvm2-2.02.115-2.el7    BUILT: Thu Jan 22 06:09:14 CST 2015
lvm2-libs-2.02.115-2.el7    BUILT: Thu Jan 22 06:09:14 CST 2015
lvm2-cluster-2.02.115-2.el7    BUILT: Thu Jan 22 06:09:14 CST 2015
device-mapper-1.02.93-2.el7    BUILT: Thu Jan 22 06:09:14 CST 2015
device-mapper-libs-1.02.93-2.el7    BUILT: Thu Jan 22 06:09:14 CST 2015
device-mapper-event-1.02.93-2.el7    BUILT: Thu Jan 22 06:09:14 CST 2015
device-mapper-event-libs-1.02.93-2.el7    BUILT: Thu Jan 22 06:09:14 CST 2015
device-mapper-persistent-data-0.4.1-2.el7    BUILT: Wed Nov 12 12:39:46 CST 2014
cmirror-2.02.115-2.el7    BUILT: Thu Jan 22 06:09:14 CST 2015

Comment 6 Jonathan Earl Brassow 2015-01-26 15:58:20 UTC

Beginning at comment #3, this bug has changed from its original problem (inability to resume the cache device) to a more benign (warning only) issue.  The original bug has been fixed by disabling snapshots of cache LVs.  I've cloned this bug to address the second issue separately.

*** This bug has been marked as a duplicate of bug 1091553 ***